LANDER:as to org mapping-20101019 From Predict README version: 4050, last modified: 2014-06-6. This file describes the trace dataset "as_to_org_mapping-20101019" provided by the LANDER project. This is a derived dataset processed on 2011-11-11, with data obtained from the source below: Regional Internet Registry (RIR) WHOIS database. http://www.afrinic.net/, http://www.apnic.net/, http://www.arin.net/, http://www.lacnic.net/, http://www.ripe.net/, October 2010. Contents • 1 LANDER Metadata • 2 Dataset Contents • 3 Data Format • 3.1 Syntax • 3.2 Schema • 3.3 How AS cluster vs. AS files relate • 4 Clustering Method • 5 Citation • 6 Results Using This Dataset • 7 User Annotations LANDER Metadata ┌───────────────────────────┬────────────────────────────────────────────────────────────────────────────────────┐ │ dataSetName │ as_to_org_mapping-20101019 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ status │ usc-web-and-predict │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ shortDesc │ Mapping from ASes to organizations │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ longDesc │ This dataset provides a mapping from ASes to organizations,v.e., identifies which │ │ │ ASes belong to which organizations. We determined the mapping by automatic │ │ │ clustering methods. The general idea of the methods is to cluster ASes by their │ │ │ attributes found in WHOIS databases obtained from five Regional Internet │ │ │ Registries (RIR). │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ datasetClass │ Quasi-Restricted │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ commercialAllowed │ true │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ requestReviewRequired │ true │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ productReviewRequired │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ ongoingMeasurement │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ submissionMethod │ Upload │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionStartDate │ 2010-10-19 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionStartTime │ 00:00:00 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionEndDate │ 2010-10-19 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionEndTime │ 00:00:00 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityStartDate │ 2013-03-04 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityStartTime │ 18:10:00 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityEndDate │ 2030-01-01 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityEndTime │ 00:00:00 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ anonymization │ none │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ archivingAllowed │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ keywords │ category:internet-topology-data, subcategory:as-organizational-data, internet, │ │ │ topology, AS, organization, mapping │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ format │ text │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ access │ https │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ hostName │ USC-LANDER │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ providerName │ USC │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ groupingId │ │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ groupingSummaryFlag │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ retrievalInstructions │ download │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ byteSize │ 10485760 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ expirationDays │ 14 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ uncompressedSize │ 9696934 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ impactDoi │ 10.23721/109/1353793 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ useAgreement │ dua-ni-160816 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ irbRequired │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ privateAccessInstructions │ See https://ant.isi.edu/datasets/#getting-datasets for information on obtaining │ │ │ this dataset. │ │ │ See │ └───────────────────────────┴────────────────────────────────────────────────────────────────────────────────────┘ Dataset Contents as_to_org_mapping-20101019.README.txt      copy of this README as_clusters.fsdb      The IDs, attributes and labels of AS clusters. ases.fsdb      ASNs, attributes, sources and names of ASes. Join with as_clusters.fsdb by "clusterid" to see which ASes belong to the same cluster. .sha1sum SHA-1 checksum The file ".sha1sum" contains SHA1 checksums of individual compressed files. The integrity of the distribution thus can be checked by independently calculating SHA1 sums of files and comparing them with those listed in the file. If you have the sha1sum utility installed on your system, you can do that by executing: sha1sum --check .sha1sum Data Format Syntax Each of the *.fsdb files are in FSDB file format---this is a simple, white-space-separated text database format, where each line is a database row and whitespace separates columns. Schema Each file is a simple database. In as_clusters.fsdb, each row is an AS cluster, and the 8 columns provide information about it. ┌────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────┐ │ clusterid │ the unique identifier that identifies an AS cluster. │ ├────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ nases │ the number of ASes in the AS cluster. │ ├────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ /* the following fields are derived from WHOIS databse */ │ ├────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ │ the OrgIDs of all ASes in the cluster, each followed by a number in the brackets. The number │ │ orgidc │ tells how many ASes have this OrgID, and thus it can be used to infer the importance of the │ │ │ OrgID. │ ├────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ │ the contact IDs of all ASes in the cluster, each followed by a number in the brackets. The number │ │ contactidc │ tells how many ASes have this contact ID, and thus it can be used to infer the importance of the │ │ │ contact ID. │ ├────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ │ the phone numbers of all ASes in the cluster, each followed by a number in the brackets. The │ │ phonec │ number tells how many ASes have this phone number, and thus it can be used to infer the │ │ │ importance of the phone number. Due to privacy concern, phones are prefix-preserving anonymized │ │ │ (the last 4 digits have been replaced with 'x'). │ ├────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ │ the email domains of all ASes in the cluster, each followed by a number in the brackets. The │ │ emailc │ number tells how many ASes have this email domain, and thus it can be used to infer the │ │ │ importance of the email domain. │ ├────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ label │ the human-readable label of the AS cluster, derived from names of ASes in the cluster. │ ├────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ domain │ the domain address of the AS cluster, derived from email domains of ASes in the cluster. │ └────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘ In ases.fsdb, each row is an AS, 7 columns provide information about it, and the remaining 1 column (clusterid) provide the link between ases.fsdb and as_clusters.fsdb. ┌───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ clusterid │ the unique identifier that identifies an AS cluster. │ ├───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ asn │ the AS Number, unique identifier of an Autonomous System (AS). │ ├───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ /* the following fields are derived from WHOIS databse */ │ ├───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ orgid │ the OrgIDs of the AS. There is only one (or none) OrgID for each AS. │ ├───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ contactid │ the contact IDs of the AS. There can be multiple (or none) contact IDs for each AS. │ ├───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ │ the phone numbers of the AS. There can be multiple (or none) phone numbers for each AS. Due to │ │ phone │ privacy concern, phones are prefix-preserving anonymized (the last 4 digits have been replaced │ │ │ with 'x'). │ ├───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ email │ the email domains of the AS. There can be multiple (or none) email domains for each AS. │ ├───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ rir │ the Regional Internet Registry (RIR) the AS belongs to, should be one of {ARIN, RIPE, APNIC, │ │ │ LACNIC, AFRINIC}. │ ├───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ name │ the name of the AS. │ └───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┘ If the value in a certain column is "-", it means the info is not available for that AS cluster/AS. How AS cluster vs. AS files relate The cluster file (as_clusters.fsdb) and the AS file (ases.fsdb) relate to each other. The cluster file lists each cluster we find, one cluster per row. The AS file lists all ASes and which cluster that AS belongs to. To see which ASes belong to the same cluster (sharing the same clusterid), join by clusterid, the AS cluster file with AS file. Clustering Method This dataset provides a mapping from ASes to organizations, i.e., identifies which ASes belong to which organizations. We determined the mapping through a combination of two methods: automatic clustering done on a structured data source (AS registration data in WHOIS database obtained from five Regional Internet Registries (RIR)) and a semi-automatic method on a less structured data source (company subsidiary data contained in the U.S. SEC Form 10-K filings). We extract organization-related attributes from WHOIS for each AS and cluster ASes based on how similar their attributes are. Because WHOIS lacks information on company mergers and acquisitions which in turn fails to group AS clusters of different subsidiaries together, we also extract the subsidiary list for 50 important companies from 10-K (this dataset will be released in the future), and group related AS clusters accordingly. Details about our methodology are in technical report: • Xue Cai, John Heidemann, Balachander Krishnamurthy, and Walter Willinger. An Organization-Level View of the Internet and its Implications (Extended). Technical Report ISI-TR-2012-679, USC/Information Sciences Institute, June, 2012. ftp://ftp.isi.edu/isi-pubs/tr-679.pdf Parameters we used in this methodology include weights for attribute types and cutting threshold for clustering algorithm. The weight assignment is as follows: 1) OrgID (orgid), 0.75; 2) contactID (contactid), 0.1; 3) phones (phone), 0.1; 4) emails (email), 0.05. The cutting threshold is 0.0025. Currently there is no formal way to associate a AS cluster with a organization, but one can infer the organization identity of one AS cluster by its label or domain. The label is derived from text names in WHOIS database, and the domain is derived from email domains we used for clustering. Email domains are usually human-readable and can provide accurate reference to organizations' websites. Keywords in text names can provide additional information and are especially useful for manual searching. Citation If you use this trace to conduct additional research, please cite it as: PREDICT ID: USC-LANDER/as_to_org_mapping-20101019/rev4050. Traces generated on 2011-11-11. Provided by the USC/LANDER project (http://www.isi.edu/ant/lander). Results Using This Dataset This dataset is a significant revision of the previous dataset (LANDER:as_to_org_mapping_preliminary_results-20100507) used in the following previously published work: • Xue Cai, John Heidemann, Balachander Krishnamurthy, and Walter Willinger. Towards an AS-to-Organization Map. In Proceedings of the ACM Internet Measurement Conference (IMC). Melbourne, Australia, ACM. November, 2010. http://www.isi.edu/~johnh/PAPERS/Cai10c.pdf Below is the technical report about this new dataset and the new methodology. • Xue Cai, John Heidemann, Balachander Krishnamurthy, and Walter Willinger. An Organization-Level View of the Internet and its Implications (Extended). Technical Report ISI-TR-2012-679, USC/Information Sciences Institute, June, 2012. ftp://ftp.isi.edu/isi-pubs/tr-679.pdf User Annotations This dataset has 49,604 ASes grouped into 31,231 clusters. Johnh 10:19, 7 December 2011 (PST) Categories: • Datasets • LANDER • LANDER:Datasets