MEGA: Modern Graph Analysis for Dynamic Networks

Project Description

The goal of the MEGA research project is to develop new models and algorithms to examine dynamic, multi-modal, large-scale social and computer networks.

MEGA is a joint research effort of Stanford University and USC’s Information Sciences Institute and Computer Science Department, and part of the ANT: the Analysis of Network Traffic research group. It is supported by the U.S. Air Force Office of Scientific Research (AFOSR) and the Defense Advanced Research Projects Agency (DARPA) from grant #FA9550-12-1-0411.

ISI’s activities on MEGA ran from July 2012 to January 2014.

People

  • Calvin Ardi, PhD student (USC CS Dept. and ISI)
  • Ashish Goel, professor (Stanford, Management Science and Engineering) (also with a prior faculty appointment in USC CSD)
  • John Heidemann, co-PI on this project, project leader and professor (USC/ISI)

Publications

  • Calvin Ardi and John Heidemann 2019. Precise Detection of Content Reuse in the Web. ACM SIGCOMM Computer Communication Review. 49, 2 (Apr. 2019), 9–24. [DOI] [PDF] Details
  • Calvin Ardi and John Heidemann 2016. AuntieTuna: Personalized Content-Based Phishing Detection. Proceedings of the NDSS Workshop on Usable Security (San Diego, California, USA, Feb. 2016). [PDF] [Code] Details
  • Calvin Ardi and John Heidemann 2015. Poster: Lightweight Content-based Phishing Detection. Technical Report ISI-TR-2015-698. USC/Information Sciences Institute. [PDF] Details
  • Calvin Ardi and John Heidemann 2014. Web-scale Content Reuse Detection (extended). Technical Report ISI-TR-2014-692. USC/Information Sciences Institute. [PDF] Details

For related publications, please see the ANT publications web page..

Software

See also the ANT software page.

Datasets

  • commoncrawl_crawl_002_hashes-20130425: contains hashes of all documents and chunks of documents in Common Crawl’s crawl-002 dataset (1619 .bz2 files, 1.8TB compressed). Hashes were generated by content-reuse-detection (under “Software”).

Dataset Format

Each .bz2 compressed file is in plaintext in tab-separated (key, value) pairs. There are multiple schemas for keys and values. Keys are generally delimited by dashes (-), and values are delimited by colons (:). The basic schema is:

{SHA1}              source-arc:begin-offset:length:url
HashFile-{SHA1}     source-arc:url
DocVector-{URL}     {SHA1}:{SHA1}:...:{SHA1}:
  • {SHA1}: a SHA-1 hash of a paragraph chunk in url. The source-arc is the path to the ARC file containing the url, with byte offsets and length (begin-offset to begin-offset+length) of the paragraph chunk.
  • HashFile-{SHA1}: a SHA-1 hash of the entire document/file at url, located inside source-arc.
  • DocVector-{URL}: a list of SHA-1 hashes of all chunks in the document/file at {URL}, delimited by colons (:)

An example entry (URL sanitized):

HashFile-b1a4d1dd9df7d3f2d66c03e5b4bdc679234dfa964189911840100631aec6f3fe       common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:http://example.com/example.html
f467127944e1709e35fb11afaf1f44cc97b33870d04dd678b4a6d8d4524f3d36                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:0:10236:http://example.com/example.html
f576fa846498366aecebeda872f7144d095df69f92adc1ad23e7efdb4624f6a9                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:10236:63:http://example.com/example.html
35f55e4abf51d4cc825d8f2014b9eeb2e023b1dc7a1de3a7e0a4d1b0af3be72a                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:10299:172:http://example.com/example.html
6dba2c195989a0c5049b7b3b2527b41c22370ba6a73e384a62d196976cc7db34                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:10471:100:http://example.com/example.html
5cdf26e417454027a864774fa1195eb92550c98ff10e9b8eb25343e22153e30c                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:10571:123:http://example.com/example.html
d1de190ed24e8587c34c416dad5bb656df2d1a3dd6a507f89eb0198ec9780978                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:10694:366:http://example.com/example.html
5820c825c4e16108b77f7ea91f3416e1c7718dc64811688ecb4f426af7a4dbce                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:11060:131:http://example.com/example.html
21f3915561bfa6b11b832f22cd946ac2ca46e88ece75136aa29fed3edd2a7ef3                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:11191:222:http://example.com/example.html
e13bccabffc2f3b9cc9f0faf50735f70de84adc93b0122df37e5890848be7619                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:11413:725:http://example.com/example.html
0d1b59bee1c80d57cfb96e8ed0aa041534d1115eb6f90021a941ada06329dcd2                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:12138:102:http://example.com/example.html
300508ba33e4cff10bb9f89381bc28392a710ca74898a766d9554b4928f129e1                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:12240:94:http://example.com/example.html
d209e54ecec7e9ec39cd9682e421d8c1bcbcd14996de9dd61b6f78fe4028d603                common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:12334:558:http://example.com/example.html
DocVector-http://example.com/example.html                                       f467127944e1709e35fb11afaf1f44cc97b33870d04dd678b4a6d8d4524f3d36:f576fa846498366aecebeda872f7144d095df69f92adc1ad23e7efdb4624f6a9:35f55e4abf51d4cc825d8f2014b9eeb2e023b1dc7a1de3a7e0a4d1b0af3be72a:6dba2c195989a0c5049b7b3b2527b41c22370ba6a73e384a62d196976cc7db34:5cdf26e417454027a864774fa1195eb92550c98ff10e9b8eb25343e22153e30c:d1de190ed24e8587c34c416dad5bb656df2d1a3dd6a507f89eb0198ec9780978:5820c825c4e16108b77f7ea91f3416e1c7718dc64811688ecb4f426af7a4dbce:21f3915561bfa6b11b832f22cd946ac2ca46e88ece75136aa29fed3edd2a7ef3:e13bccabffc2f3b9cc9f0faf50735f70de84adc93b0122df37e5890848be7619:0d1b59bee1c80d57cfb96e8ed0aa041534d1115eb6f90021a941ada06329dcd2:300508ba33e4cff10bb9f89381bc28392a710ca74898a766d9554b4928f129e1:d209e54ecec7e9ec39cd9682e421d8c1bcbcd14996de9dd61b6f78fe4028d603: