The goal of the MEGA research project is to develop new models and algorithms to examine dynamic, multi-modal, large-scale social and computer networks.
MEGA is a joint research effort of Stanford University and USC’s Information Sciences Institute and Computer Science Department, and part of the ANT: the Analysis of Network Traffic research group. It is supported by the U.S. Air Force Office of Scientific Research (AFOSR) and the Defense Advanced Research Projects Agency (DARPA) from grant #FA9550-12-1-0411.
ISI’s activities on MEGA ran from July 2012 to January 2014.
For related publications, please see the ANT publications web page..
See also the ANT software page.
Each .bz2 compressed file is in plaintext in tab-separated (key, value) pairs. There are multiple schemas for keys and values. Keys are generally delimited by dashes (-), and values are delimited by colons (:). The basic schema is:
{SHA1} source-arc:begin-offset:length:url HashFile-{SHA1} source-arc:url DocVector-{URL} {SHA1}:{SHA1}:...:{SHA1}:
An example entry (URL sanitized):
HashFile-b1a4d1dd9df7d3f2d66c03e5b4bdc679234dfa964189911840100631aec6f3fe common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:http://example.com/example.html f467127944e1709e35fb11afaf1f44cc97b33870d04dd678b4a6d8d4524f3d36 common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:0:10236:http://example.com/example.html f576fa846498366aecebeda872f7144d095df69f92adc1ad23e7efdb4624f6a9 common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:10236:63:http://example.com/example.html 35f55e4abf51d4cc825d8f2014b9eeb2e023b1dc7a1de3a7e0a4d1b0af3be72a common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:10299:172:http://example.com/example.html 6dba2c195989a0c5049b7b3b2527b41c22370ba6a73e384a62d196976cc7db34 common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:10471:100:http://example.com/example.html 5cdf26e417454027a864774fa1195eb92550c98ff10e9b8eb25343e22153e30c common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:10571:123:http://example.com/example.html d1de190ed24e8587c34c416dad5bb656df2d1a3dd6a507f89eb0198ec9780978 common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:10694:366:http://example.com/example.html 5820c825c4e16108b77f7ea91f3416e1c7718dc64811688ecb4f426af7a4dbce common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:11060:131:http://example.com/example.html 21f3915561bfa6b11b832f22cd946ac2ca46e88ece75136aa29fed3edd2a7ef3 common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:11191:222:http://example.com/example.html e13bccabffc2f3b9cc9f0faf50735f70de84adc93b0122df37e5890848be7619 common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:11413:725:http://example.com/example.html 0d1b59bee1c80d57cfb96e8ed0aa041534d1115eb6f90021a941ada06329dcd2 common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:12138:102:http://example.com/example.html 300508ba33e4cff10bb9f89381bc28392a710ca74898a766d9554b4928f129e1 common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:12240:94:http://example.com/example.html d209e54ecec7e9ec39cd9682e421d8c1bcbcd14996de9dd61b6f78fe4028d603 common-crawl/crawl-002/2010/09/24/20/1285393861218_20.arc.gz:12334:558:http://example.com/example.html DocVector-http://example.com/example.html f467127944e1709e35fb11afaf1f44cc97b33870d04dd678b4a6d8d4524f3d36:f576fa846498366aecebeda872f7144d095df69f92adc1ad23e7efdb4624f6a9:35f55e4abf51d4cc825d8f2014b9eeb2e023b1dc7a1de3a7e0a4d1b0af3be72a:6dba2c195989a0c5049b7b3b2527b41c22370ba6a73e384a62d196976cc7db34:5cdf26e417454027a864774fa1195eb92550c98ff10e9b8eb25343e22153e30c:d1de190ed24e8587c34c416dad5bb656df2d1a3dd6a507f89eb0198ec9780978:5820c825c4e16108b77f7ea91f3416e1c7718dc64811688ecb4f426af7a4dbce:21f3915561bfa6b11b832f22cd946ac2ca46e88ece75136aa29fed3edd2a7ef3:e13bccabffc2f3b9cc9f0faf50735f70de84adc93b0122df37e5890848be7619:0d1b59bee1c80d57cfb96e8ed0aa041534d1115eb6f90021a941ada06329dcd2:300508ba33e4cff10bb9f89381bc28392a710ca74898a766d9554b4928f129e1:d209e54ecec7e9ec39cd9682e421d8c1bcbcd14996de9dd61b6f78fe4028d603: