John Heidemann / Papers / Web-scale Content Reuse Detection (extended)

Web-scale Content Reuse Detection (extended)
Calvin Ardi and John Heidemann
USC/Information Sciences Institute


Calvin Ardi and John Heidemann. Web-scale Content Reuse Detection (extended). Technical Report ISI-TR-2014-692. USC/Information Sciences Institute. [PDF] [alt PDF]


With the vast amount of accessible, online content, it is not surprising that unscrupulous entities “borrow” from the web to provide filler for advertisements, link farms, and spam and make a quick profit. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically discover previously unknown duplicate content in the web, and the second to detect copies of discovered or manually identified content in the web. Our detection can also bad neighborhoods, clusters of pages where copied content is frequent. We verify our approach with controlled experiments with two large datasets: a Common Crawl subset the web, and a copy of Geocities, an older set of user-provided web content. We then demonstrate that we can discover otherwise unknown examples of duplication for spam, and detect both discovered and expert-identified content in these large datasets. Utilizing an original copy of Wikipedia as identified content, we find 40 sites that reuse this content, 86% for commercial benefit.

Bibtex Citation

  author = {Ardi, Calvin and Heidemann, John},
  title = {Web-scale Content Reuse Detection (extended)},
  institution = {USC/Information Sciences Institute},
  year = {2014},
  sortdate = {2014-06-01},
  project = {ant, mega},
  jsubject = {www},
  number = {ISI-TR-2014-692},
  month = jun,
  jlocation = {johnh: pafile},
  keywords = {hashing, content reuse, wikipedia, copying},
  url = {},
  pdfurl = {},
  otherurl = {},
  myorganization = {USC/Information Sciences Institute},
  copyrightholder = {authors}
Copyright © by John Heidemann