Precise Detection of Content Reuse in the Web

Ardi, Calvin and Heidemann, John
USC/Information Sciences Institute


Calvin Ardi and John Heidemann 2019. Precise Detection of Content Reuse in the Web. ACM SIGCOMM Computer Communication Review. 49, 2 (Apr. 2019), 9ā€“24. [DOI] [PDF]


With vast amount of content online, it is not surprising that unscrupulous entities "borrow" from the web to provide content for advertisements, link farms, and spam. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically *discover* previously unknown duplicate content in the web, and the second to *precisely detect* copies of discovered or manually identified content. We show that *bad neighborhoods*, clusters of pages where copied content is frequent, help identify copying in the web. We verify our algorithm and its choices with controlled experiments over three web datasets: Common Crawl (2009/10), GeoCities (1990sā€“2000s), and a phishing corpus (2014). We show that our use of cryptographic hashing is much more precise than alternatives such as locality-sensitive hashing, avoiding the thousands of false-positives that would otherwise occur. We apply our approach in three systems: discovering and detecting duplicated content in the web, searching explicitly for copies of Wikipedia in the web, and detecting phishing sites in a web browser. We show that general copying in the web is often benign (for example, templates), but 6ā€“11% are commercial or possibly commercial. Most copies of Wikipedia (86%) are commercialized (link farming or advertisements). For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without taking on intentional cloaking.


  author = {Ardi, Calvin and Heidemann, John},
  title = {Precise Detection of Content Reuse in the Web},
  journal = {{ACM} SIGCOMM Computer Communication Review},
  project = {ant, mega},
  sortdate = {2019-05-22},
  issue_date = {April 2019},
  volume = {49},
  number = {2},
  month = apr,
  year = {2019},
  issn = {0146-4833},
  pages = {9--24},
  numpages = {16},
  url = {},
  pdfurl = {},
  blogurl = {},
  doi = {},
  acmid = {3336940},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {content duplication, content reuse, duplicate detection, phishing},
  institution = {USC/Information Sciences Institute},
  myorganization = {USC/Information Sciences Institute},
  copyrightholder = {authors}