{"id":1311,"date":"2019-05-22T10:54:42","date_gmt":"2019-05-22T17:54:42","guid":{"rendered":"https:\/\/ant.isi.edu\/blog\/?p=1311"},"modified":"2020-10-14T15:33:48","modified_gmt":"2020-10-14T22:33:48","slug":"new-paper-precise-detection-of-content-reuse-in-the-web-to-appear-in-acm-sigcomm-computer-communication-review-volume-49-issue-3-july-2019","status":"publish","type":"post","link":"https:\/\/ant.isi.edu\/blog\/?p=1311","title":{"rendered":"new paper &#8220;Precise Detection of Content Reuse in the Web&#8221; to appear in ACM SIGCOMM Computer Communication Review"},"content":{"rendered":"<p>We have published a new paper &#8220;<a href=\"https:\/\/ccronline.sigcomm.org\/wp-content\/uploads\/2019\/02\/acmdl19-299.pdf\">Precise Detection of Content Reuse in the Web<\/a>&#8221; by Calvin Ardi and John Heidemann, in the <a href=\"http:\/\/ccronline.sigcomm.org\">ACM SIGCOMM Computer Communication Review<\/a> (<a href=\"https:\/\/dl.acm.org\/citation.cfm?id=J101\">Volume 49 Issue 2, April 2019<\/a>) newsletter.<\/p>\n<p>From the abstract:<\/p>\n<blockquote><p><a href=\"https:\/\/ant.isi.edu\/blog\/wp-content\/uploads\/2019\/04\/hashing-validation.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright size-medium wp-image-1319\" src=\"https:\/\/ant.isi.edu\/blog\/wp-content\/uploads\/2019\/04\/hashing-validation-300x240.png\" alt=\"\" width=\"300\" height=\"240\" srcset=\"https:\/\/ant.isi.edu\/blog\/wp-content\/uploads\/2019\/04\/hashing-validation-300x240.png 300w, https:\/\/ant.isi.edu\/blog\/wp-content\/uploads\/2019\/04\/hashing-validation.png 537w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a>With vast amount of content online, it is not surprising that unscrupulous entities &#8220;borrow&#8221; from the web to provide content for advertisements, link farms, and spam. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically <em>discover<\/em> previously unknown duplicate content in the web, and the second to <em>precisely detect<\/em> copies of discovered or manually identified content. We show that <em>bad neighborhoods<\/em>, clusters of pages where copied content is frequent, help identify copying in the web. We verify our algorithm and its choices with controlled experiments over three web datasets: Common Crawl (2009\/10), GeoCities (1990s\u20132000s), and a phishing corpus (2014). We show that our use of cryptographic hashing is much more precise than alternatives such as locality-sensitive hashing, avoiding the thousands of false-positives that would otherwise occur. We apply our approach in three systems: discovering and detecting duplicated content in the web, searching explicitly for copies of Wikipedia in the web, and detecting phishing sites in a web browser. We show that general copying in the web is often benign (for example, templates), but 6\u201311% are commercial or possibly commercial. Most copies of Wikipedia (86%) are commercialized (link farming or advertisements). For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without taking on intentional cloaking.<\/p><\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p>We have published a new paper &#8220;Precise Detection of Content Reuse in the Web&#8221; by Calvin Ardi and John Heidemann, in the ACM SIGCOMM Computer Communication Review (Volume 49 Issue 2, April 2019) newsletter. From the abstract: With vast amount of content online, it is not surprising that unscrupulous entities &#8220;borrow&#8221; from the web to [&hellip;]<\/p>\n","protected":false},"author":621,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[284,282],"tags":[228,141,231,71,219,232,58,191,180,229,5,140,18,57,95,230],"class_list":["post-1311","post","type-post","status-publish","format-standard","hentry","category-papers-publications","category-publications","tag-acm-sigcomm-ccr","tag-ant","tag-ccr","tag-datasets","tag-gawseed","tag-hashing","tag-isi","tag-lacanic","tag-measurement","tag-newsletter","tag-papers","tag-phishing","tag-security","tag-usc","tag-web","tag-www"],"_links":{"self":[{"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/posts\/1311","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/users\/621"}],"replies":[{"embeddable":true,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1311"}],"version-history":[{"count":6,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/posts\/1311\/revisions"}],"predecessor-version":[{"id":1557,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/posts\/1311\/revisions\/1557"}],"wp:attachment":[{"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1311"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1311"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1311"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}