Poster: Lightweight Content-based Phishing Detection

Poster: Lightweight Content-based Phishing Detection

Ardi, Calvin and Heidemann, John
USC/Information Sciences Institute

Calvin Ardi and John Heidemann 2015. Poster: Lightweight Content-based Phishing Detection. Technical Report ISI-TR-2015-698. USC/Information Sciences Institute.


Increasing use of Internet banking and shopping by a broad spectrum of users results in greater potential profits from phishing attacks via websites that masquerade as legitimate sites to trick users into sharing passwords or financial information. Most browsers today detect potential phishing with URL blacklists; while effective at stopping previously known threats, blacklists must react to new threats as they are discovered, leaving users vulnerable for a period of time. Alternatively, whitelists can be used to identify “known-good” websites so that off-list sites (to include possible phish) can never be accessed, but are too limited for many users. Our goal is proactive detection of phishing websites with neither the delay of blacklist identification nor the strict constraints of whitelists. Our approach is to list known phishing targets, index the content at their correct sites, and then look for this content to appear at incorrect sites. Our insight is that cryptographic hashing of page contents allows for efficient bulk identification of content reuse at phishing sites. Our contribution is a system to detect phish by comparing hashes of visited websites to the hashes of the original, known good, legitimate website. We implement our approach as a browser extension in Google Chrome and show that our algorithms detect a majority of phish, even with minimal countermeasures to page obfuscation. A small number of alpha users have been using the extension without issues for several weeks, and we will be releasing our extension and source code upon publication.


  author = {Ardi, Calvin and Heidemann, John},
  title = {Poster: Lightweight Content-based Phishing Detection},
  institution = {USC/Information Sciences Institute},
  year = {2015},
  sortdate = {2015-05-01},
  number = {ISI-TR-2015-698},
  month = may,
  location = {johnh: pafile},
  keywords = {hashing, content reuse, wikipedia, copying, phising},
  url = {},
  pdfurl = {},
  otherurl = {},
  myorganization = {USC/Information Sciences Institute},
  copyrightholder = {authors},
  project = {ant, retrofuture, mega}