Categories
Papers Publications

new paper “Precise Detection of Content Reuse in the Web” to appear in ACM SIGCOMM Computer Communication Review

We have published a new paper “Precise Detection of Content Reuse in the Web” by Calvin Ardi and John Heidemann, in the ACM SIGCOMM Computer Communication Review (Volume 49 Issue 2, April 2019) newsletter.

From the abstract:

With vast amount of content online, it is not surprising that unscrupulous entities “borrow” from the web to provide content for advertisements, link farms, and spam. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically discover previously unknown duplicate content in the web, and the second to precisely detect copies of discovered or manually identified content. We show that bad neighborhoods, clusters of pages where copied content is frequent, help identify copying in the web. We verify our algorithm and its choices with controlled experiments over three web datasets: Common Crawl (2009/10), GeoCities (1990s–2000s), and a phishing corpus (2014). We show that our use of cryptographic hashing is much more precise than alternatives such as locality-sensitive hashing, avoiding the thousands of false-positives that would otherwise occur. We apply our approach in three systems: discovering and detecting duplicated content in the web, searching explicitly for copies of Wikipedia in the web, and detecting phishing sites in a web browser. We show that general copying in the web is often benign (for example, templates), but 6–11% are commercial or possibly commercial. Most copies of Wikipedia (86%) are commercialized (link farming or advertisements). For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without taking on intentional cloaking.

Categories
Papers Publications

new conference paper “Who Knocks at the IPv6 Door? Detecting IPv6 Scanning” at ACM IMC 2018

We have published a new paper “Who Knocks at the IPv6 Door? Detecting IPv6 Scanning” by Kensuke Fukuda and John Heidemann, in the ACM Internet Measurements Conference (IMC 2018) in Boston, Mass., USA.

DNS backscatter from IPv4 and IPv6 ([Fukuda18a], figure 1).
From the abstract:

DNS backscatter detects internet-wide activity by looking for common reverse DNS lookups at authoritative DNS servers that are high in the DNS hierarchy. Both DNS backscatter and monitoring unused address space (darknets or network telescopes) can detect scanning in IPv4, but with IPv6’s vastly larger address space, darknets become much less effective. This paper shows how to adapt DNS backscatter to IPv6. IPv6 requires new classification rules, but these reveal large network services, from cloud providers and CDNs to specific services such as NTP and mail. DNS backscatter also identifies router interfaces suggesting traceroute-based topology studies. We identify 16 scanners per week from DNS backscatter using observations from the B-root DNS server, with confirmation from backbone traffic observations or blacklists. After eliminating benign services, we classify another 95 originators in DNS backscatter as potential abuse. Our work also confirms that IPv6 appears to be less carefully monitored than IPv4.

Categories
Publications Technical Report

new technical report “Plumb: Efficient Processing of Multi-Users Pipelines (Extended)”

We released a new technical report “Plumb: Efficient Processing of Multi-Users Pipelines (Extended)”, by Abdul Qadeer and John Heidemann, as ISI-TR-727.  It is available at https://www.isi.edu/publications/trpublic/pdfs/isi-tr-727.pdf

Benefits of processing de-duplication
Benefits of data de-duplication

From the abstract:

Services such as DNS and websites often produce streams of data that are consumed by analytics pipelines operated by multiple teams. Often this data is processed in large chunks (megabytes) to allow analysis of a block of time or to amortize costs. Such pipelines pose two problems: first, duplication of computation and storage may occur when parts of the pipeline are operated by different groups. Second, processing can be lumpy, with structural lumpiness occurring when different stages need different amounts of resources, and data lumpiness occurring when a block of  input requires increased resources. Duplication and structural lumpiness both can result in inefficient processing. Data lumpiness can cause pipeline failure or deadlock, for example if differences in DDoS traffic compared to normal can require 6× CPU. We propose Plumb, a framework to abstract file processing for a multi-stage pipeline. Plumb integrates pipelines contributed by multiple users, detecting and eliminating duplication of computation and intermediate storage. It tracks and adjusts computation of each stage, accommodating both structural and data lumpiness. We exercise Plumb with the processing pipeline for B-Root DNS traffic, where it will replace a hand-tuned system to provide one third the original latency by utilizing 22% fewer CPU and will address limitations that occur as multiple users process data and when DDoS traffic causes huge shifts in performance.

 

Categories
Papers Publications

new conference paper “LDplayer: DNS Experimentation at Scale” at ACM IMC 2018

We have published a new paper LDplayer: DNS Experimentation at Scale by Liang Zhu and John Heidemann, in the ACM Internet Measurements Conference (IMC 2018) in Boston, Mass., USA.

Figure 14a: Evaluation of server memory with different TCP timeouts and minimal RTT (<1 ms). Trace: B-Root-17a. Protocol: TLS

From the abstract:

DNS has evolved over the last 20 years, improving in security and privacy and broadening the kinds of applications it supports. However, this evolution has been slowed by the large installed base and the wide range of implementations. The impact of changes is difficult to model due to complex interactions between DNS optimizations, caching, and distributed operation. We suggest that experimentation at scale is needed to evaluate changes and facilitate DNS evolution. This paper presents LDplayer, a configurable, general-purpose DNS experimental framework that enables DNS experiments to scale in several dimensions: many zones, multiple levels of DNS hierarchy, high query rates, and diverse query sources. LDplayer provides high fidelity experiments while meeting these requirements through its distributed DNS query replay system, methods to rebuild the relevant DNS hierarchy from traces, and efficient emulation of this hierarchy on minimal hardware. We show that a single DNS server can correctly emulate multiple independent levels of the DNS hierarchy while providing correct responses as if they were independent. We validate that our system can replay a DNS root traffic with tiny error (± 8 ms quartiles in query timing and ± 0.1% difference in query rate). We show that our system can replay queries at 87k queries/s while using only one CPU, more than twice of a normal DNS Root traffic rate. LDplayer’s trace replay has the unique ability to evaluate important design questions with confidence that we capture the interplay of caching, timeouts, and resource constraints. As an example, we demonstrate the memory requirements of a DNS root server with all traffic running over TCP and TLS, and identify performance discontinuities in latency as a function of client RTT.

Categories
DNS Papers Publications

new conference paper “When the Dike Breaks: Dissecting DNS Defenses During DDoS” at ACM IMC 2018

We have published a new paper “When the Dike Breaks: Dissecting DNS Defenses During DDoS” in the ACM Internet Measurements Conference (IMC 2018) in Boston, Mass., USA.

From the abstract:

Caching and retries protect half of clients even with 90% loss and an attack twice the cache duration. (Figure 7c from [Moura18b].)

The Internet’s Domain Name System (DNS) is a frequent target of Distributed Denial-of-Service (DDoS) attacks, but such attacks have had very different outcomes—some attacks have disabled major public websites, while the external effects of other attacks have been minimal. While on one hand the DNS protocol is relatively simple, the \emph{system} has many moving parts, with multiple levels of caching and retries and replicated servers. This paper uses controlled experiments to examine how these mechanisms affect DNS resilience and latency, exploring both the client side’s DNS \emph{user experience}, and server-side traffic. We find that, for about 30\% of clients, caching is not effective. However, when caches are full they allow about half of clients to ride out server outages that last less than cache lifetimes, Caching and retries together allow up to half of the clients to tolerate DDoS attacks longer than cache lifetimes, with 90\% query loss, and almost all clients to tolerate attacks resulting in 50\% packet loss. While clients may get service during an attack, tail-latency increases for clients. For servers, retries during DDoS attacks increase normal traffic up to $8\times$. Our findings about caching and retries help explain why users see service outages from some real-world DDoS events, but minimal visible effects from others.

Datasets from this paper are available at no cost and are listed at https://ant.isi.edu/datasets/dns/#Moura18b_data.

 

Categories
Papers Publications

new workshop paper “Leveraging Controlled Information Sharing for Botnet Activity Detection”

We have published a new paper “Leveraging Controlled Information Sharing for Botnet Activity Detection” in the Workshop on Traffic Measurements for Cybersecurity (WTMC 2018) in Budapest, Hungary, co-located with ACM SIGCOMM 2018.

The sensitivity of BotDigger’s detection is im- proved with controlled data sharing. All three domain/IP sets meet or pass the detection threshold.

From the abstract of our paper:

Today’s malware often relies on DNS to enable communication with command-and-control (C&C). As defenses that block traffic improve, malware use sophisticated techniques to hide this traffic, including “fast flux” names and Domain-Generation Algorithms (DGAs). Detecting this kind of activity requires analysis of DNS queries in network traffic, yet these signals are sparse. As bot countermeasures grow in sophistication, detecting these signals increasingly requires the synthesis of information from multiple sites. Yet *sharing security information across organizational boundaries* to date has been infrequent and ad hoc because of unknown risks and uncertain benefits. In this paper, we take steps towards formalizing cross-site information sharing and quantifying the benefits of data sharing. We use a case study on DGA-based botnet detection to evaluate how sharing cybersecurity data can improve detection sensitivity and allow the discovery of malicious activity with greater precision.

The relevant software is open-sourced and freely available at https://ant.isi.edu/retrofuture.

This paper is joint work between Calvin Ardi and John Heidemann from USC/ISI, with additional support from collaborators and Colorado State University and Los Alamos National Laboratory.

Categories
Papers Publications

new conference paper “The Policy Potential of Measuring Internet Outages” at TPRC

We have published a new paper “The Policy Potential of Measuring Internet Outages” in TPRC46, the Research Conference on Communications, Information and Internet Policy, to be presented on September 21, 2018 at the American University, Washington College of Law.

Outages from Hurricane Irma after landfall in Florida on 2017-09-11, observed with Trinocular.

From the abstract of our paper:

Today it is possible to evaluate the reliability of the Internet. Prior approaches to measure network reliability required telecommunications providers reporting the status of their own networks, resulting in limits on the precision, timeliness, and availability of the results. Recent work in Internet measurement has shown that network outages can be observed with active measurements from a few sites, and from passive measurements of network telescopes (large, unused address space) or large network services such as content-delivery networks. We suggest that these kinds of *third-party* observations of network outages can provide data that is precise and timely. We discuss early results of Trinocular, an outage detection system using active probing developed at the University of Southern California. Trinocular has been operating continuously since November 2013, and we provide (at no charge) data covering about 4 million network blocks from around the world. This paper describes some results of Trinocular showing outages in a large U.S. Internet Service Provider, and those resulting from the 2017 Hurricane Irma in Florida. Our data shows the impact of the Broadband America policy for always-on networks, and we discuss how it might be used to address future policy questions and assist in disaster planning and recovery.

Data we describe in this paper is at https://ant.isi.edu/datasets/outage/, with visualizations at https://ant.isi.edu/outage/world/.

This paper is joint work of John Heideman, Yuri Pradkin, and Guillermo Baltra from USC/ISI, with work carried out as part of LACANIC and DIVOICE projects with DHS S&T/CSD support.

Categories
Publications Technical Report

new technical report “Detecting IoT Devices in the Internet (Extended)”

We have released a new technical report “Detecting IoT Devices in the Internet (Extended)” as ISI-TR-726.

ISP-Level Deployment for  26 IoT Device Types. From Figure 2 of [Guo18c].
From the abstract of our technical report:

Distributed Denial-of-Service (DDoS) attacks launched from compromised Internet-of-Things (IoT) devices have shown how vulnerable the Internet is to large-scale DDoS attacks. To understand the risks of these attacks requires learning about these IoT devices: where are they? how many are there? how are they changing? This paper describes three new methods to find IoT devices on the Internet: server IP addresses in traffic, server names in DNS queries, and manufacturer information in TLS certificates. Our primary methods (IP addresses and DNS names) use knowledge of servers run by the manufacturers of these devices. We have developed these approaches with 10 device models from 7 vendors. Our third method uses TLS certificates obtained by active scanning. We have applied our algorithms to a number of observations. Our IP-based algorithms see at least 35 IoT devices on a college campus, and 122 IoT devices in customers of a regional IXP. We apply our DNSbased algorithm to traffic from 5 root DNS servers from 2013 to 2018, finding huge growth (about 7×) in ISPlevel deployment of 26 device types. DNS also shows similar growth in IoT deployment in residential households from 2013 to 2017. Our certificate-based algorithm finds 254k IP cameras and network video recorders from 199 countries around the world.

We make operational traffic we captured from 10 IoT devices we own public at https://ant.isi.edu/datasets/iot/. We also use operational traffic of 21 IoT devices shared by University of New South Wales at http://149.171.189.1/.

This technical report is joint work of Hang Guo and  John Heidemann from USC/ISI.

Categories
Papers Publications

New workshop paper “IP-Based IoT Device Detection”

We have published a new paper “IP-Based IoT Device Detection” in the Second ACM Workshop on Internet-of-Things Security and Privacy (IoTS&P 2018) in Budapest, Hungary, co-located with SIGCOMM 2018.

IoT devices we detect in use at a campus (Table 3 from [Guo18b])
From the abstract of our  paper:

Recent IoT-based DDoS attacks have exposed how vulnerable the Internet can be to millions of insufficiently secured IoT devices. To understand the risks of these attacks requires
learning about these IoT devices—where are they, how many are there, how are they changing? In this paper, we propose
a new method to find IoT devices in Internet to begin to assess this threat. Our approach requires observations of flow-level network traffic and knowledge of servers run by
the manufacturers of the IoT devices. We have developed our approach with 10 device models by 7 vendors and controlled
experiments. We apply our algorithm to observations from 6 days of Internet traffic at a college campus and partial traffic
from an IXP to detect IoT devices.

We make operational traffic we captured from 10 IoT devices we own public at https://ant.isi.edu/datasets/iot/. We also use operational traffic of 21 IoT devices shared by University of New South Wales at http://149.171.189.1/.

This paper is joint work of Hang Guo and  John Heidemann from USC/ISI.

Categories
Publications Technical Report

new technical report “When the Dike Breaks: Dissecting DNS Defenses During DDoS (extended)”

We released a new technical report “When the Dike Breaks: Dissecting DNS Defenses During DDoS (extended)”, ISI-TR-725, available at https://www.isi.edu/~johnh/PAPERS/Moura18a.pdf.

Moura18a Figure 6a, Answers received during a DDoS attack causing 100% packet loss with pre-loaded caches.

From the abstract:

The Internet’s Domain Name System (DNS) is a frequent target of Distributed Denial-of-Service (DDoS) attacks, but such attacks have had very different outcomes—some attacks have disabled major public websites, while the external effects of other attacks have been minimal. While on one hand the DNS protocol is a relatively simple, the system has many moving parts, with multiple levels of caching and retries and replicated servers. This paper uses controlled experiments to examine how these mechanisms affect DNS resilience and latency, exploring both the client side’s DNS user experience, and server-side traffic. We find that, for about about 30% of clients, caching is not effective. However, when caches are full they allow about half of clients to ride out server outages, and caching and retries allow up to half of the clients to tolerate DDoS attacks that result in 90% query loss, and almost all clients to tolerate attacks resulting in 50% packet loss. The cost of such attacks to clients are greater median latency. For servers, retries during DDoS attacks increase normal traffic up to 8x. Our findings about caching and retries can explain why some real-world DDoS cause service outages for users while other large attacks have minimal visible effects.

Datasets from this paper are available at no cost and are listed at https://ant.isi.edu/datasets/dns/#Moura18a_data.