Categories
Uncategorized

congratulations to Guillermo Baltra for his PhD

I would like to congratulate Dr. Guillermo Baltra for defending his PhD at the University of Southern California in August 2023 and completing his doctoral dissertation “Improving network reliability using a formal definition of the Internet core”.

Guillermo Baltra (right) and his thesis advisor.

From the abstract:

After 50 years, the Internet is still defined as “a collection of interconnected networks”. Yet seamless, universal connectivity is challenged in several ways. Political pressure threatens fragmentation due to de-peering; architectural changes such as carrier-grade NAT, the cloud makes connectivity indirect; firewalls impede connectivity; and operational problems and commercial disputes all challenge the idea of a single set of “interconnected networks”. We propose that a new, conceptual definition of the Internet core helps disambiguate questions in analysis of network reliability and address space usage.


We prove this statement through three studies. First, we improve coverage of outage detection by dealing with sparse sections of the Internet, increasing from a nominal 67% responsive /24 blocks coverage to 96% of the responsive Internet. Second, we provide a new definition of the Internet core, and use it to resolve partial reachability ambiguities. We show that the Internet today has peninsulas of persistent, partial connectivity, and that some outages cause islands where the Internet at the site is up, but partitioned from the main Internet. Finally, we use our definition to identify ISP trends, with applications to policy and improving outage detection accuracy. We show how these studies together thoroughly prove our thesis statement. We provide a new conceptual definition of “the Internet core” in our second study about partial reachability. We use our definition in our first and second studies to disambiguate questions about network reliability and in our third study, to ISP address space usage dynamics.

Guillermo’s PhD work was supported by NSF grants CNS-1806785, CNS-2007106 and NSF-2028279 and DH S&T Cyber Security Division contract 70RSAT18CB0000014 and a DHS contract administred by AFRL as contract FA8750-18-2-0280, to USC Viterbi, the Armada de Chile, and the Agencia Nacional de Investigación y Desarrollo de Chile (ANID).

Please see his individual publications for what data is available from his research; his results are also in use in ongoing Trinocular outage detection datasets.

Categories
Uncategorized

congratulations to Basileal Imana for his PhD

I would like to congratulate Dr. Basileal Imana for defending his PhD at the University of Southern California in August 2023 and completing his doctoral dissertation “Methods for Auditing Social Media Algorithms in the Public Interest”.

Basileal Imana at his PhD hooding with his thesis advisors.
Basi at his PhD hooding in May 2023 with his thesis advisors.

From the abstract:

Social-media platforms are entering a new era of increasing scrutiny by public interest groups and regulators. One reason for the increased scrutiny is platform-induced bias in how they deliver ads for life opportunities. Certain ad domains are legally protected against discrimination, and even when not, some domains have societal interest in equitable ad delivery. Platforms use relevance-estimator algorithms to optimize the delivery of ads. Such algorithms are proprietary and therefore opaque to outside evaluation, and early evidence suggests these algorithms may be biased or discriminatory. In response to such risks, the U.S. and the E.U. have proposed policies to allow researchers to audit platforms while protecting users’ privacy and platforms’ proprietary information. Currently, no technical solution exists for implementing such audits with rigorous privacy protections and without putting significant constraints on researchers. In this work, our thesis is that relevance-estimator algorithms bias the delivery of opportunity ads, but new auditing methods can detect that bias while preserving privacy.


We support our thesis statement through three studies. In the first study, we propose a black-box method for measuring gender bias in the delivery of job ads with a novel control for differences in job qualification, as well as other confounding factors that influence ad delivery. Controlling for qualification is necessary since qualification is a legally acceptable factor to target ads with, and we must separate it from bias introduced by platforms’ algorithms. We apply our method to Meta and LinkedIn, and demonstrate that Meta’s relevance estimators result in discriminatory delivery of job ads by gender. In our second study, we design a black-box methodology that is the first to propose a means to draw out potential racial bias in the delivery of education ads. Our method employs a pair of ads that are seemingly identical education opportunities but one is of inferior quality tied with a historical societal disparity that ad delivery algorithms may propagate. We apply our method to Meta and demonstrate their relevance estimators racially bias the delivery of education ads. In addition, we observe that the lack of access to demographic attributes is a growing challenge for auditing bias in ad delivery. Motivated by this challenge, we make progress towards enabling use of inferred race in black-box audits by analyzing how inference error can lead to incorrect measurement of skew in ad delivery. Going beyond the domain-specific and black-box methods we used in our first two studies, our final study proposes a novel platform-supported framework to allow researchers to audit relevance estimators that is generalizable to studying various categories of ads, demographic attributes and target platforms. The framework allows auditors to get privileged query-access to platforms’ relevance estimators to audit for bias in the algorithms while preserving the privacy interests of users and platforms. Overall, our first two studies show relevance-estimator algorithms bias the delivery of job and education ads, and thus motivate making these algorithms the target of platform-supported auditing in our third study. Our work demonstrates a platform-supported means to audit these algorithms is the key to increasing public oversight over ad platforms while rigorously protecting privacy.

Basi’s PhD work was co-advised by Aleksandra Korolova and John Heidemann, and supported by grants from the Rose Foundation and the NSF (CNS-1755992, CNS-1916153, CNS-1943584, CNS-1956435, and CNS-1925737.) Please see his individual publications for what data is available from his research.

Categories
Uncategorized

USC/Viterbi and ISI news about “Anycast Agility” paper

USC Viterbi and ISI both posted a news article about our paper “Anycast Agility: Network Playbooks to Fight DDoS”.

Please see our blog entry for the abstract and the full technical paper for the real details, but their posts are very accessible. And with the hacker in the hoodie, you know it’s serious :-)

The canonical hacker in the hoodie, testifying to serious security work.
Categories
Uncategorized

new paper “Differences in Monitoring the DNS Root Over IPv4 and IPv6” to appear at the IEEE National Symposium for NSF REU Research in Data Science, Systems, and Security

On December 15, 2022, Tarang Saluja will present the paper “Differences in Monitoring the DNS Root Over IPv4 and IPv6” (by Tarang Saluja, John Heidemann, and Yuri Pradkin) at the IEEE National Symposium for NSF REU Research in Data Science, Systems, and Security.

From the abstract:

Figure 9 from [Saluja22a], showing fraction of query failures in RIPE Atlas after we remove observers that are islands (unable to reach any of the 13 DNS root identifiers). Blue is IPv4, red is IPv6, with data for each of the 13 DNS root identifiers. We believe this data is a better representation of what people expect to see than Atlas results that include these “broken” observers.

The Domain Name System (DNS) is an essential service for the Internet which maps host names to IP addresses. The DNS Root Sever System operates the top of this namespace. RIPE Atlas observes DNS from more than 11k vantage points (VPs) around the world, reporting the reliability of the DNS Root Server System in DNSmon. DNSmon shows that loss rates for queries to the DNS Root are nearly 10% for IPv6, much higher than the approximately 2% loss seen for IPv4. Although IPv6 is “new,” as an operational protocol available to a third of Internet users, it ought to be just as reliable as IPv4. We examine this difference at a finer granularity by investigating loss at individual VPs. We confirm that specific VPs are the source of this difference and identify two root causes: VP islands with routing problems at the edge which leave them unable to access IPv6 outside their LAN, and VP peninsulas which indicate routing problems in the core of the network. These problems account for most of the loss and nearly all of the difference between IPv4 and IPv6 query loss rates. Islands account for most of the loss (half of IPv4 failures and 5/6ths of IPv6 failures), and we suggest these measurement devices should be filtered out to get a more accurate picture of loss rates. Peninsulas account for the main differences between root identifiers, suggesting routing disagreements root operators need to address. We believe that filtering out both of these known problems provides a better measure of underlying network anomalies and loss and will result in more actionable alerts.

Original data from this paper is available from RIPE Atlas (measurement ids are in the paper). We are publishing new results daily on our website (from the RIPE data).

This work was done while Tarang was on his Summer 2022 undergraduate research internship at USC/ISI, with support from NSF grant 2051101 (PI: Jelena Mirkovich). John Heidemann and Yuri Pradkin’s work is supported by NSF through the EIEIO project (CNS-2007106). We thank Guillermo Baltra for his work on islands and peninsulas, as seen in his arXiv report.

Categories
Anycast BGP Internet

new paper “Anycast Agility: Network Playbooks to Fight DDoS” at USENIX Security Symposium 2022

We will publish a new paper titled “Anycast Agility: Network Playbooks to Fight DDoS” by A S M Rizvi (USC/ISI), Leandro Bertholdo (University of Twente), João Ceron (SIDN Labs), and John Heidemann (USC/ISI) at the 31st USENIX Security Symposium in Aug. 2022.

A sample anycast playbook for a 3-site anycast deployment. Different routing configurations provide different traffic mixes. From [Rizvi22a, Table 5].

From the abstract:

IP anycast is used for services such as DNS and Content Delivery Networks (CDN) to provide the capacity to handle Distributed Denial-of-Service (DDoS) attacks. During a DDoS attack service operators redistribute traffic between anycast sites to take advantage of sites with unused or greater capacity. Depending on site traffic and attack size, operators may instead concentrate attackers in a few sites to preserve operation in others. Operators use these actions during attacks, but how to do so has not been described systematically or publicly. This paper describes several methods to use BGP to shift traffic when under DDoS, and shows that a response playbook can provide a menu of responses that are options during an attack. To choose an appropriate response from this playbook, we also describe a new method to estimate true attack size, even though the operator’s view during the attack is incomplete. Finally, operator choices are constrained by distributed routing policies, and not all are helpful. We explore how specific anycast deployment can constrain options in this playbook, and are the first to measure how generally applicable they are across multiple anycast networks.

Dataset used in this paper are listed at https://ant.isi.edu/datasets/anycast/anycast_against_ddos/index.html, and the software used in our work is at https://ant.isi.edu/software/anygility. They are provided as part of Call for Artifacts.

Acknowledgments: A S M Rizvi and John Heidemann’s work on this paper is supported, in part, by the DHS HSARPA Cyber Security Division via contract number HSHQDC-17-R-B0004-TTA.02-0006-I. Joao Ceron and Leandro Bertholdo’s work on this paper is supported by Netherlands Organisation for scientific research (4019020199), and European Union’s Horizon 2020 research and innovation program (830927). We would like to thank our anonymous reviewers for their valuable feedback. We are also grateful to the Peering and Tangled admins who allowed us to run measurements. We thank Dutch National Scrubbing Center for sharing DDoS data with us. We also thank Yuri Pradkin for his help to release our datasets.

Categories
Papers Publications

new symposium paper “Visualizing Internet Measurements of Covid-19 Work-from-Home” at IEEE Symposium on REU Research in Data Science, Systems, and Security

We published a new paper “Visualizing Internet Measurements of Covid-19 Work-from-Home” by Erica Stutz (Swarthmore College), Yuri Pradkin, Xiao Song, and John Heidemann (USC/ISI) at the Symposium for REU Research in Data Science, Systems, and Security, co-located with IEEE BigData 2021.

A screenshot from our Covid-WFH website showing an event in Malaysia on 2020-04-02.
A change in Internet use seen in Malaysia on 2020-04-02, present in our Covid-WFH data but discovered through our website.

From the abstract:

The Covid-19 pandemic disrupted the world as businesses and schools shifted to work-from-home (WFH), and comprehensive maps have helped visualize how those policies changed over time and in different places. We recently developed algorithms that infer the onset of WFH based on changes in observed Internet usage. Measurements of WFH are important to evaluate how effectively policies are implemented and followed, or to confirm policies in countries with less transparent journalism.This paper describes a web-based visualization system for measurements of Covid-19-induced WFH. We build on a web-based world map, showing a geographic grid of observations about WFH. We extend typical map interaction (zoom and pan, plus animation over time) with two new forms of pop-up information that allow users to drill-down to investigate our underlying data.We use sparklines to show changes over the first 6 months of 2020 for a given location, supporting identification and navigation to hot spots. Alternatively, users can report particular networks (Internet Service Providers) that show WFH on a given day.We show that these tools help us relate our observations to news reports of Covid-19-induced changes and, in some cases, lockdowns due to other causes. Our visualization is publicly available at https://covid.ant.isi.edu, as is our underlying data.

Datasets from this work will be available from our website and can be seen now at https://covid.ant.isi.edu. We thank NSF grants 2028279 and CNS-2007106 for supporting this work.

Categories
Students Uncategorized

congratulations to Manaf Gharaibeh for his PhD

I would like to congratulate Dr. Manaf Gharaibeh for defending his PhD at Colorado State University in February 2020 and completing his doctoral dissertation “Characterizing the Visible Address Space to Enable Efficient, Continuous IP Geolocation” in March 2020.

From the abstract:

Manaf Gharaibeh’s phd defense, with Christos Papadopoulos.

Internet Protocol (IP) geolocation is vital for location-dependent applications and many network research problems. The benefits to applications include enabling content customization, proximal server selection, and management of digital rights based on the location of users, to name a few. The benefits to networking research include providing geographic context useful for several purposes, such as to study the geographic deployment of Internet resources, bind cloud data to a location, and to study censorship and monitoring, among others.
The measurement-based IP geolocation is widely considered as the state-of-the-art client-independent approach to estimate the location of an IP address. However, full measurement-based geolocation is prohibitive when applied continuously to the entire Internet to maintain up-to-date IP-to-location mappings. Furthermore, many IP address blocks rarely move, making it unnecessary to perform such full geolocation.
The thesis of this dissertation states that \emph{we can enable efficient, continuous IP geolocation by identifying clusters of co-located IP addresses and their location stability from latency observations.} In this statement, a cluster indicates a group of an arbitrary number of adjacent co-located IP addresses (a few up to a /16). Location stability indicates a measure of how often an IP block changes location. We gain efficiency by allowing IP geolocation systems to geolocate IP addresses as units, and by detecting when a geolocation update is required, optimizations not explored in prior work. We present several studies to support this thesis statement.
We first present a study to evaluate the reliability of router geolocation in popular geolocation services, complementing prior work that evaluates end-hosts geolocation in such services. The results show the limitations of these services and the need for better solutions, motivating our work to enable more accurate approaches. Second, we present a method to identify clusters of \emph{co-located} IP addresses by the similarity in their latency. Identifying such clusters allows us to geolocate them efficiently as units without compromising accuracy. Third, we present an efficient delay-based method to identify IP blocks that move over time, allowing us to recognize when geolocation updates are needed and avoid frequent geolocation of the entire Internet to maintain up-to-date geolocation. In our final study, we present a method to identify cellular blocks by their distinctive variation in latency compared to WiFi and wired blocks. Our method to identify cellular blocks allows a better interpretation of their latency estimates and to study their geographic properties without the need for proprietary data from operators or users.

Categories
Publications Students

congratulations to Lan Wei for her new PhD

I would like to congratulate Dr. Lan Wei for defending her PhD in September 2020 and completing her doctoral dissertation “Anycast Stability, Security and Latency in The Domain Name System (DNS) and Content Deliver Networks (CDNs)” in December 2020.

From the abstract:

Clients’ performance is important for both Content-Delivery Networks (CDNs) and the Domain Name System (DNS). Operators would like the service to meet expectations of their users. CDNs providing stable connections will prevent users from experiencing downloading pause from connection breaks. Users expect DNS traffic to be secure without being intercepted or injected. Both CDN and DNS operators care about a short network latency, since users can become frustrated by slow replies.


Many CDNs and DNS services (such as the DNS root) use IP anycast to bring content closer to users. Anycast-based services announce the same IP address(es) from globally distributed sites. In an anycast infrastructure, Internet routing protocols will direct users to a nearby site naturally. The path between a user and an anycast site is formed on a hop-to-hop basis—at each hop} (a network device such as a router), routing protocols like Border Gateway Protocol (BGP) makes the decision about which next hop to go to. ISPs at each hop will impose their routing policies to influence BGP’s decisions. Without globally knowing (also unable to modify) the distributed information of BGP routing table of every ISP on the path, anycast infrastructure operators are unable to predict and control in real-time which specific site a user will visit and what the routing path will look like. Also, any change in routing policy along the path may change both the path and the site visited by a user. We refer to such minimal control over routing towards an anycast service, the uncertainty of anycast routing. Using anycast spares extra traffic management to map users to sites, but can operators provide a good anycast-based service without precise control over the routing?


This routing uncertainty raises three concerns: routing can change, breaking connections; uncertainty about global routing means spoofing can go undetected, and lack of knowledge of global routing can lead to suboptimal latency. In this thesis, we show how we confirm the stability, how we confirm the security, and how we improve the latency of anycast to answer these three concerns. First, routing changes can cause users to switch sites, and therefore break a stateful connection such as a TCP connection immediately. We study routing stability and demonstrate that connections in anycast infrastructure are rarely broken by routing instability. Of all vantage points (VPs), fewer than 0.15% VP’s TCP connections frequently break due to timeout in 5s during all 17 hours we observed. We only observe such frequent TCP connection break in 1 service out of all 12 anycast services studied. A second problem is DNS spoofing, where a third-party can intercept the DNS query and return a false answer. We examine DNS spoofing to study two aspects of security–integrity and privacy, and we design an algorithm to detect spoofing and distinguish different mechanisms to spoof anycast-based DNS. We show that DNS spoofing is uncommon, happening to only 1.7% of all VPs, although increasing over the years. Among all three ways to spoof DNS–injections, proxies, and third-party anycast site (prefix hijack), we show that third-party anycast site is the least popular one. Last, diagnosing poor latency and improving the latency can be difficult for CDNs. We develop a new approach, BAUP (bidirectional anycast unicast probing), which detects inefficient routing with better routing replacement provided. We use BAUP to study anycast latency. By applying BAUP and changing peering policies, a commercial CDN is able to significantly reduce latency, cutting median latency in half from 40ms to 16ms for regional users.

Lan defended her PhD when USC was on work-from-home due to COVID-19; she is the third ANT student with a fully on-line PhD defense.

Categories
Papers Publications

new journal paper “Detecting IoT Devices in the Internet” in IEEE/ACM Transactions on Networking

We have published a new journal paper “Detecting IoT Devices in the Internet” in IEEE/ACM Transactions on Networking, available at https://www.isi.edu/~johnh/PAPERS/Guo20c.pdf

Figure 5 from [Guo20c] showing per-device-type AS penetrations from 2013 to 2018 for 16 of the 23 device types we studies (omitting 7 device types appearing in less than10 ASes)

From the abstract of our journal paper:

Distributed Denial-of-Service (DDoS) attacks launched from compromised Internet-of-Things (IoT) devices have shown how vulnerable the Internet is to largescale DDoS attacks. To understand the risks of these attacks requires learning about these IoT devices: where are they? how many are there? how are they changing? This paper describes three new methods to find IoT devices on the Internet: server IP addresses in traffic, server names in DNS queries, and manufacturer information in TLS certificates. Our primary methods (IP addresses and DNS names) use knowledge of servers run by the manufacturers of these devices. Our third method uses TLS certificates obtained by active scanning. We have applied our algorithms to a number of observations. With our IP-based algorithm, we report detections from a university campus over 4 months and from traffic transiting an IXP over 10 days. We apply our DNS-based algorithm to traffic from 8 root DNS servers from 2013 to 2018 to study AS-level IoT deployment. We find substantial growth (about 3.5×) in AS penetration for 23 types of IoT devices and modest increase in device type density for ASes detected with these device types (at most 2 device types in 80% of these ASes in 2018). DNS also shows substantial growth in IoT deployment in residential households from 2013 to 2017. Our certificate-based algorithm finds 254k IP cameras and network video recorders from 199 countries around the world.

We make operational traffic we captured from 10 IoT devices we own public at https://ant.isi.edu/datasets/iot/. We also use operational traffic of 21 IoT devices shared by University of New South Wales at http://149.171.189.1/.

This journal paper is joint work of Hang Guo and  John Heidemann from USC/ISI.

Categories
Data

fighting bit rot in log-term data archives with babarchive

As part of research at ANT we generate a lot of data, and our goal is to keep it safe even in the face of an imperfect world of data storage.

When we say a lot, we mean hundreds of terabytes: As of May 2020, we have releasable 860 datasets making up 134 TB of storage (510TB if we uncompressed it). We provide this data at no cost to researchers, and since 2008 we’ve provided 2049 datasets (338 TB, or 1.1PB if uncompressed!) to 406 researchers!

These datasets range from packet captures of “normal” traffic, to curated captures of DDoS attacks, as well as dozens of research paper-specific datasets, 16 years of Internet censuses and 7 years of Internet outages, plus target lists for IPv4 that are regularly used for traffic studies and tools like Verfploeter anycast mapping.

As part of keeping this data, our goal is to keep this data. We want to fight bit rot and data loss. That means the RAID-6 for primary storage, with monitoring and timely disk replacement. It means off site backup (with a big thanks to our collaborators at Colorado State University, Christos Papadopoulos, Craig Partridge, and Dimitrios Kounalakis for their help). And it means watching bits to make sure they don’t spontaneously change.

One might think that bits at rest stay at rest, but… not always. We’ve seen three times when disks have spontaneously changed a byte over the last 20 years. In 2011 and 2012 I had bit flips on my personal files, and in 2020 we had a byte flip on a packet capture.

How do we know? We have application-level checksums of every file, and every day we take 10 minutes to check at least one dataset against its checksums. (Over time, we cover all datasets and then start all over.)

Our checksumming software is babarchive–our own wrapper around collecting SHA-256 checksums over a directory tree. We encourage other researchers interested in long-term data curation to carry out active content monitoring (in addition to backups and RAID).

A huge thanks to our research sponsors: DHS (through the LANDER, LACREND, and LACANIC projects), NSF (through the MADCAT, MR-Net), and DARPA (through GAWSEED).