Papers Publications

new journal paper “Plumb: Efficient Stream Processing of Multi-User Pipelines” in the Journal of Software: Practice and Experience

We have published a new journal paper “Plumb: Efficient Stream Processing of Multi-User Pipelines” in Wiley’s Journal of Software: Practice and Experience, available at

Plumb provides a new pipeline-graph abstraction that allows multiple users to specify workflows in which Plumb can detect and elimiate duplicate processing and handle processing skew due to unbalanced data or stages. The end result is that users get their results faster and a shared cluster is efficiently utilized.

From the abstract of our journal paper:

Operational services run 24×7 and require analytics pipelines to evaluate performance. In mature services such as DNS, these pipelines often grow to many stages developed by multiple, loosely-coupled teams. Such pipelines pose two problems: first, computation and data storage may be duplicated across components developed by different groups, wasting resources. Second, processing can be skewed, with structural skew occurring when different pipeline stages need different amounts of resources, and computational skew occurring when a block of input data requires increased resources. Duplication and structural skew both decrease efficiency, increasing cost, latency, or both. Computational skew can cause pipeline failure or deadlock when resource consumption balloons; we have seen cases where pessimal traffic increases CPU requirements 6-fold. Detecting duplication is challenging when components from multiple teams evolve independently and require fault isolation. Skew management is hard due to dynamic workloads coupled with the conflicting goals of both minimizing latency and maximizing utilization. We propose Plumb, a framework to abstract stream processing as large-block streaming (LBS) for a multi-stage, multi-user workflow. Plumb users express analytics as a DAG of processing modules, allowing Plumb to integrate and optimize workflows from multiple users. Many real-world applications map to the LBS abstraction. Plumb detects and eliminates duplicate computation and storage, and it detects and addresses both structural and computational skew by tracking computation across the pipeline. We exercise Plumb using the analytics pipeline for B-Root DNS. We compare Plumb to a hand-tuned system, cutting latency to one-third the original, and requiring 39% fewer container hours, while supporting more flexible, multi-user analytics and providing greater robustness to DDoS-driven demands.

This journal paper is joint work of Abdul Qadeer and  John Heidemann from USC/ISI.

Plumb is open source software and we will be interested in beta testers. Please contact us if you think it would be useful to manage your workflows over one or a cluster of computers.


fighting bit rot in log-term data archives with babarchive

As part of research at ANT we generate a lot of data, and our goal is to keep it safe even in the face of an imperfect world of data storage.

When we say a lot, we mean hundreds of terabytes: As of May 2020, we have releasable 860 datasets making up 134 TB of storage (510TB if we uncompressed it). We provide this data at no cost to researchers, and since 2008 we’ve provided 2049 datasets (338 TB, or 1.1PB if uncompressed!) to 406 researchers!

These datasets range from packet captures of “normal” traffic, to curated captures of DDoS attacks, as well as dozens of research paper-specific datasets, 16 years of Internet censuses and 7 years of Internet outages, plus target lists for IPv4 that are regularly used for traffic studies and tools like Verfploeter anycast mapping.

As part of keeping this data, our goal is to keep this data. We want to fight bit rot and data loss. That means the RAID-6 for primary storage, with monitoring and timely disk replacement. It means off site backup (with a big thanks to our collaborators at Colorado State University, Christos Papadopoulos, Craig Partridge, and Dimitrios Kounalakis for their help). And it means watching bits to make sure they don’t spontaneously change.

One might think that bits at rest stay at rest, but… not always. We’ve seen three times when disks have spontaneously changed a byte over the last 20 years. In 2011 and 2012 I had bit flips on my personal files, and in 2020 we had a byte flip on a packet capture.

How do we know? We have application-level checksums of every file, and every day we take 10 minutes to check at least one dataset against its checksums. (Over time, we cover all datasets and then start all over.)

Our checksumming software is babarchive–our own wrapper around collecting SHA-256 checksums over a directory tree. We encourage other researchers interested in long-term data curation to carry out active content monitoring (in addition to backups and RAID).

A huge thanks to our research sponsors: DHS (through the LANDER, LACREND, and LACANIC projects), NSF (through the MADCAT, MR-Net), and DARPA (through GAWSEED).


the ns family of network simulators are awarded the SIGCOMM Networking Systems Award for 2020

From the SIGCOMM mailing list and Facebook feed, Dina Papagiannaki posted on June 3, 2020:

I hope that everyone in the community is safe. A brief announcement that we have our winner for the networking systems award 2020. The committee comprised of Anja Feldmann (Max-Planck-Institut für Informatik), Srinivasan Keshav (University of Cambridge, chair), and Nick McKeown (Stanford University) have decided to present the award to the ns family of network simulators (ns-1, ns-2, and ns-3). Congratulations to all the contributors!


“ns” is a well-known acronym in networking research, referring to a series of network simulators (ns-1, ns-2, and ns-3) developed over the past twenty five years. ns-1 was developed at Lawrence Berkeley National Laboratory (LBNL) between 1995-97 based on an earlier simulator (REAL, written by S. Keshav). ns-2 was an early open source project, developed in the 1997-2004 timeframe and led by collaborators from USC Information Sciences Institute, LBNL, UC Berkeley, and Xerox PARC. A companion network animator (nam) was also developed during this time [Est00]. Between 2005-08, collaborators from the University of Washington, Inria Sophia Antipolis, Georgia Tech, and INESC TEC significantly rewrote the simulator to create ns-3, which continues today as an active open source project.

All of the ns simulators can be characterized as packet-level, discrete-event network simulators, with which users can build models of computer networks with varying levels of fidelity, in order to conduct performance evaluation studies. The core of all three versions is written in C++, and simulation scripts are written directly in a native programming language: for ns-1, in the Tool Command Language (Tcl), for ns-2, in object-oriented Tcl (OTcl), and for ns-3, in either C++ or Python. ns is a full-stack simulator, with a high degree of abstraction at the physical and application layers, and varying levels of modeling detail between the MAC and transport layers. ns-1 was released with a BSD software license, ns-2 with a collection of licenses later consolidated into a GNU GPLv2-compatible framework, and ns-3 with the GNU GPLv2 license.

ns-3 [Hen08, Ril10] can be viewed as a synthesis of three predecessor tools: yans [Lac06], GTNetS [Ril03], and ns-2 [Bre00]. ns-3 contains extensions to allow distributed execution on parallel processors, real-time scheduling with emulation capabilities for packet exchange with real systems, and a framework to allow C and C++ implementation (application and kernel) code to be compiled for reuse within ns-3 [Taz13]. Although ns-3 can be used as a general-purpose discrete-event simulator, and as a simulator for non-Internet-based networks, by far the most active use centers around Internet-based simulation studies, particularly those using its detailed models of Wi-Fi and 4G LTE systems. The project is now focused on developing models to allow ns-3 to support research and standardization activities involving several aspects of 5G NR, next-generation Wi-Fi, and the IETF Transport Area.

The ns-3-users Google Groups forum has over 9000 members (with several hundred monthly posts), and the developer mailing list contains over 1500 subscribers. Publication counts (as counted annually) in the ACM and IEEE digital libraries, as well as search results in Google Scholar, describing research work using or extending ns-2 and ns-3, continue to increase each year, and usage also appears to be growing within the networking industry and government laboratories. The project’s home page is at, and software development discussion is conducted on the mailing list.


The main authors of ns-1 were (in alphabetical order): Kevin Fall, Sally Floyd, Steve McCanne, and Kannan Varadhan.

ns-2 had a larger number of contributors. Space precludes listing all authors, but the following people were leading source code committers to ns-2 (in alphabetical order): Xuan Chen, Kevin Fall, Sally Floyd, Padma Haldar, John Heidemann, Tom Henderson, Polly Huang, K.C. Lan, Steve McCanne, Giao Ngyuen, Venkat Padmanabhan, Yuri Pryadkin, Kannan Varadhan, Ya Xu, and Haobo Yu. A more complete list of ns-2 contributors can be found at:

The ns-3 simulator has been developed by over 250 contributors over the past fifteen years. The original main development team consisted of (in alphabetical order): Raj Bhattacharjea, Gustavo Carneiro, Craig Dowell, Tom Henderson, Mathieu Lacage, and George Riley.

Recognition is also due to the long list of ns-3 software maintainers, many of which made significant contributions to ns-3, including (in alphabetical order): John Abraham, Zoraze Ali, Kirill Andreev, Abhijith Anilkumar, Stefano Avallone, Ghada Badawy Nicola Baldo, Peter D. Barnes, Jr., Biljana Bojovic, Pavel Boyko, Junling Bu, Elena Buchatskaya, Daniel Camara, Matthieu Coudron, Yufei Cheng, Ankit Deepak, Sebastien Deronne, Tom Goff, Federico Guerra, Budiarto Herman, Mohamed Amine Ismail, Sam Jansen, Konstantinos Katsaros, Joe Kopena, Alexander Krotov, Flavio Kubota, Daniel Lertpratchya, Faker Moatamri, Vedran Miletic, Marco Miozzo, Hemanth Narra, Natale Patriciello, Tommaso Pecorella, Josh Pelkey, Alina Quereilhac, Getachew Redieteab, Manuel Requena, Matias Richart, Lalith Suresh, Brian Swenson, Cristiano Tapparello, Adrian S.W. Tam, Hajime Tazaki, Frederic Urbani, Mitch Watrous, Florian Westphal, and Dizhi Zhou.

The full list of ns-3 authors is maintained in the AUTHORS file in the top-level source code directory, and full commit attributions can be found in the git commit logs.


[Bre00] Lee Breslau et al., Advances in network simulation, IEEE Computer, vol. 33, no. 5, pp. 59-67, May 2000.

[Est00] Deborah Estrin et al., Network Visualization with Nam, the VINT Network Animator, IEEE Computer, vol. 33, no.11, pp. 63-68, November 2000.

[Hen08] Thomas R. Henderson, Mathieu Lacage, and George F. Riley, Network simulations with the ns-3 simulator, In Proceedings of ACM Sigcomm Conference (demo), 2008.

[Lac06] Mathieu Lacage and Thomas R. Henderson. 2006. Yet another network simulator. In Proceeding from the 2006 workshop on ns-2: the IP network simulator (WNS2 ’06). Association for Computing Machinery, New York, NY, USA, 12–es.

[Ril03] George F. Riley, The Georgia Tech Network Simulator, In Proceedings of the ACM SIGCOMM Workshop on Models, Methods and Tools for Reproducible Network Research (MoMeTools) , Aug. 2003.

[Ril10] George F. Riley and Thomas Henderson, The ns-3 Network Simulator. In Modeling and Tools for Network Simulation, SpringerLink, 2010.

[Taz13] Hajime Tazaki et al. Direct code execution: revisiting library OS architecture for reproducible network experiments. In Proceedings of the ninth ACM conference on Emerging networking experiments and technologies (CoNEXT ’13). Association for Computing Machinery, New York, NY, USA, 217–228.

USC/ISI had multiple projects and was very active in ns-2 development for many years, first lead by Deborah Estrin (with the VINT project), then by John Heidemann (with the SAMAN and SCADDS projects), with Tom Henderson took over leadership and evolved it (with others) into ns-3. All of these efforts have been open source collaborations with key players at other institutions as well. Sally Floyd, Steve McCanne, and Kevin Fall were all leaders.

I would particularly like to thank the several USC students who did PhDs on ns-2 related topics: Polly Huang, Kun-Chan Lan, Debojyoti Dutta, and earlier Kannan Varadhan. Ns-2 also benefited from external code contributions from David B. Johnson’s Monarch group (then at CMU) and Elizabeth Belding and Charles Perkins. (My apologies for other contributors I’m sure I’m missing.)

A huge thanks to the ns-1 authors (Kannan Varadhan was a USC student at the time), and a huge thanks to the ns-3 authors for taking over maintainership and evolution and keeping it vibrant.


congratulations to Ryan Bogutz for his summer undergraduate internship

Ryan Bogutz completed his summer undergraduate research internship at ISI this summer, working with John Heidemann and Yuri Pradkin on his project “Identifying Interesting Outages”.

Ryan Bogutz with his poster at the ISI summer undergraduate research poster session.

In this project, Ryan examined Internet Outage data from Trinocular, developing an outage report that summarized the most “interesting” outages each day. Yuri integrated this report into our outage website where is available as a left side panel.

We hope Ryan’s new report makes it easier to evaluate Internet outages on a given day, and we look forward to continue to work with Ryan on this topic.

Ryan visited USC/ISI in summer 2019 as part of the (ISI Research Experiences for Undergraduates. We thank Jelena Mirkovic (PI) for coordinating the second year of this great program, and NSF for support through award #1659886.

See also ISI’s post about this summer undergradate program.

Publications Technical Report

new technical report “Plumb: Efficient Processing of Multi-User Pipelines (Poster)”

We released a new technical report “Plumb: Efficient Processing of Multi-User Pipelines (Poster)”, by Abdul Qadeer and John Heidemann, as ISI-TR-731.  This work was originally presented at ACM Symposium on Cloud Computing (the poster abstract is available at ACM). The poster abstract with a small version of the poster is available at

aqadeer at SoCC 2018 Carlsbad CA

From the abstract:

As the field of big data analytics matures, workflows are increasingly complex and often include components that are shared by different users. Individual workflows often include multiple stages, and when groups build on each other’s work it is easy to lose track of computation that may be shared across different groups.

The contribution of this poster is to provide an organization-wide processing substrate Plumb that can be used to solve commonly occurring problems and to achieve a common goal. Plumb makes multi-user sharing a first-class concern by providing pipeline-graph abstraction. This abstraction is simple and based on fundamental model of input-processing-output but is powerful to capture processing and data duplication. Plumb then employs best available solutions to tackle problems of large-block processing under structural and computational skew without user intervention.

We expect to release the Plumb software this fall; please contact us if you have questions or interest in using it.

Publications Technical Report

new technical report “Plumb: Efficient Processing of Multi-Users Pipelines (Extended)”

We released a new technical report “Plumb: Efficient Processing of Multi-Users Pipelines (Extended)”, by Abdul Qadeer and John Heidemann, as ISI-TR-727.  It is available at

Benefits of processing de-duplication

Benefits of data de-duplication

From the abstract:

Services such as DNS and websites often produce streams of data that are consumed by analytics pipelines operated by multiple teams. Often this data is processed in large chunks (megabytes) to allow analysis of a block of time or to amortize costs. Such pipelines pose two problems: first, duplication of computation and storage may occur when parts of the pipeline are operated by different groups. Second, processing can be lumpy, with structural lumpiness occurring when different stages need different amounts of resources, and data lumpiness occurring when a block of  input requires increased resources. Duplication and structural lumpiness both can result in inefficient processing. Data lumpiness can cause pipeline failure or deadlock, for example if differences in DDoS traffic compared to normal can require 6× CPU. We propose Plumb, a framework to abstract file processing for a multi-stage pipeline. Plumb integrates pipelines contributed by multiple users, detecting and eliminating duplication of computation and intermediate storage. It tracks and adjusts computation of each stage, accommodating both structural and data lumpiness. We exercise Plumb with the processing pipeline for B-Root DNS traffic, where it will replace a hand-tuned system to provide one third the original latency by utilizing 22% fewer CPU and will address limitations that occur as multiple users process data and when DDoS traffic causes huge shifts in performance.


Announcements Students

congratulations to Liang Zhu for his new PhD

I would like to congratulate Dr. Liang Zhu for defending his PhD in August 2018 and completing his doctoral dissertation “Balancing Security and Performance of Network Request-Response Protocols” in September 2018.

Liang Zhu (left) and John Heidemann, after Liang’s PhD defense.

From the abstract:

The Internet has become a popular tool to acquire information and knowledge. Usually information retrieval on the Internet depends on request-response protocols, where clients and servers exchange data. Despite of their wide use, request-response protocols bring challenges for security and privacy. For example, source-address spoofing enables denial-of-service (DoS) attacks, and eavesdropping of unencrypted data leaks sensitive information in request-response protocols. There is often a trade-off between security and performance in request-response protocols. More advanced protocols, such as Transport Layer Security (TLS), are proposed to solve these problems of source spoofing and eavesdropping. However, developers often avoid adopting those advanced protocols, due to performance costs such as client latency and server memory requirement. We need to understand the trade-off between security and performance for request-response protocols and find a reasonable balance, instead of blindly prioritizing one of them.
This thesis of this dissertation states that it is possible to improve security of network request-response protocols without compromising performance, by protocol and deployment optimizations, that are demonstrated through measurements of protocol developments and deployments. We support the thesis statement through three specific studies, each of which uses measurements and experiments to evaluate the development and optimization of a request-response protocol. We show that security benefits can be achieved with modest performance costs. In the first study, we measure the latency of OCSP in TLS connections. We show that OCSP has low latency due to its wide use of CDN and caching, while identifying certificate revocation to secure TLS. In the second study, we propose to use TCP and TLS for DNS to solve a range of fundamental problems in DNS security and privacy. We show that DNS over TCP and TLS can achieve favorable performance with selective optimization. In the third study, we build a configurable, general-purpose DNS trace replay system that emulates global DNS hierarchy in a testbed and enables DNS experiments at scale efficiently. We use this system to further prove the reasonable performance of DNS over TCP and TLS at scale in the real world.

In addition to supporting our thesis, our studies have their own research contributions. Specifically, In the first work, we conducted new measurements of OCSP by examining network traffic of OCSP and showed a significant improvement of OCSP latency: a median latency of only 20ms, much less than the 291ms observed in prior work. We showed that CDN serves 94% of the OCSP traffic and OCSP use is ubiquitous. In the second work, we selected necessary protocol and implementation optimizations for DNS over TCP/TLS, and suggested how to run a production TCP/TLS DNS server [RFC7858]. We suggested appropriate connection timeouts for DNS operations: 20s at authoritative servers and 60s elsewhere. We showed that the cost of DNS over TCP/TLS can be modest. Our trace analysis showed that connection reuse can be frequent (60%-95% for stub and recursive resolvers). We showed that server memory is manageable (additional 3.6GB for a recursive server), and latency of connection-oriented DNS is acceptable (9%-22% slower than UDP). In the third work, we showed how to build a DNS experimentation framework that can scale to emulate a large DNS hierarchy and replay large traces. We used this experimentation framework to explore how traffic volume changes (increasing by 31%) when all DNS queries employ DNSSEC. Our DNS experimentation framework can benefit other studies on DNS performance evaluations.

Papers Publications

new conference paper “LDplayer: DNS Experimentation at Scale” at ACM IMC 2018

We have published a new paper LDplayer: DNS Experimentation at Scale by Liang Zhu and John Heidemann, in the ACM Internet Measurements Conference (IMC 2018) in Boston, Mass., USA.

Figure 14a: Evaluation of server memory with different TCP timeouts and minimal RTT (<1 ms). Trace: B-Root-17a. Protocol: TLS

From the abstract:

DNS has evolved over the last 20 years, improving in security and privacy and broadening the kinds of applications it supports. However, this evolution has been slowed by the large installed base and the wide range of implementations. The impact of changes is difficult to model due to complex interactions between DNS optimizations, caching, and distributed operation. We suggest that experimentation at scale is needed to evaluate changes and facilitate DNS evolution. This paper presents LDplayer, a configurable, general-purpose DNS experimental framework that enables DNS experiments to scale in several dimensions: many zones, multiple levels of DNS hierarchy, high query rates, and diverse query sources. LDplayer provides high fidelity experiments while meeting these requirements through its distributed DNS query replay system, methods to rebuild the relevant DNS hierarchy from traces, and efficient emulation of this hierarchy on minimal hardware. We show that a single DNS server can correctly emulate multiple independent levels of the DNS hierarchy while providing correct responses as if they were independent. We validate that our system can replay a DNS root traffic with tiny error (± 8 ms quartiles in query timing and ± 0.1% difference in query rate). We show that our system can replay queries at 87k queries/s while using only one CPU, more than twice of a normal DNS Root traffic rate. LDplayer’s trace replay has the unique ability to evaluate important design questions with confidence that we capture the interplay of caching, timeouts, and resource constraints. As an example, we demonstrate the memory requirements of a DNS root server with all traffic running over TCP and TLS, and identify performance discontinuities in latency as a function of client RTT.

Papers Publications

new workshop paper “Leveraging Controlled Information Sharing for Botnet Activity Detection”

We have published a new paper “Leveraging Controlled Information Sharing for Botnet Activity Detection” in the Workshop on Traffic Measurements for Cybersecurity (WTMC 2018) in Budapest, Hungary, co-located with ACM SIGCOMM 2018.

The sensitivity of BotDigger’s detection is im- proved with controlled data sharing. All three domain/IP sets meet or pass the detection threshold.

From the abstract of our paper:

Today’s malware often relies on DNS to enable communication with command-and-control (C&C). As defenses that block traffic improve, malware use sophisticated techniques to hide this traffic, including “fast flux” names and Domain-Generation Algorithms (DGAs). Detecting this kind of activity requires analysis of DNS queries in network traffic, yet these signals are sparse. As bot countermeasures grow in sophistication, detecting these signals increasingly requires the synthesis of information from multiple sites. Yet *sharing security information across organizational boundaries* to date has been infrequent and ad hoc because of unknown risks and uncertain benefits. In this paper, we take steps towards formalizing cross-site information sharing and quantifying the benefits of data sharing. We use a case study on DGA-based botnet detection to evaluate how sharing cybersecurity data can improve detection sensitivity and allow the discovery of malicious activity with greater precision.

The relevant software is open-sourced and freely available at

This paper is joint work between Calvin Ardi and John Heidemann from USC/ISI, with additional support from collaborators and Colorado State University and Los Alamos National Laboratory.

Papers Publications

new conference paper “The Policy Potential of Measuring Internet Outages” at TPRC

We have published a new paper “The Policy Potential of Measuring Internet Outages” in TPRC46, the Research Conference on Communications, Information and Internet Policy, to be presented on September 21, 2018 at the American University, Washington College of Law.

Outages from Hurricane Irma after landfall in Florida on 2017-09-11, observed with Trinocular.

From the abstract of our paper:

Today it is possible to evaluate the reliability of the Internet. Prior approaches to measure network reliability required telecommunications providers reporting the status of their own networks, resulting in limits on the precision, timeliness, and availability of the results. Recent work in Internet measurement has shown that network outages can be observed with active measurements from a few sites, and from passive measurements of network telescopes (large, unused address space) or large network services such as content-delivery networks. We suggest that these kinds of *third-party* observations of network outages can provide data that is precise and timely. We discuss early results of Trinocular, an outage detection system using active probing developed at the University of Southern California. Trinocular has been operating continuously since November 2013, and we provide (at no charge) data covering about 4 million network blocks from around the world. This paper describes some results of Trinocular showing outages in a large U.S. Internet Service Provider, and those resulting from the 2017 Hurricane Irma in Florida. Our data shows the impact of the Broadband America policy for always-on networks, and we discuss how it might be used to address future policy questions and assist in disaster planning and recovery.

Data we describe in this paper is at, with visualizations at

This paper is joint work of John Heideman, Yuri Pradkin, and Guillermo Baltra from USC/ISI, with work carried out as part of LACANIC and DIVOICE projects with DHS S&T/CSD support.