the tsuNAME vulnerability in DNS

On 2020-05-06, researchers at SIDN Labs, (the .nl registry), InternetNZ (the .nz registry) , and at the Information Science Institute at the University of Southern California publicly disclosed tsuNAME, a vulnerability in some DNS resolver software that can be weaponized to carry out DDoS attacks against authoritative DNS servers.

TsuNAME is a problem that results from cyclic dependencies in DNS records, where two NS records point at each other. We found that some recursive resolvers would follow this cycle, greatly amplifying an initial queries and stresses the authoritative servers providing those records.

Our technical report describes a tsuNAME related event observed in 2020 at the .nz authoritative servers, when two domains were misconfigured with cyclic dependencies. It caused the total traffic to growth by 50%. In the report, we show how an EU-based ccTLD experienced a 10x traffic growth due to cyclic dependent misconfigurations.

We refer DNS operators and developers to our security advisory that provides recommendations for how to mitigate or detect tsuNAME.

We have also created a tool, CycleHunter, for detecting cyclic dependencies in DNS zones. Following responsible disclosure practices, we provided operators and software vendors time to address the problem first. We are happy that Google public DNS and Cisco OpenDNS both took steps to protect their public resolvers, and that PowerDNS and NLnet have confirmed their current software is not affected.


congratulations to Abdul Qadeer for his PhD

I would like to congratulate Dr. Abdul Qadeer for defending his PhD at the University of Southern California in March 2021 and completing his doctoral dissertation “Efficient Processing of Streaming Data in Multi-User and Multi-Abstraction Workflows”.

From the abstract:

Abdul Qadeer after his defense.

Ever-increasing data and evolving processing needs force enterprises to scale-out expensive computational resources to prioritize processing for timely results. Teams process their organization’s data either independently or using ad hoc sharing mechanisms. Often different users start with the same data and the same initial stages (decrypt, decompress, clean, anonymize). As their workflows evolve, later stages often diverge, and different stages may work best with different abstractions. The result is workflows with some overlap, some variations, and multiple transitions where data handling changes between continuous, windowed, and per-block. The system processing this diverse, multi-user, multi-abstraction workflow should be efficient and safe, but also must cope with fault recovery.

Analytics from multiple users can cause redundant processing and data, or encounter performance anomalies due to skew. Skew arises due to static or dynamic imbalance in the workflow stages. Both redundancy and skew waste compute resources and add latency to results. When users bridge between multiple abstractions, such as from per-block processing to windowed processing, they often employ custom code. These transitions can be error prone due to corner cases, can easily add latency as an inefficiency, and custom code is often a source of errors and maintenance difficulty. We need new solutions to manage the above challenges and to expose opportunities for data sharing explicitly. Our thesis is: new methods enable efficient processing of multi-user and multi-abstraction workflows of streaming data. We present two new methods for efficient stream processing—optimizations for multi-user workflows, and multiple abstractions for application coverage and efficient bridging.

These algorithms use a pipeline-graph to detect duplication of code and data across multiple users and cleanly delineate workflow stages for skew management. The pipeline-graph is our job description language that allows developers to specify their need easily and enables our system to automatically detect duplication and manage skew. The pipeline-graph acts as a shared canvas for collaboration amongst users to extend each other’s work. To efficiently implement our deduplication and skew management algorithms, we present streaming data to processing stages as fixed-sized but large blocks. Large-blocks have low meta-data overhead per user, provide good parallelism, and help with fault recovery.

Our second method enables applications to use a different abstraction on a different workflow stage. We provide three key abstractions and show that they cover many classes of analytics and our framework can bridge them efficiently. We provide Block-Streaming, Windowed-Streaming, and Stateful-Streaming abstractions. Block-Streaming is suitable for single-pass applications that care about temporal or spatial locality. Windowed-Streaming allows applications to process accumulated data (time-aligned blocks to sync with external information) and reductions like summation, averages, or other MapReduce-style analytics. We believe our three abstractions allow many classes of analytics and enable processing of one block, many blocks, or infinite stream. Plumb allows multiple abstractions in different parts of the workflow and provides efficient bridging between them so that users could make complex analytics from individual stages without worrying about data movement.

Our methods aim for good throughput, low latency, and clean and easy-to-use support for more applications to achieve better efficiency than our prior hand-tuned but often brittle system. The Plumb framework is the implementation of our solutions and a testbed to validate them. We use real-world workloads from the B-Root DNS domain to demonstrate effectiveness of our solutions. Our processing deduplication increases throughput up to $6\times$, reduces storage by 75%, as compared to their pre-Plumb counterparts. Plumb reduces CPU wastage due to structural skew up to half and reduces latency due to computational skew by 50%. Plumb has cut per-block latency by 74% and latency of daily statistics by 97%, while reducing code size by 58% and lowering manual intervention to handle problems by 73% as compared to pre-Plumb system.

The operational use of Plumb for the B-Root service provides a multi-year validation of our design choices under many traffic conditions. Over the last three years, Plumb has processed more than 12PB of DNS packet data and daily statistics. We show that our abstractions apply to many applications in the domain of networking big-data and beyond.

Publications Students

congratulations to Lan Wei for her new PhD

I would like to congratulate Dr. Lan Wei for defending her PhD in September 2020 and completing her doctoral dissertation “Anycast Stability, Security and Latency in The Domain Name System (DNS) and Content Deliver Networks (CDNs)” in December 2020.

From the abstract:

Clients’ performance is important for both Content-Delivery Networks (CDNs) and the Domain Name System (DNS). Operators would like the service to meet expectations of their users. CDNs providing stable connections will prevent users from experiencing downloading pause from connection breaks. Users expect DNS traffic to be secure without being intercepted or injected. Both CDN and DNS operators care about a short network latency, since users can become frustrated by slow replies.

Many CDNs and DNS services (such as the DNS root) use IP anycast to bring content closer to users. Anycast-based services announce the same IP address(es) from globally distributed sites. In an anycast infrastructure, Internet routing protocols will direct users to a nearby site naturally. The path between a user and an anycast site is formed on a hop-to-hop basis—at each hop} (a network device such as a router), routing protocols like Border Gateway Protocol (BGP) makes the decision about which next hop to go to. ISPs at each hop will impose their routing policies to influence BGP’s decisions. Without globally knowing (also unable to modify) the distributed information of BGP routing table of every ISP on the path, anycast infrastructure operators are unable to predict and control in real-time which specific site a user will visit and what the routing path will look like. Also, any change in routing policy along the path may change both the path and the site visited by a user. We refer to such minimal control over routing towards an anycast service, the uncertainty of anycast routing. Using anycast spares extra traffic management to map users to sites, but can operators provide a good anycast-based service without precise control over the routing?

This routing uncertainty raises three concerns: routing can change, breaking connections; uncertainty about global routing means spoofing can go undetected, and lack of knowledge of global routing can lead to suboptimal latency. In this thesis, we show how we confirm the stability, how we confirm the security, and how we improve the latency of anycast to answer these three concerns. First, routing changes can cause users to switch sites, and therefore break a stateful connection such as a TCP connection immediately. We study routing stability and demonstrate that connections in anycast infrastructure are rarely broken by routing instability. Of all vantage points (VPs), fewer than 0.15% VP’s TCP connections frequently break due to timeout in 5s during all 17 hours we observed. We only observe such frequent TCP connection break in 1 service out of all 12 anycast services studied. A second problem is DNS spoofing, where a third-party can intercept the DNS query and return a false answer. We examine DNS spoofing to study two aspects of security–integrity and privacy, and we design an algorithm to detect spoofing and distinguish different mechanisms to spoof anycast-based DNS. We show that DNS spoofing is uncommon, happening to only 1.7% of all VPs, although increasing over the years. Among all three ways to spoof DNS–injections, proxies, and third-party anycast site (prefix hijack), we show that third-party anycast site is the least popular one. Last, diagnosing poor latency and improving the latency can be difficult for CDNs. We develop a new approach, BAUP (bidirectional anycast unicast probing), which detects inefficient routing with better routing replacement provided. We use BAUP to study anycast latency. By applying BAUP and changing peering policies, a commercial CDN is able to significantly reduce latency, cutting median latency in half from 40ms to 16ms for regional users.

Lan defended her PhD when USC was on work-from-home due to COVID-19; she is the third ANT student with a fully on-line PhD defense.


Deep Dive into DNS at IETF108

The Domain Name System (DNS) is responsible for handling the initial steps of almost all connections on the Internet. USC/ISI’s Wes Hardaker, along with Geoff Houston and Joao Damas from APNIC, gave a “Deep Dive” presentation on how the DNS works at the 108th IETF conference. The recording is available on YouTube for those that missed it.

DNS Internet

APNIC Blog Post on the effects of chromium generated DNS traffic to the root server system

During the summer of 2019, Haoyu Jiang and Wes Hardaker studied the effects of DNS traffic sent to the root serevr system by chromium-based web browsers. The results of this short research effort were posted to the APNIC blog.

DNS Internet

B-root’s new sites reduce latency

B-Root, one of the 13 root DNS servers, deployed three new sites in January 2020, doubling its footprint and adding its first sites in Asia and Europe. How did this growth lower latency to users? We looked at B-Root deployment with Verfploter to answer this question. The end result was that new sites in Asia and Europe allowed users there to resolve DNS names with B-Root with lower latency (see the catchment map below). For more details please review our anycast catchment page.

B-root added 3 new sites in Singapore, Washington, DC, and Amsterdam to their three existing 3 sites in Los Angeles, Chile, and Miami. The graph below shows anycast catchments after these sites were deployed (each color in the pie charts shows traffic to a different site).

Announcements DNS Internet

Early longitudinal results in measuring the usage of Mozilla’s DNS Canary

Mozilla announced the creation of a “” “Canary Domain” that could be configured within ISPs to disable Firefox’s default use of DNS over HTTPS. On 2019/09/21 Wes Hardaker created a RIPE Atlas measurement to study resolvers within ISPs that had been configured to return an NXDOMAIN response. This measurement is configured to have 1000 Atlas probes query for the name once a day.

The full description of methodology is on Wes’ ISI site, which should receive regular updates to the graph.



Talks at DNS-OARC 61

Wes Hardaker gave two presentations at DNS-OARC on November 1st, 2019. The first was a presentation about the previously announced “Cache me if you can” paper, which is on youtube, and the slides are available as well. The second talk presented Haoyu Jiang’s work during the summer of 2018 on analyzing DNS B-Root traffic during the 2018 DITL data for levels of traffic sent by the Chrome web browser, levels of traffic associated with different languages, and levels of traffic sent by different label lengths. It is available on youtube with the slides here.

Papers Publications

new conference paper “Cache Me If You Can: Effects of DNS Time-to-Live” at ACM IMC 2019

We will publish a new paper “Cache Me If You Can: Effects of DNS Time-to-Live” by Giovane C. M. Moura, John Heidemann, Ricardo de O. Schmidt, and Wes Hardaker, in the ACM Internet Measurements Conference (IMC 2019) in Amsterdam, the Netherlands.

From the abstract:

Figure 10a from [Moura19b], showing the distribution of latency with small TTLs before (right in blue) and with larger TTLs after (left in red) the .uy domain reviewed our work and lengthened their domain’s cache lifetimes to reduce latency to their customers.

DNS depends on extensive caching for good performance, and every DNS zone owner must set Time-to-Live (TTL) values to control their DNS caching. Today there is relatively little guidance backed by research about how to set TTLs, and operators must balance conflicting demands of caching against agility of configuration. Exactly how TTL value choices affect operational networks is quite challenging to understand due to interactions across the distributed DNS service, where resolvers receive TTLs in different ways (answers and hints), TTLs are specified in multiple places (zones and their parent’s glue), and while DNS resolution must be security-aware. This paper provides the first careful evaluation of how these multiple, interacting factors affect the effective cache lifetimes of DNS records, and provides recommendations for how to configure DNS TTLs based on our findings. We provide recommendations in TTL choice for different situations, and for where they must be configured. We show that longer TTLs have significant promise in reducing latency, reducing it from 183ms to 28.7ms for one country-code TLD.

We have also reported on this work at the RIPE and APNIC blogs.

Publications Technical Report

new technical report “Plumb: Efficient Processing of Multi-User Pipelines (Poster)”

We released a new technical report “Plumb: Efficient Processing of Multi-User Pipelines (Poster)”, by Abdul Qadeer and John Heidemann, as ISI-TR-731.  This work was originally presented at ACM Symposium on Cloud Computing (the poster abstract is available at ACM). The poster abstract with a small version of the poster is available at

aqadeer at SoCC 2018 Carlsbad CA

From the abstract:

As the field of big data analytics matures, workflows are increasingly complex and often include components that are shared by different users. Individual workflows often include multiple stages, and when groups build on each other’s work it is easy to lose track of computation that may be shared across different groups.

The contribution of this poster is to provide an organization-wide processing substrate Plumb that can be used to solve commonly occurring problems and to achieve a common goal. Plumb makes multi-user sharing a first-class concern by providing pipeline-graph abstraction. This abstraction is simple and based on fundamental model of input-processing-output but is powerful to capture processing and data duplication. Plumb then employs best available solutions to tackle problems of large-block processing under structural and computational skew without user intervention.

We expect to release the Plumb software this fall; please contact us if you have questions or interest in using it.