These outages were nation-wide, apparently affecting most of Italy. However, it looks like they “only” affected 20-30% of networks, and not all Italian ISPs. We’re happy they were able to recover so quickly.
This event shows the importance of global network monitoring.
We had a good poster session for semester research projects at CSci551/651 computer networking class at USC on November 30, 2022. The class enjoyed the brief presentations and the extended poster sesssion!
Thanks to the four students with research projects this semester: Kishore Ganesh on evaluating caching entire DNS zones, Siddhant Gupta on the effects of radio reliability on vehicle-to-vehicle communciation, Sulyab Thottungal Valapu on evaluating IPv6 usage, and Kicho Yu on detecting encrypted network traffic. Also thanks to our external mentors Wes Hardaker (who worked with Kishore Ganesh) and Hang Qiu (who worked with Siddhant Gupta).
The Domain Name System (DNS) is an essential service for the Internet which maps host names to IP addresses. The DNS Root Sever System operates the top of this namespace. RIPE Atlas observes DNS from more than 11k vantage points (VPs) around the world, reporting the reliability of the DNS Root Server System in DNSmon. DNSmon shows that loss rates for queries to the DNS Root are nearly 10% for IPv6, much higher than the approximately 2% loss seen for IPv4. Although IPv6 is “new,” as an operational protocol available to a third of Internet users, it ought to be just as reliable as IPv4. We examine this difference at a finer granularity by investigating loss at individual VPs. We confirm that specific VPs are the source of this difference and identify two root causes: VP islands with routing problems at the edge which leave them unable to access IPv6 outside their LAN, and VP peninsulas which indicate routing problems in the core of the network. These problems account for most of the loss and nearly all of the difference between IPv4 and IPv6 query loss rates. Islands account for most of the loss (half of IPv4 failures and 5/6ths of IPv6 failures), and we suggest these measurement devices should be filtered out to get a more accurate picture of loss rates. Peninsulas account for the main differences between root identifiers, suggesting routing disagreements root operators need to address. We believe that filtering out both of these known problems provides a better measure of underlying network anomalies and loss and will result in more actionable alerts.
This work was done while Tarang was on his Summer 2022 undergraduate research internship at USC/ISI, with support from NSF grant 2051101 (PI: Jelena Mirkovich). John Heidemann and Yuri Pradkin’s work is supported by NSF through the EIEIO project (CNS-2007106). We thank Guillermo Baltra for his work on islands and peninsulas, as seen in his arXiv report.
Tarang Saluja completed his summer undergraduate research internship at ISI this summer, working with John Heidemann and Yuri Pradkin on his project “Differences in Monitoring the DNS Root Over IPv4 and IPv6″.
In his project, Tarang examined RIPE Atlas’s DNSmon, a measurement system that monitors the Root Server System. DNSmon examines both IPv4 and IPv6, and its IPv6 reports show query loss rates that are consistently higher than IPv4, often 4-6% IPv6 loss vs. no or 2% IPv4 loss. Prior results by researchers at RIPE suggested these differences were due to problems at specific Atlas Vantage Points (VPs, also called Atlas Probes).
Building on the Guillero Baltra’s studies of partial connectivity in the Internet, Tarang classified Atlas VPs with problems as islands and peninsulas. Islands think they are on IPv6, but cannot reach any of the 13 Root DNS “letters” over IPv6, indicating that the VP has a local network configuration problem. Peninsulas can reach some letters, but not others, indicating a routing problem somewhere in the core of the Internet.
Tarang’s work is important because these observations allow lead to potential solutions. Islands suggest VPs that do not support IPv6 and so should not be used for monitoring. Peninsulas point to IPv6 routing problems that need to be addressed by ISPs. Setting VPs with these problems aside provides a more accurate view of what IPv6 should be, and allows us to use DNSmon to detect more subtle problems. Together, his work points the way to improving IPv6 for everyone and improving Root DNS access over IPv6.
Tarang’s work was part of the ISI Research Experiences for Undergraduates program at USC/ISI. We thank Jelena Mirkovic (PI) for coordinating another year of this great program, and NSF for support through award #2051101.
It’s big! Maybe 30% of Toronto and southern Ontario networks, plus a lot of outages in New Brunswick.
An update: Newfoundland also sees a lot of outages. Quebec looks in pretty good shape, though.
And it’s lasting a long time. It looks like it started at 5am Eastern time (2022-07-08t09:00Z), it it has lasted 9.5 hours so far!
We wish Rogers personnel and our Canadian neighbors the best.
Update at 2022-07-09t06:15Z (2:15am Eastern time): Toronto is doing much better, with “only” 10% of blocks unreachable (22808 of 21.5k in the 43.8N,79.3W 0.5 grid cell). New Brunswick and Newfoundland still look the same, with outages in about 50% of blocks.
Update at 2022-07-09t21:10Z (5:10pm Eastern time): It looks like many Rogers networks recovered at 2022-07-09t05:15Z (1:15am Eastern time). This includes all of New Brunswick and Newfoundland and most of Ontario. Trinocular has about a one-hour delay while it computes results, so I did not see this result when I checked in the prior update–I needed to wait 15 minutes more.
We recently added timeline support to our Outage World map–clicking on an outage bubble pops up a window with a sparkline (a small graph) showing maximum outages on each data for the current quarter, and clicking on the “daily timeline” tab shows outages for the current 24 hours. These graphs help provide context for how long an outage lasts, and if there were other outages the same quarter.
On March 29, 2022 the paper “Old but Gold: Prospecting TCP to Engineer and Live Monitor DNS Anycast” by Giovane C. M. Moura, John Heidemann, Wes Hardaker, Pithayuth Charnsethikul, Jeroen Bulten, João M. Ceron, and Cristian Hesselman appeared that the 2022 Passive and Active Measurement Conference. We’re happy that it was awarded Best Paper for this year’s conference!
From the abstract:
DNS latency is a concern for many service operators: CDNs exist to reduce service latency to end-users but must rely on global DNS for reachability and load-balancing. Today, DNS latency is monitored by active probing from distributed platforms like RIPE Atlas, with Verfploeter, or with commercial services. While Atlas coverage is wide, its 10k sites see only a fraction of the Internet. In this paper we show that passive observation of TCP handshakes can measure live DNS latency, continuously, providing good coverage of current clients of the service. Estimating RTT from TCP is an old idea, but its application to DNS has not previously been studied carefully. We show that there is sufficient TCP DNS traffic today to provide good operational coverage (particularly of IPv6), and very good temporal coverage (better than existing approaches), enabling near-real time evaluation of DNS latency from real clients. We also show that DNS servers can optionally solicit TCP to broaden coverage. We quantify coverage and show that estimates of DNS latency from TCP is consistent with UDP latency. Our approach finds previously unknown, real problems: DNS polarization is a new problem where a hypergiant sends global traffic to one anycast site rather than taking advantage of the global anycast deployment. Correcting polarization in Google DNS cut its latency from 100ms to 10ms; and from Microsoft Azure cut latency from 90ms to 20ms. We also show other instances of routing problems that add 100-200ms latency. Finally, real-time use of our approach for a European country-level domain has helped detect and correct a BGP routing misconfiguration that detoured European traffic to Australia. We have integrated our approach into several open source tools: Entrada, our open source data warehouse for DNS, a monitoring tool (ANTS), which has been operational for the last 2 years on a country-level top-level domain, and a DNS anonymization tool in use at a root server since March 2021.
This paper was made in part through DHS HSARPA Cyber Security Division via contract number HSHQDC-17-R-B0004-TTA.02-0006-I (PAADDOS) and by NWO, NSF CNS-1925737 (DIINER), and the Conconrdia Project, an European Union’s Horizon 2020 Research and Innovation program under Grant Agreement No 830927.
We hope Erica’s new website makes it easier to evaluate COVID-19 WFH changes, and we look forward to continue to work with Erica on this topic.
Erica worked virtually at USC/ISI in summer 2021 as part of the (ISI Research Experiences for Undergraduates. We thank Jelena Mirkovic (PI) for coordinating the second year of this great program, and NSF for support through award #2051101.