Categories
Presentations

new talk “Collecting and Visualizing Outages Over the Long Haul” at the AIMS Workshop 2017

John Heidemann gave the talk “Collecting and Visualizing Outages Over the Long Haul” at CAIDA’s Active Internet Measurement (AIMS) Workshop in San Diego, California, USA on March 2, 2017.  Slides are available at http://www.isi.edu/~johnh/PAPERS/Heidemann17b.pdf.
From the abstract:

Unmeasurable blocks over time, a challenge in long-haul outage measurement, from [Alwabel15a]
We have been collecting data about outages in the Internet since Oct. 2014. Our outage detection system, Trinocular, uses active probing from four sites to study about 4 million /24 IPv4 address blocks. Long-duration measurements bring challenges that don’t occur in short observations. Most importantly, our target (“the Internet”) changes as we measure it, as new blocks come on-line, old blocks are reused in different ways, and ISPs observe and sometimes block our traffic. Our measurement platform also sees occasional hardware failures. Visualization can assist detection of these problems, allowing human perception to detect changes in data collection that have not previously been anticipated. This talk will discuss the challenges of long-term outage measurement and describe our new algorithm that scales to support clustering of 4M blocks and 3 months of observations for visualization.
Our visualization is joint work with Yuri Pradkin, and analysis of our long-term outages includes work with Abdulla Alwabel.

This talk draws on work from [Alwabel15a].  Data from this talk is available at https://ant.isi.edu/datasets/outage/, and visualizations can be found at https://ant.isi.edu/outage/browse/.

Categories
Publications Technical Report

new technical report “Does Anycast hang up on You? (extended)”

We have released a new technical report “Does Anycast hang up on you?(extended)”, ISI-TR-716, available at http://www.isi.edu/~weilan/PAPER/anycast_instability.pdf

From the abstract:

In each anycast-based DNS root service, there are about 1% VPs see a route flip happens every one or two observation during a week with an observation interval as 4 min.

Anycast-based services today are widely used commercially, with several major providers serving thousands of important websites. However, to our knowledge, there has been only limited study of how often anycast fails because routing changes interrupt connections between users and their current anycast site. While the commercial success of anycast CDNs means anycast usually work well, do some users end up shut out of anycast? In this paper we examine data from more than 9000 geographically distributed vantage points (VPs) to 11 anycast services to evaluate this question. Our contribution is the analysis of this data to provide the first quantification of this problem, and to explore where and why it occurs. We see that about 1% of VPs are anycast unstable, reaching a different anycast site frequently sometimes every query. Flips back and forth between two sites in 10 seconds are observed in selected experiments for given service and VPs.
Moreover, we show that anycast instability is persistent for some VPs—a few VPs never see a stable connections to certain anycast services during a week or even longer. The vast majority of VPs only saw unstable routing towards one or two services instead of instability with all services, suggesting the cause of the instability lies somewhere in the path to the anycast sites. Finally, we point out that for highly-unstable VPs, their probability to hit a given site is constant, which means the flipping are happening at a fine granularity —per packet level, suggesting load balancing might be the cause to anycast routing flipping. Our findings confirm the common wisdom that anycast almost always works well, but provide evidence that a small number of locations in the Internet where specific anycast services are never stable.

This technical report is joint work of  Lan Wei,  John Heidemann, from USC/ISI.

Categories
Presentations

new talk “Anycast vs. DDoS: Evaluating Nov. 30” at DNS-OARC

John Heidemann gave the talk “Anycast vs. DDoS: Evaluating Nov. 30” at DNS-OARC in Dallas, Texas, USA on October 16, 2016.  Slides are available at http://www.isi.edu/~johnh/PAPERS/Heidemann16c.pdf.

 

Possible outcomes of anycast under stress, a slide from a talk about the Nov. 30, 2015 Root DNS event.
Possible outcomes of anycast under stress, a slide from a talk about the Nov. 30, 2015 Root DNS event.

From the abstract:

Distributed Denial-of-Service (DDoS) attacks continue to be a major threat in the Internet today. DDoS attacks overwhelm target services with requests or other “bogus” traffic, causing requests from legitimate users to be shut out. A common defense against DDoS is to replicate the service in multiple physical locations or sites. If all sites announce a common IP address, BGP will associate users around the Internet with a nearby site, defining the catchment of that site. Anycast adds resilience against DDoS both by increasing capacity to the aggregate of many sites, and allowing each catchment to contain attack traffic leaving other sites unaffected. IP anycast is widely used for commercial CDNs and essential infrastructure such as DNS, but there is little evaluation of anycast under stress.

This talk will provide a first evaluation of several anycast services under stress with public data. Our subject is the Internet’s Root Domain Name Service, made up of 13 independently designed services (“letters”, 11 with IP anycast) running at more than 500 sites. Many of these services were stressed by sustained traffic at 100x normal load on Nov. 30 and Dec. 1, 2015. We use public data for most of our analysis to examine how different services respond to the these events. In our analysis we identify two policies by operators: (1) sites may absorb attack traffic, containing the damage but reducing service to some users, or (2) they may withdraw routes to shift both legitimate and bogus traffic to other sites. We study how these deployment policies result in different levels of service to different users, during and immediately after the attacks.

We also show evidence of collateral damage on other services located near the attack targets. The work is based on analysis of DNS response from around 9000 RIPE Atlas vantage points (or “probes”), agumented by RSSAC-002 reports from 5 root letters and BGP data from BGPmon. We examine DNS performance for each Root Letter, for anycast sites inside specific letters, and for specific servers at one site.

This talk is based on the work in the paper “Anycast vs. DDoS: Evaluating the November 2015 Root DNS Event” at appear at  IMC 2016, by Giovane C. M. Moura, Ricardo de O. Schmidt, John Heidemann, Wouter B. de Vries, Moritz Müller,  Lan Wei, and Christian Hesselman.

Datasets from the paper are available at https://ant.isi.edu/datasets/anycast/

Categories
Papers Publications

new conference paper “Anycast vs. DDoS: Evaluating the November 2015 Root DNS Event” in IMC 2016

The paper “Anycast vs. DDoS: Evaluating the November 2015 Root DNS Event” will appear at ACM Internet Measurement Conference in November 2016 in Santa Monica, California, USA. (available at http://www.isi.edu/~weilan/PAPER/IMC2016camera.pdf)

From the abstract:

RIPE Atlas VPs going to different anycast sites when under stress. Colors indicate different sites, with black showing unsuccessful queries. [Moura16b, figure 11b]

Distributed Denial-of-Service (DDoS) attacks continue to be a major threat in the Internet today. DDoS attacks overwhelm target services with requests or other traffic, causing requests from legitimate users to be shut out. A common defense against DDoS is to replicate the service in multiple physical locations or sites. If all sites announce a common IP address, BGP will associate users around the Internet with a nearby site,defining the catchment of that site. Anycast addresses DDoS both by increasing capacity to the aggregate of many sites, and allowing each catchment to contain attack traffic leaving other sites unaffected. IP anycast is widely used for commercial CDNs and essential infrastructure such as DNS, but there is little evaluation of anycast under stress. This paper provides the first evaluation of several anycast services under stress with public data. Our subject is the Internet’s Root Domain Name Service, made up of 13 independently designed services (“letters”, 11 with IP anycast) running at more than 500 sites. Many of these services were stressed by sustained traffic at 100 times normal load on Nov.30 and Dec.1, 2015. We use public data for most of our analysis to examine how different services respond to the these events. We see how different anycast deployments respond to stress, and identify two policies: sites may absorb attack traffic, containing the damage but reducing service to some users, or they may withdraw routes to shift both good and bad traffic to other sites. We study how these deployments policies result in different levels of service to different users. We also show evidence of collateral damage on other services located near the attacks.

This IMC paper is joint work of  Giovane C. M. Moura, Moritz Müller, Cristian Hesselman (SIDN Labs), Ricardo de O. Schmidt, Wouter B. de Vries (U. Twente), John Heidemann, Lan Wei (USC/ISI). Datasets in this paper are derived from RIPE Atlas and are available at http://traces.simpleweb.org/ and at https://ant.isi.edu/datasets/anycast/.

Categories
Students

congratulations to Xun Fan for his new PhD

I would like to congratulate Dr. Xun Fan for defending his PhD in May 2015 and completing his doctoral dissertation “Enabling Efficient Service Enumeration Through Smart Selection of Measurements” in July 2015.

Xun Fan (left) and John Heidemann, after Xun's PhD defense.
Xun Fan (left) and John Heidemann, after Xun’s PhD defense.

From the abstract:

The Internet is becoming more and more important in our daily lives. Both the government and industry invest in the growth of the Internet, bringing more users to the world of networks. As the Internet grows, researchers and operators need to track and understand the behavior of global Internet services to achieve smooth operation. Active measurements are often used to study behavior of large Internet service, and efficient service enumeration is required. For example, studies of Internet topology may need active probing to all visible network prefixes; monitoring large replicated service requires periodical enumeration of all service replicas. To achieve efficient service enumeration, it is important to select probing sources and destinations wisely. However, there are challenges for making smart selection of probing sources and destinations. Prior methods to select probing destinations are either inefficient or hard to maintain. Enumerating replicas of large Internet services often requires many widely distributed probing sources. Current measurement platforms don’t have enough probing sources to approach complete enumeration of large services.

This dissertation makes the thesis statement that smart selection of probing sources and destinations enables efficient enumeration of global Internet services to track and understand their behavior. We present three studies to demonstrate this thesis statement. First, we propose new automated approach to generate a list of destination IP addresses that enables efficient enumeration of Internet edge links. Second, we show that using large number of widely distributed open resolvers enables efficient enumeration of anycast nodes which helps study abnormal behavior of anycast DNS services. In our last study, we efficiently enumerate Front-End (FE) Clusters of Content Delivery Networks (CDNs) and use the efficient enumeration to track and understand the dynamics of user-to-FE Cluster mapping of large CDNs. We achieve the efficient enumeration of CDN FE Clusters by selecting probing sources from a large set of open resolvers. Our selected probing sources have smaller number of open resolvers but provide same coverage on CDN FE Cluster as the larger set.

In addition to our direct results, our work has also been used by several published studies to track and understand the behavior of Internet and large network services. These studies not only support our thesis as additional examples but also suggest this thesis can further benefit many other studies that need efficient service enumeration to track and understand behavior of global Internet services.

Categories
Presentations

new animation: the August 2014 Time Warner outage

Global network outages on 2014-08-27 during the Time Warner event in the U.S.
Global network outages on 2014-08-27 during the Time Warner event in the U.S.

On August 27, 2014, Time Warner suffered a network outage that affected about 11 million customers for more than two hours (making national news). We have observing global network outages since December 2013, including this outage.

We recently animated this August Time Warner outage.

We see that the Time Warner outage lasted about two hours and affected a good swath of the United States. We caution that all large network operators have occasional outages–this animation is not intended to complain about Time Warner, but to illustrate the need to have tools that can detect and visualize national-level outages.  It also puts the outage into context: we can see a few other outages in Uruguay, Brazil, and Saudi Arabia.

This analysis uses dataset usc-lander /internet_outage_adaptive_a17all-20140701, available for research use from PREDICT, or by request from us if PREDICT access is not possible.

This animation was first shown at the Dec. 2014 DHS Cyber Security Division R&D Showcase and Technical Workshop as part of the talk “Towards Understanding Internet Reliability” given by John Heidemann. This work was supported by DHS, most recently through the LACREND project.

Categories
Presentations

new animation: a sample of U.S. networks, before and after Hurricane Sandy

In October 2012, Hurricane Sandy made landfall on the U.S. East Coast causing widespread power outages. We were able to see the effects of Hurricane Sandy by analyzing active probing of the Internet. We first reported this work in a technical report and then with more refined analysis in a peer-reviewed paper.

Network outages for a sample of U.s. East Coast networks on the day after Hurricane Sandy made landfall.
Network outages for a sample of U.s. East Coast networks on the day after Hurricane Sandy made landfall.

We recently animated our data showing Hurricane Sandy landfall.

These 4 days before landfall and 7 after show some intersting results: On the day of landfall we see about three-times the number of outages relative to “typical” U.S. networks. Finally, we see it takes about four days to recover back to typical conditions.

This analysis uses dataset usc-lander / internet_address_survey_reprobing_it50j, available for research use from PREDICT, or by request from us if PREDICT access is not possible.

This animation was first shown at the Dec. 2014 DHS Cyber Security Division R&D Showcase and Technical Workshop as part of the talk “Towards Understanding Internet Reliability” given by John Heidemann. This work was supported by DHS, most recently through the LACREND project.

Categories
Publications Technical Report

new technical report “When the Internet Sleeps: Correlating Diurnal Networks With External Factors (extended)”

We released a new technical report “When the Internet Sleeps: Correlating Diurnal Networks With External Factors (extended)”, ISI-TR-2014-691, by Lin Quan, John Heidemann, and Yuri Pradkin, available as http://www.isi.edu/~johnh/PAPERS/Quan14b.
pdf

Comparing observed diurnal phase and geolocation longitude for 287k geolocatable, diurnal blocks ([Quan14b], figure 14b)
Comparing observed diurnal phase and geolocation longitude for 287k geolocatable, diurnal blocks ([Quan14b], figure 14b)
From the abstract:

As the Internet matures, policy questions loom larger in its operation. When should an ISP, city, or government invest in infrastructure? How do their policies affect use? In this work, we develop a new approach to evaluate how policies, economic conditions and technology correlates with Internet use around the world. First, we develop an adaptive and accurate approach to estimate block availability, the fraction of active IP addresses in each /24 block over short timescales (every 11 minutes). Our estimator provides a new lens to interpret data taken from existing long-term outage measurements, this requiring no no additional traffic. (If new collection was required, it would be lightweight, since on average, outage detection requires less than 20 probes per hour per /24 block; less than 1% of background radiation.) Second, we show that spectral analysis of this measure can identify diurnal usage: blocks where addresses are regularly used during part of the day and idle in other times. Finally, we analyze data for the entire responsive Internet (3.7M /24 blocks) over 35 days. These global observations show when and where the Internet sleeps—networks are mostly always-on in the US and Western Europe, and diurnal in much of Asia, South America, and Eastern Europe. ANOVA testing shows that diurnal networks correlate negatively with country GDP and electrical consumption, quantifying that national policies and economics relate to networks.

Data from this paper is available from http://www.isi.edu/ant/traces/internet_otuages/index.html, and from http://www.predict.org as dataset internet_outage_adaptive_a12w-20130424.

Categories
Presentations

new video “A Retrospective on an Australian Routing Event”

On 2012-02-23, hardware problems in an Australian ISP (Dodo) router caused it to announce many global routes to their ISP (Telstra), and from there to others.

The result: for 45 minutes, millions of Australians lost international Internet connectivity.

While this problem was detected and corrected in less than an hour, this kind of problem can reoccur.

In this video we show the Internet address space (IPv4) from Sydney, Australia.   Colors show estimated physical location (blue: North America, Red: Europe, Green: Asia).   Addresses map to a Hilbert Curve, and nearby addresses form squares.  White boxes show routing changes, with bursts after 02:40 UTC.

In the visualization we see there are many, many routing changes for much of Internet (the many white boxes)–evidence of routing instability in Sydney.

A copy of this video is also available at Vimeo (some system may have problems viewing the above embedded video, but Vimeo is a good alternative).

This video was made by Kaustubh Gadkari, John Heidemann, Cathie Olschanowsky, Christos Papadopoulos, Yuri Pradkin, and Lawrence Weikum at University of Southern California/Information Sciences Institute (USC/ISI) and Colorado State University/Computer Science (CSU).

This video uses software developed at USC/ISI and CSU:  Retro-future Time Travel, the LANDER IPv4 Web Address Browser, and BGPMon, the BGP logging and monitor.  Data from this video is available from BGPMon and PREDICT (or the authors).

This work was supported by DHS S&T (BGPMon, contract N66001-08-C-2028; LANDER, contract D08PC75599, admin. by SPAWAR; LACREND, contract FA8750-12-2-0344, admin. by AFRL; Retro-future, contract N66001-13-C-3001, admin. by SPAWAR), and NSF/CISE (BGPMon, grant CNS-1305404).  Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of funding and administrative agencies.

Categories
Papers Publications

new conference paper “The Need for End-to-End Evaluation of Cloud Availability” in PAM 2014

The paper “The Need for End-to-End Evaluation of Cloud Availability” was published by PAM 2014 in Marina del Rey, CA (available at http://www.isi.edu/~zihu/paper/cloud_availability.pdf).

From the abstract:cloud_availability_blog

People’s computing lives are moving into the cloud, making understanding cloud availability increasingly critical. Prior studies of Internet outages have used ICMP-based pings and traceroutes. While these studies can detect network availability, we show that they can be inaccurate at estimating cloud availability. Without care, ICMP probes can underestimate availability because ICMP is not as robust as application-level measurements such as HTTP. They can overestimate availability if they measure reachability of the cloud’s edge, missing failures in the cloud’s back-end. We develop methodologies sensitive to five “nines” of reliability, and then we compare ICMP and end-to-end measurements for both cloud VM and storage services. We show case studies where one fails and the other succeeds, and our results highlight the importance of application-level retries to reach high precision. When possible, we recommend end-to-end measurement with application-level protocols to evaluate the availability of cloud services.

Citation: Zi Hu, Liang Zhu, Calvin Ardi, Ethan Katz-Bassett, Harsha Madhyastha, John Heidemann, Minlan Yu. The Need for End-to-End Evaluation of Cloud Availability. Passive and Active Measurements Conference (PAM). Los Angeles, CA, USA, March 2014.