Back Out: End-to-end Inference of Common Points-of-Failure in the Internet (extended)

Heidemann, John and Pradkin, Yuri and Nisar, Aqib
USC/Information Sciences Institute


John Heidemann, Yuri Pradkin and Aqib Nisar 2018. Back Out: End-to-end Inference of Common Points-of-Failure in the Internet (extended). Technical Report ISI-TR-724. USC/Information Sciences Institute. [PDF]


Internet reliability has many potential weaknesses: fiber rights-of-way at the physical layer, exchange-point congestion from DDOS at the network layer, settlement disputes between organizations at the financial layer, and government intervention the political layer. This paper shows that we can discover common points-of-failure at any of these layers by observing correlated failures. We use end-to-end observations from data-plane-level connectivity of edge hosts in the Internet. We identify correlations in connectivity: networks that usually fail and recover at the same time suggest common point-of-failure. We define two new algorithms to meet these goals. First, we define a computationally-efficient algorithm to create a linear ordering of blocks to make correlated failures apparent to a human analyst. Second, we develop an event-based clustering algorithm that directly networks with correlated failures, suggesting common points-of-failure. Our algorithms scale to real-world datasets of millions of networks and observations: linear ordering is O(n log n) time and event-based clustering parallelizes with Map/Reduce. We demonstrate them on three months of outages for 4 million /24 network prefixes, showing high recall (0.83 to 0.98) and precision (0.72 to 1.0) for blocks that respond. We also show that our algorithms generalize to identify correlations in anycast catchments and routing.


  author = {Heidemann, John and Pradkin, Yuri and Nisar, Aqib},
  title = {Back Out: End-to-end Inference of Common
                    Points-of-Failure in the Internet (extended)},
  institution = {USC/Information Sciences Institute},
  year = {2018},
  sortdate = {2018-02-02},
  project = {ant, lacanic, retrofuturebridge, duoi},
  jsubject = {routing},
  number = {ISI-TR-724},
  month = feb,
  jlocation = {johnh: pafile},
  keywords = {network outage detection, clustering, visualization},
  url = {},
  pdfurl = {},
  myorganization = {USC/Information Sciences Institute},
  copyrightholder = {authors}