We’ve been taking Internet IPv4 censuses regularly since 2006. In each census, we probe the entire allocated IPv4 address space. You may browse 8 years of data at our IPv4 address browser.
These eight years show some interesting events, from an early “open” Internet in 2006, to full allocation of IPv4 by ICANN in 2011, to higher utilization in 2014.
This animation was first shown at the Dec. 2014 DHS Cyber Security Division R&D Showcase and Technical Workshop as part of the talk “Towards Understanding Internet Reliability” given by John Heidemann. This work was supported by DHS, most recently through the LACREND project.
The paper “Measuring DANE TLSA Deployment” will appear at the Traffic Monitoring and Analysis Workshop in April 2015 in Barcelona, Spain (available at http://www.isi.edu/~liangzhu/papers/dane_tlsa.pdf).
From the abstract:
The DANE (DNS-based Authentication of Named Entities) framework uses DNSSEC to provide a source of trust, and with TLSA it can serve as a root of trust for TLS certificates. This serves to complement traditional certificate authentication methods, which is important given the risks inherent in trusting hundreds of organizations—risks already demonstrated with multiple compromises. The TLSA protocol was published in 2012, and this paper presents the first systematic study of its deployment. We studied TLSA usage, developing a tool that actively probes all signed zones in .com and .net for TLSA records. We find the TLSA use is early: in our latest measurement, of the 485k signed zones, we find only 997 TLSA names. We characterize how it is being used so far, and find that around 7–13% of TLSA records are invalid. We find 33% of TLSA responses are larger than 1500 Bytes and will very likely be fragmented.
The work in the paper is by Liang Zhu (USC/ISI), Duane Wessels and Allison Mankin (both of Verisign Labs), and John Heidemann (USC/ISI).
Digit-1.1 has been released (available at https://ant.isi.edu/software/tdns/index.html). Digit is a DNS client side tool that can perform DNS queries via different protocols such as UDP, TCP, TLS. This tool is primarily designed to evaluate the client side latency of using DNS over TCP/TLS, as described in the technical report “T-DNS: Connection-Oriented DNS to Improve Privacy and Security” (http://www.isi.edu/~johnh/PAPERS/Zhu14b/index.html).
A README in the package has detailed instructions about how to use this software.
Our research studies the Internet’s public face. Since 2006 we have been taking censuses of the Internet address space (pinging all IPv4 addresses) every 3 months. Since 2012 we have studied network outages and events like Hurricane Sandy, using probes of much of the Internet every 11 minutes. Most recently we have evaluated the diurnal Internet, finding countries where most people turn off their computers at night. Finally, we have looked at network reputation, identifying how spam generation correlates with network location, and others have studies multiple measurements of “network reputation”.
A common theme across this work is one must estimate characteristics of the edge of the Internet in spite of noisy measurements and a underlying changes. One also need to compare and correlate these imperfect measurements with other factors (from GDP to telecommunications policies).
How do these applications relate to the mathematics of census taking and measurement, estimation, and correlation? Are there tools we should be using that we aren’t? Do the properties of the Internet suggest new approaches (for example where rapid full enumeration is possible)? Does correlation and estimates of network “badness” help us improve cybersecurity by treating different parts of the network differently?
We have recently put together a video showing 35 days of Internet address usage as observed from Trinocular, our outage detection system.
The Internet sleeps: address use in South America is low (blue) in the early morning, while India is high (red) in afternoon. When we look at address usage over time, we see that some parts of the globe have daily swings of +/-10% to 20% in the number of active addresses. In China, India, eastern Europe and much of South America, the Internet sleeps.
Understanding when the Internet sleeps is important to understand how different country’s network policies affect use, it is part of outage detection, and it is a piece of improving our long-term goal of understanding exactly how big the Internet is.
This work is partly supported by DHS S&T, Cyber Security division, agreement FA8750-12-2-0344 (under AFRL) and N66001-13-C-3001 (under SPAWAR). The views contained
herein are those of the authors and do not necessarily represent those of DHS or the U.S. Government. This work was classified by USC’s IRB as non-human subjects research (IIR00001648).
The differences in the census are small, as one would hope, since it’s a global Internet. However, when we look at latency (the time it takes for an IP address to reply to our requests), Greece gives us a European view.
Compare the lower-left corner of the Internet, since that is European IPv4 address space:
In addition to big thanks to George Xylomenos and George Polyzos of AUEB (σας ευχαριστώ!) and AUEB for institutional funding for this work. We also thank Christos Papadopoulos (Colorado State) for helping with many details, and Colin Perkins (U. Glasgow) for discussions about potential European hosts.
Data from our Greece census is available to researchers at no cost on the same terms as our existing census data. See our datasets page for details. Greek data starts with it61 as of 2014-08-29.
The DANE (DNS-based Authentication of Named Entities) framework uses DNSSEC to provide a source of trust, and with TLSA it can serve as a root of trust for TLS certificates. This alternative to traditional certificate authorities is important given the risks inherent in trusting hundreds of organizations—risks already demonstrated with multiple compromises. The TLSA protocol was published in 2012, and this talk presents the first systematic study of its deployment. We studied TLSA usage, developing a tool that actively probes all signed zones in .com and .net for TLSA records. We find the TLSA use is early: in our latest measurement, of the 461k signed zones, we find only 701 TLSA names. We characterize how it is being used so far, and find that around 7–12% of TLSA records are invalid. We find 31% of TLSA responses are larger than 1500 Bytes and get IP fragmented.
The work in the talk is by Liang Zhu (USC/ISI), Duane Wessels and Allison Mankin (both of Verisign), and John Heidemann (USC/ISI).
The paper “When the Internet Sleeps: Correlating Diurnal Networks With External Factors” will appear at ACM Internet Measurements Conference 2014 in Vancouver, Canada (available at http://www.isi.edu/~johnh/PAPERS/Quan14c/ with cite and pdf, or direct pdf).
From the abstract:
As the Internet matures, policy questions loom larger in its operation. When should an ISP, city, or government invest in infrastructure? How do their policies affect use? In this work, we develop a new approach to evaluate how policies, economic conditions and technology correlates with Internet use around the world. First, we develop an adaptive and accurate approach to estimate block availability, the fraction of active IP addresses in each /24 block over short timescales (every 11 minutes). Our estimator provides a new lens to interpret data taken from existing long-term outage measurements, thus requiring no additional traffic. (If new collection was required, it would be lightweight, since on average, outage detection requires less than 20 probes per hour per /24 block; less than 1% of background radiation.) Second, we show that spectral analysis of this measure can identify diurnal usage: blocks where addresses are regularly used during part of the day and idle in other times. Finally, we analyze data for the entire responsive Internet (3.7M /24 blocks) over 35 days. These global observations show when and where the Internet sleeps—networks are mostly always-on in the US and Western Europe, and diurnal in much of Asia, South America, and Eastern Europe. ANOVA (Analysis of Variance) testing shows that diurnal networks correlate negatively with country GDP and electrical consumption, quantifying that national policies and economics relate to networks.
All data in this paper is available to researchers at no cost, and source code to our analysis tools is available on request; see our diurnal datasets webpage.
This work is partly supported by DHS S&T, Cyber Security division, agreement FA8750-12-2-0344 (under AFRL) and N66001-13-C-3001 (under SPAWAR). The views contained
herein are those of the authors and do not necessarily represent those of DHS or the U.S. Government. This work was classified by USC’s IRB as non-human subjects research (IIR00001648).
With the vast amount of accessible, online content, it is not surprising that unscrupulous entities “borrow” from the web to provide filler for advertisements, link farms, and spam and make a quick profit. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically discover previously unknown duplicate content in the web, and the second to detect copies of discovered or manually identified content in the web. Our detection can also bad neighborhoods, clusters of pages where copied content is frequent. We verify our approach with controlled experiments with two large datasets: a Common Crawl subset the web, and a copy of Geocities, an older set of user-provided web content. We then demonstrate that we can discover otherwise unknown examples of duplication for spam, and detect both discovered and expert-identified content in these large datasets. Utilizing an original copy of Wikipedia as identified content, we find 40 sites that reuse this content, 86% for commercial benefit.
We released a new technical report “T-DNS: Connection-Oriented DNS to Improve Privacy and Security (extended)”, ISI-TR-2014-693, available as http://www.isi.edu/~johnh/PAPERS/Zhu14b.pdf
From the abstract:
DNS is the canonical protocol for connectionless UDP. Yet DNS today is challenged by eavesdropping that compromises privacy, source-address spoofing that results in denial-of-service (DoS) attacks on the server and third parties, injection attacks that exploit fragmentation, and size limitations that constrain policy and operational choices. We propose T-DNS to address these problems. It uses TCP to smoothly support large payloads and to mitigate spoofing and amplification for DoS. T-DNS uses transport-layer security (TLS) to provide privacy from users to their DNS resolvers and optionally to authoritative servers. Expectations about DNS suggest connections will balloon client latency and overwhelm server with state, but our evaluation shows costs are modest: end-to-end latency from TLS to the recursive resolver is only about 9% slower when UDP is used to the authoritative server, and 22% slower with TCP to the authoritative. With diverse traces we show that frequent connection reuse is possible (60–95% for stub and recursive resolvers, although half that for authoritative servers), and after connection establishment, we show TCP and TLS latency is equivalent to UDP. With conservative timeouts (20 s at authoritative servers and 60 s elsewhere) and conservative estimates of connection state memory requirements, we show that server memory requirements match current hardware: a large recursive resolver may have 24k active connections requiring about 3.6 GB additional RAM. We identify the key design and implementation decisions needed to minimize overhead: query pipelining, out-of-order responses, TLS connection resumption, and plausible timeouts.
This paper is a major revision of the prior technical report ISI-TR-2014-688. Since that work we have improved our understanding of the availability of TCP fast open and TLS resumption, and we have tightened our estimates on memory based on external reports (section 5.2). This additional information has allowed us to conduct additional experiments, improve our modeling, and provide a more accurate view of what is possible today; our estimates of latency and memory consumption are both lower than in our prior technical report as a result. We have also added additional information about packet size limitations (Figure 2), experiments evaluating DNSCrypt/DNSCurve (section 6.1), analysis of DTLS, and covered a broader range of RTTs in our experiments. We believe these additions strengthen our central claims: that connectionless DNS causes multiple problems and that T-DNS addresses those problems with modest increase in latency and memory suitable for current hardware.