Categories
Papers Publications

new conference paper “Understanding Block-level Address Usage in the Visible Internet” at SIGCOMM

The paper “Understanding Block-level Address Usage in the Visible Internet” was accepted and presented at SIGCOMM’10 in New Delhi, India (available at http://www.isi.edu/~johnh/PAPERS/Cai10a.html).

From the abstract:

Although the Internet is widely used today, we have little information about the edge of the network. Decentralized management, firewalls, and sensitivity to probing prevent easy answers and make measurement difficult. Building on frequent ICMP probing of 1% of the Internet address space, we develop clustering and analysis methods to estimate how Internet addresses are used. We show that adjacent addresses often have similar characteristics and are used for similar purposes (61% of addresses we probe are consistent blocks of 64 neighbors or more). We then apply this block-level clustering to provide data to explore several open questions in how networks are managed. First, we provide information about how effectively network address blocks appear to be used, finding that a significant number of blocks are only lightly used (most addresses in about one-fifth of /24 blocks are in use less than 10% of the time), an important issue as the IPv4 address space nears full allocation. Second, we provide new measurements about dynamically managed address space, showing nearly 40% of /24 blocks appear to be dynamically allocated, and dynamic addressing is most widely used in countries more recent to the Internet (more than 80% in China, while less than 30% in the U.S.). Third, we distinguish blocks with low-bitrate last-hops and show that such blocks are often underutilized.

Citation: Xue Cai and John Heidemann. Understanding Block-level Address Usage in the Visible Internet. In Proceedings of the ACM SIGCOMM Conference , p. to appear. New Delhi, India, ACM. August, 2010. <http://www.isi.edu/~johnh/PAPERS/Cai10a.html>.

Categories
Publications Technical Report

New tech report “Selecting Representative IP Addresses for Internet Topology Studies”

We just published a new technical report “Selecting Representative IP Addresses for Internet Topology Studies” (available at ftp://ftp.isi.edu/isi-pubs/tr-666.pdf) .

From the abstract:

An Internet hitlist is a set of addresses that cover and can represent the the Internet as a whole. Hitlists have long been used in studies of Internet topology, reachability, and performance, serving as the destinations of traceroute or performance probes. Most early topology studies used manually generated lists of prominent addresses, but evolution and growth of the Internet make human maintenance untenable. Random selection scales to today’s address space, but most andom addresses fail to respond. In this paper we present what we believe is the first automatic generation of hitlists informed censuses of Internet addresses. We formalize the desirable characteristics of a hitlist: reachability, each representative responds to pings; completeness, they cover all the allocated IPv4 address space; and stability, list evolution is minimized when possible. We quantify the accuracy of our automatic hitlists, showing that only one-third of the Internet allows informed selection of representatives. Of informed representatives, 50–60% are likely to respond three months later, and we show that causes for non-responses are likely due to dynamic addressing (so no stable representative exists) or firewalls. In spite of these limitations, we show that the use of informed hitlists can add 1.7 million edge links (a 5% growth) to traceroute-based Internet topology studies. Our hitlists are available free-of-charge and are in use by several other research projects.

Citation: Xun Fan and John Heidemann. Selecting Representative IP Addresses for Internet Topology Studies. Technical Report N. ISI-TR-666, USC/Information Sciences Institute, June, 2010. http://www.isi.edu/~johnh/PAPERS/Fan10a.html

Categories
Papers Publications

New conference paper “Correlating Spam Activity with IP Address Characteristics” at Global Internet

The paper “Correlating Spam Activity with IP Address Characteristics” (available at PDF Format) was accepted and presented at Global Internet 2010. The focus of this paper is to quantify the collateral damage (legitimate mail servers incorrectly blacklisted) caused by the practice of blocking /24 address blocks based on the presence of spammers. The paper also revisits the differences in IP address characteristics and domain names between spammers and non-spammers.

From the abstract:

It is well known that spam bots mostly utilize compromised machines with certain address characteristics, such as dynamically allocated addresses, machines in specific geographic areas and IP ranges from AS’ with more tolerant spam policies. Such machines tend to be less diligently administered and may exhibit less stability, more volatility, and shorter uptimes. However, few studies have attempted to quantify how such spam bot address characteristics compare with non-spamming hosts. Quantifying these characteristics may help provide important information for comprehensive spam mitigation.

We use two large datasets, namely a commercial blacklist and an Internet-wide address visibility study to quantify address characteristics of spam and non-spam networks. We find that spam networks exhibit significantly less availability and uptime, and higher volatility than non-spam networks. In addition, we conduct a collateral damage study of a common practice where an ISP blocks the entire /24 prefix if spammers are detected in that range.  We find that such a policy blacklists a significant portion of legitimate mail servers belonging to the same prefix.

Citation: Chris Wilcox, Christos Papadopoulos, John Heidemann. Correlating Spam Activity with IP Address Characteristics.  Proceedings of the IEEE Global Internet Conference, San Diego, CA, USA, IEEE.  March, 2010.

Categories
Papers Publications

New conference paper “Improved Internet Traffic Analysis via Optimized Sampling”

The paper “Improved Internet Traffic Analysis via Optimized Sampling” (available at PDF Format) was accepted to ICASSP 2010. The focus of this paper is on the best down-sampling methods to use when measuring internet traffic in order preserve signal information for traffic analysis techniques such as anomaly detection.

From the abstract:

Applications to evaluate Internet quality-of-service and increase network security are essential to maintaining reliability and high performance in computer networks. These applications typically use very accurate, but high cost, hardware measurement systems. Alternate, less expensive software based systems are often impractical for use with analysis applications because they reduce the number and accuracy of measurements using a technique called interrupt coalescence, which can be viewed as a form of sampling. The goal of this paper is to optimize the way interrupt coalescence groups packets into measurements
so as to retain as much of the packet timing information as possible. Our optimized solution produces estimates of timing distributions much closer to those obtained using hardware based systems.
Further we show that for a real Internet analysis application, periodic signal detection, using measurements generated with our method improved detection times by at least 36%.

Citation: Sean McPherson and Antonio Ortega.  Improved Internet Traffic Analysis via Optimized Sampling.  In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, p. to appear.  Dallas, TX, USA, IEEE.  March, 2010.

Categories
Publications Technical Report

New tech report “Analysis of Internet Measurement Systems for Optimized Anomaly Detection System Design”

A new tech report has been posted to the Arxiv database at http://arxiv.org/abs/0907.5233. This paper shows the effect of a software based measurement system on the timing of the measurements obtained. Additionally this paper develops a period signal detection method specific to software based measurement.

Although there exist very accurate hardware systems for measuring traffic on the internet, their widespread use for analysis tasks is limited by their high cost. On the other hand, less expensive, software-based systems exist that are widely available and can be used to perform a number of simple analysis tasks. The caveat with using such software systems is that application of standard analysis methods cannot proceed blindly because inherent distortions exist in the measurements obtained from software systems. The goal of this paper is to analyze common Internet measurement systems to discover the effect of these distortions on common analysis tasks. Then by selecting one specific task, periodic signal detection, a more in-depth analysis is conducted which derives a signal representation to capture the salient features of the measurement and develops a periodic detection mechanism designed for the measurement system which outperforms an existing detection method not optimized for the measurement system. Finally, through experiments the importance of understanding the relationship between the input traffic, measurement system configuration and detection method performance is emphasized.

Citation: Sean McPherson and Antonio Ortega. Analysis of Internet Measurement Systems for Optimized Anomaly Detection System Design. Technical Report N. arXiv:0907.5233v1, University of Southern California, Department of Electrical Engineering, July, 2009. http://arxiv.org/abs/0907.5233.

Categories
Announcements

multiple views in browsable Internet address map

We’re happy to announce an update to our browsable Internet map at http://www.isi.edu/ant/address/browse/. Our map now includes FIND ME and MULTIPLE VIEWS.

screenshot of browsing RTTs in the Internet
screenshot of browsing RTTs in the Internet

FIND ME: To locate any host on the map, click in the IP address address box (at the top right) and type in a hostname. A pushpin will appear at that address, with a bubble indicating the hostname and IP address, and the map will scroll to the location. No more manually finding addresses!

MULTIPLE VIEWS allow users to flip between different data types, census dates, source locations:

  1. DATA TYPES: We now plot round-trip times in addition to prior ping responsiveness. See how far away the Internet is! (At least from our probing sites.)
  2. CENSUS DATES: We currently plot five datasets from Nov 2006 to June 2009. Travel through time to see the Internet of yesteryear!
  3. SOURCE LOCATIONS: We collect data from two different locations: Los Angeles and Colorado State University, to help understand if we have observation bias. See the Internet from sea level, or a mile high!

To select different views, click the +-sign on the right of the screen and pick from the menus.

Data collection for this work is through the LANDER project http://www.isi.edu/ant/lander/, and the visualization improvements are due to AMITE http://www.isi.edu/ant/amite/, both supported by DHS.  We thank OpenLayers.org for the customizable front-end.

Categories
Publications Technical Report

new tech report “Parametric Methods for Anomaly Detection in Aggregate Traffic”

We just posted a tech report “Parametric Methods for Anomaly Detection in Aggregate Traffic” at <ftp://ftp.isi.edu/isi-pubs/tr-663.pdf>. This paper represents quite a bit of work looking at how to apply parametric detection as part of the NSF-sponsored MADCAT project.

From the abstract:

This paper develops parametric methods to detect network anomalies using only aggregate traffic statistics in contrast to other works requiring flow separation, even when the anomaly is a small fraction of the total traffic.  By adopting simple statistical models for anomalous and background traffic in the time-domain, one can estimate model parameters in real-time, thus obviating the need for a long training phase or manual parameter tuning.  The detection mechanism uses a sequential probability ratio test, allowing for control over the false positive rate while examining the trade-off between detection time and the strength of an anomaly.  Additionally, it uses both traffic-rate and packet-size statistics, yielding a bivariate model that eliminates most false positives.  The method is analyzed using the bitrate SNR metric, which is shown to be an effective metric for anomaly detection.  The performance of the bPDM is evaluated in three ways:  first, synthetically generated traffic provides for a controlled comparison of detection time as a function of the anomalous level of traffic.  Second, the approach is shown to be able to detect controlled artificial attacks over the USC campus network in varying real traffic mixes.  Third, the proposed algorithm achieves rapid detection of real denial-of-service attacks as determined by the replay of previously captured network traces.  The method developed in this paper is able to detect all attacks in these scenarios in a few seconds or less.

Citation: Gautam Thatte, Urbashi Mitra, and John Heidemann. Parametric Methods for Anomaly Detection in Aggregate Traffic. Technical Report N. ISI-TR-2009-663, USC/Information Sciences Institute, August, 2009. http://www.isi.edu/~johnh/PAPERS/Thatte09a.html.

Categories
Announcements Collaborations Software releases

ANT extensions for bzip2-splitting to appear in Hadoop

The ANT project is happy to announce that our extensions to Hadoop to support splitting of bzip2-compressed files have been accepted to appear in the next Hadoop release (will be 0.21.0).

Support for compression is important in map/reduce because it reduces the amount of I/O, and because important input files (for us, our Internet address censuses) are provided in compressed format.

Splitting is important in map/reduce, because splitting allows many computers to process parts of a few big files.  Since the whole point of Hadoop and map/reduce is processing big files (for us, 4GB or more) with many computers (for us, dozens to hundreds), splitting is really essential.

Until now, Hadoop did not support splitting of compressed files.  Instead, if input data was compressed, you get at most one computer per file.  Some work-arounds were possible, but basically unpleasant, and often requiring that one rewrite all the input data is some other format.

Our extensions (see HADOOP-4012 and MAPREDUCE-830, plus HADOOP-3646 that went into 0.19.0) support Hadoop execution over bzip2 files with automatic splitting.  Getting this done was trickier than one might expect:  Hadoop really wants to decide where to split files, yet bzip2 can only support splits at specific locations that are different, and users don’t care about either of these but instead only about their record boundaries.  Fortunately, we were able to align all of these constraints, and deal with the corner cases that inevitably arise.  (What if the bzip2 marker appears in normal data?  What happens when markers exactly align, or are off-by-one?)

Abdul Qadeer did this work in 2008, working with Yuri Pradkin and me (John Heidemann), and continued to work with the patch through its getting committed.  We especially thank Chris Douglas at Yahoo for shepherding patch through the Hadoop bug tracking system, including helping clean it up and add test cases.  And we thank Doug Cutting for initially suggesting bzip2 as a splittable compression scheme.

This work was supported by NSF through the MR-Net research project (CNS-0823774).

Categories
Publications Technical Report

new tech report “Understanding Address Usage in the Visible Internet”

We just posted a tech report “Understanding Address Usage in the Visible Internet” at <ftp://ftp.isi.edu/isi-pubs/tr-656.pdf>.

The abstract summarizes the tech report:

Although the Internet is widely used today, there are few sound estimates of network demographics. Decentralized network management means questions about Internet use cannot be answered by a central authority, and firewalls and sensitivity to probing means that active measurements must be done carefully and validated against known data. Building on frequent ICMP probing of 1% of the Internet address space, we develop a clustering algorithm to estimate how Internet addresses are used. We show that adjacent addresses often have similar characteristics and are used for similar purposes (61% of addresses we probe are consistent blocks of 64 neighbors or more). We then apply this block-level clustering to provide data to explore several open questions in how networks are managed. First, the nearing full allocation of IPv4 addresses makes it increasingly important to estimate the costs of better management of the IPv4 space as a component of an IPv6 transition. We provide about how effectively network addresses blocks appear to be used, finding that a significant number of blocks are only lightly used (about one-fifth of /24 blocks ha
ve most addresses in use less than 10% of the time). Second, we provide new measurements about dynamically managed address space, showing nearly 40% of /24 blocks appear to be dynamically allocated, and dynamic addressing is most widely used in countries more recently to the Internet (more than 80% in China, while less then 30% in the U.S.).

Xue Cai and John Heidemann. Understanding Address Usage in the Visible Internet. Technical Report N. ISI-TR-2009-656, USC/Information Sciences Institute, February, 2009. http://www.isi.edu/~johnh/PAPERS/Cai09b.html

Categories
Papers Publications

new paper “Uses and Challenges for Network Datasets”

We just posted a pre-print of the paper “Uses and Challenges for Network Datasets”, to appear at IEEE CATCH in March.  The pre-print is at <http://www.isi.edu/~johnh/PAPERS/Heidemann09a.html>.

The abstract summarizes the paper:

Network datasets are necessary for many types of network research.  While there has been significant discussion about specific datasets, there has been less about the overall state of network data collection.  The goal of this paper is to explore the research questions facing the Internet today, the datasets needed to answer those questions, and the challenges to using those datasets.  We suggest several practices that have proven important in use of current data sets, and open challenges to improve use of network data.

More specifically, the paper tries to answer the question Jody Westby put to PREDICT PIs, which is “why take data, what is it good for”?  While a simple question, it’s not easy to answer (at least, my attempt to dash of a quick answer in e-mail failed).  The paper is an attempt at a more thoughtful answer.

The paper tries to summarize and point to a lot of ongoing work, but I know that our coverage was insufficient.  We welcome feedback about what we’re missing.

John Heidemann and Christos Papadopoulos. Uses and Challenges for Network Datasets. In Proceedings of the IEEE Cybersecurity Applications and Technologies Conference for Homeland Security (CATCH), pp. 73-82. Washington, DC, USA, IEEE. March, 2009. http://www.isi.edu/~johnh/PAPERS/Heidemann09a.html