September 2009 – ANT Research News

We just posted a tech report “Parametric Methods for Anomaly Detection in Aggregate Traffic” at <ftp://ftp.isi.edu/isi-pubs/tr-663.pdf>. This paper represents quite a bit of work looking at how to apply parametric detection as part of the NSF-sponsored MADCAT project.

From the abstract:

This paper develops parametric methods to detect network anomalies using only aggregate traffic statistics in contrast to other works requiring flow separation, even when the anomaly is a small fraction of the total traffic. By adopting simple statistical models for anomalous and background traffic in the time-domain, one can estimate model parameters in real-time, thus obviating the need for a long training phase or manual parameter tuning. The detection mechanism uses a sequential probability ratio test, allowing for control over the false positive rate while examining the trade-off between detection time and the strength of an anomaly. Additionally, it uses both traffic-rate and packet-size statistics, yielding a bivariate model that eliminates most false positives. The method is analyzed using the bitrate SNR metric, which is shown to be an effective metric for anomaly detection. The performance of the bPDM is evaluated in three ways: first, synthetically generated traffic provides for a controlled comparison of detection time as a function of the anomalous level of traffic. Second, the approach is shown to be able to detect controlled artificial attacks over the USC campus network in varying real traffic mixes. Third, the proposed algorithm achieves rapid detection of real denial-of-service attacks as determined by the replay of previously captured network traces. The method developed in this paper is able to detect all attacks in these scenarios in a few seconds or less.

Citation: Gautam Thatte, Urbashi Mitra, and John Heidemann. Parametric Methods for Anomaly Detection in Aggregate Traffic. Technical Report N. ISI-TR-2009-663, USC/Information Sciences Institute, August, 2009. http://www.isi.edu/~johnh/PAPERS/Thatte09a.html.

The ANT project is happy to announce that our extensions to Hadoop to support splitting of bzip2-compressed files have been accepted to appear in the next Hadoop release (will be 0.21.0).

Support for compression is important in map/reduce because it reduces the amount of I/O, and because important input files (for us, our Internet address censuses) are provided in compressed format.

Splitting is important in map/reduce, because splitting allows many computers to process parts of a few big files. Since the whole point of Hadoop and map/reduce is processing big files (for us, 4GB or more) with many computers (for us, dozens to hundreds), splitting is really essential.

Until now, Hadoop did not support splitting of compressed files. Instead, if input data was compressed, you get at most one computer per file. Some work-arounds were possible, but basically unpleasant, and often requiring that one rewrite all the input data is some other format.

Our extensions (see HADOOP-4012 and MAPREDUCE-830, plus HADOOP-3646 that went into 0.19.0) support Hadoop execution over bzip2 files with automatic splitting. Getting this done was trickier than one might expect: Hadoop really wants to decide where to split files, yet bzip2 can only support splits at specific locations that are different, and users don’t care about either of these but instead only about their record boundaries. Fortunately, we were able to align all of these constraints, and deal with the corner cases that inevitably arise. (What if the bzip2 marker appears in normal data? What happens when markers exactly align, or are off-by-one?)

Abdul Qadeer did this work in 2008, working with Yuri Pradkin and me (John Heidemann), and continued to work with the patch through its getting committed. We especially thank Chris Douglas at Yahoo for shepherding patch through the Hadoop bug tracking system, including helping clean it up and add test cases. And we thank Doug Cutting for initially suggesting bzip2 as a splittable compression scheme.

This work was supported by NSF through the MR-Net research project (CNS-0823774).