The ANT project is happy to announce that our extensions to Hadoop to support splitting of bzip2-compressed files have been accepted to appear in the next Hadoop release (will be 0.21.0).
Support for compression is important in map/reduce because it reduces the amount of I/O, and because important input files (for us, our Internet address censuses) are provided in compressed format.
Splitting is important in map/reduce, because splitting allows many computers to process parts of a few big files. Since the whole point of Hadoop and map/reduce is processing big files (for us, 4GB or more) with many computers (for us, dozens to hundreds), splitting is really essential.
Until now, Hadoop did not support splitting of compressed files. Instead, if input data was compressed, you get at most one computer per file. Some work-arounds were possible, but basically unpleasant, and often requiring that one rewrite all the input data is some other format.
Our extensions (see HADOOP-4012 and MAPREDUCE-830, plus HADOOP-3646 that went into 0.19.0) support Hadoop execution over bzip2 files with automatic splitting. Getting this done was trickier than one might expect: Hadoop really wants to decide where to split files, yet bzip2 can only support splits at specific locations that are different, and users don’t care about either of these but instead only about their record boundaries. Fortunately, we were able to align all of these constraints, and deal with the corner cases that inevitably arise. (What if the bzip2 marker appears in normal data? What happens when markers exactly align, or are off-by-one?)
Abdul Qadeer did this work in 2008, working with Yuri Pradkin and me (John Heidemann), and continued to work with the patch through its getting committed. We especially thank Chris Douglas at Yahoo for shepherding patch through the Hadoop bug tracking system, including helping clean it up and add test cases. And we thank Doug Cutting for initially suggesting bzip2 as a splittable compression scheme.
This work was supported by NSF through the MR-Net research project (CNS-0823774).