{"id":19,"date":"2009-09-11T09:49:53","date_gmt":"2009-09-11T16:49:53","guid":{"rendered":"http:\/\/ant.isi.edu\/blog\/?p=19"},"modified":"2020-10-19T10:45:06","modified_gmt":"2020-10-19T17:45:06","slug":"ant-extensions-for-bzip2-splitting-to-appear-in-hadoop","status":"publish","type":"post","link":"https:\/\/ant.isi.edu\/blog\/?p=19","title":{"rendered":"ANT extensions for bzip2-splitting to appear in Hadoop"},"content":{"rendered":"<p>The ANT project is happy to announce that our extensions to Hadoop to support splitting of bzip2-compressed files have been accepted to appear in the next Hadoop release (will be 0.21.0).<\/p>\n<p>Support for compression is important in map\/reduce because it reduces the amount of I\/O, and because important input files (for us, our <a href=\"http:\/\/www.isi.edu\/ant\/address\/\">Internet address censuses<\/a>) are provided in compressed format.<\/p>\n<p>Splitting is important in map\/reduce, because splitting allows many computers to process <em>parts<\/em> of a few big files.\u00a0 Since the whole point of Hadoop and map\/reduce is processing <em>big<\/em> files (for us, 4GB or more) with<em> many<\/em> computers (for us, dozens to hundreds), splitting is really <em>essential<\/em>.<\/p>\n<p>Until now, Hadoop did not support splitting of compressed files.\u00a0 Instead, if input data was compressed, you get at most one computer per file.\u00a0 Some work-arounds were possible, but basically unpleasant, and often requiring that one rewrite all the input data is some other format.<\/p>\n<p>Our extensions (see <a href=\"https:\/\/issues.apache.org\/jira\/browse\/HADOOP-4012\">HADOOP-4012<\/a> and <a href=\"https:\/\/issues.apache.org\/jira\/browse\/MAPREDUCE-830\">MAPREDUCE-830<\/a>, plus <a href=\"https:\/\/issues.apache.org\/jira\/browse\/HADOOP-3646\">HADOOP-3646<\/a> that went into 0.19.0) support <strong>Hadoop execution over bzip2 files with automatic splitting<\/strong>.\u00a0 Getting this done was trickier than one might expect:\u00a0 Hadoop really wants to decide where to split files, yet bzip2 can only support splits at specific locations that are different, and users don&#8217;t care about either of these but instead only about <em>their <\/em>record boundaries.\u00a0 Fortunately, we were able to align all of these constraints, and deal with the corner cases that inevitably arise.\u00a0 (What if the bzip2 marker appears in normal data?\u00a0 What happens when markers exactly align, or are off-by-one?)<\/p>\n<p>Abdul Qadeer did this work in 2008, working with Yuri Pradkin and me (John Heidemann), and continued to work with the patch through its getting committed.\u00a0 We especially thank Chris Douglas at Yahoo for shepherding patch through the Hadoop bug tracking system, including helping clean it up and add test cases.\u00a0 And we thank Doug Cutting for initially <a href=\"http:\/\/www.mail-archive.com\/hadoop-user@lucene.apache.org\/msg01971.html\">suggesting bzip2<\/a> as a splittable compression scheme.<\/p>\n<p>This work was supported by NSF through the <a href=\"http:\/\/www.isi.edu\/ant\/mrnet\/index.html\">MR-Net research project<\/a> (CNS-0823774).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The ANT project is happy to announce that our extensions to Hadoop to support splitting of bzip2-compressed files have been accepted to appear in the next Hadoop release (will be 0.21.0). Support for compression is important in map\/reduce because it reduces the amount of I\/O, and because important input files (for us, our Internet address [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[288,292,162],"tags":[14,12,13,16,15],"class_list":["post-19","post","type-post","status-publish","format-standard","hentry","category-announcements","category-collaborations","category-software-releases","tag-bzip2","tag-hadoop","tag-mapreduce","tag-software","tag-splitting"],"_links":{"self":[{"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/posts\/19","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=19"}],"version-history":[{"count":6,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/posts\/19\/revisions"}],"predecessor-version":[{"id":1680,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=\/wp\/v2\/posts\/19\/revisions\/1680"}],"wp:attachment":[{"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=19"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=19"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ant.isi.edu\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=19"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}