map/reduce – ANT Research News

I would like to congratulate Dr. Abdul Qadeer for defending his PhD at the University of Southern California in March 2021 and completing his doctoral dissertation “Efficient Processing of Streaming Data in Multi-User and Multi-Abstraction Workflows”.

From the abstract:

Ever-increasing data and evolving processing needs force enterprises to scale-out expensive computational resources to prioritize processing for timely results. Teams process their organization’s data either independently or using ad hoc sharing mechanisms. Often different users start with the same data and the same initial stages (decrypt, decompress, clean, anonymize). As their workflows evolve, later stages often diverge, and different stages may work best with different abstractions. The result is workflows with some overlap, some variations, and multiple transitions where data handling changes between continuous, windowed, and per-block. The system processing this diverse, multi-user, multi-abstraction workflow should be efficient and safe, but also must cope with fault recovery.

Analytics from multiple users can cause redundant processing and data, or encounter performance anomalies due to skew. Skew arises due to static or dynamic imbalance in the workflow stages. Both redundancy and skew waste compute resources and add latency to results. When users bridge between multiple abstractions, such as from per-block processing to windowed processing, they often employ custom code. These transitions can be error prone due to corner cases, can easily add latency as an inefficiency, and custom code is often a source of errors and maintenance difficulty. We need new solutions to manage the above challenges and to expose opportunities for data sharing explicitly. Our thesis is: new methods enable efficient processing of multi-user and multi-abstraction workflows of streaming data. We present two new methods for efficient stream processing—optimizations for multi-user workflows, and multiple abstractions for application coverage and efficient bridging.

These algorithms use a pipeline-graph to detect duplication of code and data across multiple users and cleanly delineate workflow stages for skew management. The pipeline-graph is our job description language that allows developers to specify their need easily and enables our system to automatically detect duplication and manage skew. The pipeline-graph acts as a shared canvas for collaboration amongst users to extend each other’s work. To efficiently implement our deduplication and skew management algorithms, we present streaming data to processing stages as fixed-sized but large blocks. Large-blocks have low meta-data overhead per user, provide good parallelism, and help with fault recovery.

Our second method enables applications to use a different abstraction on a different workflow stage. We provide three key abstractions and show that they cover many classes of analytics and our framework can bridge them efficiently. We provide Block-Streaming, Windowed-Streaming, and Stateful-Streaming abstractions. Block-Streaming is suitable for single-pass applications that care about temporal or spatial locality. Windowed-Streaming allows applications to process accumulated data (time-aligned blocks to sync with external information) and reductions like summation, averages, or other MapReduce-style analytics. We believe our three abstractions allow many classes of analytics and enable processing of one block, many blocks, or infinite stream. Plumb allows multiple abstractions in different parts of the workflow and provides efficient bridging between them so that users could make complex analytics from individual stages without worrying about data movement.

Our methods aim for good throughput, low latency, and clean and easy-to-use support for more applications to achieve better efficiency than our prior hand-tuned but often brittle system. The Plumb framework is the implementation of our solutions and a testbed to validate them. We use real-world workloads from the B-Root DNS domain to demonstrate effectiveness of our solutions. Our processing deduplication increases throughput up to $6\times$, reduces storage by 75%, as compared to their pre-Plumb counterparts. Plumb reduces CPU wastage due to structural skew up to half and reduces latency due to computational skew by 50%. Plumb has cut per-block latency by 74% and latency of daily statistics by 97%, while reducing code size by 58% and lowering manual intervention to handle problems by 73% as compared to pre-Plumb system.

The operational use of Plumb for the B-Root service provides a multi-year validation of our design choices under many traffic conditions. Over the last three years, Plumb has processed more than 12PB of DNS packet data and daily statistics. We show that our abstractions apply to many applications in the domain of networking big-data and beyond.

The ANT project is happy to announce that our extensions to Hadoop to support splitting of bzip2-compressed files have been accepted to appear in the next Hadoop release (will be 0.21.0).

Support for compression is important in map/reduce because it reduces the amount of I/O, and because important input files (for us, our Internet address censuses) are provided in compressed format.

Splitting is important in map/reduce, because splitting allows many computers to process parts of a few big files. Since the whole point of Hadoop and map/reduce is processing big files (for us, 4GB or more) with many computers (for us, dozens to hundreds), splitting is really essential.

Until now, Hadoop did not support splitting of compressed files. Instead, if input data was compressed, you get at most one computer per file. Some work-arounds were possible, but basically unpleasant, and often requiring that one rewrite all the input data is some other format.

Our extensions (see HADOOP-4012 and MAPREDUCE-830, plus HADOOP-3646 that went into 0.19.0) support Hadoop execution over bzip2 files with automatic splitting. Getting this done was trickier than one might expect: Hadoop really wants to decide where to split files, yet bzip2 can only support splits at specific locations that are different, and users don’t care about either of these but instead only about their record boundaries. Fortunately, we were able to align all of these constraints, and deal with the corner cases that inevitably arise. (What if the bzip2 marker appears in normal data? What happens when markers exactly align, or are off-by-one?)

Abdul Qadeer did this work in 2008, working with Yuri Pradkin and me (John Heidemann), and continued to work with the patch through its getting committed. We especially thank Chris Douglas at Yahoo for shepherding patch through the Hadoop bug tracking system, including helping clean it up and add test cases. And we thank Doug Cutting for initially suggesting bzip2 as a splittable compression scheme.

This work was supported by NSF through the MR-Net research project (CNS-0823774).