congratulations to Xaio Song for receiving a 2021 USC Viterbi award for MS Student Research

Congratulations to Xiao Song for receiving a 2021 USC Viterbi School of Engineering award for Masters Student Research in the Computer Science Department. This award was on the basis of her work observing work-from-home due to Covid-19, as reported in her poster at the NSF PREPARE-VO Workshop and our arXive technical report.

The award was presented at the May 2021 Viterbi Masters Student Awards Ceremony.


congratulations to Abdul Qadeer for his PhD

I would like to congratulate Dr. Abdul Qadeer for defending his PhD at the University of Southern California in March 2021 and completing his doctoral dissertation “Efficient Processing of Streaming Data in Multi-User and Multi-Abstraction Workflows”.

From the abstract:

Abdul Qadeer after his defense.

Ever-increasing data and evolving processing needs force enterprises to scale-out expensive computational resources to prioritize processing for timely results. Teams process their organization’s data either independently or using ad hoc sharing mechanisms. Often different users start with the same data and the same initial stages (decrypt, decompress, clean, anonymize). As their workflows evolve, later stages often diverge, and different stages may work best with different abstractions. The result is workflows with some overlap, some variations, and multiple transitions where data handling changes between continuous, windowed, and per-block. The system processing this diverse, multi-user, multi-abstraction workflow should be efficient and safe, but also must cope with fault recovery.

Analytics from multiple users can cause redundant processing and data, or encounter performance anomalies due to skew. Skew arises due to static or dynamic imbalance in the workflow stages. Both redundancy and skew waste compute resources and add latency to results. When users bridge between multiple abstractions, such as from per-block processing to windowed processing, they often employ custom code. These transitions can be error prone due to corner cases, can easily add latency as an inefficiency, and custom code is often a source of errors and maintenance difficulty. We need new solutions to manage the above challenges and to expose opportunities for data sharing explicitly. Our thesis is: new methods enable efficient processing of multi-user and multi-abstraction workflows of streaming data. We present two new methods for efficient stream processing—optimizations for multi-user workflows, and multiple abstractions for application coverage and efficient bridging.

These algorithms use a pipeline-graph to detect duplication of code and data across multiple users and cleanly delineate workflow stages for skew management. The pipeline-graph is our job description language that allows developers to specify their need easily and enables our system to automatically detect duplication and manage skew. The pipeline-graph acts as a shared canvas for collaboration amongst users to extend each other’s work. To efficiently implement our deduplication and skew management algorithms, we present streaming data to processing stages as fixed-sized but large blocks. Large-blocks have low meta-data overhead per user, provide good parallelism, and help with fault recovery.

Our second method enables applications to use a different abstraction on a different workflow stage. We provide three key abstractions and show that they cover many classes of analytics and our framework can bridge them efficiently. We provide Block-Streaming, Windowed-Streaming, and Stateful-Streaming abstractions. Block-Streaming is suitable for single-pass applications that care about temporal or spatial locality. Windowed-Streaming allows applications to process accumulated data (time-aligned blocks to sync with external information) and reductions like summation, averages, or other MapReduce-style analytics. We believe our three abstractions allow many classes of analytics and enable processing of one block, many blocks, or infinite stream. Plumb allows multiple abstractions in different parts of the workflow and provides efficient bridging between them so that users could make complex analytics from individual stages without worrying about data movement.

Our methods aim for good throughput, low latency, and clean and easy-to-use support for more applications to achieve better efficiency than our prior hand-tuned but often brittle system. The Plumb framework is the implementation of our solutions and a testbed to validate them. We use real-world workloads from the B-Root DNS domain to demonstrate effectiveness of our solutions. Our processing deduplication increases throughput up to $6\times$, reduces storage by 75%, as compared to their pre-Plumb counterparts. Plumb reduces CPU wastage due to structural skew up to half and reduces latency due to computational skew by 50%. Plumb has cut per-block latency by 74% and latency of daily statistics by 97%, while reducing code size by 58% and lowering manual intervention to handle problems by 73% as compared to pre-Plumb system.

The operational use of Plumb for the B-Root service provides a multi-year validation of our design choices under many traffic conditions. Over the last three years, Plumb has processed more than 12PB of DNS packet data and daily statistics. We show that our abstractions apply to many applications in the domain of networking big-data and beyond.

Students Uncategorized

congratulations to Manaf Gharaibeh for his PhD

I would like to congratulate Dr. Manaf Gharaibeh for defending his PhD at Colorado State University in February 2020 and completing his doctoral dissertation “Characterizing the Visible Address Space to Enable Efficient, Continuous IP Geolocation” in March 2020.

From the abstract:

Manaf Gharaibeh’s phd defense, with Christos Papadopoulos.

Internet Protocol (IP) geolocation is vital for location-dependent applications and many network research problems. The benefits to applications include enabling content customization, proximal server selection, and management of digital rights based on the location of users, to name a few. The benefits to networking research include providing geographic context useful for several purposes, such as to study the geographic deployment of Internet resources, bind cloud data to a location, and to study censorship and monitoring, among others.
The measurement-based IP geolocation is widely considered as the state-of-the-art client-independent approach to estimate the location of an IP address. However, full measurement-based geolocation is prohibitive when applied continuously to the entire Internet to maintain up-to-date IP-to-location mappings. Furthermore, many IP address blocks rarely move, making it unnecessary to perform such full geolocation.
The thesis of this dissertation states that \emph{we can enable efficient, continuous IP geolocation by identifying clusters of co-located IP addresses and their location stability from latency observations.} In this statement, a cluster indicates a group of an arbitrary number of adjacent co-located IP addresses (a few up to a /16). Location stability indicates a measure of how often an IP block changes location. We gain efficiency by allowing IP geolocation systems to geolocate IP addresses as units, and by detecting when a geolocation update is required, optimizations not explored in prior work. We present several studies to support this thesis statement.
We first present a study to evaluate the reliability of router geolocation in popular geolocation services, complementing prior work that evaluates end-hosts geolocation in such services. The results show the limitations of these services and the need for better solutions, motivating our work to enable more accurate approaches. Second, we present a method to identify clusters of \emph{co-located} IP addresses by the similarity in their latency. Identifying such clusters allows us to geolocate them efficiently as units without compromising accuracy. Third, we present an efficient delay-based method to identify IP blocks that move over time, allowing us to recognize when geolocation updates are needed and avoid frequent geolocation of the entire Internet to maintain up-to-date geolocation. In our final study, we present a method to identify cellular blocks by their distinctive variation in latency compared to WiFi and wired blocks. Our method to identify cellular blocks allows a better interpretation of their latency estimates and to study their geographic properties without the need for proprietary data from operators or users.