Plexus: Optimizing Join Approximation for Geo-Distributed Data Analytics

Joel Wolfrath, Abhishek Chandra

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Modern applications are increasingly generating and persisting data across geo-distributed data centers or edge clusters rather than a single cloud. This paradigm introduces challenges for traditional query execution due to increased latency when transferring data over wide-area network links. Join queries in particular are heavily affected, due to their large output size and amount of data that must be shuffled over the network. Join sampling—computing a uniform sample from the join results—is a useful technique for reducing resource requirements. However, applying it to a geo-distributed setting is challenging, since acquiring independent samples from each location and joining on the samples does not produce uniform and independent tuples from the join result. To address these challenges, we first generalize an existing join sampling algorithm to the geo-distributed setting. We then present our system, Plexus, which introduces three additional optimizations to further reduce the network overhead and handle network and data heterogeneity: (i) weight approximation, (ii) heterogeneity awareness and (iii) sample prefetching. We evaluate Plexus on a geo-distributed system deployed across multiple AWS regions, with an implementation based on Apache Spark. Using three real-world datasets, we show that Plexus can reduce query latency by up to 80% over the default Spark join implementation on a wide class of join queries without substantially impacting sample uniformity.

Original languageEnglish (US)
Title of host publicationSoCC 2023 - Proceedings of the 2023 ACM Symposium on Cloud Computing
PublisherAssociation for Computing Machinery, Inc
Pages1-16
Number of pages16
ISBN (Electronic)9798400703874
DOIs
StatePublished - Oct 30 2023
Event14th ACM Symposium on Cloud Computing, SoCC 2023 - Santa Cruz, United States
Duration: Oct 30 2023Nov 1 2023

Publication series

NameSoCC 2023 - Proceedings of the 2023 ACM Symposium on Cloud Computing

Conference

Conference14th ACM Symposium on Cloud Computing, SoCC 2023
Country/TerritoryUnited States
CitySanta Cruz
Period10/30/2311/1/23

Bibliographical note

Publisher Copyright:
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Keywords

  • Distributed Systems
  • Join Algorithms
  • Query Optimization
  • Wide Area Network

Fingerprint

Dive into the research topics of 'Plexus: Optimizing Join Approximation for Geo-Distributed Data Analytics'. Together they form a unique fingerprint.

Cite this