Abstract
Large scale data analytics over geographically distributed data sources is challenging primarily due to the constrained and heterogeneous resource availability such as the wide area network (WAN) bandwidth. In this work, we look at the problem of generating random samples over joins for geo-distributed data sources. Joins are one of the most fundamental yet expensive operations in data analytics. To reduce the cost of computing joins, existing techniques have looked at efficiently generating a random sample over the join result for centralized environments, where all the data is available in one location. These techniques fail to address the unique challenges posed by geo-distributed environments. To address these challenges, we propose a sampling technique which aims to reduce the WAN traffic and latency, thereby reducing the overall latency for generating samples over joins for geo-distributed data sources. We implement our geo-distributed sampling technique on top of Apache Spark and compare it with existing state-of-The-Art sampling techniques to identify scenarios where the proposed approach gives significant benefits. Based on this exploration, we provide a detailed outline of additional factors which should be considered when designing a WAN-Aware join sampling technique for geo-distributed environments.
Original language | English (US) |
---|---|
Title of host publication | EdgeSys 2022 - Proceedings of the 5th International Workshop on Edge Systems, Analytics and Networking, Part of EuroSys 2022 |
Publisher | Association for Computing Machinery, Inc |
Pages | 13-18 |
Number of pages | 6 |
ISBN (Electronic) | 9781450392532 |
DOIs | |
State | Published - Apr 5 2022 |
Event | 5th International Workshop on Edge Systems, Analytics and Networking, EdgeSys 2022, in conjunction with ACM EuroSys 2022 - Virtual, Online, France Duration: Apr 5 2022 → Apr 8 2022 |
Publication series
Name | EdgeSys 2022 - Proceedings of the 5th International Workshop on Edge Systems, Analytics and Networking, Part of EuroSys 2022 |
---|
Conference
Conference | 5th International Workshop on Edge Systems, Analytics and Networking, EdgeSys 2022, in conjunction with ACM EuroSys 2022 |
---|---|
Country/Territory | France |
City | Virtual, Online |
Period | 4/5/22 → 4/8/22 |
Bibliographical note
Publisher Copyright:© 2022 ACM.
Keywords
- cloud
- edge
- geo-distributed systems
- join sampling