TY - JOUR
T1 - Network Cost-Aware Geo-Distributed Data Analytics System
AU - Oh, Kwangsung
AU - Zhang, Minmin
AU - Chandra, Abhishek
AU - Weissman, Jon
N1 - Publisher Copyright:
© 1990-2012 IEEE.
PY - 2022/6/1
Y1 - 2022/6/1
N2 - Many geo-distributed data analytics (GDA) systems have focused on the network performance-bottleneck: inter-data center network bandwidth to improve performance. Unfortunately, these systems may encounter a cost-bottleneck (${\$}$$) because they have not considered data transfer cost (${\$}$$), one of the most expensive and heterogeneous resources in a multi-cloud environment. In this article, we present Kimchi, a network cost-aware GDA system to meet the cost-performance tradeoff by exploiting data transfer cost heterogeneity to avoid the cost-bottleneck. Kimchi determines cost-aware task placement decisions for scheduling tasks given inputs including data transfer cost, network bandwidth, input data size and locations, and desired cost-performance tradeoff preference. In addition, Kimchi is also mindful of data transfer cost in the presence of dynamics. Kimchi has been applied to two common GDA MapReduce models: synchronous barrier and asynchronous push-based shuffle. A Kimchi prototype has been implemented on Spark, and experiments show that it reduces cost by 5% $\scriptstyle \sim$∼ 24% without impacting performance and reduces query execution time by 45% $\scriptstyle \sim$∼ 70% without impacting cost compared to other baseline approaches centralized, vanilla Spark, and bandwidth-aware (e.g., Iridium). More importantly, Kimchi allows applications to explore a much richer cost-performance tradeoff space in a multi-cloud environment.
AB - Many geo-distributed data analytics (GDA) systems have focused on the network performance-bottleneck: inter-data center network bandwidth to improve performance. Unfortunately, these systems may encounter a cost-bottleneck (${\$}$$) because they have not considered data transfer cost (${\$}$$), one of the most expensive and heterogeneous resources in a multi-cloud environment. In this article, we present Kimchi, a network cost-aware GDA system to meet the cost-performance tradeoff by exploiting data transfer cost heterogeneity to avoid the cost-bottleneck. Kimchi determines cost-aware task placement decisions for scheduling tasks given inputs including data transfer cost, network bandwidth, input data size and locations, and desired cost-performance tradeoff preference. In addition, Kimchi is also mindful of data transfer cost in the presence of dynamics. Kimchi has been applied to two common GDA MapReduce models: synchronous barrier and asynchronous push-based shuffle. A Kimchi prototype has been implemented on Spark, and experiments show that it reduces cost by 5% $\scriptstyle \sim$∼ 24% without impacting performance and reduces query execution time by 45% $\scriptstyle \sim$∼ 70% without impacting cost compared to other baseline approaches centralized, vanilla Spark, and bandwidth-aware (e.g., Iridium). More importantly, Kimchi allows applications to explore a much richer cost-performance tradeoff space in a multi-cloud environment.
KW - Geo-distributed data
KW - data analytics system
KW - multi cloud providers
KW - multi-DCs
UR - http://www.scopus.com/inward/record.url?scp=85115126939&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115126939&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2021.3108893
DO - 10.1109/TPDS.2021.3108893
M3 - Article
AN - SCOPUS:85115126939
SN - 1045-9219
VL - 33
SP - 1407
EP - 1420
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 6
ER -