Exploring MapReduce efficiency with highly-distributed data

Michael Cardosa; Chenyu Wang; Anshuman Nangia; Abhishek Chandra; Jon Weissman

doi:10.1145/1996092.1996100

Exploring MapReduce efficiency with highly-distributed data

Michael Cardosa, Chenyu Wang, Anshuman Nangia, Abhishek Chandra, Jon Weissman

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

55 Scopus citations

Abstract

MapReduce is a highly-popular paradigm for high-performance computing over large data sets in large-scale platforms. However, when the source data is widely distributed and the computing platform is also distributed, e.g. data is collected in separate data center locations, the most efficient architecture for running Hadoop jobs over the entire data set becomes non-trivial. In this paper, we show the traditional single-cluster MapReduce setup may not be suitable for situations when data and compute resources are widely distributed. Further, we provide recommendations for alternative (and even hierarchical) distributed MapReduce setup configurations, depending on the workload and data set.

Original language	English (US)
Title of host publication	MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications
Pages	27-33
Number of pages	7
DOIs	https://doi.org/10.1145/1996092.1996100
State	Published - 2011
Event	2nd International Workshop on MapReduce and Its Applications, MapReduce'11, Co-located with 20th International ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2011 - San Jose, CA, United States Duration: Jun 8 2011 → Jun 8 2011

Publication series

Name	MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications

Other

Other	2nd International Workshop on MapReduce and Its Applications, MapReduce'11, Co-located with 20th International ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2011
Country/Territory	United States
City	San Jose, CA
Period	6/8/11 → 6/8/11

Keywords

MapReduce
distributed systems

Access

10.1145/1996092.1996100

OpenUrl availability

Full text

Cite this

Cardosa, M., Wang, C., Nangia, A., Chandra, A., & Weissman, J. (2011). Exploring MapReduce efficiency with highly-distributed data. In MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications (pp. 27-33). (MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications). https://doi.org/10.1145/1996092.1996100

Exploring MapReduce efficiency with highly-distributed data. / Cardosa, Michael; Wang, Chenyu; Nangia, Anshuman et al.
MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications. 2011. p. 27-33 (MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Cardosa, M, Wang, C, Nangia, A, Chandra, A & Weissman, J 2011, Exploring MapReduce efficiency with highly-distributed data. in MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications. MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications, pp. 27-33, 2nd International Workshop on MapReduce and Its Applications, MapReduce'11, Co-located with 20th International ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2011, San Jose, CA, United States, 6/8/11. https://doi.org/10.1145/1996092.1996100

@inproceedings{6c880a0344b54a3da3a0ac43f21275c4,

title = "Exploring MapReduce efficiency with highly-distributed data",

abstract = "MapReduce is a highly-popular paradigm for high-performance computing over large data sets in large-scale platforms. However, when the source data is widely distributed and the computing platform is also distributed, e.g. data is collected in separate data center locations, the most efficient architecture for running Hadoop jobs over the entire data set becomes non-trivial. In this paper, we show the traditional single-cluster MapReduce setup may not be suitable for situations when data and compute resources are widely distributed. Further, we provide recommendations for alternative (and even hierarchical) distributed MapReduce setup configurations, depending on the workload and data set.",

keywords = "MapReduce, distributed systems",

author = "Michael Cardosa and Chenyu Wang and Anshuman Nangia and Abhishek Chandra and Jon Weissman",

year = "2011",

doi = "10.1145/1996092.1996100",

language = "English (US)",

isbn = "9781450307000",

series = "MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications",

pages = "27--33",

booktitle = "MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications",

note = "2nd International Workshop on MapReduce and Its Applications, MapReduce'11, Co-located with 20th International ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2011 ; Conference date: 08-06-2011 Through 08-06-2011",

}

TY - GEN

T1 - Exploring MapReduce efficiency with highly-distributed data

AU - Cardosa, Michael

AU - Wang, Chenyu

AU - Nangia, Anshuman

AU - Chandra, Abhishek

AU - Weissman, Jon

PY - 2011

Y1 - 2011

N2 - MapReduce is a highly-popular paradigm for high-performance computing over large data sets in large-scale platforms. However, when the source data is widely distributed and the computing platform is also distributed, e.g. data is collected in separate data center locations, the most efficient architecture for running Hadoop jobs over the entire data set becomes non-trivial. In this paper, we show the traditional single-cluster MapReduce setup may not be suitable for situations when data and compute resources are widely distributed. Further, we provide recommendations for alternative (and even hierarchical) distributed MapReduce setup configurations, depending on the workload and data set.

AB - MapReduce is a highly-popular paradigm for high-performance computing over large data sets in large-scale platforms. However, when the source data is widely distributed and the computing platform is also distributed, e.g. data is collected in separate data center locations, the most efficient architecture for running Hadoop jobs over the entire data set becomes non-trivial. In this paper, we show the traditional single-cluster MapReduce setup may not be suitable for situations when data and compute resources are widely distributed. Further, we provide recommendations for alternative (and even hierarchical) distributed MapReduce setup configurations, depending on the workload and data set.

KW - MapReduce

KW - distributed systems

UR - http://www.scopus.com/inward/record.url?scp=79961048998&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79961048998&partnerID=8YFLogxK

U2 - 10.1145/1996092.1996100

DO - 10.1145/1996092.1996100

M3 - Conference contribution

AN - SCOPUS:79961048998

SN - 9781450307000

T3 - MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications

SP - 27

EP - 33

BT - MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications

T2 - 2nd International Workshop on MapReduce and Its Applications, MapReduce'11, Co-located with 20th International ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2011

Y2 - 8 June 2011 through 8 June 2011

ER -

Exploring MapReduce efficiency with highly-distributed data

Abstract

Publication series

Other

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this