Exploring MapReduce efficiency with highly-distributed data

Michael Cardosa, Chenyu Wang, Anshuman Nangia, Abhishek Chandra, Jon Weissman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

55 Scopus citations

Abstract

MapReduce is a highly-popular paradigm for high-performance computing over large data sets in large-scale platforms. However, when the source data is widely distributed and the computing platform is also distributed, e.g. data is collected in separate data center locations, the most efficient architecture for running Hadoop jobs over the entire data set becomes non-trivial. In this paper, we show the traditional single-cluster MapReduce setup may not be suitable for situations when data and compute resources are widely distributed. Further, we provide recommendations for alternative (and even hierarchical) distributed MapReduce setup configurations, depending on the workload and data set.

Original languageEnglish (US)
Title of host publicationMapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications
Pages27-33
Number of pages7
DOIs
StatePublished - 2011
Event2nd International Workshop on MapReduce and Its Applications, MapReduce'11, Co-located with 20th International ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2011 - San Jose, CA, United States
Duration: Jun 8 2011Jun 8 2011

Publication series

NameMapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications

Other

Other2nd International Workshop on MapReduce and Its Applications, MapReduce'11, Co-located with 20th International ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2011
Country/TerritoryUnited States
CitySan Jose, CA
Period6/8/116/8/11

Keywords

  • MapReduce
  • distributed systems

Fingerprint

Dive into the research topics of 'Exploring MapReduce efficiency with highly-distributed data'. Together they form a unique fingerprint.

Cite this