Cross-phase optimization in mapreduce

Benjamin Heintz, Chenyu Wang, Abhishek Chandra, Jon Weissman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Scopus citations

Abstract

MapReduce has been designed to accommodate large-scale data-intensive workloads running on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed including skewed workloads, iterative applications, and heterogeneous computing environments. Our work continues this exploration by applying MapReduce across widely distributed data over distributed computation resources. This problem arises when datasets are generated at multiple sites as is common in many scientific domains and increasingly e-commerce applications. It also occurs when multi-site resources such as geographically separated data centers are applied to the same MapReduce job. Using Hadoop, we show that the absence of network and node homogeneity and locality of data lead to poor performance. The problem is that interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. In this paper, we propose new cross-phase optimization techniques that enable independent MapReduce phases to influence one another. We propose techniques that optimize the push and map phases to enable push-map overlap and to allow map behavior to feed back into push dynamics. Similarly, we propose techniques that optimize the map and reduce phases to enable shuffle cost to feed back and affect map scheduling decisions. We evaluate the benefits of our techniques in both Amazon EC2 and PlanetLab. The experimental results show the potential of these techniques as performance is improved from 7%-18% depending on the execution environment and application.

Original languageEnglish (US)
Title of host publicationProceedings of the IEEE International Conference on Cloud Engineering, IC2E 2013
Pages338-347
Number of pages10
DOIs
StatePublished - Aug 12 2013
Event1st IEEE International Conference on Cloud Engineering, IC2E 2013 - San Francisco, CA, United States
Duration: Mar 25 2013Mar 28 2013

Publication series

NameProceedings of the IEEE International Conference on Cloud Engineering, IC2E 2013

Other

Other1st IEEE International Conference on Cloud Engineering, IC2E 2013
Country/TerritoryUnited States
CitySan Francisco, CA
Period3/25/133/28/13

Keywords

  • Cloud
  • Distributed
  • MapReduce
  • Scheduling

Fingerprint

Dive into the research topics of 'Cross-phase optimization in mapreduce'. Together they form a unique fingerprint.

Cite this