The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce

TitleThe Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce
Publication TypeJournal Articles
Year of Publication2009
AuthorsJimmy Lin
JournalProceedings of LSDS-IR workshop
Pagination57 - 62
Date Published2009///
Abstract

This paper explores the problem of “stragglers” in Map-Reduce: a common phenomenon where a small number of
mappers or reducers takes significantly longer than the oth-
ers to complete. The effects of these stragglers include un-
necessarily long wall-clock running times and sub-optimal
cluster utilization. In many cases, this problem cannot sim-
ply be attributed to hardware idiosyncrasies, but is rather
caused by the Zipfian distribution of input or intermedi-
ate data. I present a simple theoretical model that shows
how such distributions impose a fundamental limit on the
amount of parallelism that can be extracted from a large
class of algorithms where all occurrences of the same ele-
ment must be processed together. A case study in parallel
ad hoc query evaluation highlights the severity of the strag-
glers problem. Fortunately, a simple modification of the in-
put data cuts end-to-end running time in half. This example
illustrates some of the issues associated with designing effi-
cient MapReduce algorithms for real-world datasets.