Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce

Title	Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce
Publication Type	Conference Papers
Year of Publication	2008
Authors	Jimmy Lin
Conference Name	Proceedings of the Conference on Empirical Methods in Natural Language Processing
Date Published	2008///
Publisher	Association for Computational Linguistics
Conference Location	Stroudsburg, PA, USA
Abstract	This paper explores the challenge of scaling up language processing algorithms to increasingly large datasets. While cluster computing has been available in commercial environments for several years, academic researchers have fallen behind in their ability to work on large datasets. I discuss two barriers contributing to this problem: lack of a suitable programming model for managing concurrency and difficulty in obtaining access to hardware. Hadoop, an open-source implementation of Google's MapReduce framework, provides a compelling solution to both issues. Its simple programming model hides system-level details from the developer, and its ability to run on commodity hardware puts cluster computing within the reach of many academic research groups. This paper illustrates these points with a case study in building word cooccurrence matrices from large corpora. I conclude with an analysis of an alternative computing model based on renting instead of buying computer clusters.
URL	http://dl.acm.org/citation.cfm?id=1613715.1613769

Publications