Abstract | Current high-throughput algorithms for constructing inverted files allfollow the MapReduce framework, which presents a high-level programming
model that hides the complexities of parallel programming. In this
paper, we take an alternative approach and develop a novel strategy that
exploits the current and emerging architectures of multicore processors.
Our algorithm is based on a high-throughput pipelined strategy that
produces parallel parsed streams, which are immediately consumed at the
same rate by parallel indexers. We have performed extensive tests of our
algorithm on a cluster of 32 nodes, and were able to achieve a
throughput close to the peak throughput of the I/O system: a throughput
of 280 MB/s on a single node and a throughput that ranges between 5.15
GB/s (1 Gb/s Ethernet interconnect) and 6.12GB/s (10Gb/s InfiniBand
interconnect) on a cluster with 32 nodes for processing the ClueWeb09
dataset. Such a performance represents a substantial gain over the best
known MapReduce algorithms even when comparing the single node
performance of our algorithm to MapReduce algorithms running on large
clusters. Our results shed a light on the extent of the performance
cost that may be incurred by using the simpler, higher-level MapReduce
programming model for large scale applications.
|