UB Paderborn / Katalog / Suche / Details

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010, p.495-506

2010

Volltextzugriff (PDF)

Autor(en) / Beteiligte

Titel

Efficient parallel set-similarity joins using MapReduce

Ist Teil von

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010, p.495-506

Ort / Verlag

New York, NY, USA: ACM

Erscheinungsjahr

2010

Quelle

ACM Digital Library

Beschreibungen/Notizen

In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.

Sprache: Englisch
Identifikatoren: ISBN: 1450300324, 9781450300322
DOI: 10.1145/1807167.1807222
Titel-ID: cdi_acm_books_10_1145_1807167_1807222

Format: –
Schlagworte: Information systems -- Data management systems -- Database management system engines -- Database query processing, Information systems -- Data management systems -- Database management system engines -- Parallel and distributed DBMSs, Theory of computation -- Theory and algorithms for application domains -- Database theory -- Database query processing and optimization (theory)

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX