Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
Ergebnis 3 von 103
Information systems (Oxford), 2020-02, Vol.88, p.101455, Article 101455
2020
Volltextzugriff (PDF)

Details

Autor(en) / Beteiligte
Titel
Similarity query support in big data management systems
Ist Teil von
  • Information systems (Oxford), 2020-02, Vol.88, p.101455, Article 101455
Ort / Verlag
Oxford: Elsevier Ltd
Erscheinungsjahr
2020
Quelle
Alma/SFX Local Collection
Beschreibungen/Notizen
  • Similarity query processing is becoming increasingly important in many applications such as data cleaning, record linkage, Web search, and document analytics. In this paper we study how to provide end-to-end similarity query support natively in a parallel database system. We discuss how to express a similarity predicate in its query language, how to build indexes, how to answer similarity queries (selections and joins) efficiently in the runtime engine, possibly using indexes, and how to optimize similarity queries. One particular challenge is how to incorporate existing similarity join algorithms, which often require a series of steps to achieve a high efficiency, including collecting token frequencies, finding matching record id pairs, and reassembling result records based on id pairs. We present a novel approach that uses existing runtime operators to implement such complex join algorithms without reinventing the wheel; doing so positions the system to automatically benefit from future improvements to those operators. The approach includes a technique to transform a similarity join plan into an efficient operator-based physical plan during query optimization by using a template expressed largely in the system’s user-level query language; this technique greatly simplifies the specification of such a transformation rule. We use Apache AsterixDB, a parallel Big Data management system, to illustrate and validate our techniques. We conduct an experimental study using several large, real datasets on a parallel computing cluster to assess the similarity query support. We also include experiments involving three other parallel systems and report the efficacy and performance results. •Extends the existing query language of a parallel DBMS to support similarity queries.•Uses existing operators in the system to implement state-of-the-art techniques.•Presents a novel framework called the ”AQL+” to optimize similarity queries.•Includes empirical similarity query experiments using several large, real datasets.•Compares the approach with three other parallel systems to show its relative efficacy.
Sprache
Englisch
Identifikatoren
ISSN: 0306-4379
eISSN: 1873-6076
DOI: 10.1016/j.is.2019.101455
Titel-ID: cdi_proquest_journals_2333949439

Weiterführende Literatur

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX