UB Paderborn / Katalog / Suche / Details

Ergebnis 2 von 3735

Transactions of the Association for Computational Linguistics, 2022-01, Vol.10, p.73-91

2022

Autor(en) / Beteiligte

Titel

Canine : Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Ist Teil von

Transactions of the Association for Computational Linguistics, 2022-01, Vol.10, p.73-91

Ort / Verlag

One Rogers Street, Cambridge, MA 02142-1209, USA: MIT Press

Erscheinungsjahr

2022

Link zum Volltext

Quelle

Elektronische Zeitschriftenbibliothek (Open access)

Beschreibungen/Notizen

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present , a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. outperforms a comparable m model by 5.7 F1 on , a challenging multilingual benchmark, despite having fewer model parameters.

Sprache: Englisch
Identifikatoren: ISSN: 2307-387X
eISSN: 2307-387X
DOI: 10.1162/tacl_a_00448
Titel-ID: cdi_mit_journals_10_1162_tacl_a_00448

Format: –
Schlagworte: Algorithms, Coders, Language, Linguistics, Morphology, Training, Vocabulary

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX