Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
Ergebnis 2 von 3735
Transactions of the Association for Computational Linguistics, 2022-01, Vol.10, p.73-91
2022

Details

Autor(en) / Beteiligte
Titel
Canine : Pre-training an Efficient Tokenization-Free Encoder for Language Representation
Ist Teil von
  • Transactions of the Association for Computational Linguistics, 2022-01, Vol.10, p.73-91
Ort / Verlag
One Rogers Street, Cambridge, MA 02142-1209, USA: MIT Press
Erscheinungsjahr
2022
Link zum Volltext
Quelle
Elektronische Zeitschriftenbibliothek (Open access)
Beschreibungen/Notizen
  • Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present , a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. outperforms a comparable m model by 5.7 F1 on , a challenging multilingual benchmark, despite having fewer model parameters.
Sprache
Englisch
Identifikatoren
ISSN: 2307-387X
eISSN: 2307-387X
DOI: 10.1162/tacl_a_00448
Titel-ID: cdi_mit_journals_10_1162_tacl_a_00448

Weiterführende Literatur

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX