UB Paderborn / Katalog / Suche / Details

Zur Ergebnisliste

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Computational linguistics - Association for Computational Linguistics, 2022-04, Vol.48 (1), p.5-42

Şahin, Gözde Gül

2022

Details

Autor(en) / Beteiligte

Şahin, Gözde Gül

Titel

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Ist Teil von

Computational linguistics - Association for Computational Linguistics, 2022-04, Vol.48 (1), p.5-42

Ort / Verlag

One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA: MIT Press

Erscheinungsjahr

2022

Link zum Volltext

Quelle

ACM Digital Library

Beschreibungen/Notizen

Data-hungry deep neural networks have established themselves as the de facto standard for many NLP tasks, including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind their statistical counterparts in low-resource scenarios. One methodology to counterattack this problem is text augmentation, that is, generating new synthetic training data points from existing data. Although NLP has recently witnessed several new textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies that perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion), and character (e.g., character swapping) levels. We systematically compare the methods on part-of-speech tagging, dependency parsing, and semantic role labeling for a diverse set of language families using various models, including the architectures that rely on pretrained multilingual contextualized language models such as . Augmentation most significantly improves dependency parsing, followed by part-of-speech tagging and semantic role labeling. We find the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese. Our results suggest that the augmentation techniques can further improve over strong baselines based on , especially for dependency parsing. We identify the character-level methods as the most consistent performers, while synonym replacement and syntactic augmenters provide inconsistent improvements. Finally, we discuss that the results most heavily depend on the task, language pair (e.g., syntactic-level techniques mostly benefit higher-level tasks and morphologically richer languages), and model type (e.g., token-level augmentation provides significant improvements for , while character-level ones give generally higher scores for and based models).

Sprache: Englisch
Identifikatoren: ISSN: 0891-2017
eISSN: 1530-9312
DOI: 10.1162/coli_a_00425
Titel-ID: cdi_mit_journals_coliv48i1_317952_2022_04_06_zip_coli_a_00425

Format: –
Schlagworte: Artificial neural networks, Comparative linguistics, Comparative studies, Data points, Labelling, Language modeling, Marking, Morphology, Natural language processing, Neural networks, Parsing, Semantic roles, Semantics, Speech, Syntax, Tagging (Computational linguistics), Vietnamese

Weiterführende Literatur

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX

Menü

Weitere Dienste

Einstellungen

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Details

Weiterführende Literatur