Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
Computational linguistics - Association for Computational Linguistics, 2022-04, Vol.48 (1), p.5-42
2022

Details

Autor(en) / Beteiligte
Titel
To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP
Ist Teil von
  • Computational linguistics - Association for Computational Linguistics, 2022-04, Vol.48 (1), p.5-42
Ort / Verlag
One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA: MIT Press
Erscheinungsjahr
2022
Link zum Volltext
Quelle
ACM Digital Library
Beschreibungen/Notizen
  • Data-hungry deep neural networks have established themselves as the de facto standard for many NLP tasks, including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind their statistical counterparts in low-resource scenarios. One methodology to counterattack this problem is text augmentation, that is, generating new synthetic training data points from existing data. Although NLP has recently witnessed several new textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies that perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion), and character (e.g., character swapping) levels. We systematically compare the methods on part-of-speech tagging, dependency parsing, and semantic role labeling for a diverse set of language families using various models, including the architectures that rely on pretrained multilingual contextualized language models such as . Augmentation most significantly improves dependency parsing, followed by part-of-speech tagging and semantic role labeling. We find the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese. Our results suggest that the augmentation techniques can further improve over strong baselines based on , especially for dependency parsing. We identify the character-level methods as the most consistent performers, while synonym replacement and syntactic augmenters provide inconsistent improvements. Finally, we discuss that the results most heavily depend on the task, language pair (e.g., syntactic-level techniques mostly benefit higher-level tasks and morphologically richer languages), and model type (e.g., token-level augmentation provides significant improvements for , while character-level ones give generally higher scores for and based models).
Sprache
Englisch
Identifikatoren
ISSN: 0891-2017
eISSN: 1530-9312
DOI: 10.1162/coli_a_00425
Titel-ID: cdi_mit_journals_coliv48i1_317952_2022_04_06_zip_coli_a_00425

Weiterführende Literatur

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX