TU Darmstadt / ULB / TUbiblio

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

Bayer, Markus ; Kaufhold, Marc-André ; Buchhold, Björn ; Keller, Marcel ; Dallmeyer, Jörg ; Reuter, Christian (2022)
Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers.
In: International Journal of Machine Learning and Cybernetics, 2021
doi: 10.26083/tuprints-00022164
Article, Secondary publication, Publisher's Version

WarningThere is a more recent version of this item available.

Abstract

In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets.

Item Type: Article
Erschienen: 2022
Creators: Bayer, Markus ; Kaufhold, Marc-André ; Buchhold, Björn ; Keller, Marcel ; Dallmeyer, Jörg ; Reuter, Christian
Type of entry: Secondary publication
Title: Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
Language: English
Date: 2022
Place of Publication: Darmstadt
Year of primary publication: 2021
Publisher: Springer
Journal or Publication Title: International Journal of Machine Learning and Cybernetics
Collation: 16 Seiten
DOI: 10.26083/tuprints-00022164
URL / URN: https://tuprints.ulb.tu-darmstadt.de/22164
Corresponding Links:
Origin: Secondary publication service
Abstract:

In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets.

Uncontrolled Keywords: Textual data augmentation, Small text data analytics, Text generation, Long and short text classifier
Status: Publisher's Version
URN: urn:nbn:de:tuda-tuprints-221643
Classification DDC: 000 Generalities, computers, information > 004 Computer science
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Science and Technology for Peace and Security (PEASEC)
Forschungsfelder
Forschungsfelder > Information and Intelligence
Forschungsfelder > Information and Intelligence > Cybersecurity & Privacy
Date Deposited: 05 Sep 2022 13:19
Last Modified: 07 Sep 2022 09:08
PPN:
Export:
Suche nach Titel in: TUfind oder in Google

Available Versions of this Item

Send an inquiry Send an inquiry

Options (only for editors)
Show editorial Details Show editorial Details