TU Darmstadt / ULB / TUbiblio

Investigating Paraphrasing-Based Data Augmentation for Task-Oriented Dialogue Systems

Vogel, Liane ; Flek, Lucie
Hrsg.: Sojka, Petr ; Horak, Ales ; Kopecek, Ivan ; Pala, Karel (2022)
Investigating Paraphrasing-Based Data Augmentation for Task-Oriented Dialogue Systems.
25th International Conference on Text, Speech, and Dialogue. Brno, Czech Republic (06.09.2022-09.09.2022)
doi: 10.1007/978-3-031-16270-1_39
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

With synthetic data generation, the required amount of human-generated training data can be reduced significantly. In this work, we explore the usage of automatic paraphrasing models such as GPT-2 and CVAE to augment template phrases for task-oriented dialogue systems while preserving the slots. Additionally, we systematically analyze how far manually annotated training data can be reduced. We extrinsically evaluate the performance of a natural language understanding system on augmented data on various levels of data availability, reducing manually written templates by up to 75% while preserving the same level of accuracy. We further point out that the typical NLG quality metrics such as BLEU or utterance similarity are not suitable to assess the intrinsic quality of NLU paraphrases, and that public task-oriented NLU datasets such as ATIS and SNIPS have severe limitations.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2022
Herausgeber: Sojka, Petr ; Horak, Ales ; Kopecek, Ivan ; Pala, Karel
Autor(en): Vogel, Liane ; Flek, Lucie
Art des Eintrags: Bibliographie
Titel: Investigating Paraphrasing-Based Data Augmentation for Task-Oriented Dialogue Systems
Sprache: Englisch
Publikationsjahr: 16 September 2022
Verlag: Springer
Buchtitel: Text, Speech, and Dialogue
Reihe: Lecture Notes in Computer Science
Band einer Reihe: 13502
Veranstaltungstitel: 25th International Conference on Text, Speech, and Dialogue
Veranstaltungsort: Brno, Czech Republic
Veranstaltungsdatum: 06.09.2022-09.09.2022
DOI: 10.1007/978-3-031-16270-1_39
Kurzbeschreibung (Abstract):

With synthetic data generation, the required amount of human-generated training data can be reduced significantly. In this work, we explore the usage of automatic paraphrasing models such as GPT-2 and CVAE to augment template phrases for task-oriented dialogue systems while preserving the slots. Additionally, we systematically analyze how far manually annotated training data can be reduced. We extrinsically evaluate the performance of a natural language understanding system on augmented data on various levels of data availability, reducing manually written templates by up to 75% while preserving the same level of accuracy. We further point out that the typical NLG quality metrics such as BLEU or utterance similarity are not suitable to assess the intrinsic quality of NLU paraphrases, and that public task-oriented NLU datasets such as ATIS and SNIPS have severe limitations.

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Data and AI Systems
Hinterlegungsdatum: 08 Feb 2023 09:06
Letzte Änderung: 11 Mai 2023 15:17
PPN: 507740270
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen