Klie, Jan-Christoph (2024)
Improving Natural Language Dataset Annotation Quality and Efficiency.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00026580
Dissertation, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
Annotated data is essential in many scientific disciplines, including natural language processing, linguistics, language acquisition research, bioinformatics, healthcare, or the digital humanities. Datasets are used to train and evaluate machine learning models, to deduce new knowledge, and to suggest appropriate revisions to existing theories. Especially in machine learning, large, high-quality datasets play a crucial role in advancing the field and evaluate new approaches. There are two central topics when creating these crucial datasets: annotation efficiency and annotation quality. We improve on both in this thesis.
While annotated data is fundamental and sought after, creating it via manual annotation is expensive, time-consuming, and often requires experts. It is therefore very desirable to reduce costs and improve speed of data annotation, two significant aspects of annotation efficiency. Through this thesis, we hence propose different ways of improving annotation efficiency, including human-in-the-loop label suggestions, interactive annotator training, and community annotation.
To train well-performing models and for their accurate evaluation, the data itself needs to be of the highest quality. Errors in the dataset can lead to degraded downstream task performance, biased or even cause harmful predictions. In addition, when erroneous data is used to evaluate or compare model architectures, algorithms, training regimes, or other scientific contributions, the relative order in performance might change. Thus, dataset errors can cause incorrect conclusions to be drawn. The focus of most machine learning work is on developing new models and methods; data quality is often overlooked. To alleviate quality issues, this thesis presents two contributions to improve annotation quality. First, we analyze best practices of annotation quality management, analyze how it is conducted in practice, and derive recommendations for future dataset creators on how to structure the annotation process and manage quality. Second, we survey the field of automatic annotation error detection, formalize it, re-implement and study the effectiveness of the most commonly used methods. Based on extensive experiments, we provide insights and recommendations concerning which ones should be used in which context.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2024 | ||||
Autor(en): | Klie, Jan-Christoph | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Improving Natural Language Dataset Annotation Quality and Efficiency | ||||
Sprache: | Englisch | ||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Webber, Prof. Ph.D Bonnie | ||||
Publikationsjahr: | 7 Juni 2024 | ||||
Ort: | Darmstadt | ||||
Kollation: | xi, 242 Seiten | ||||
Datum der mündlichen Prüfung: | 18 April 2024 | ||||
DOI: | 10.26083/tuprints-00026580 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/26580 | ||||
Kurzbeschreibung (Abstract): | Annotated data is essential in many scientific disciplines, including natural language processing, linguistics, language acquisition research, bioinformatics, healthcare, or the digital humanities. Datasets are used to train and evaluate machine learning models, to deduce new knowledge, and to suggest appropriate revisions to existing theories. Especially in machine learning, large, high-quality datasets play a crucial role in advancing the field and evaluate new approaches. There are two central topics when creating these crucial datasets: annotation efficiency and annotation quality. We improve on both in this thesis. While annotated data is fundamental and sought after, creating it via manual annotation is expensive, time-consuming, and often requires experts. It is therefore very desirable to reduce costs and improve speed of data annotation, two significant aspects of annotation efficiency. Through this thesis, we hence propose different ways of improving annotation efficiency, including human-in-the-loop label suggestions, interactive annotator training, and community annotation. To train well-performing models and for their accurate evaluation, the data itself needs to be of the highest quality. Errors in the dataset can lead to degraded downstream task performance, biased or even cause harmful predictions. In addition, when erroneous data is used to evaluate or compare model architectures, algorithms, training regimes, or other scientific contributions, the relative order in performance might change. Thus, dataset errors can cause incorrect conclusions to be drawn. The focus of most machine learning work is on developing new models and methods; data quality is often overlooked. To alleviate quality issues, this thesis presents two contributions to improve annotation quality. First, we analyze best practices of annotation quality management, analyze how it is conducted in practice, and derive recommendations for future dataset creators on how to structure the annotation process and manage quality. Second, we survey the field of automatic annotation error detection, formalize it, re-implement and study the effectiveness of the most commonly used methods. Based on extensive experiments, we provide insights and recommendations concerning which ones should be used in which context. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Status: | Verlagsversion | ||||
URN: | urn:nbn:de:tuda-tuprints-265805 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung |
||||
TU-Projekte: | DFG|GU798/21-1|Infrastruktur für in DFG|EC503/1-1|Infrastruktur für in |
||||
Hinterlegungsdatum: | 07 Jun 2024 12:07 | ||||
Letzte Änderung: | 11 Jun 2024 06:12 | ||||
PPN: | |||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Webber, Prof. Ph.D Bonnie | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 18 April 2024 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |