Lee, Ji-Ung (2024)
Constrained Generation and Adaptive Selection of C-Tests.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00027274
Dissertation, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
Increasing globalization and immigration is driving the importance of multi-lingual proficiency. Being able to communicate across different languages is already one of the key competencies that can define success—however, various institutions such as the European Council or the United Nations High Commissioner for Refugees predict that this trend will intensify even further with climate change and rising refugee numbers. Despite these concerning developments, a shortage of proficient human translators remains, while existing automated solutions fall far behind the requirements. For instance, current translation tools have been shown to perform substantially worse in low-resource languages or in specialized domains such as legal or medical—causing real-world harm through unreflected use. Large language models (LLMs) still exhibit biases and hallucinations—rendering them unreliable. At the same time, the continuous shortage of teachers leads to an increasing gap for language learning opportunities. While self-directed learning and intelligent tutoring systems (ITS) have the potential to alleviate some of the issues, research in this area suffers from limited available data—a result of proprietary software and data protection regulations. This calls for methods that are capable of learning efficiently from little user feedback. The goal of this thesis is to provide new language learning opportunities by devising methods that alleviate the work for teachers and that empower learners to self-directed learning. For evaluation we use C-Tests, a type of gap filling exercise that is similar to cloze tests, but less ambiguous. In the first part of this thesis, we develop novel methods for generating C-Tests. In contrast to previous works, our methods—that are based on heuristics and constrained optimization—are capable of generating C-Tests with a specific target difficulty. Moreover, our method based on mixed-integer programming allows teachers to pose specific constraints which are guaranteed to be adhered, resulting in C-Tests that better suit their needs. In the second part of this thesis, we devise a new sampling method to interactively train a C-Test selection model. We draw inspiration from active learning that aims to improve model training by only annotating instances that presumably help the model most (model objective). At first glance, active learning seems to be unfit for educational scenarios as it can lead to instances that are more difficult to annotate—or likewise, result in C-Tests that do not suit a learner’s current proficiency. Conversely, only selecting instances that suit the learner’s current proficiency—ideally with a high certainty (user objective)—will result in feedback that is uninformative for the model. We show that it is indeed possible to sample instances that optimize both and that this results in C-Tests which benefit model and learner better than sampling instances for each objective individually. Finally, we explore interactive data annotation as a scenario that could benefit from our joint sampling strategy. We first develop an application that showcases the usefulness of interactive data annotation in a scenario where domain experts can interactively annotate data to ease their work. We then show how annotation studies in general comprise a learning process, and devise annotation curricula, a method to reorder annotated instances which significantly reduces annotation time.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2024 | ||||
Autor(en): | Lee, Ji-Ung | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Constrained Generation and Adaptive Selection of C-Tests | ||||
Sprache: | Englisch | ||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Zesch, Prof. Dr. Torsten | ||||
Publikationsjahr: | 26 Juli 2024 | ||||
Ort: | Darmstadt | ||||
Kollation: | xiv, 233 Seiten | ||||
Datum der mündlichen Prüfung: | 9 Juli 2024 | ||||
DOI: | 10.26083/tuprints-00027274 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/27274 | ||||
Kurzbeschreibung (Abstract): | Increasing globalization and immigration is driving the importance of multi-lingual proficiency. Being able to communicate across different languages is already one of the key competencies that can define success—however, various institutions such as the European Council or the United Nations High Commissioner for Refugees predict that this trend will intensify even further with climate change and rising refugee numbers. Despite these concerning developments, a shortage of proficient human translators remains, while existing automated solutions fall far behind the requirements. For instance, current translation tools have been shown to perform substantially worse in low-resource languages or in specialized domains such as legal or medical—causing real-world harm through unreflected use. Large language models (LLMs) still exhibit biases and hallucinations—rendering them unreliable. At the same time, the continuous shortage of teachers leads to an increasing gap for language learning opportunities. While self-directed learning and intelligent tutoring systems (ITS) have the potential to alleviate some of the issues, research in this area suffers from limited available data—a result of proprietary software and data protection regulations. This calls for methods that are capable of learning efficiently from little user feedback. The goal of this thesis is to provide new language learning opportunities by devising methods that alleviate the work for teachers and that empower learners to self-directed learning. For evaluation we use C-Tests, a type of gap filling exercise that is similar to cloze tests, but less ambiguous. In the first part of this thesis, we develop novel methods for generating C-Tests. In contrast to previous works, our methods—that are based on heuristics and constrained optimization—are capable of generating C-Tests with a specific target difficulty. Moreover, our method based on mixed-integer programming allows teachers to pose specific constraints which are guaranteed to be adhered, resulting in C-Tests that better suit their needs. In the second part of this thesis, we devise a new sampling method to interactively train a C-Test selection model. We draw inspiration from active learning that aims to improve model training by only annotating instances that presumably help the model most (model objective). At first glance, active learning seems to be unfit for educational scenarios as it can lead to instances that are more difficult to annotate—or likewise, result in C-Tests that do not suit a learner’s current proficiency. Conversely, only selecting instances that suit the learner’s current proficiency—ideally with a high certainty (user objective)—will result in feedback that is uninformative for the model. We show that it is indeed possible to sample instances that optimize both and that this results in C-Tests which benefit model and learner better than sampling instances for each objective individually. Finally, we explore interactive data annotation as a scenario that could benefit from our joint sampling strategy. We first develop an application that showcases the usefulness of interactive data annotation in a scenario where domain experts can interactively annotate data to ease their work. We then show how annotation studies in general comprise a learning process, and devise annotation curricula, a method to reorder annotated instances which significantly reduces annotation time. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Status: | Verlagsversion | ||||
URN: | urn:nbn:de:tuda-tuprints-272746 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung |
||||
TU-Projekte: | DFG|GU798/20-1|Argumentationsanalys DFG|GU798/27-1|EVIDENCE: Computer-u EU/EFRE|20005482|TexPrax - Gurevych HA(Hessen Agentur)|521/17-03|a! automated languag |
||||
Hinterlegungsdatum: | 26 Jul 2024 12:10 | ||||
Letzte Änderung: | 29 Jul 2024 07:40 | ||||
PPN: | |||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Zesch, Prof. Dr. Torsten | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 9 Juli 2024 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |