Jamison, Emily ; Gurevych, Iryna
Hrsg.: Aroonmanakun, Wirote ; Boonkwan, Prachya ; Supnithi, Thepchai (2014)
Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets.
Phuket, Thailand
Konferenzveröffentlichung, Bibliographie
Kurzbeschreibung (Abstract)
Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation.
Typ des Eintrags: | Konferenzveröffentlichung |
---|---|
Erschienen: | 2014 |
Herausgeber: | Aroonmanakun, Wirote ; Boonkwan, Prachya ; Supnithi, Thepchai |
Autor(en): | Jamison, Emily ; Gurevych, Iryna |
Art des Eintrags: | Bibliographie |
Titel: | Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets |
Sprache: | Englisch |
Publikationsjahr: | Dezember 2014 |
Verlag: | Department of Linguistics, Chulalongkorn University |
Buchtitel: | Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing |
Veranstaltungsort: | Phuket, Thailand |
URL / URN: | http://www.aclweb.org/anthology/Y/Y14/Y14-1030.pdf |
Kurzbeschreibung (Abstract): | Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation. |
Freie Schlagworte: | Secure Data;reviewed;UKP_p_ItForensics;UKP_a_TexMinAn |
ID-Nummer: | TUD-CS-2014-0991 |
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung LOEWE LOEWE > LOEWE-Zentren LOEWE > LOEWE-Zentren > CASED – Center for Advanced Security Research Darmstadt |
Hinterlegungsdatum: | 31 Dez 2016 14:29 |
Letzte Änderung: | 24 Jan 2020 12:03 |
PPN: | |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |