Jamison, Emily ; Gurevych, Iryna
eds.: Aroonmanakun, Wirote ; Boonkwan, Prachya ; Supnithi, Thepchai (2014)
Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets.
Phuket, Thailand
Conference or Workshop Item, Bibliographie
Abstract
Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation.
Item Type: | Conference or Workshop Item |
---|---|
Erschienen: | 2014 |
Editors: | Aroonmanakun, Wirote ; Boonkwan, Prachya ; Supnithi, Thepchai |
Creators: | Jamison, Emily ; Gurevych, Iryna |
Type of entry: | Bibliographie |
Title: | Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets |
Language: | English |
Date: | December 2014 |
Publisher: | Department of Linguistics, Chulalongkorn University |
Book Title: | Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing |
Event Location: | Phuket, Thailand |
URL / URN: | http://www.aclweb.org/anthology/Y/Y14/Y14-1030.pdf |
Abstract: | Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation. |
Uncontrolled Keywords: | Secure Data;reviewed;UKP_p_ItForensics;UKP_a_TexMinAn |
Identification Number: | TUD-CS-2014-0991 |
Divisions: | 20 Department of Computer Science 20 Department of Computer Science > Ubiquitous Knowledge Processing LOEWE LOEWE > LOEWE-Zentren LOEWE > LOEWE-Zentren > CASED – Center for Advanced Security Research Darmstadt |
Date Deposited: | 31 Dec 2016 14:29 |
Last Modified: | 24 Jan 2020 12:03 |
PPN: | |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Send an inquiry |
Options (only for editors)
Show editorial Details |