TU Darmstadt / ULB / TUbiblio

Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets

Jamison, Emily ; Gurevych, Iryna
eds.: Aroonmanakun, Wirote ; Boonkwan, Prachya ; Supnithi, Thepchai (2014)
Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets.
Phuket, Thailand
Conference or Workshop Item, Bibliographie

Abstract

Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation.

Item Type: Conference or Workshop Item
Erschienen: 2014
Editors: Aroonmanakun, Wirote ; Boonkwan, Prachya ; Supnithi, Thepchai
Creators: Jamison, Emily ; Gurevych, Iryna
Type of entry: Bibliographie
Title: Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets
Language: English
Date: December 2014
Publisher: Department of Linguistics, Chulalongkorn University
Book Title: Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing
Event Location: Phuket, Thailand
URL / URN: http://www.aclweb.org/anthology/Y/Y14/Y14-1030.pdf
Abstract:

Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation.

Uncontrolled Keywords: Secure Data;reviewed;UKP_p_ItForensics;UKP_a_TexMinAn
Identification Number: TUD-CS-2014-0991
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Ubiquitous Knowledge Processing
LOEWE
LOEWE > LOEWE-Zentren
LOEWE > LOEWE-Zentren > CASED – Center for Advanced Security Research Darmstadt
Date Deposited: 31 Dec 2016 14:29
Last Modified: 24 Jan 2020 12:03
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Send an inquiry Send an inquiry

Options (only for editors)
Show editorial Details Show editorial Details