TU Darmstadt / ULB / TUbiblio

Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets

Jamison, Emily and Gurevych, Iryna
Aroonmanakun, Wirote and Boonkwan, Prachya and Supnithi, Thepchai (eds.) :

Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets.
[Online-Edition: http://www.aclweb.org/anthology/Y/Y14/Y14-1030.pdf]
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing Department of Linguistics, Chulalongkorn University
[Conference or Workshop Item] , (2014)

Official URL: http://www.aclweb.org/anthology/Y/Y14/Y14-1030.pdf

Abstract

Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation.

Item Type: Conference or Workshop Item
Erschienen: 2014
Editors: Aroonmanakun, Wirote and Boonkwan, Prachya and Supnithi, Thepchai
Creators: Jamison, Emily and Gurevych, Iryna
Title: Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets
Language: English
Abstract:

Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation.

Title of Book: Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing
Publisher: Department of Linguistics, Chulalongkorn University
Uncontrolled Keywords: Secure Data;reviewed;UKP_p_ItForensics;UKP_a_TexMinAn
Divisions: Department of Computer Science
Department of Computer Science > Ubiquitous Knowledge Processing
LOEWE
LOEWE > LOEWE-Zentren
LOEWE > LOEWE-Zentren > CASED – Center for Advanced Security Research Darmstadt
Event Location: Phuket, Thailand
Date Deposited: 31 Dec 2016 14:29
Official URL: http://www.aclweb.org/anthology/Y/Y14/Y14-1030.pdf
Identification Number: TUD-CS-2014-0991
Export:

Optionen (nur für Redakteure)

View Item View Item