TU Darmstadt / ULB / TUbiblio

Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets

Jamison, Emily ; Gurevych, Iryna
Aroonmanakun, Wirote ; Boonkwan, Prachya ; Supnithi, Thepchai (eds.) :

Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets.
[Online-Edition: http://www.aclweb.org/anthology/Y/Y14/Y14-1030.pdf]
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing Department of Linguistics, Chulalongkorn University
[ Konferenzveröffentlichung] , (2014)

Offizielle URL: http://www.aclweb.org/anthology/Y/Y14/Y14-1030.pdf

Kurzbeschreibung (Abstract)

Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2014
Herausgeber: Aroonmanakun, Wirote ; Boonkwan, Prachya ; Supnithi, Thepchai
Autor(en): Jamison, Emily ; Gurevych, Iryna
Titel: Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets
Sprache: Englisch
Kurzbeschreibung (Abstract):

Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation.

Buchtitel: Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing
Verlag: Department of Linguistics, Chulalongkorn University
Freie Schlagworte: Secure Data;reviewed;UKP_p_ItForensics;UKP_a_TexMinAn
Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
LOEWE
LOEWE > LOEWE-Zentren
LOEWE > LOEWE-Zentren > CASED – Center for Advanced Security Research Darmstadt
Veranstaltungsort: Phuket, Thailand
Hinterlegungsdatum: 31 Dez 2016 14:29
Offizielle URL: http://www.aclweb.org/anthology/Y/Y14/Y14-1030.pdf
ID-Nummer: TUD-CS-2014-0991
Export:

Optionen (nur für Redakteure)

Eintrag anzeigen Eintrag anzeigen