TU Darmstadt / ULB / TUbiblio

Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets

Jamison, Emily ; Gurevych, Iryna
Hrsg.: Aroonmanakun, Wirote ; Boonkwan, Prachya ; Supnithi, Thepchai (2014)
Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets.
Phuket, Thailand
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2014
Herausgeber: Aroonmanakun, Wirote ; Boonkwan, Prachya ; Supnithi, Thepchai
Autor(en): Jamison, Emily ; Gurevych, Iryna
Art des Eintrags: Bibliographie
Titel: Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets
Sprache: Englisch
Publikationsjahr: Dezember 2014
Verlag: Department of Linguistics, Chulalongkorn University
Buchtitel: Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing
Veranstaltungsort: Phuket, Thailand
URL / URN: http://www.aclweb.org/anthology/Y/Y14/Y14-1030.pdf
Kurzbeschreibung (Abstract):

Crowdsourced data annotation is noisier than annotation from trained workers. Previous work has shown that redundant annotations can eliminate the agreement gap between crowdsource workers and trained workers. Redundant annotation is usually non-problematic because individual crowdsource judgments are inconsequentially cheap in a class-balanced dataset. However, redundant annotation on class-imbalanced datasets requires many more labels per instance. In this paper, using three class-imbalanced corpora, we show that annotation redundancy for noise reduction is very expensive on a class-imbalanced dataset, and should be discarded for instances receiving a single common-class label. We also show that this simple technique produces annotations at approximately the same cost of a metadata-trained, supervised cascading machine classifier, or about 70% cheaper than 5-vote majority-vote aggregation.

Freie Schlagworte: Secure Data;reviewed;UKP_p_ItForensics;UKP_a_TexMinAn
ID-Nummer: TUD-CS-2014-0991
Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
LOEWE
LOEWE > LOEWE-Zentren
LOEWE > LOEWE-Zentren > CASED – Center for Advanced Security Research Darmstadt
Hinterlegungsdatum: 31 Dez 2016 14:29
Letzte Änderung: 24 Jan 2020 12:03
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen