TU Darmstadt / ULB / TUbiblio

Evaluation Discrepancy Discovery: A Sentence Compression Case-study

Puzikov, Yevgeniy (2021)
Evaluation Discrepancy Discovery: A Sentence Compression Case-study.
doi: 10.48550/arXiv.2101.09079
Report, Bibliographie

Kurzbeschreibung (Abstract)

Reliable evaluation protocols are of utmost importance for reproducible NLP research. In this work, we show that sometimes neither metric nor conventional human evaluation is sufficient to draw conclusions about system performance. Using sentence compression as an example task, we demonstrate how a system can game a well-established dataset to achieve state-of-the-art results. In contrast with the results reported in previous work that showed correlation between human judgements and metric scores, our manual analysis of state-of-the-art system outputs demonstrates that high metric scores may only indicate a better fit to the data, but not better outputs, as perceived by humans.

Typ des Eintrags: Report
Erschienen: 2021
Autor(en): Puzikov, Yevgeniy
Art des Eintrags: Bibliographie
Titel: Evaluation Discrepancy Discovery: A Sentence Compression Case-study
Sprache: Englisch
Publikationsjahr: 22 Januar 2021
Verlag: arXiv
Reihe: Computation and Language
Kollation: 15 Seiten
Veranstaltungstitel: arXiv
DOI: 10.48550/arXiv.2101.09079
URL / URN: https://arxiv.org/abs/2101.09079
Kurzbeschreibung (Abstract):

Reliable evaluation protocols are of utmost importance for reproducible NLP research. In this work, we show that sometimes neither metric nor conventional human evaluation is sufficient to draw conclusions about system performance. Using sentence compression as an example task, we demonstrate how a system can game a well-established dataset to achieve state-of-the-art results. In contrast with the results reported in previous work that showed correlation between human judgements and metric scores, our manual analysis of state-of-the-art system outputs demonstrates that high metric scores may only indicate a better fit to the data, but not better outputs, as perceived by humans.

Zusätzliche Informationen:

1. Version

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
Hinterlegungsdatum: 17 Feb 2021 08:31
Letzte Änderung: 19 Dez 2024 10:04
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen