TU Darmstadt / ULB / TUbiblio

Evaluation Discrepancy Discovery: A Sentence Compression Case-study

Puzikov, Yevgeniy (2021)
Evaluation Discrepancy Discovery: A Sentence Compression Case-study.
doi: 10.48550/arXiv.2101.09079
Report, Bibliographie

Kurzbeschreibung (Abstract)

Reliable evaluation protocols are of utmost importance for reproducible NLP research. In this work, we show that sometimes neither metric nor conventional human evaluation is sufficient to draw conclusions about system performance. Using sentence compression as an example task, we demonstrate how a system can game a well-established dataset to achieve state-of-the-art results. In contrast with the results reported in previous work that showed correlation between human judgements and metric scores, our manual analysis of state-of-the-art system outputs demonstrates that high metric scores may only indicate a better fit to the data, but not better outputs, as perceived by humans.

Typ des Eintrags: Report
Erschienen: 2021
Autor(en): Puzikov, Yevgeniy
Art des Eintrags: Bibliographie
Titel: Evaluation Discrepancy Discovery: A Sentence Compression Case-study
Sprache: Englisch
Publikationsjahr: 22 Januar 2021
Verlag: arXiv
Reihe: Computation and Language
Veranstaltungstitel: arXiv
Auflage: 1. Version
DOI: 10.48550/arXiv.2101.09079
URL / URN: https://arxiv.org/abs/2101.09079
Kurzbeschreibung (Abstract):

Reliable evaluation protocols are of utmost importance for reproducible NLP research. In this work, we show that sometimes neither metric nor conventional human evaluation is sufficient to draw conclusions about system performance. Using sentence compression as an example task, we demonstrate how a system can game a well-established dataset to achieve state-of-the-art results. In contrast with the results reported in previous work that showed correlation between human judgements and metric scores, our manual analysis of state-of-the-art system outputs demonstrates that high metric scores may only indicate a better fit to the data, but not better outputs, as perceived by humans.

Zusätzliche Informationen:

Preprint

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
Hinterlegungsdatum: 17 Feb 2021 08:31
Letzte Änderung: 11 Jul 2024 07:29
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen