TU Darmstadt / ULB / TUbiblio

Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity

Reimers, Nils ; Beyer, Philip ; Gurevych, Iryna (2016)
Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity.
Osaka, Japan
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

Semantic Textual Similarity (STS) is a foundational NLP task and can be used in a wide range of tasks. To determine the STS of two texts, hundreds of different STS systems exist, however, for an NLP system designer, it is hard to decide which system is the best on. To answer this question, an intrinsic evaluation of the STS systems is conducted by comparing the output of the system to human judgments on semantic similarity. The comparison is usually done using Pearson cor- relation. In this work, we show that relying on intrinsic evaluations with Pearson correlation can be misleading. In three common STS based tasks we could observe that the Pearson correlation was especially ill-suited to detect the best STS system for the task and other evaluation measures were much better suited. In this work we define how the validity of an intrinsic evaluation can be assessed and compare different intrinsic evaluation methods. Understanding of the properties of the targeted task is crucial and we propose a framework for conducting the intrinsic evaluation which takes the properties of the targeted task into account.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2016
Autor(en): Reimers, Nils ; Beyer, Philip ; Gurevych, Iryna
Art des Eintrags: Bibliographie
Titel: Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity
Sprache: Englisch
Publikationsjahr: Dezember 2016
Buchtitel: Proceedings of the 26th International Conference on Computational Linguistics (COLING)
Veranstaltungsort: Osaka, Japan
URL / URN: http://aclweb.org/anthology/C16-1009
Kurzbeschreibung (Abstract):

Semantic Textual Similarity (STS) is a foundational NLP task and can be used in a wide range of tasks. To determine the STS of two texts, hundreds of different STS systems exist, however, for an NLP system designer, it is hard to decide which system is the best on. To answer this question, an intrinsic evaluation of the STS systems is conducted by comparing the output of the system to human judgments on semantic similarity. The comparison is usually done using Pearson cor- relation. In this work, we show that relying on intrinsic evaluations with Pearson correlation can be misleading. In three common STS based tasks we could observe that the Pearson correlation was especially ill-suited to detect the best STS system for the task and other evaluation measures were much better suited. In this work we define how the validity of an intrinsic evaluation can be assessed and compare different intrinsic evaluation methods. Understanding of the properties of the targeted task is crucial and we propose a framework for conducting the intrinsic evaluation which takes the properties of the targeted task into account.

Freie Schlagworte: UKP_reviewed
ID-Nummer: TUD-CS-2016-1451
Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
DFG-Graduiertenkollegs
DFG-Graduiertenkollegs > Graduiertenkolleg 1994 Adaptive Informationsaufbereitung aus heterogenen Quellen
Hinterlegungsdatum: 31 Dez 2016 14:29
Letzte Änderung: 24 Jan 2020 12:03
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen