Reimers, Nils ; Gurevych, Iryna (2017)
Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging.
Copenhagen, Denmark
Konferenzveröffentlichung, Bibliographie
Kurzbeschreibung (Abstract)
In this paper we show that reporting a single performance score is insufficient to compare non-deterministic approaches. We demonstrate this for common sequence tagging tasks that the seed value for the random number generator can result in statistically significant (p < 10^{-4}) differences for state-of-the-art systems. For two recent systems for NER, we observe an absolute difference of one percentage point F1-score depending on the selected seed value, making these systems perceived either as state-of-the-art or mediocre. Instead of publishing and reporting single performance scores, we propose to compare score distributions based on multiple executions. Based on the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we present network architectures that perform superior as well as produce results with higher stability on unseen data.
Typ des Eintrags: | Konferenzveröffentlichung |
---|---|
Erschienen: | 2017 |
Autor(en): | Reimers, Nils ; Gurevych, Iryna |
Art des Eintrags: | Bibliographie |
Titel: | Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging |
Sprache: | Englisch |
Publikationsjahr: | September 2017 |
Buchtitel: | Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP) |
Veranstaltungsort: | Copenhagen, Denmark |
URL / URN: | http://aclweb.org/anthology/D17-1035 |
Zugehörige Links: | |
Kurzbeschreibung (Abstract): | In this paper we show that reporting a single performance score is insufficient to compare non-deterministic approaches. We demonstrate this for common sequence tagging tasks that the seed value for the random number generator can result in statistically significant (p < 10^{-4}) differences for state-of-the-art systems. For two recent systems for NER, we observe an absolute difference of one percentage point F1-score depending on the selected seed value, making these systems perceived either as state-of-the-art or mediocre. Instead of publishing and reporting single performance scores, we propose to compare score distributions based on multiple executions. Based on the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we present network architectures that perform superior as well as produce results with higher stability on unseen data. |
ID-Nummer: | TUD-CS-2017-0150 |
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung DFG-Graduiertenkollegs DFG-Graduiertenkollegs > Graduiertenkolleg 1994 Adaptive Informationsaufbereitung aus heterogenen Quellen |
Hinterlegungsdatum: | 04 Jul 2017 09:53 |
Letzte Änderung: | 24 Jan 2020 12:03 |
PPN: | |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |