TU Darmstadt / ULB / TUbiblio

Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approach

Reimers, Nils and Gurevych, Iryna :
Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approach.
[Online-Edition: https://arxiv.org/abs/1803.09578]
In: arXiv:1803.09578
[Article] , (2018)

Official URL: https://arxiv.org/abs/1803.09578

Abstract

Developing state-of-the-art approaches for specific tasks is a major driving force in our research community. Depending on the prestige of the task, publishing it can come along with a lot of visibility. The question arises how reliable are our evaluation methodologies to compare approaches? One common methodology to identify the state-of-the-art is to partition data into a train, a development and a test set. Researchers can train and tune their approach on some part of the dataset and then select the model that worked best on the development set for a final evaluation on unseen test data. Test scores from different approaches are compared, and performance differences are tested for statistical significance. In this publication, we show that there is a high risk that a statistical significance in this type of evaluation is not due to a superior learning approach. Instead, there is a high risk that the difference is due to chance. For example for the CoNLL 2003 NER dataset we observed in up to 26% of the cases type I errors (false positives) with a threshold of p < 0.05, i.e., falsely concluding a statistically significant difference between two identical approaches. We prove that this evaluation setup is unsuitable to compare learning approaches. We formalize alternative evaluation setups based on score distributions.

Item Type: Article
Erschienen: 2018
Creators: Reimers, Nils and Gurevych, Iryna
Title: Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approach
Language: English
Abstract:

Developing state-of-the-art approaches for specific tasks is a major driving force in our research community. Depending on the prestige of the task, publishing it can come along with a lot of visibility. The question arises how reliable are our evaluation methodologies to compare approaches? One common methodology to identify the state-of-the-art is to partition data into a train, a development and a test set. Researchers can train and tune their approach on some part of the dataset and then select the model that worked best on the development set for a final evaluation on unseen test data. Test scores from different approaches are compared, and performance differences are tested for statistical significance. In this publication, we show that there is a high risk that a statistical significance in this type of evaluation is not due to a superior learning approach. Instead, there is a high risk that the difference is due to chance. For example for the CoNLL 2003 NER dataset we observed in up to 26% of the cases type I errors (false positives) with a threshold of p < 0.05, i.e., falsely concluding a statistically significant difference between two identical approaches. We prove that this evaluation setup is unsuitable to compare learning approaches. We formalize alternative evaluation setups based on score distributions.

Journal or Publication Title: arXiv:1803.09578
Divisions: Department of Computer Science
Department of Computer Science > Ubiquitous Knowledge Processing
DFG-Graduiertenkollegs
DFG-Graduiertenkollegs > Research Training Group 1994 Adaptive Preparation of Information from Heterogeneous Sources
Date Deposited: 20 Jun 2018 11:28
Official URL: https://arxiv.org/abs/1803.09578
Export:

Optionen (nur für Redakteure)

View Item View Item