TU Darmstadt / ULB / TUbiblio

Automatic Analysis of Flaws in Pre-Trained NLP Models

Eckart de Castilho, Richard :
Automatic Analysis of Flaws in Pre-Trained NLP Models.
[Online-Edition: http://www.aclweb.org/anthology/W16-5203]
Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI3nOIAF2) at COLING 2016
[Conference or Workshop Item] , (2016)

Official URL: http://www.aclweb.org/anthology/W16-5203

Abstract

Most tools for natural language processing (NLP) today are based on machine learning and come with pre-trained models. In addition, third-parties provide pre-trained models for popular NLP tools. The predictive power and accuracy of these tools depends on the quality of these models. Downstream researchers often base their results on pre-trained models instead of training their own. Consequently, pre-trained models are an essential resource to our community. However, to be best of our knowledge, no systematic study of pre-trained models has been conducted so far. This paper reports on the analysis of 274 pre-models for six NLP tools and four potential causes of problems: encoding, tokenization, normalization, and change over time. The analysis is implemented in the open source tool Model Investigator. Our work 1) allows model consumers to better assess whether a model is suitable for their task, 2) enables tool and model creators to sanity-check their models before distributing them, and 3) enables improvements in tool interoperability by performing automatic adjustments of normalization or other pre-processing based on the models used.

Item Type: Conference or Workshop Item
Erschienen: 2016
Creators: Eckart de Castilho, Richard
Title: Automatic Analysis of Flaws in Pre-Trained NLP Models
Language: English
Abstract:

Most tools for natural language processing (NLP) today are based on machine learning and come with pre-trained models. In addition, third-parties provide pre-trained models for popular NLP tools. The predictive power and accuracy of these tools depends on the quality of these models. Downstream researchers often base their results on pre-trained models instead of training their own. Consequently, pre-trained models are an essential resource to our community. However, to be best of our knowledge, no systematic study of pre-trained models has been conducted so far. This paper reports on the analysis of 274 pre-models for six NLP tools and four potential causes of problems: encoding, tokenization, normalization, and change over time. The analysis is implemented in the open source tool Model Investigator. Our work 1) allows model consumers to better assess whether a model is suitable for their task, 2) enables tool and model creators to sanity-check their models before distributing them, and 3) enables improvements in tool interoperability by performing automatic adjustments of normalization or other pre-processing based on the models used.

Title of Book: Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI3nOIAF2) at COLING 2016
Uncontrolled Keywords: CEDIFOR;UKP_s_DKPro_Core;UKP_p_DKPro;UKP_reviewed;UKP_p_OpenMinTeD
Divisions: Department of Computer Science
Department of Computer Science > Ubiquitous Knowledge Processing
DFG-Graduiertenkollegs
DFG-Graduiertenkollegs > Research Training Group 1994 Adaptive Preparation of Information from Heterogeneous Sources
Event Location: Osaka, Japan
Date Deposited: 31 Dec 2016 14:29
Official URL: http://www.aclweb.org/anthology/W16-5203
Identification Number: TUD-CS-2016-14654
Export:

Optionen (nur für Redakteure)

View Item View Item