TU Darmstadt / ULB / TUbiblio

Automatic Analysis of Flaws in Pre-Trained NLP Models

Eckart de Castilho, Richard (2016):
Automatic Analysis of Flaws in Pre-Trained NLP Models.
In: Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI3nOIAF2) at COLING 2016, Osaka, Japan, ISBN 978-4-87974-720-4,
[Online-Edition: http://www.aclweb.org/anthology/W16-5203],
[Conference or Workshop Item]

Abstract

Most tools for natural language processing (NLP) today are based on machine learning and come with pre-trained models. In addition, third-parties provide pre-trained models for popular NLP tools. The predictive power and accuracy of these tools depends on the quality of these models. Downstream researchers often base their results on pre-trained models instead of training their own. Consequently, pre-trained models are an essential resource to our community. However, to be best of our knowledge, no systematic study of pre-trained models has been conducted so far. This paper reports on the analysis of 274 pre-models for six NLP tools and four potential causes of problems: encoding, tokenization, normalization, and change over time. The analysis is implemented in the open source tool Model Investigator. Our work 1) allows model consumers to better assess whether a model is suitable for their task, 2) enables tool and model creators to sanity-check their models before distributing them, and 3) enables improvements in tool interoperability by performing automatic adjustments of normalization or other pre-processing based on the models used.

Item Type: Conference or Workshop Item
Erschienen: 2016
Creators: Eckart de Castilho, Richard
Title: Automatic Analysis of Flaws in Pre-Trained NLP Models
Language: English
Abstract:

Most tools for natural language processing (NLP) today are based on machine learning and come with pre-trained models. In addition, third-parties provide pre-trained models for popular NLP tools. The predictive power and accuracy of these tools depends on the quality of these models. Downstream researchers often base their results on pre-trained models instead of training their own. Consequently, pre-trained models are an essential resource to our community. However, to be best of our knowledge, no systematic study of pre-trained models has been conducted so far. This paper reports on the analysis of 274 pre-models for six NLP tools and four potential causes of problems: encoding, tokenization, normalization, and change over time. The analysis is implemented in the open source tool Model Investigator. Our work 1) allows model consumers to better assess whether a model is suitable for their task, 2) enables tool and model creators to sanity-check their models before distributing them, and 3) enables improvements in tool interoperability by performing automatic adjustments of normalization or other pre-processing based on the models used.

Title of Book: Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI3nOIAF2) at COLING 2016
ISBN: 978-4-87974-720-4
Uncontrolled Keywords: CEDIFOR;UKP_s_DKPro_Core;UKP_p_DKPro;UKP_reviewed;UKP_p_OpenMinTeD
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Ubiquitous Knowledge Processing
DFG-Graduiertenkollegs
DFG-Graduiertenkollegs > Research Training Group 1994 Adaptive Preparation of Information from Heterogeneous Sources
Event Location: Osaka, Japan
Date Deposited: 31 Dec 2016 14:29
Official URL: http://www.aclweb.org/anthology/W16-5203
Identification Number: TUD-CS-2016-14654
Export:

Optionen (nur für Redakteure)

View Item View Item