TU Darmstadt / ULB / TUbiblio

Automatic Analysis of Flaws in Pre-Trained NLP Models

Eckart de Castilho, Richard (2016):
Automatic Analysis of Flaws in Pre-Trained NLP Models.
In: Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI3nOIAF2) at COLING 2016, pp. 19-27,
Osaka, Japan, ISBN 978-4-87974-720-4,
[Conference or Workshop Item]

Abstract

Most tools for natural language processing (NLP) today are based on machine learning and come with pre-trained models. In addition, third-parties provide pre-trained models for popular NLP tools. The predictive power and accuracy of these tools depends on the quality of these models. Downstream researchers often base their results on pre-trained models instead of training their own. Consequently, pre-trained models are an essential resource to our community. However, to be best of our knowledge, no systematic study of pre-trained models has been conducted so far. This paper reports on the analysis of 274 pre-models for six NLP tools and four potential causes of problems: encoding, tokenization, normalization, and change over time. The analysis is implemented in the open source tool Model Investigator. Our work 1) allows model consumers to better assess whether a model is suitable for their task, 2) enables tool and model creators to sanity-check their models before distributing them, and 3) enables improvements in tool interoperability by performing automatic adjustments of normalization or other pre-processing based on the models used.

Item Type: Conference or Workshop Item
Erschienen: 2016
Creators: Eckart de Castilho, Richard
Title: Automatic Analysis of Flaws in Pre-Trained NLP Models
Language: English
Abstract:

Most tools for natural language processing (NLP) today are based on machine learning and come with pre-trained models. In addition, third-parties provide pre-trained models for popular NLP tools. The predictive power and accuracy of these tools depends on the quality of these models. Downstream researchers often base their results on pre-trained models instead of training their own. Consequently, pre-trained models are an essential resource to our community. However, to be best of our knowledge, no systematic study of pre-trained models has been conducted so far. This paper reports on the analysis of 274 pre-models for six NLP tools and four potential causes of problems: encoding, tokenization, normalization, and change over time. The analysis is implemented in the open source tool Model Investigator. Our work 1) allows model consumers to better assess whether a model is suitable for their task, 2) enables tool and model creators to sanity-check their models before distributing them, and 3) enables improvements in tool interoperability by performing automatic adjustments of normalization or other pre-processing based on the models used.

Title of Book: Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI3nOIAF2) at COLING 2016
ISBN: 978-4-87974-720-4
Uncontrolled Keywords: CEDIFOR;UKP_s_DKPro_Core;UKP_p_DKPro;UKP_reviewed;UKP_p_OpenMinTeD
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Ubiquitous Knowledge Processing
DFG-Graduiertenkollegs
DFG-Graduiertenkollegs > Research Training Group 1994 Adaptive Preparation of Information from Heterogeneous Sources
Event Location: Osaka, Japan
Date Deposited: 31 Dec 2016 14:29
Official URL: http://www.aclweb.org/anthology/W16-5203
Identification Number: TUD-CS-2016-14654
Export:
Suche nach Titel in: TUfind oder in Google
Send an inquiry Send an inquiry

Options (only for editors)
Show editorial Details Show editorial Details