TU Darmstadt / ULB / TUbiblio

DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse

Eckart de Castilho, Richard and Gurevych, Iryna (2009):
DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse.
In: Online-proceedings of the First French-speaking meeting around the framework Apache UIMA, Nantes, France, [Online-Edition: https://pdfs.semanticscholar.org/16e0/ae274740aaa1eb9a1c8a36...],
[Conference or Workshop Item]

Abstract

User-generated discourse from Web 2.0 poses particular challenges to natural language processing (NLP) due to its noise and error proneness. A data cleansing step preceding the analysis steps in an NLP pipeline can reduce the problems. While recent efforts provide general-purpose collections of UIMA-based analysis components, data cleansing seems not yet to be covered. The five-stage data cleansing approach proposed here offers a maximum of flexibility in identifying problematic artifacts, deciding how to deal with them and analysing cleansed data. Simultaneously, it allowed us to create reusable UIMA-based components for the actual data cleansing and for mapping annotations created on the clean data back to the original representation. These components are released as part of the Darmstadt Knowledge Processing Software Repository (DKPro) under the name of DKPro-UGD.

Item Type: Conference or Workshop Item
Erschienen: 2009
Creators: Eckart de Castilho, Richard and Gurevych, Iryna
Title: DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse
Language: English
Abstract:

User-generated discourse from Web 2.0 poses particular challenges to natural language processing (NLP) due to its noise and error proneness. A data cleansing step preceding the analysis steps in an NLP pipeline can reduce the problems. While recent efforts provide general-purpose collections of UIMA-based analysis components, data cleansing seems not yet to be covered. The five-stage data cleansing approach proposed here offers a maximum of flexibility in identifying problematic artifacts, deciding how to deal with them and analysing cleansed data. Simultaneously, it allowed us to create reusable UIMA-based components for the actual data cleansing and for mapping annotations created on the clean data back to the original representation. These components are released as part of the Darmstadt Knowledge Processing Software Repository (DKPro) under the name of DKPro-UGD.

Title of Book: Online-proceedings of the First French-speaking meeting around the framework Apache UIMA
Uncontrolled Keywords: UKP_p_DKPro;UKP_p_THESEUS;UKP_p_CLARIND;UKP_s_DKPro_Core
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Ubiquitous Knowledge Processing
Event Location: Nantes, France
Date Deposited: 31 Dec 2016 14:29
Official URL: https://pdfs.semanticscholar.org/16e0/ae274740aaa1eb9a1c8a36...
Identification Number: TUD-CS-2009-0078
Related URLs:
Export:
Suche nach Titel in: TUfind oder in Google

Optionen (nur für Redakteure)

View Item View Item