TU Darmstadt / ULB / TUbiblio

DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse

Eckart de Castilho, Richard ; Gurevych, Iryna (2009)
DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse.
Nantes, France
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

User-generated discourse from Web 2.0 poses particular challenges to natural language processing (NLP) due to its noise and error proneness. A data cleansing step preceding the analysis steps in an NLP pipeline can reduce the problems. While recent efforts provide general-purpose collections of UIMA-based analysis components, data cleansing seems not yet to be covered. The five-stage data cleansing approach proposed here offers a maximum of flexibility in identifying problematic artifacts, deciding how to deal with them and analysing cleansed data. Simultaneously, it allowed us to create reusable UIMA-based components for the actual data cleansing and for mapping annotations created on the clean data back to the original representation. These components are released as part of the Darmstadt Knowledge Processing Software Repository (DKPro) under the name of DKPro-UGD.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2009
Autor(en): Eckart de Castilho, Richard ; Gurevych, Iryna
Art des Eintrags: Bibliographie
Titel: DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse
Sprache: Englisch
Publikationsjahr: Juli 2009
Buchtitel: Online-proceedings of the First French-speaking meeting around the framework Apache UIMA
Veranstaltungsort: Nantes, France
URL / URN: https://pdfs.semanticscholar.org/16e0/ae274740aaa1eb9a1c8a36...
Zugehörige Links:
Kurzbeschreibung (Abstract):

User-generated discourse from Web 2.0 poses particular challenges to natural language processing (NLP) due to its noise and error proneness. A data cleansing step preceding the analysis steps in an NLP pipeline can reduce the problems. While recent efforts provide general-purpose collections of UIMA-based analysis components, data cleansing seems not yet to be covered. The five-stage data cleansing approach proposed here offers a maximum of flexibility in identifying problematic artifacts, deciding how to deal with them and analysing cleansed data. Simultaneously, it allowed us to create reusable UIMA-based components for the actual data cleansing and for mapping annotations created on the clean data back to the original representation. These components are released as part of the Darmstadt Knowledge Processing Software Repository (DKPro) under the name of DKPro-UGD.

Freie Schlagworte: UKP_p_DKPro;UKP_p_THESEUS;UKP_p_CLARIND;UKP_s_DKPro_Core
ID-Nummer: TUD-CS-2009-0078
Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
Hinterlegungsdatum: 31 Dez 2016 14:29
Letzte Änderung: 24 Jan 2020 12:03
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen