TU Darmstadt / ULB / TUbiblio

Distantly Supervised POS Tagging of Low-Resource Languages under Extreme Data Sparsity: The Case of Hittite

Sukhareva, Maria and Fuscagni, Francesco and Daxenberger, Johannes and Görke, Susanne and Prechel, Doris and Gurevych, Iryna (2017):
Distantly Supervised POS Tagging of Low-Resource Languages under Extreme Data Sparsity: The Case of Hittite.
In: LaTeCH-CLfL '17 Proceedings of the 11th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Vancouver, BC, Canada, [Online-Edition: http://www.aclweb.org/anthology/W17-2213],
[Conference or Workshop Item]

Abstract

This paper presents a statistical approach to automatic morphosyntactic annotation of Hittite transcripts. Hittite is an extinct Indo-European language using the cuneiform script. There are currently no&nbsp;&nbsp; morphosyntactic annotations available for Hittite, so we explored methods of distant supervision. <br />The annotations were projected from parallel German translations of the Hittite texts. In order to reduce data sparsity, we applied stemming of German and Hittite texts. As there is no off-the-shelf Hittite stemmer, a stemmer for Hittite was developed for this purpose. The resulting annotation projections were used to train a POS tagger, achieving an accuracy of 69\% on a test sample. To our knowledge, this is the first attempt of statistical POS tagging of a cuneiform language.

Item Type: Conference or Workshop Item
Erschienen: 2017
Creators: Sukhareva, Maria and Fuscagni, Francesco and Daxenberger, Johannes and Görke, Susanne and Prechel, Doris and Gurevych, Iryna
Title: Distantly Supervised POS Tagging of Low-Resource Languages under Extreme Data Sparsity: The Case of Hittite
Language: English
Abstract:

This paper presents a statistical approach to automatic morphosyntactic annotation of Hittite transcripts. Hittite is an extinct Indo-European language using the cuneiform script. There are currently no&nbsp;&nbsp; morphosyntactic annotations available for Hittite, so we explored methods of distant supervision. <br />The annotations were projected from parallel German translations of the Hittite texts. In order to reduce data sparsity, we applied stemming of German and Hittite texts. As there is no off-the-shelf Hittite stemmer, a stemmer for Hittite was developed for this purpose. The resulting annotation projections were used to train a POS tagger, achieving an accuracy of 69\% on a test sample. To our knowledge, this is the first attempt of statistical POS tagging of a cuneiform language.

Title of Book: LaTeCH-CLfL '17 Proceedings of the 11th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Uncontrolled Keywords: reviewed;CEDIFOR;UKP_reviewed;UKP_s_DKPro_Core;POS tagging, low resource languages, Hittite
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Ubiquitous Knowledge Processing
Event Location: Vancouver, BC, Canada
Date Deposited: 13 Jun 2017 11:45
Official URL: http://www.aclweb.org/anthology/W17-2213
Identification Number: TUD-CS-2017-0133
Related URLs:
Projects: CEDIFOR
Export:
Suche nach Titel in: TUfind oder in Google

Optionen (nur für Redakteure)

View Item View Item