Matuschek, Michael (2015)
Word Sense Alignment of Lexical Resources.
Technische Universität Darmstadt
Dissertation, Erstveröffentlichung
Kurzbeschreibung (Abstract)
Lexical-semantic resources (LSRs) are a cornerstone for many areas of Natural Language Processing (NLP) such as word sense disambiguation or information extraction. LSRs exist in many varieties, focusing on different information types and languages, or being constructed according to different paradigms. However, the large number of different LSRs is still not able to meet the growing demand for large-scale resources for different languages and application purposes. Thus, the orchestrated usage of different LSRs is necessary in order to cover more words and senses, and also to have access to a richer knowledge representation when word senses are covered in more than one resource. In this thesis, we address the task of finding equivalent senses in these resources, which is known as \emph{Word Sense Alignment} (WSA), and report various contributions to this area.
First, we give a formal definition of WSA and describe suitable evaluation metrics and baselines for this task. Then, we position WSA in the broad area of semantic processing by comparing it to related tasks from NLP and other fields, establishing that WSA indeed displays a unique set of properties and challenges which need to be addressed.
After that, we discuss the resources we employ for WSA, distinguishing between expert-built and collaboratively constructed resources. We give a brief description and refer to related work for each resource, and we discuss the collaboratively constructed, multilingual resource OmegaWiki in greater detail, as it has not been exhaustively covered in previous work and also presents a unique, concept-centered and language-agnostic structure, which makes it interesting for NLP applications. At the same time, we shed light on disadvantages of this approach and gaps in OmegaWiki's content. After the presentation of the resources, we perform a comparative analysis of them which focuses on their suitability for different approaches to WSA. In particular, we analyze their glosses as well as their structure and point out flaws and differences between them. Based on this, we motivate the selection of resource pairs we investigate and describe the WSA gold standard datasets they participate in. On top of the ones presented in previous work, we discuss four new datasets we created, filling gaps in the body of WSA research.
We then go on to present an alignment between Wiktionary and OmegaWiki, using a similarity-based framework. For the first time, it is applied to two collaboratively constructed resources. We improve this framework by adding a machine translation component, which we use to align WordNet and the German part of OmegaWiki. A cross-validation experiment with the English OmegaWiki (i.e. for the monolingual case) shows that both configurations perform comparably as only few errors are introduced by the translation component. This confirms the general validity of the idea.
Building on the observation that similarity-based approaches suffer from the insufficient lexical overlap between different glosses, we also present the novel alignment algorithm Dijkstra-WSA. It works on graph representations of LSRs induced, for instance, by semantic relations or links, and exploits the intuition that related senses are concentrated in adjacent regions of the resources. This algorithm performs competitively on six out of eight evaluation datasets, and we also present a combination with the similarity-based approach mentioned above in a backoff configuration. This approach achieves a significant improvement over previous work on all considered datasets.
To further exploit the insight that text similarity-based and graph-based approaches complement each other, we also combine these notions in a machine learning framework. This way, we achieve a further overall improvement in terms of F-measure for four out of eight considered datasets, while for three others we could achieve a significant improvement in alignment precision and accuracy. We investigate different machine learning classifiers and conclude that Bayesian Networks show the most robust results across datasets. While we also discuss additional machine learning features, none of these lead to further improvements, which we consider proof that structure and glosses of the LSRs are sufficiently informative for finding equivalent senses in LSRs. Moreover, we discuss different approaches to aligning more than two resources at once (N-way alignment), which however do not yield satisfactory results. We also analyze the reasons for that and identify a great demand for future research.
The unified LSR UBY provides the greater context for this thesis. Its representation format UBY-LMF (based on the \emph{Lexical Markup Framework} standard) reflects the structure and content of many different LSRs with the greatest possible level of accuracy, making them interoperable and accessible. We demonstrate how the standardization is operationalized, where OmegaWiki serves as a showcase for presenting the properties of UBY-LMF, including the representation of the sense alignments. We also discuss the final, instantiated resource UBY, as well as the Java-based API, which allows easy programmatic access to it, a web interface for conveniently browsing UBY's contents, and the alignment framework we used for our experiments, whose implementation was enabled by the standardization efforts and the API.
To demonstrate that sense alignments are indeed beneficial for NLP, we discuss different applications which make use of them. The clustering of fine-grained GermaNet and WordNet senses by exploiting 1:n alignments to OmegaWiki, Wiktionary and Wikipedia significantly improves word sense disambiguation accuracy on standard evaluation datasets for German and English, while this approach is language-independent and does not require external knowledge or resource-specific feature engineering. The second scenario is computer-aided translation. We argue that the multilingual resources OmegaWiki and Wiktionary can be a useful source of knowledge, and especially translations, for this kind of applications. In this context, we also further discuss the results of the alignment we produce between them, and we give examples of the additional knowledge that becomes available through their combined usage.
Finally, we point out many directions for future work, not only for WSA, but also for the design of aligned resources such as UBY and the applications that benefit from them.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2015 | ||||
Autor(en): | Matuschek, Michael | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Word Sense Alignment of Lexical Resources | ||||
Sprache: | Englisch | ||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Navigli, PhD Roberto ; Weihe, Prof. Dr. Karsten | ||||
Publikationsjahr: | 2015 | ||||
Ort: | Darmstadt | ||||
Datum der mündlichen Prüfung: | 29 September 2014 | ||||
URL / URN: | http://tuprints.ulb.tu-darmstadt.de/4355 | ||||
Zugehörige Links: | |||||
Kurzbeschreibung (Abstract): | Lexical-semantic resources (LSRs) are a cornerstone for many areas of Natural Language Processing (NLP) such as word sense disambiguation or information extraction. LSRs exist in many varieties, focusing on different information types and languages, or being constructed according to different paradigms. However, the large number of different LSRs is still not able to meet the growing demand for large-scale resources for different languages and application purposes. Thus, the orchestrated usage of different LSRs is necessary in order to cover more words and senses, and also to have access to a richer knowledge representation when word senses are covered in more than one resource. In this thesis, we address the task of finding equivalent senses in these resources, which is known as \emph{Word Sense Alignment} (WSA), and report various contributions to this area. First, we give a formal definition of WSA and describe suitable evaluation metrics and baselines for this task. Then, we position WSA in the broad area of semantic processing by comparing it to related tasks from NLP and other fields, establishing that WSA indeed displays a unique set of properties and challenges which need to be addressed. After that, we discuss the resources we employ for WSA, distinguishing between expert-built and collaboratively constructed resources. We give a brief description and refer to related work for each resource, and we discuss the collaboratively constructed, multilingual resource OmegaWiki in greater detail, as it has not been exhaustively covered in previous work and also presents a unique, concept-centered and language-agnostic structure, which makes it interesting for NLP applications. At the same time, we shed light on disadvantages of this approach and gaps in OmegaWiki's content. After the presentation of the resources, we perform a comparative analysis of them which focuses on their suitability for different approaches to WSA. In particular, we analyze their glosses as well as their structure and point out flaws and differences between them. Based on this, we motivate the selection of resource pairs we investigate and describe the WSA gold standard datasets they participate in. On top of the ones presented in previous work, we discuss four new datasets we created, filling gaps in the body of WSA research. We then go on to present an alignment between Wiktionary and OmegaWiki, using a similarity-based framework. For the first time, it is applied to two collaboratively constructed resources. We improve this framework by adding a machine translation component, which we use to align WordNet and the German part of OmegaWiki. A cross-validation experiment with the English OmegaWiki (i.e. for the monolingual case) shows that both configurations perform comparably as only few errors are introduced by the translation component. This confirms the general validity of the idea. Building on the observation that similarity-based approaches suffer from the insufficient lexical overlap between different glosses, we also present the novel alignment algorithm Dijkstra-WSA. It works on graph representations of LSRs induced, for instance, by semantic relations or links, and exploits the intuition that related senses are concentrated in adjacent regions of the resources. This algorithm performs competitively on six out of eight evaluation datasets, and we also present a combination with the similarity-based approach mentioned above in a backoff configuration. This approach achieves a significant improvement over previous work on all considered datasets. To further exploit the insight that text similarity-based and graph-based approaches complement each other, we also combine these notions in a machine learning framework. This way, we achieve a further overall improvement in terms of F-measure for four out of eight considered datasets, while for three others we could achieve a significant improvement in alignment precision and accuracy. We investigate different machine learning classifiers and conclude that Bayesian Networks show the most robust results across datasets. While we also discuss additional machine learning features, none of these lead to further improvements, which we consider proof that structure and glosses of the LSRs are sufficiently informative for finding equivalent senses in LSRs. Moreover, we discuss different approaches to aligning more than two resources at once (N-way alignment), which however do not yield satisfactory results. We also analyze the reasons for that and identify a great demand for future research. The unified LSR UBY provides the greater context for this thesis. Its representation format UBY-LMF (based on the \emph{Lexical Markup Framework} standard) reflects the structure and content of many different LSRs with the greatest possible level of accuracy, making them interoperable and accessible. We demonstrate how the standardization is operationalized, where OmegaWiki serves as a showcase for presenting the properties of UBY-LMF, including the representation of the sense alignments. We also discuss the final, instantiated resource UBY, as well as the Java-based API, which allows easy programmatic access to it, a web interface for conveniently browsing UBY's contents, and the alignment framework we used for our experiments, whose implementation was enabled by the standardization efforts and the API. To demonstrate that sense alignments are indeed beneficial for NLP, we discuss different applications which make use of them. The clustering of fine-grained GermaNet and WordNet senses by exploiting 1:n alignments to OmegaWiki, Wiktionary and Wikipedia significantly improves word sense disambiguation accuracy on standard evaluation datasets for German and English, while this approach is language-independent and does not require external knowledge or resource-specific feature engineering. The second scenario is computer-aided translation. We argue that the multilingual resources OmegaWiki and Wiktionary can be a useful source of knowledge, and especially translations, for this kind of applications. In this context, we also further discuss the results of the alignment we produce between them, and we give examples of the additional knowledge that becomes available through their combined usage. Finally, we point out many directions for future work, not only for WSA, but also for the design of aligned resources such as UBY and the applications that benefit from them. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Freie Schlagworte: | Natural Language Processing, Lexical-Semantic Resources, Word Sense Alignment | ||||
URN: | urn:nbn:de:tuda-tuprints-43555 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik 400 Sprache > 400 Sprache, Linguistik |
||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung |
||||
Hinterlegungsdatum: | 15 Feb 2015 20:55 | ||||
Letzte Änderung: | 15 Feb 2015 20:55 | ||||
PPN: | |||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Navigli, PhD Roberto ; Weihe, Prof. Dr. Karsten | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 29 September 2014 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |