Zesch, Torsten (2010)
Study of Semantic Relatedness of Words Using Collaboratively Constructed Semantic Resources.
Technische Universität Darmstadt
Dissertation, Erstveröffentlichung
Kurzbeschreibung (Abstract)
Computing the semantic relatedness between words is a pervasive task in natural language processing with applications e.g. in word sense disambiguation, semantic information retrieval, or information extraction. Semantic relatedness measures typically use linguistic knowledge resources like WordNet whose construction is very expensive and time-consuming. So far, insufficient coverage of these linguistic resources has been a major impediment for using semantic relatedness measures in large-scale natural language processing applications. However, the World Wide Web is currently undergoing a major change as more and more people are actively contributing to new resources available in the so called Web 2.0. Some of these rapidly growing collaboratively constructed resources like Wikipedia and Wiktionary have the potential to be used as a new kind of semantic resource due to their increasing size and significant coverage of past and current developments. In this thesis, we present a comprehensive study aimed at computing semantic relatedness of word pairs using such collaboratively constructed semantic resources. We analyze the properties of the emerging collaboratively constructed semantic resources Wikipedia and Wiktionary and compare them to classical linguistically constructed semantic resources like WordNet and GermaNet. We show that collaboratively constructed semantic resources significantly differ from linguistically constructed semantic resources, and argue why this constitutes both an asset and an impediment for research in natural language processing. For handling the growing number of available semantic resources, we propose a representational interoperability framework that is used to represent and access all semantic resources in a uniform manner. We give a detailed overview of the state of the art in computing semantic relatedness and categorize semantic relatedness measures into four types according to their working principles and the properties of the semantic resources they use. We investigate how existing semantic relatedness measures can be adapted to collaboratively constructed semantic resources bridging the observed differences in semantic resources. For that purpose, we perform a graph-theoretic analysis of semantic resources to prove that semantic relatedness measures working on graphs can be correctly adapted. For the first time, we generalize a state-of-the-art vector based semantic relatedness measure to each semantic resource where we can retrieve or construct a textual description for each concept. This generalized semantic relatedness measure turns out to be the most versatile measure being easily applicable to all semantic resources. For the first time, we show (on the example of the German Wikipedia) that the growth of a resource has no or little negative effect on the performance of semantic relatedness measures, but that the coverage steadily increases. We intrinsically evaluate the adapted semantic relatedness measures on two tasks: (i) comparison with human judgments, and (ii) solving word choice problems. Additionally, we extrinsically evaluate semantic relatedness measures on the task of keyphrase extraction, and propose a new approach to keyphrase extraction based on semantic relatedness measures with the goal to find infrequently used words in a document that are semantically connected to many other words in the document. For the purpose of evaluating keyphrase extraction, we developed a new evaluation strategy based on approximate keyphrase matching that accounts for the shortcomings of exact keyphrase matching. On larger documents, our new approach outperforms all other state-of-the-art unsupervised approaches, and almost reaches the performance of a state-of-the-art supervised approach. From our comprehensive intrinsic and extrinsic evaluations, we conclude that collaboratively constructed semantic resources provide better coverage than linguistically constructed semantic resources while yielding comparable task performance. Thus, collaboratively constructed semantic resources can indeed be used as a proxy for linguistically constructed semantic resources that might not exist for minor languages.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2010 | ||||
Autor(en): | Zesch, Torsten | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Study of Semantic Relatedness of Words Using Collaboratively Constructed Semantic Resources | ||||
Sprache: | Englisch | ||||
Referenten: | Gurevych, Prof. Dr. Iryna | ||||
Publikationsjahr: | 3 Februar 2010 | ||||
Ort: | Darmstadt | ||||
Verlag: | Technische Universität | ||||
Datum der mündlichen Prüfung: | 1 Dezember 2009 | ||||
URL / URN: | urn:nbn:de:tuda-tuprints-20413 | ||||
Kurzbeschreibung (Abstract): | Computing the semantic relatedness between words is a pervasive task in natural language processing with applications e.g. in word sense disambiguation, semantic information retrieval, or information extraction. Semantic relatedness measures typically use linguistic knowledge resources like WordNet whose construction is very expensive and time-consuming. So far, insufficient coverage of these linguistic resources has been a major impediment for using semantic relatedness measures in large-scale natural language processing applications. However, the World Wide Web is currently undergoing a major change as more and more people are actively contributing to new resources available in the so called Web 2.0. Some of these rapidly growing collaboratively constructed resources like Wikipedia and Wiktionary have the potential to be used as a new kind of semantic resource due to their increasing size and significant coverage of past and current developments. In this thesis, we present a comprehensive study aimed at computing semantic relatedness of word pairs using such collaboratively constructed semantic resources. We analyze the properties of the emerging collaboratively constructed semantic resources Wikipedia and Wiktionary and compare them to classical linguistically constructed semantic resources like WordNet and GermaNet. We show that collaboratively constructed semantic resources significantly differ from linguistically constructed semantic resources, and argue why this constitutes both an asset and an impediment for research in natural language processing. For handling the growing number of available semantic resources, we propose a representational interoperability framework that is used to represent and access all semantic resources in a uniform manner. We give a detailed overview of the state of the art in computing semantic relatedness and categorize semantic relatedness measures into four types according to their working principles and the properties of the semantic resources they use. We investigate how existing semantic relatedness measures can be adapted to collaboratively constructed semantic resources bridging the observed differences in semantic resources. For that purpose, we perform a graph-theoretic analysis of semantic resources to prove that semantic relatedness measures working on graphs can be correctly adapted. For the first time, we generalize a state-of-the-art vector based semantic relatedness measure to each semantic resource where we can retrieve or construct a textual description for each concept. This generalized semantic relatedness measure turns out to be the most versatile measure being easily applicable to all semantic resources. For the first time, we show (on the example of the German Wikipedia) that the growth of a resource has no or little negative effect on the performance of semantic relatedness measures, but that the coverage steadily increases. We intrinsically evaluate the adapted semantic relatedness measures on two tasks: (i) comparison with human judgments, and (ii) solving word choice problems. Additionally, we extrinsically evaluate semantic relatedness measures on the task of keyphrase extraction, and propose a new approach to keyphrase extraction based on semantic relatedness measures with the goal to find infrequently used words in a document that are semantically connected to many other words in the document. For the purpose of evaluating keyphrase extraction, we developed a new evaluation strategy based on approximate keyphrase matching that accounts for the shortcomings of exact keyphrase matching. On larger documents, our new approach outperforms all other state-of-the-art unsupervised approaches, and almost reaches the performance of a state-of-the-art supervised approach. From our comprehensive intrinsic and extrinsic evaluations, we conclude that collaboratively constructed semantic resources provide better coverage than linguistically constructed semantic resources while yielding comparable task performance. Thus, collaboratively constructed semantic resources can indeed be used as a proxy for linguistically constructed semantic resources that might not exist for minor languages. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Freie Schlagworte: | semantic relatedness, semantic distance, lexical semantic resources, wikipedia, wiktionary, keyphrase extraction, semantic information retrieval, vocabulary gap | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung 20 Fachbereich Informatik |
||||
Hinterlegungsdatum: | 09 Feb 2010 13:06 | ||||
Letzte Änderung: | 05 Mär 2013 09:31 | ||||
PPN: | |||||
Referenten: | Gurevych, Prof. Dr. Iryna | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 1 Dezember 2009 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |