Study of Semantic Relatedness of Words Using Collaboratively Constructed Semantic Resources

Zesch, Torsten (2010)
Study of Semantic Relatedness of Words Using Collaboratively Constructed Semantic Resources.
Technische Universität Darmstadt
Dissertation, Erstveröffentlichung

URL / URN: urn:nbn:de:tuda-tuprints-20413

Kurzbeschreibung (Abstract)

Computing the semantic relatedness between words is a pervasive task in natural language processing with applications e.g. in word sense disambiguation, semantic information retrieval, or information extraction. Semantic relatedness measures typically use linguistic knowledge resources like WordNet whose construction is very expensive and time-consuming. So far, insufficient coverage of these linguistic resources has been a major impediment for using semantic relatedness measures in large-scale natural language processing applications. However, the World Wide Web is currently undergoing a major change as more and more people are actively contributing to new resources available in the so called Web 2.0. Some of these rapidly growing collaboratively constructed resources like Wikipedia and Wiktionary have the potential to be used as a new kind of semantic resource due to their increasing size and significant coverage of past and current developments. In this thesis, we present a comprehensive study aimed at computing semantic relatedness of word pairs using such collaboratively constructed semantic resources. We analyze the properties of the emerging collaboratively constructed semantic resources Wikipedia and Wiktionary and compare them to classical linguistically constructed semantic resources like WordNet and GermaNet. We show that collaboratively constructed semantic resources significantly differ from linguistically constructed semantic resources, and argue why this constitutes both an asset and an impediment for research in natural language processing. For handling the growing number of available semantic resources, we propose a representational interoperability framework that is used to represent and access all semantic resources in a uniform manner. We give a detailed overview of the state of the art in computing semantic relatedness and categorize semantic relatedness measures into four types according to their working principles and the properties of the semantic resources they use. We investigate how existing semantic relatedness measures can be adapted to collaboratively constructed semantic resources bridging the observed differences in semantic resources. For that purpose, we perform a graph-theoretic analysis of semantic resources to prove that semantic relatedness measures working on graphs can be correctly adapted. For the first time, we generalize a state-of-the-art vector based semantic relatedness measure to each semantic resource where we can retrieve or construct a textual description for each concept. This generalized semantic relatedness measure turns out to be the most versatile measure being easily applicable to all semantic resources. For the first time, we show (on the example of the German Wikipedia) that the growth of a resource has no or little negative effect on the performance of semantic relatedness measures, but that the coverage steadily increases. We intrinsically evaluate the adapted semantic relatedness measures on two tasks: (i) comparison with human judgments, and (ii) solving word choice problems. Additionally, we extrinsically evaluate semantic relatedness measures on the task of keyphrase extraction, and propose a new approach to keyphrase extraction based on semantic relatedness measures with the goal to find infrequently used words in a document that are semantically connected to many other words in the document. For the purpose of evaluating keyphrase extraction, we developed a new evaluation strategy based on approximate keyphrase matching that accounts for the shortcomings of exact keyphrase matching. On larger documents, our new approach outperforms all other state-of-the-art unsupervised approaches, and almost reaches the performance of a state-of-the-art supervised approach. From our comprehensive intrinsic and extrinsic evaluations, we conclude that collaboratively constructed semantic resources provide better coverage than linguistically constructed semantic resources while yielding comparable task performance. Thus, collaboratively constructed semantic resources can indeed be used as a proxy for linguistically constructed semantic resources that might not exist for minor languages.

Typ des Eintrags:

Dissertation

Erschienen:

2010

Autor(en):

Zesch, Torsten

Art des Eintrags:

Erstveröffentlichung

Titel:

Study of Semantic Relatedness of Words Using Collaboratively Constructed Semantic Resources

Sprache:

Englisch

Referenten:

Gurevych, Prof. Dr. Iryna

Publikationsjahr:

3 Februar 2010

Ort:

Darmstadt

Verlag:

Technische Universität

Datum der mündlichen Prüfung:

1 Dezember 2009

URL / URN:

urn:nbn:de:tuda-tuprints-20413

Kurzbeschreibung (Abstract):

Alternatives oder übersetztes Abstract:

Alternatives Abstract

Sprache

Die Berechnung der semantischen Verwandtschaft zwischen Wörtern ist von zentraler Bedeutung in der automatischen Sprachverarbeitung und findet Anwendung z.B. in der Lesarten-Disambiguierung, dem semantischen Information-Retrieval oder in der Informationsextraktion. Die Maße zur Berechnung der semantischen Verwandtschaft nutzen typischerweise linguistische Ressourcen, wie z.B. WordNet, deren Erstellung sehr zeitaufwändig und teuer ist. Selbst wenn solche linguistischen Ressourcen zur Verfügung stehen, bleibt ihr unzureichender Umfang ein großes Hindernis für die Nutzung von semantischen Verwandtschaftsmaßen in realistischen Anwendungen. Allerdings werden im Zuge der Transformation des World Wide Web ins sogenannte Web 2.0 immer mehr gemeinschaftlich erstellte Ressourcen verfügbar. Beispiele sind Wikipedia und Wiktionary, die sehr schnell wachsen und damit das Potential aufweisen, als neue semantische Ressourcen in der Sprachverarbeitung genutzt zu werden. In dieser Dissertation untersuchen wir umfassend die Anwendung gemeinschaftlich entwickelter semantischer Ressourcen zur Berechnung der semantischen Verwandtschaft zwischen Wörtern. Dazu analysieren wir die Eigenschaften der gemeinschaftlich entwickelten semantischen Ressourcen Wikipedia und Wiktionary und vergleichen diese mit klassischen, linguistisch motivierten semantischen Ressourcen wie WordNet und GermaNet. Dabei zeigen wir, dass signifikante Unterschiede bestehen, welche einerseits eine Chance zur Erschließung neuen Wissens aus diesen Ressourcen darstellen, es andererseits aber auch notwendig machen, semantische Verwandtschaftsmaße an die gemeinschaftlich erstellten Ressourcen anzupassen. Um die wachsende Anzahl von verfügbaren semantischen Ressourcen effizient handhaben zu können, haben wir ein Interoperabilitäts-Framework entwickelt, in dem alle semantischen Ressourcen einheitlich repräsentiert werden. Wir geben den Stand der Forschung zu semantischer Verwandtschaft detailliert wieder und kategorisieren existierende Maße in vier Typen, die jeweils unterschiedliche Eigenschaften der semantischen Ressourcen zur Berechnung der semantischen Verwandtschaft nutzen. Wir untersuchen, wie existierende semantische Verwandtschaftsmaße so adaptiert werden können, dass das optimale Zusammenspiel mit gemeinschaftlich erstellten semantischen Ressourcen gewährleistet ist. Zu diesem Zweck führen wir eine graphentheoretische Analyse der semantischen Ressourcen durch und zeigen, dass graphbasierte Maße zur Berechnung semantischen Verwandtschaft korrekt adaptiert werden können. Erstmalig generalisieren wir vektorbasierte Verwandtschaftsmaße auf alle semantischen Ressourcen, welche eine textuelle Beschreibung von Konzepten enthalten oder mit deren Hilfe eine solche Beschreibung konstruiert werden kann. Dieses generalisierte semantische Verwandtschaftsmaß erweist sich in experimentellen Studien bei gleichzeitig hoher Leistung als am vielseitigsten und am einfachsten adaptierbar. Erstmalig zeigen wir (am Beispiel der deutschen Wikipedia), dass das Wachstum einer Ressource keinen oder nur geringen Einfluss auf die Leistung eines semantischen Verwandtschaftsmaß hat, während der Umfang der semantischen Ressource und damit die Einsetzbarkeit in realistischen Anwendungen ständig wächst. Wir führen eine intrinsische Evaluation der semantischen Verwandtschaftsmaße anhand von zwei etablierten Aufgaben durch: (i) dem Vergleich mit menschlichen Bewertungen und (ii) der Lösung von Wortauswahlproblemen. Zusätzlich evaluieren wir semantische Verwandtschaftsmaße noch extrinsisch anhand der Eignung zur Extraktion von Schlüsselphrasen. Dazu schlagen wir ein neues Extraktionsverfahren basierend auf semantischen Verwandtschaftsmaßen vor. Durch dieses Verfahren sollen auch Phrasen, welche im Dokument selten vorkommen aber viele semantische Beziehungen zu anderen Wörtern im Dokument besitzen, als Schlüsselphrasen entdeckt werden. Das neue Extraktionsverfahren erweist sich bei längeren Dokumenten allen anderen unüberwachten Verfahren als überlegen und erreicht fast das Leistungsniveau von überwachten Verfahren. Zusätzlich entwickeln wir eine neue Evaluationsstrategie basierend auf einem approximierten Vergleich von extrahierten Schlüsselphrasen mit den vorher annotierten korrekten Schlüsselphrasen. In einer Annotationsstudie zeigen wir, dass diese neue Evaluationsstrategie besser mit menschlichen Bewertungen von Schlüsselphrasen übereinstimmt. Zusammenfassend lässt unsere umfassende intrinsische und extrinsische Evaluation den Schluss zu, dass gemeinschaftlich entwickelte semantische Ressourcen und linguistische motivierte semantische Ressourcen zu vergleichbaren Ergebnissen führen. Jedoch eignen sich gemeinschaftlich entwickelte semantische Ressourcen durch ihre höhere Abdeckung deutlich besser für realistische Anwendungen. Daher können gemeinschaftlich entwickelte semantische Ressourcen, die für fast alle Sprachen verfügbar sind, als Ersatz für linguistisch motivierte semantische Ressourcen eingesetzt werden, die nur für wenige Sprachen zur Verfügung stehen.

Deutsch

Freie Schlagworte:

semantic relatedness, semantic distance, lexical semantic resources, wikipedia, wiktionary, keyphrase extraction, semantic information retrieval, vocabulary gap

Sachgruppe der Dewey Dezimalklassifikatin (DDC):

000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik

Fachbereich(e)/-gebiet(e):

20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
20 Fachbereich Informatik

Hinterlegungsdatum:

09 Feb 2010 13:06

Letzte Änderung:

05 Mär 2013 09:31

PPN:

Referenten:

Gurevych, Prof. Dr. Iryna

Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: