Bär, Daniel (2013)
A Composite Model for Computing Similarity Between Texts.
Technische Universität Darmstadt
Dissertation, Erstveröffentlichung
Kurzbeschreibung (Abstract)
Computing text similarity is a foundational technique for a wide range of tasks in natural language processing such as duplicate detection, question answering, or automatic essay grading. Just recently, text similarity received wide-spread attention in the research community by the establishment of the Semantic Textual Similarity (STS) Task at the Semantic Evaluation (SemEval) workshop in 2012---a fact that stresses the importance of text similarity research. The goal of the STS Task is to create automated measures which are able to compute the degree of similarity between two given texts in the same way that humans do. Measures are thereby expected to output continuous text similarity scores, which are then either compared with human judgments or used as a means for solving a particular problem. We start this thesis with the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. No attempt has been made yet to formalize in what way text similarity between two texts can be computed. Still, text similarity is regarded as a fixed, axiomatic notion in the community. To alleviate this shortcoming, we describe existing formal models of similarity and discuss how we can adapt them to texts. We propose to judge text similarity along multiple text dimensions, i.e. characteristics inherent to texts, and provide empirical evidence based on a set of annotation studies that the proposed dimensions are perceived by humans. We continue with a comprehensive survey of state-of-the-art text similarity measures previously proposed in the literature. To the best of our knowledge, no such survey has been done yet. We propose a classification into compositional and non-compositional text similarity measures according to their inherent properties. Compositional measures compute text similarity based on pairwise word similarity scores between all words which are then aggregated to an overall similarity score, while non-compositional measures project the complete texts onto particular models and then compare the texts based on these models. Based on our theoretical insights, we then present the implementation of a text similarity system which composes a multitude of text similarity measures along multiple text dimensions using a machine learning classifier. Depending on the concrete task at hand, we argue that such a system may need to address more than a single text dimension in order to best resemble human judgments. Our efforts culminate in the open source framework DKPro Similarity, which streamlines the development of text similarity measures and experimental setups. We apply our system in two evaluations, for which it consistently outperforms prior work and competing systems: an intrinsic and an extrinsic evaluation. In the intrinsic evaluation, the performance of text similarity measures is evaluated in an isolated setting by comparing the algorithmically produced scores with human judgments. We conducted the intrinsic evaluation in the context of the STS Task as part of the SemEval workshop. In the extrinsic evaluation, the performance of text similarity measures is evaluated with respect to a particular task at hand, where text similarity is a means for solving a particular problem. We conducted the extrinsic evaluation in the text classification task of text reuse detection. The results of both evaluations support our hypothesis that a composition of text similarity measures highly benefits the similarity computation process. Finally, we stress the importance of text similarity measures for real-world applications. We therefore introduce the application scenario Self-Organizing Wikis, where users of wikis, i.e. web-based collaborative content authoring systems, are supported in their everyday tasks by means of natural language processing techniques in general, and text similarity in particular. We elaborate on two use cases where text similarity computation is particularly beneficial: the detection of duplicates, and the semi-automatic insertion of hyperlinks. Moreover, we discuss two further applications where text similarity is a valuable tool: In both question answering and textual entailment recognition, text similarity has been used successfully in experiments and appears to be a promising means for further research in these fields. We conclude this thesis with an analysis of shortcomings of current text similarity research and formulate challenges which should be tackled by future work. In particular, we believe that computing text similarity along multiple text dimensions---which depend on the specific task at hand---will benefit any other task where text similarity is fundamental, as a composition of text similarity measures has shown superior performance in both the intrinsic as well as the extrinsic evaluation.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2013 | ||||
Autor(en): | Bär, Daniel | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | A Composite Model for Computing Similarity Between Texts | ||||
Sprache: | Englisch | ||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Dagan, Prof. Ido ; Zesch, Dr. Torsten | ||||
Publikationsjahr: | 11 Oktober 2013 | ||||
Datum der mündlichen Prüfung: | 11 Oktober 2013 | ||||
URL / URN: | http://tuprints.ulb.tu-darmstadt.de/3641 | ||||
Kurzbeschreibung (Abstract): | Computing text similarity is a foundational technique for a wide range of tasks in natural language processing such as duplicate detection, question answering, or automatic essay grading. Just recently, text similarity received wide-spread attention in the research community by the establishment of the Semantic Textual Similarity (STS) Task at the Semantic Evaluation (SemEval) workshop in 2012---a fact that stresses the importance of text similarity research. The goal of the STS Task is to create automated measures which are able to compute the degree of similarity between two given texts in the same way that humans do. Measures are thereby expected to output continuous text similarity scores, which are then either compared with human judgments or used as a means for solving a particular problem. We start this thesis with the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. No attempt has been made yet to formalize in what way text similarity between two texts can be computed. Still, text similarity is regarded as a fixed, axiomatic notion in the community. To alleviate this shortcoming, we describe existing formal models of similarity and discuss how we can adapt them to texts. We propose to judge text similarity along multiple text dimensions, i.e. characteristics inherent to texts, and provide empirical evidence based on a set of annotation studies that the proposed dimensions are perceived by humans. We continue with a comprehensive survey of state-of-the-art text similarity measures previously proposed in the literature. To the best of our knowledge, no such survey has been done yet. We propose a classification into compositional and non-compositional text similarity measures according to their inherent properties. Compositional measures compute text similarity based on pairwise word similarity scores between all words which are then aggregated to an overall similarity score, while non-compositional measures project the complete texts onto particular models and then compare the texts based on these models. Based on our theoretical insights, we then present the implementation of a text similarity system which composes a multitude of text similarity measures along multiple text dimensions using a machine learning classifier. Depending on the concrete task at hand, we argue that such a system may need to address more than a single text dimension in order to best resemble human judgments. Our efforts culminate in the open source framework DKPro Similarity, which streamlines the development of text similarity measures and experimental setups. We apply our system in two evaluations, for which it consistently outperforms prior work and competing systems: an intrinsic and an extrinsic evaluation. In the intrinsic evaluation, the performance of text similarity measures is evaluated in an isolated setting by comparing the algorithmically produced scores with human judgments. We conducted the intrinsic evaluation in the context of the STS Task as part of the SemEval workshop. In the extrinsic evaluation, the performance of text similarity measures is evaluated with respect to a particular task at hand, where text similarity is a means for solving a particular problem. We conducted the extrinsic evaluation in the text classification task of text reuse detection. The results of both evaluations support our hypothesis that a composition of text similarity measures highly benefits the similarity computation process. Finally, we stress the importance of text similarity measures for real-world applications. We therefore introduce the application scenario Self-Organizing Wikis, where users of wikis, i.e. web-based collaborative content authoring systems, are supported in their everyday tasks by means of natural language processing techniques in general, and text similarity in particular. We elaborate on two use cases where text similarity computation is particularly beneficial: the detection of duplicates, and the semi-automatic insertion of hyperlinks. Moreover, we discuss two further applications where text similarity is a valuable tool: In both question answering and textual entailment recognition, text similarity has been used successfully in experiments and appears to be a promising means for further research in these fields. We conclude this thesis with an analysis of shortcomings of current text similarity research and formulate challenges which should be tackled by future work. In particular, we believe that computing text similarity along multiple text dimensions---which depend on the specific task at hand---will benefit any other task where text similarity is fundamental, as a composition of text similarity measures has shown superior performance in both the intrinsic as well as the extrinsic evaluation. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Freie Schlagworte: | text similarity, text relatedness | ||||
URN: | urn:nbn:de:tuda-tuprints-36415 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung 20 Fachbereich Informatik |
||||
Hinterlegungsdatum: | 20 Okt 2013 19:55 | ||||
Letzte Änderung: | 20 Okt 2013 19:55 | ||||
PPN: | |||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Dagan, Prof. Ido ; Zesch, Dr. Torsten | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 11 Oktober 2013 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |