TU Darmstadt / ULB / TUbiblio

Composing Measures for Computing Text Similarity

Bär, Daniel and Zesch, Torsten and Gurevych, Iryna
UKP Lab, Technische Universität Darmstadt (Corporate Creator) (2015):
Composing Measures for Computing Text Similarity.
Darmstadt, Germany, [Online-Edition: http://tuprints.ulb.tu-darmstadt.de/4342],
[Report]

Abstract

We present a comprehensive study of computing similarity between texts. We start from the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. We thus define the notion of text similarity and distinguish it from related tasks such as textual entailment and near-duplicate detection. We then identify multiple text dimensions, i.e. characteristics inherent to texts that can be used to judge text similarity, for which we provide empirical evidence. We discuss state-of-the-art text similarity measures previously proposed in the literature, before continuing with a thorough discussion of common evaluation metrics and datasets. Based on the analysis, we devise an architecture which combines text similarity measures in a unified classification framework. We apply our system in two evaluation settings, for which it consistently outperforms prior work and competing systems: (a) an intrinsic evaluation in the context of the Semantic Textual Similarity Task as part of the Semantic Evaluation (SemEval) exercises, and (b) an extrinsic evaluation for the detection of text reuse. As a basis for future work, we introduce DKPro Similarity, an open source software package which streamlines the development of text similarity measures and complete experimental setups.

Item Type: Report
Erschienen: 2015
Creators: Bär, Daniel and Zesch, Torsten and Gurevych, Iryna
Title: Composing Measures for Computing Text Similarity
Language: English
Abstract:

We present a comprehensive study of computing similarity between texts. We start from the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. We thus define the notion of text similarity and distinguish it from related tasks such as textual entailment and near-duplicate detection. We then identify multiple text dimensions, i.e. characteristics inherent to texts that can be used to judge text similarity, for which we provide empirical evidence. We discuss state-of-the-art text similarity measures previously proposed in the literature, before continuing with a thorough discussion of common evaluation metrics and datasets. Based on the analysis, we devise an architecture which combines text similarity measures in a unified classification framework. We apply our system in two evaluation settings, for which it consistently outperforms prior work and competing systems: (a) an intrinsic evaluation in the context of the Semantic Textual Similarity Task as part of the Semantic Evaluation (SemEval) exercises, and (b) an extrinsic evaluation for the detection of text reuse. As a basis for future work, we introduce DKPro Similarity, an open source software package which streamlines the development of text similarity measures and complete experimental setups.

Place of Publication: Darmstadt, Germany
Uncontrolled Keywords: Text Similarity Plagiarism Paraphrase Recognition
Divisions: 20 Department of Computer Science > Ubiquitous Knowledge Processing
20 Department of Computer Science
Date Deposited: 01 Feb 2015 20:55
Official URL: http://tuprints.ulb.tu-darmstadt.de/4342
URN: urn:nbn:de:tuda-tuprints-43429
Related URLs:
Export:
Suche nach Titel in: TUfind oder in Google

Optionen (nur für Redakteure)

View Item View Item