TU Darmstadt / ULB / TUbiblio

Text Reuse Detection Using a Composition of Text Similarity Measures

Bär, Daniel ; Zesch, Torsten ; Gurevych, Iryna (2012)
Text Reuse Detection Using a Composition of Text Similarity Measures.
Mumbai, India
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

Detecting text reuse is a fundamental requirement for a variety of tasks and applications, ranging from journalistic text reuse to plagiarism detection. Text reuse is traditionally detected by computing similarity between a source text and a possibly reused text. However, existing text similarity measures exhibit a major limitation: They compute similarity only on features which can be derived from the content of the given texts, thereby inherently implying that any other text characteristics are negligible. In this paper, we overcome this traditional limitation and compute similarity along three characteristic dimensions inherent to texts: content, structure, and style. We explore and discuss possible combinations of measures along these dimensions, and our results demonstrate that the composition consistently outperforms previous approaches on three standard evaluation datasets, and that text reuse detection greatly benefits from incorporating a diverse feature set that reflects a wide variety of text characteristics.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2012
Autor(en): Bär, Daniel ; Zesch, Torsten ; Gurevych, Iryna
Art des Eintrags: Bibliographie
Titel: Text Reuse Detection Using a Composition of Text Similarity Measures
Sprache: Englisch
Publikationsjahr: Dezember 2012
Buchtitel: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012)
Veranstaltungsort: Mumbai, India
URL / URN: http://aclweb.org/anthology/C12-1011
Kurzbeschreibung (Abstract):

Detecting text reuse is a fundamental requirement for a variety of tasks and applications, ranging from journalistic text reuse to plagiarism detection. Text reuse is traditionally detected by computing similarity between a source text and a possibly reused text. However, existing text similarity measures exhibit a major limitation: They compute similarity only on features which can be derived from the content of the given texts, thereby inherently implying that any other text characteristics are negligible. In this paper, we overcome this traditional limitation and compute similarity along three characteristic dimensions inherent to texts: content, structure, and style. We explore and discuss possible combinations of measures along these dimensions, and our results demonstrate that the composition consistently outperforms previous approaches on three standard evaluation datasets, and that text reuse detection greatly benefits from incorporating a diverse feature set that reflects a wide variety of text characteristics.

Freie Schlagworte: UKP_a_NLP4Wikis;UKP_p_WIKULU;reviewed;UKP_s_DKPro_Similarity;UKP_p_ItForensics;UKP_a_TexMinAn
ID-Nummer: TUD-CS-2012-0218
Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
Hinterlegungsdatum: 31 Dez 2016 14:29
Letzte Änderung: 24 Jan 2020 12:03
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen