Halvani, Oren (2021)
Practice-Oriented Authorship Verification.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00019861
Ph.D. Thesis, Primary publication, Publisher's Version
Abstract
The question of the authorship of texts has occupied numerous areas both within and beyond research for a very long time. Knowing whether a particular person is a potential author of a text is of central importance for countless application scenarios. In practice, there are many examples where a document has an alleged author, but the authorship is disputed by another party. These include, for example, theses suspected to have been written by someone else (e. g., ghostwriters), testaments purportedly written by the individuals involved, messages disseminated from compromised email or social media accounts, or claims for damages from allegedly real policyholders. The underlying task in all these examples is authorship verification (AV), which concentrates on the fundamental question of whether two texts were written by the same person. AV represents an important sub-discipline of authorship analysis that has been researched for decades. However, when looking at much of the existing research in the field of AV, it becomes apparent that the focus is mainly on the detection accuracies of AV methods. Other important issues dealing with the robustness, reliability, sensitivity to topic influences, and generalizability of AV methods, as well as the interpretability of verification results, receive much less attention in comparison. The latter, interpretability, is of considerable interest. The reason for this is that verification results in the form of probabilities, similarity scores, and binary predictions (same-authorship/different-authorship) are generally insufficient on their own to be used in practice (especially in forensic contexts).
The aim of this thesis is to shed light on these and other aspects of AV and to provide definitions, concepts and approaches that can and have been used in realistic scenarios. We first provide an improved systematization of AV, in which we concentrate on relevant key topics. We explain the importance of preprocessing in the context of AV and propose an effective technique that masks topic-related words in documents. In this way, potential biases with respect to verification results can be successfully counteracted. We also propose new categories of features that can be used in AV (and other related disciplines) for different types of texts, e. g., newspaper articles, reviews, chat logs, emails, and scientific texts. Furthermore, we present a taxonomy of characteristics of AV methods that can be used to compare AV approaches in more detail. These characteristics can be used to assess the extent to which AV methods are suitable for practical use, regardless of their detection accuracy. Moreover, we highlight shortcomings of existing evaluation methodologies in the literature and, in particular, address the weaknesses of some performance measures that are still used today to evaluate AV methods. In this context, we present an alternative evaluation approach that aims to reflect a problematic aspect of AV methods in terms of their applicability in practice. After an in-depth analysis of a variety of existing AV methods, we propose alternative approaches that aim to counteract the identified issues. To supplement this, we offer various (visualization) techniques that allow human experts such as investigators to interpret the verification results of our AV methods. Based on a set of self-compiled corpora covering different genres and topics, we afterwards evaluate our AV approaches against competitive baseline methods. These corpora focus on different challenges, such as verification cases with cross-topic conditions, documents written in two widely separated time periods, and documents with excessive use of slang. Thus, the suitability of the methods is evaluated and analyzed from different perspectives.
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Erschienen: | 2021 | ||||
Creators: | Halvani, Oren | ||||
Type of entry: | Primary publication | ||||
Title: | Practice-Oriented Authorship Verification | ||||
Language: | English | ||||
Referees: | Waidner, Prof. Dr. Michael ; Savoy, Prof. Dr. Jaques | ||||
Date: | 2021 | ||||
Place of Publication: | Darmstadt | ||||
Collation: | 244 Seiten | ||||
Refereed: | 28 June 2021 | ||||
DOI: | 10.26083/tuprints-00019861 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/19861 | ||||
Abstract: | The question of the authorship of texts has occupied numerous areas both within and beyond research for a very long time. Knowing whether a particular person is a potential author of a text is of central importance for countless application scenarios. In practice, there are many examples where a document has an alleged author, but the authorship is disputed by another party. These include, for example, theses suspected to have been written by someone else (e. g., ghostwriters), testaments purportedly written by the individuals involved, messages disseminated from compromised email or social media accounts, or claims for damages from allegedly real policyholders. The underlying task in all these examples is authorship verification (AV), which concentrates on the fundamental question of whether two texts were written by the same person. AV represents an important sub-discipline of authorship analysis that has been researched for decades. However, when looking at much of the existing research in the field of AV, it becomes apparent that the focus is mainly on the detection accuracies of AV methods. Other important issues dealing with the robustness, reliability, sensitivity to topic influences, and generalizability of AV methods, as well as the interpretability of verification results, receive much less attention in comparison. The latter, interpretability, is of considerable interest. The reason for this is that verification results in the form of probabilities, similarity scores, and binary predictions (same-authorship/different-authorship) are generally insufficient on their own to be used in practice (especially in forensic contexts). The aim of this thesis is to shed light on these and other aspects of AV and to provide definitions, concepts and approaches that can and have been used in realistic scenarios. We first provide an improved systematization of AV, in which we concentrate on relevant key topics. We explain the importance of preprocessing in the context of AV and propose an effective technique that masks topic-related words in documents. In this way, potential biases with respect to verification results can be successfully counteracted. We also propose new categories of features that can be used in AV (and other related disciplines) for different types of texts, e. g., newspaper articles, reviews, chat logs, emails, and scientific texts. Furthermore, we present a taxonomy of characteristics of AV methods that can be used to compare AV approaches in more detail. These characteristics can be used to assess the extent to which AV methods are suitable for practical use, regardless of their detection accuracy. Moreover, we highlight shortcomings of existing evaluation methodologies in the literature and, in particular, address the weaknesses of some performance measures that are still used today to evaluate AV methods. In this context, we present an alternative evaluation approach that aims to reflect a problematic aspect of AV methods in terms of their applicability in practice. After an in-depth analysis of a variety of existing AV methods, we propose alternative approaches that aim to counteract the identified issues. To supplement this, we offer various (visualization) techniques that allow human experts such as investigators to interpret the verification results of our AV methods. Based on a set of self-compiled corpora covering different genres and topics, we afterwards evaluate our AV approaches against competitive baseline methods. These corpora focus on different challenges, such as verification cases with cross-topic conditions, documents written in two widely separated time periods, and documents with excessive use of slang. Thus, the suitability of the methods is evaluated and analyzed from different perspectives. |
||||
Alternative Abstract: |
|
||||
Status: | Publisher's Version | ||||
URN: | urn:nbn:de:tuda-tuprints-198617 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science 400 Language > 400 Language, linguistics |
||||
Divisions: | 20 Department of Computer Science 20 Department of Computer Science > Security in Information Technology |
||||
Date Deposited: | 15 Nov 2021 13:03 | ||||
Last Modified: | 22 Nov 2021 09:45 | ||||
PPN: | |||||
Referees: | Waidner, Prof. Dr. Michael ; Savoy, Prof. Dr. Jaques | ||||
Refereed / Verteidigung / mdl. Prüfung: | 28 June 2021 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Send an inquiry |
Options (only for editors)
Show editorial Details |