Igamberdiev, Timour (2023)
Differentially private methods in natural language processing.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00024429
Dissertation, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
In today's world, the protection of privacy is increasingly gaining attention, not only among the general public, but also within the fields of machine learning and natural language processing (NLP). An established gold standard for providing a guarantee of privacy protection to all individuals in a dataset is the framework of differential privacy (DP). Intuitively, differential privacy provides a formal theoretical guarantee that the contribution of any individual to some analysis on a dataset is bounded. In other words, no single individual can influence this analysis 'too much'.
While the application of differential privacy to the fields of statistics and machine learning is becoming more widespread, it is still at a relatively early stage in NLP, with many important issues currently unresolved. This includes finding the most favorable methodologies for privatizing textual data that is used to train an NLP system, as well as dealing with the question of privatizing textual data independent of an NLP system, releasing it for general analysis, such as for use in a variety of downstream tasks. In this thesis, we address these and other fundamental questions relevant to applying privacy-preserving methods to the field of NLP.
We first present a detailed theoretical background on differential privacy and NLP. We discuss the problem of defining privacy from a philosophical perspective, fundamental concepts in the framework of differential privacy (e.g. the privacy guarantees it provides and how to achieve them), as well as the application of differential privacy to the fields of machine learning and NLP. This is followed by a description of important concepts in the field of NLP, including the structure of a modern NLP system, common tasks of text classification and generation, as well as relevant neural architectures.
We then delve into the primary investigations of this thesis, starting from the privatization of text classification systems. First, we tackle the problem of applying differential privacy to the data structure of graphs used in NLP datasets. Specifically, we demonstrate how to successfully apply the algorithm of differentially private stochastic gradient descent (DP-SGD) to graph convolutional networks, which pose theoretical and practical challenges due to their training characteristics. Next, we move into the territory of more 'standard' NLP models and textual datasets, answering the question of whether a common strategy exists for incorporating DP-SGD in these various settings.
In the second principal set of investigations of this thesis, we focus on the privatization of textual data that is independent of a specific NLP system. In particular, we address this problem from the perspective of privatized text rewriting in the setting of local differential privacy (LDP), in which an entire document is rewritten with differentially private guarantees. We first present our modular framework DP-Rewrite, meant to lay down a foundation for the NLP community to solving this task in a transparent and reproducible manner. We then tackle the privatized text rewriting problem itself, proposing the DP-BART model that introduces several techniques which can be applied to a pre-trained BART model, including a novel clipping method, iterative pruning of the model, and further training of internal representations. Using these techniques, we can drastically reduce the amount of perturbation required to achieve a DP guarantee. We thoroughly examine the feasibility of this approach as a whole, with a focus on the problem of the strict adjacency constraint that is inherent in the LDP setting, which leads to a high amount of perturbation of the original text.
Throughout this thesis, we additionally address several crucial points that are important to keep in mind when applying differential privacy to textual data. First is the question of interpretability, such as what exactly is being privatized in a textual dataset when DP is applied to some analysis on it, as well as the exact details of a proposed DP algorithm and the strength of the privacy guarantee that it provides. Furthermore, it is crucial to be aware of the limitations of proposed methodologies that incorporate DP. This includes computational and memory limitations, as well as the trade-off between the level of privacy that can be provided and the utility of an algorithm, with stronger privacy guarantees expected to more negatively impact utility.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2023 | ||||
Autor(en): | Igamberdiev, Timour | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Differentially private methods in natural language processing | ||||
Sprache: | Englisch | ||||
Referenten: | Habernal, Dr. Ivan ; Gurevych, Prof. Dr. Iryna ; Wachsmuth, Prof. Dr. Henning | ||||
Publikationsjahr: | 2023 | ||||
Ort: | Darmstadt | ||||
Kollation: | xiii, 170 Seiten | ||||
Datum der mündlichen Prüfung: | 20 Juli 2023 | ||||
DOI: | 10.26083/tuprints-00024429 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/24429 | ||||
Kurzbeschreibung (Abstract): | In today's world, the protection of privacy is increasingly gaining attention, not only among the general public, but also within the fields of machine learning and natural language processing (NLP). An established gold standard for providing a guarantee of privacy protection to all individuals in a dataset is the framework of differential privacy (DP). Intuitively, differential privacy provides a formal theoretical guarantee that the contribution of any individual to some analysis on a dataset is bounded. In other words, no single individual can influence this analysis 'too much'. While the application of differential privacy to the fields of statistics and machine learning is becoming more widespread, it is still at a relatively early stage in NLP, with many important issues currently unresolved. This includes finding the most favorable methodologies for privatizing textual data that is used to train an NLP system, as well as dealing with the question of privatizing textual data independent of an NLP system, releasing it for general analysis, such as for use in a variety of downstream tasks. In this thesis, we address these and other fundamental questions relevant to applying privacy-preserving methods to the field of NLP. We first present a detailed theoretical background on differential privacy and NLP. We discuss the problem of defining privacy from a philosophical perspective, fundamental concepts in the framework of differential privacy (e.g. the privacy guarantees it provides and how to achieve them), as well as the application of differential privacy to the fields of machine learning and NLP. This is followed by a description of important concepts in the field of NLP, including the structure of a modern NLP system, common tasks of text classification and generation, as well as relevant neural architectures. We then delve into the primary investigations of this thesis, starting from the privatization of text classification systems. First, we tackle the problem of applying differential privacy to the data structure of graphs used in NLP datasets. Specifically, we demonstrate how to successfully apply the algorithm of differentially private stochastic gradient descent (DP-SGD) to graph convolutional networks, which pose theoretical and practical challenges due to their training characteristics. Next, we move into the territory of more 'standard' NLP models and textual datasets, answering the question of whether a common strategy exists for incorporating DP-SGD in these various settings. In the second principal set of investigations of this thesis, we focus on the privatization of textual data that is independent of a specific NLP system. In particular, we address this problem from the perspective of privatized text rewriting in the setting of local differential privacy (LDP), in which an entire document is rewritten with differentially private guarantees. We first present our modular framework DP-Rewrite, meant to lay down a foundation for the NLP community to solving this task in a transparent and reproducible manner. We then tackle the privatized text rewriting problem itself, proposing the DP-BART model that introduces several techniques which can be applied to a pre-trained BART model, including a novel clipping method, iterative pruning of the model, and further training of internal representations. Using these techniques, we can drastically reduce the amount of perturbation required to achieve a DP guarantee. We thoroughly examine the feasibility of this approach as a whole, with a focus on the problem of the strict adjacency constraint that is inherent in the LDP setting, which leads to a high amount of perturbation of the original text. Throughout this thesis, we additionally address several crucial points that are important to keep in mind when applying differential privacy to textual data. First is the question of interpretability, such as what exactly is being privatized in a textual dataset when DP is applied to some analysis on it, as well as the exact details of a proposed DP algorithm and the strength of the privacy guarantee that it provides. Furthermore, it is crucial to be aware of the limitations of proposed methodologies that incorporate DP. This includes computational and memory limitations, as well as the trade-off between the level of privacy that can be provided and the utility of an algorithm, with stronger privacy guarantees expected to more negatively impact utility. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Status: | Verlagsversion | ||||
URN: | urn:nbn:de:tuda-tuprints-244295 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung |
||||
Hinterlegungsdatum: | 18 Aug 2023 12:12 | ||||
Letzte Änderung: | 22 Aug 2023 09:47 | ||||
PPN: | |||||
Referenten: | Habernal, Dr. Ivan ; Gurevych, Prof. Dr. Iryna ; Wachsmuth, Prof. Dr. Henning | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 20 Juli 2023 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |