TU Darmstadt / ULB / TUbiblio

Graph Structures in Privacy-Preserving Biomedical Analyses

Kussel, Tobias (2022)
Graph Structures in Privacy-Preserving Biomedical Analyses.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00021828
Ph.D. Thesis, Primary publication, Publisher's Version

Abstract

Graph theory is one of the truly interdisciplinary fields of research. Not only does the application of graphs lead to new insights in a variety of disciplines — physics, biology, sociology, mathematics, and computer science — but all of these disciplines, with their different questions and perspectives, actively influence the development and research of graph theory itself. Being able to abstract interconnected and interdependent systems and describe them in a mathematically exact way offers a different way of describing the problem when facing a diverse array of questions. It favors finding certain “topology-based” approaches while simultaneously providing a methodological toolbox; especially statistical physics offers numerous examples for this.

In this thesis, I consider various research questions in the field of distributed biomedical analysis. In this context, the health data used pose a significant risk to the privacy of the individuals involved — still, many conditions are socially stigmatized, so their disclosure could lead to ostracization, occupational disadvantages, and even physical harm. This is reflected in legislation, which requires special protection for health data — and other data that carry the risk of identifying an individual.

In this dissertation, I present the design, implementation, and empirical analysis of methods and algorithms that are not only relevant for obtaining medical knowledge and developing new treatment methods, but at the same time provide an exceedingly high level of protection concerning sensitive data. The developed methods presented hereafter use the cryptographic techniques of Secure Multi-Party Computation (MPC), a class of cryptographic protocols that enables the joint computation over distributed datasets while simultaneously providing highest security guarantees regarding the secrecy of the (distributed) input data. Graphs play a central role in this context: the desired functionalities are mapped to Boolean or arithmetic Directed Acyclic Graphs (DAGs), where the nodes represent operations or interactive protocols.

However, not only do the methodology and techniques employed in this dissertation make use of approaches from statistical physics and graph theory — many of the medical questions directly relate to graph problems or lead to more efficient solutions when interpreted as graphs. Specifically, a method is developed to efficiently analyze epistasis relationships, i.e., non-linear gene-gene and gene-environment interactions, and the resulting gene regulatory networks in a privacy-preserving manner. To enable the processing of the required amounts of genetic data within feasible time scales, novel generally applicable MPC building blocks were developed. Furthermore, the graph optimization problem of kidney exchanges — the efficient and fair allocation of donors’ kidneys for kidney transplantation — is explored. Since sensitive medical data are involved as well, these data protection considerations are of central interest here. In addition to the many boundary conditions to this problem, for example regarding the robustness of the solutions or the medical compatibility of organ donors and recipients, finding the global solution has been proven to be a problem with superpolynomial complexity. Hence, a locally optimal algorithm to solve this issue is presented in this work. Finally, the problem of probabilistic record linkage, which is highly relevant in medical research, is considered. Based on the personal data such as first name, last name, address, and birthday, the similarity between patients in one or more databases is to be assessed, with the goal of identifying duplicates even in the presence of noisy data, i.e., misspellings and interchanged identifiers. Via the path of graph representation and the associated search for approximated subgraph isomorphisms, record linkage is reduced to a (simpler) database problem with efficiently solvable algorithms. The developed framework Mainzelliste Secure EpiLinker (MainSEL) provides a means for secure, privacy-preserving probabilistic record linkage between different medical institutions. However, it is also capable of comparing other data types for their respective similarity. At the end of this dissertation, three promising but not yet extensively explored extensions are presented that allow the matching of pharmacologically relevant small molecule graphs, the search for similar patients in clinical databases for nonspecific diagnoses, or the identification of disrupted biological regulatory pathways by comparing transcriptome networks.

Item Type: Ph.D. Thesis
Erschienen: 2022
Creators: Kussel, Tobias
Type of entry: Primary publication
Title: Graph Structures in Privacy-Preserving Biomedical Analyses
Language: English
Referees: Hamacher, Prof. Dr. Kay ; Schneider, Prof. Dr. Thomas
Date: 2022
Place of Publication: Darmstadt
Collation: V, 166 Seiten
Refereed: 20 July 2022
DOI: 10.26083/tuprints-00021828
URL / URN: https://tuprints.ulb.tu-darmstadt.de/21828
Abstract:

Graph theory is one of the truly interdisciplinary fields of research. Not only does the application of graphs lead to new insights in a variety of disciplines — physics, biology, sociology, mathematics, and computer science — but all of these disciplines, with their different questions and perspectives, actively influence the development and research of graph theory itself. Being able to abstract interconnected and interdependent systems and describe them in a mathematically exact way offers a different way of describing the problem when facing a diverse array of questions. It favors finding certain “topology-based” approaches while simultaneously providing a methodological toolbox; especially statistical physics offers numerous examples for this.

In this thesis, I consider various research questions in the field of distributed biomedical analysis. In this context, the health data used pose a significant risk to the privacy of the individuals involved — still, many conditions are socially stigmatized, so their disclosure could lead to ostracization, occupational disadvantages, and even physical harm. This is reflected in legislation, which requires special protection for health data — and other data that carry the risk of identifying an individual.

In this dissertation, I present the design, implementation, and empirical analysis of methods and algorithms that are not only relevant for obtaining medical knowledge and developing new treatment methods, but at the same time provide an exceedingly high level of protection concerning sensitive data. The developed methods presented hereafter use the cryptographic techniques of Secure Multi-Party Computation (MPC), a class of cryptographic protocols that enables the joint computation over distributed datasets while simultaneously providing highest security guarantees regarding the secrecy of the (distributed) input data. Graphs play a central role in this context: the desired functionalities are mapped to Boolean or arithmetic Directed Acyclic Graphs (DAGs), where the nodes represent operations or interactive protocols.

However, not only do the methodology and techniques employed in this dissertation make use of approaches from statistical physics and graph theory — many of the medical questions directly relate to graph problems or lead to more efficient solutions when interpreted as graphs. Specifically, a method is developed to efficiently analyze epistasis relationships, i.e., non-linear gene-gene and gene-environment interactions, and the resulting gene regulatory networks in a privacy-preserving manner. To enable the processing of the required amounts of genetic data within feasible time scales, novel generally applicable MPC building blocks were developed. Furthermore, the graph optimization problem of kidney exchanges — the efficient and fair allocation of donors’ kidneys for kidney transplantation — is explored. Since sensitive medical data are involved as well, these data protection considerations are of central interest here. In addition to the many boundary conditions to this problem, for example regarding the robustness of the solutions or the medical compatibility of organ donors and recipients, finding the global solution has been proven to be a problem with superpolynomial complexity. Hence, a locally optimal algorithm to solve this issue is presented in this work. Finally, the problem of probabilistic record linkage, which is highly relevant in medical research, is considered. Based on the personal data such as first name, last name, address, and birthday, the similarity between patients in one or more databases is to be assessed, with the goal of identifying duplicates even in the presence of noisy data, i.e., misspellings and interchanged identifiers. Via the path of graph representation and the associated search for approximated subgraph isomorphisms, record linkage is reduced to a (simpler) database problem with efficiently solvable algorithms. The developed framework Mainzelliste Secure EpiLinker (MainSEL) provides a means for secure, privacy-preserving probabilistic record linkage between different medical institutions. However, it is also capable of comparing other data types for their respective similarity. At the end of this dissertation, three promising but not yet extensively explored extensions are presented that allow the matching of pharmacologically relevant small molecule graphs, the search for similar patients in clinical databases for nonspecific diagnoses, or the identification of disrupted biological regulatory pathways by comparing transcriptome networks.

Alternative Abstract:
Alternative abstract Language

Die Graphtheorie ist eines der wirklich interdisziplinären Forschungsfelder. Nicht nur führt die Anwendung von Graphen zu neuen Erkenntnissen in einer Vielzahl von Disziplinen – Physik, Biologie, Soziologie, Mathematik, Informatik –, sondern all diese Disziplinen wirken mit ihren unterschiedlichen Fragestellungen und Sichtweisen aktiv auf die Entwicklung und Erforschung der Graphtheorie selbst ein. Vernetzte und mit Interdependenzen behaftete Systeme abstrahieren und mathematisch exakt beschreiben zu können, bietet für viele Fragestellungen eine Art der Problembeschreibung, die das Finden gewisser »topologie-basierter« Lösungswege begünstigt und gleichzeitig einen methodischen Werkzeugkoffer zur Verfügung stellt – gerade die statistische Physik bietet zahlreiche Beispiele dafür.

In dieser Arbeit werden verschiedene Fragestellungen aus dem Bereich der verteilten Analysen der Biologie und Medizin betrachtet. Dabei stellen die verwendeten Gesundheitsdaten oft ein erhebliches Risiko für die Privatsphäre der Betroffenen dar – noch immer sind zahlreiche Erkrankungen gesellschaftlich stigmatisiert, sodass ein Bekanntwerden zu Ausgrenzung, beruflichen Nachteilen, bis hin zu körperlichen Gefährdungen führen könnte. Das wird auch von der Gesetzgebung widergespiegelt, die Gesundheitsdaten — und anderen Daten, die das Risiko der Identifikation eines Individuums bergen — einen besonderen Schutzbedarf attestiert.

Daher werden in dieser Dissertation Methoden und Algorithmen entworfen, implementiert und empirisch untersucht, die nicht nur für die Erlangung medizinischer Erkenntnisse und der Entwicklung neuer Behandlungsmethoden relevant sind, sondern die zeitgleich ein überaus hohes Datenschutzniveau bezüglich der sensiblen Daten aufweisen. Dazu bedienen sich die entwickelten Methoden der Techniken des kryptografischen Feldes der Secure Multi-Party-Computation (MPC) – einer Klasse kryptografischer Protokolle, die die gemeinsame Berechnung über verteilte Datenbestände ermöglicht und zeitgleich höchste Sicherheitsgarantien bezüglich der Geheimhaltung der (verteilten) Eingabedaten gibt. Dabei spielen Graphen eine zentrale Rolle: Die abzubildenden Funktionalitäten werden als Boolesche oder arithmetische ungerichtete azyklische Graphen repräsentiert, bei denen die Knoten Operationen oder interaktive Protokolle darstellen.

Doch nicht nur die Methodik und die eingesetzten Techniken in dieser Dissertation bedienen sich der Methoden der statistischen Physik und der Graphtheorie – viele der medizinischen Fragestellungen sind ganz direkt Graphprobleme oder führen über die Betrachtung als Graph zu einer effizienteren Problemdarstellung. So wurde ein Verfahren entwickelt, um Epistasisbeziehungen, also nichtlineare Gen-Gen- und Gen-Umgebungs-Interaktionen, sowie die daraus resultierenden Genregulationsnetzwerke effizient und die Privatsphäre schützend zu analysieren. Um die Verarbeitung der erforderlichen Mengen genetischer Daten in für den Einsatz akzeptablen Zeitskalen zu ermöglichen, wurden neue, allgemein verwendbare MPC-Bausteine entwickelt. Des Weiteren wurde das Graphoptimierungsproblem der effizienten und fairen Zuteilung von Spendernieren für Nierentransplantationen bearbeitet. Auch hier sind sensible medizinische Daten betroffen, sodass Datenschutzaspekte eine bedeutende Rolle spielen. Nicht nur bestehen bei dieser Fragestellung viele Randbedingungen, zum Beispiel bezüglich der Robustheit der Lösungen oder der medizinischen Kompatibilität von Organspendern und -empfängern sondern das Finden der globalen Lösung ist ein Problem mit superpolynomialer Komplexität, sodass in dieser Arbeit ein lokal optimaler Algorithmus vorgestellt wird. Schließlich wird das in der medizinischen Forschung hochrelevante Problem des probabilistischen Record-Linkage betrachtet. Dabei soll basierend auf personenbezogenen Daten wie Vorund Nachname, Adresse und Geburtstag die Ähnlichkeit zwischen Patienten in einer oder mehreren Datenbanken beurteilt werden, mit dem Ziel, selbst bei »verrauschten« Daten, also Schreibfehlern und der Verwechslung von Kennzeichnern, Duplikate zu identifizieren. Über den Weg der Graphrepräsentation und die damit verbundene Suche nach approximierten Subgraph-Isomorphismen wird Record-Linkage auf ein (einfacheres) Datenbank-Problem mit effizient lösbaren Algorithmen zurückgeführt. Das entwickelte Framework Mainzelliste Secure EpiLinker (MainSEL) bietet eine Möglichkeit zum sicheren, die Privatsphäre schützenden probabilistischen Record-Linkage zwischen verschiedenen medizinischen Institutionen. Es ermöglicht zudem weitere Datentypen auf Ähnlichkeit zu vergleichen. Am Ende dieser Dissertation werden drei vielversprechende, jedoch noch nicht umfänglich erforschte Erweiterung vorgestellt, die den Abgleich von pharmakologisch relevanten kleinen Molekülgraphen, die Suche nach ähnlichen Patienten in klinischen Datenbanken für unspezifische Diagnosen oder die Identifikation von gestörten biologischen Regulationspfaden durch den Vergleich von Transkriptomnetzwerken erlauben.

German
Uncontrolled Keywords: Multi-Party Computation, Graphtheorie, Medizin, IT Sicherheit
Status: Publisher's Version
URN: urn:nbn:de:tuda-tuprints-218285
Classification DDC: 000 Generalities, computers, information > 004 Computer science
500 Science and mathematics > 500 Science
500 Science and mathematics > 530 Physics
500 Science and mathematics > 570 Life sciences, biology
600 Technology, medicine, applied sciences > 610 Medicine and health
Divisions: 10 Department of Biology
10 Department of Biology > Computational Biology and Simulation
TU-Projects: Bund/BMBF|01ZZ1802G|HIGHmed
DLR|01ZZ1911F|CORD_MI
Date Deposited: 10 Aug 2022 12:09
Last Modified: 11 Aug 2022 06:03
PPN:
Referees: Hamacher, Prof. Dr. Kay ; Schneider, Prof. Dr. Thomas
Refereed / Verteidigung / mdl. Prüfung: 20 July 2022
Export:
Suche nach Titel in: TUfind oder in Google
Send an inquiry Send an inquiry

Options (only for editors)
Show editorial Details Show editorial Details