Machkour, Jasin (2024)
Development of Fast Machine Learning Algorithms for False Discovery Rate Control in Large-Scale High-Dimensional Data.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00028231
Dissertation, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
This dissertation develops false discovery rate (FDR) controlling machine learning algorithms for large-scale high-dimensional data. Ensuring the reproducibility of discoveries based on high-dimensional data is pivotal in numerous applications. The developed algorithms perform fast variable selection tasks in large-scale high-dimensional settings where the number of variables may be much larger than the number of samples. This includes large-scale data with up to millions of variables such as genome-wide association studies (GWAS). Theoretical finite sample FDR-control guarantees based on martingale theory have been established proving the trustworthiness of the developed methods. The practical open-source R software packages TRexSelector and tlars, which implement the proposed algorithms, have been published on the Comprehensive R Archive Network (CRAN). Extensive numerical experiments and real-world problems in biomedical and financial engineering demonstrate the performance in challenging use-cases. The first three main parts of this dissertation present the methodological and theoretical contributions, while the fourth main part contains the practical contributions.
The first main part (Chapter 3) is dedicated to the Terminating-Random Experiments (T-Rex) selector, a new fast variable selection framework for high-dimensional data. The proposed T-Rex selector controls a user-defined target FDR while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the candidate variables and multiple independent sets of randomly generated dummy variables. A finite sample proof of the FDR control property is provided using martingale theory. The computational complexity of the T-Rex selector grows linearly with the number of candidate variables. Furthermore, its computation time is more than two orders of magnitude faster compared to state-of-the-art benchmark methods in large-scale data settings. Therefore, the T-Rex selector scales to millions of candidate variables in a reasonable computation time. An important use-case of the T-Rex selector is determining reproducible associations between phenotypes and genotypes in GWAS, which is imperative in personalized medicine and drug discovery.
The second main part (Chapter 4) concerns dependency-aware FDR-controlling algorithms for large-scale high-dimensional data. In many biomedical and financial applications, the high-dimensional data sets often contain highly correlated candidate variables (e.g., gene expression data and stock returns). For such applications, the dependency-aware T-Rex (T-Rex+DA) framework has been developed. It extends the ordinary T-Rex framework by accounting for dependency structures among the candidate variables. This is achieved by integrating graphical models within the T-Rex framework, which allows to effectively harness the dependency structure among variables and to develop variable penalization mechanisms that guarantee FDR control.
In the third main part (Chapter 5), algorithms for joint grouped variable selection and FDR control are proposed. This approach to tackling the challenges resulting from the presence of groups of highly dependent variables in the data is different to the more conservative variable penalization approach that has been developed in the second part of this dissertation. That is, instead of finding the few true active variables among groups of highly correlated variables, the goal is to select all groups of highly correlated variables that contain at least one true active variable. In genomics research, especially for GWAS, grouped variable selection approaches are highly relevant, since one is not interested in identifying a few single-nucleotide polymorphisms (SNPs) that are associated with a disease of interest but rather the entire groups of correlated SNPs that point to relevant locations on the genome.
The fourth main part of this dissertation (Chapters 6 and 7) demonstrates the application of the developed methods to practical problems in biomedical engineering as well as financial engineering. The biomedical applications include (i) a semi-real-world GWAS, (ii) a human immunodeficiency virus type 1 (HIV-1) data set with associated drug resistance measurements, and (iii) a breast cancer data set with associated survival times of the patients. The financial engineering applications include (i) accurately tracking the S&P 500 index using a quarterly updated and rebalanced tracking portfolio that consists of few stocks and (ii) a factor analysis of S&P 500 stock returns. The common challenge of all considered applications lies in detecting the few true active variables (i.e., SNPs, mutations, genes, stocks) among many non-active variables in, among other things, large-scale high-dimensional settings.
Summarizing, this dissertation develops and analyses new fast and scalable machine learning algorithms with provable FDR-control guarantees for variable selection tasks in large-scale high-dimensional data. The developed algorithms and the associated open-source software packages have enabled making reproducible discoveries in various real-world applications ranging from biomedical to financial engineering.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2024 | ||||
Autor(en): | Machkour, Jasin | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Development of Fast Machine Learning Algorithms for False Discovery Rate Control in Large-Scale High-Dimensional Data | ||||
Sprache: | Englisch | ||||
Referenten: | Muma, Prof. Dr. Michael ; Palomar, Prof. Dr. Daniel P. | ||||
Publikationsjahr: | 19 November 2024 | ||||
Ort: | Darmstadt | ||||
Kollation: | xvi, 216 Seiten | ||||
Datum der mündlichen Prüfung: | 23 August 2024 | ||||
DOI: | 10.26083/tuprints-00028231 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/28231 | ||||
Kurzbeschreibung (Abstract): | This dissertation develops false discovery rate (FDR) controlling machine learning algorithms for large-scale high-dimensional data. Ensuring the reproducibility of discoveries based on high-dimensional data is pivotal in numerous applications. The developed algorithms perform fast variable selection tasks in large-scale high-dimensional settings where the number of variables may be much larger than the number of samples. This includes large-scale data with up to millions of variables such as genome-wide association studies (GWAS). Theoretical finite sample FDR-control guarantees based on martingale theory have been established proving the trustworthiness of the developed methods. The practical open-source R software packages TRexSelector and tlars, which implement the proposed algorithms, have been published on the Comprehensive R Archive Network (CRAN). Extensive numerical experiments and real-world problems in biomedical and financial engineering demonstrate the performance in challenging use-cases. The first three main parts of this dissertation present the methodological and theoretical contributions, while the fourth main part contains the practical contributions. The first main part (Chapter 3) is dedicated to the Terminating-Random Experiments (T-Rex) selector, a new fast variable selection framework for high-dimensional data. The proposed T-Rex selector controls a user-defined target FDR while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the candidate variables and multiple independent sets of randomly generated dummy variables. A finite sample proof of the FDR control property is provided using martingale theory. The computational complexity of the T-Rex selector grows linearly with the number of candidate variables. Furthermore, its computation time is more than two orders of magnitude faster compared to state-of-the-art benchmark methods in large-scale data settings. Therefore, the T-Rex selector scales to millions of candidate variables in a reasonable computation time. An important use-case of the T-Rex selector is determining reproducible associations between phenotypes and genotypes in GWAS, which is imperative in personalized medicine and drug discovery. The second main part (Chapter 4) concerns dependency-aware FDR-controlling algorithms for large-scale high-dimensional data. In many biomedical and financial applications, the high-dimensional data sets often contain highly correlated candidate variables (e.g., gene expression data and stock returns). For such applications, the dependency-aware T-Rex (T-Rex+DA) framework has been developed. It extends the ordinary T-Rex framework by accounting for dependency structures among the candidate variables. This is achieved by integrating graphical models within the T-Rex framework, which allows to effectively harness the dependency structure among variables and to develop variable penalization mechanisms that guarantee FDR control. In the third main part (Chapter 5), algorithms for joint grouped variable selection and FDR control are proposed. This approach to tackling the challenges resulting from the presence of groups of highly dependent variables in the data is different to the more conservative variable penalization approach that has been developed in the second part of this dissertation. That is, instead of finding the few true active variables among groups of highly correlated variables, the goal is to select all groups of highly correlated variables that contain at least one true active variable. In genomics research, especially for GWAS, grouped variable selection approaches are highly relevant, since one is not interested in identifying a few single-nucleotide polymorphisms (SNPs) that are associated with a disease of interest but rather the entire groups of correlated SNPs that point to relevant locations on the genome. The fourth main part of this dissertation (Chapters 6 and 7) demonstrates the application of the developed methods to practical problems in biomedical engineering as well as financial engineering. The biomedical applications include (i) a semi-real-world GWAS, (ii) a human immunodeficiency virus type 1 (HIV-1) data set with associated drug resistance measurements, and (iii) a breast cancer data set with associated survival times of the patients. The financial engineering applications include (i) accurately tracking the S&P 500 index using a quarterly updated and rebalanced tracking portfolio that consists of few stocks and (ii) a factor analysis of S&P 500 stock returns. The common challenge of all considered applications lies in detecting the few true active variables (i.e., SNPs, mutations, genes, stocks) among many non-active variables in, among other things, large-scale high-dimensional settings. Summarizing, this dissertation develops and analyses new fast and scalable machine learning algorithms with provable FDR-control guarantees for variable selection tasks in large-scale high-dimensional data. The developed algorithms and the associated open-source software packages have enabled making reproducible discoveries in various real-world applications ranging from biomedical to financial engineering. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Status: | Verlagsversion | ||||
URN: | urn:nbn:de:tuda-tuprints-282317 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik 500 Naturwissenschaften und Mathematik > 510 Mathematik 600 Technik, Medizin, angewandte Wissenschaften > 621.3 Elektrotechnik, Elektronik |
||||
Fachbereich(e)/-gebiet(e): | 18 Fachbereich Elektrotechnik und Informationstechnik 18 Fachbereich Elektrotechnik und Informationstechnik > Institut für Nachrichtentechnik 18 Fachbereich Elektrotechnik und Informationstechnik > Institut für Nachrichtentechnik > Robust Data Science LOEWE LOEWE > LOEWE-Zentren LOEWE > LOEWE-Zentren > emergenCITY Zentrale Einrichtungen Zentrale Einrichtungen > Hochschulrechenzentrum (HRZ) Zentrale Einrichtungen > Hochschulrechenzentrum (HRZ) > Hochleistungsrechner |
||||
TU-Projekte: | DFG|MU4507/1-1|REFOCUS: Robuste Sch | ||||
Hinterlegungsdatum: | 19 Nov 2024 12:05 | ||||
Letzte Änderung: | 27 Nov 2024 08:45 | ||||
PPN: | |||||
Referenten: | Muma, Prof. Dr. Michael ; Palomar, Prof. Dr. Daniel P. | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 23 August 2024 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |