Sukhanov, Sergey (2021)
Clustering, classifying and matching patterns with ensemble techniques.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00019897
Dissertation, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
This thesis addresses three important general machine learning and signal processing problems: clustering, classification, and pattern matching which arise in many scientific and practical challenges. Despite many solutions proposed throughout the last decade, these problems are still imposing particular difficulties when addressing them: many approaches fail when it comes to the multidimensional nature of signals; some methods are able to operate only with a moderate amount of data due to their intrinsic complexity limitations; the majority of frameworks require hard decisions to be provided with assumptions that might not hold in reality. By leveraging group learning or wisdom of the crowds concepts, this thesis brings in an ensemble learning paradigm in order to solve these fundamental challenges.
The first part of the dissertation addresses the problem of identifying similar groups of objects also known as clustering. While being widely used in many domains, clustering carries several fundamental intrinsic challenges (subjectivity, large parameter set, own assumptions on resulting clusters, etc.) that often hinder satisfactory results. To address these challenges, a novel consensus clustering framework is proposed. Operating on multiple clustering outcomes it provides two scalable ways of approaching the problem. First, by accounting for the drawbacks of the Hamming distance in co-occurrence-based consensus clustering methods the proposed approach offers construction of an expressive distance measure operating with data structures called data fragments. As the result, a novel consensus function is built around this measure based on a hierarchical clustering method demonstrating stable and accurate results. Second, by formulating a consensus clustering problem as a binary matrix factorization problem it allows to efficiently solve it by means of a recursive rank-one binary matrix approximation. This brings descriptive results interpretation suiting large-scale datasets and a high amount of ensemble members.
The second part of the dissertation deals with the classification task that is about deciding for one out of several predefined categories that an object belongs to. We solve high-dimensional remote sensing data fusion problems by formulating them as a classification task and proposing a dynamic classifier and ensemble selection framework. Relying on the multiple classifier systems concept the proposed framework selects and combines competent classifiers from an established ensemble in order to provide reliable and accurate classification. To enable that, a competence estimation and selection methodology is developed.
In the third part of the dissertation, we address the problem of similarity search in data streams that is about finding similar objects (or events) in a real-time stream of data. Due to outliers, noise, and potential distortions in amplitude and time dimensions, it is often challenging to correctly retrieve required patterns from the stream in presence of distortions and outliers. To enable this, we propose a dynamic normalization mechanism that allows bringing streaming signal subsequences to the scale of the query template. Additionally, we extend it for the case when multiple examples of a query template are available allowing for leveraging the wisdom of the crowds concepts in pattern matching settings. This significantly improves pattern retrieval capabilities, especially when sampling variance or time distortions are present.
The proposed contributions for clustering, classification, and pattern matching are studied and validated on artificially generated datasets as well as on the real-world measurement data obtained from open sources or recorded in the laboratory of AGT Group (R&D) GmbH, Darmstadt, Germany. Multiple experiments are conducted to confirm and verify the performance consistency of the proposed methods as well as partly integrated into real-world solutions.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2021 | ||||
Autor(en): | Sukhanov, Sergey | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Clustering, classifying and matching patterns with ensemble techniques | ||||
Sprache: | Englisch | ||||
Referenten: | Zoubir, Prof. Dr. Abdelhak M. ; Muma, Dr.-Ing. Michael | ||||
Publikationsjahr: | 2021 | ||||
Ort: | Darmstadt | ||||
Kollation: | XI, 128 Seiten | ||||
Datum der mündlichen Prüfung: | 31 Mai 2021 | ||||
DOI: | 10.26083/tuprints-00019897 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/19897 | ||||
Kurzbeschreibung (Abstract): | This thesis addresses three important general machine learning and signal processing problems: clustering, classification, and pattern matching which arise in many scientific and practical challenges. Despite many solutions proposed throughout the last decade, these problems are still imposing particular difficulties when addressing them: many approaches fail when it comes to the multidimensional nature of signals; some methods are able to operate only with a moderate amount of data due to their intrinsic complexity limitations; the majority of frameworks require hard decisions to be provided with assumptions that might not hold in reality. By leveraging group learning or wisdom of the crowds concepts, this thesis brings in an ensemble learning paradigm in order to solve these fundamental challenges. The first part of the dissertation addresses the problem of identifying similar groups of objects also known as clustering. While being widely used in many domains, clustering carries several fundamental intrinsic challenges (subjectivity, large parameter set, own assumptions on resulting clusters, etc.) that often hinder satisfactory results. To address these challenges, a novel consensus clustering framework is proposed. Operating on multiple clustering outcomes it provides two scalable ways of approaching the problem. First, by accounting for the drawbacks of the Hamming distance in co-occurrence-based consensus clustering methods the proposed approach offers construction of an expressive distance measure operating with data structures called data fragments. As the result, a novel consensus function is built around this measure based on a hierarchical clustering method demonstrating stable and accurate results. Second, by formulating a consensus clustering problem as a binary matrix factorization problem it allows to efficiently solve it by means of a recursive rank-one binary matrix approximation. This brings descriptive results interpretation suiting large-scale datasets and a high amount of ensemble members. The second part of the dissertation deals with the classification task that is about deciding for one out of several predefined categories that an object belongs to. We solve high-dimensional remote sensing data fusion problems by formulating them as a classification task and proposing a dynamic classifier and ensemble selection framework. Relying on the multiple classifier systems concept the proposed framework selects and combines competent classifiers from an established ensemble in order to provide reliable and accurate classification. To enable that, a competence estimation and selection methodology is developed. In the third part of the dissertation, we address the problem of similarity search in data streams that is about finding similar objects (or events) in a real-time stream of data. Due to outliers, noise, and potential distortions in amplitude and time dimensions, it is often challenging to correctly retrieve required patterns from the stream in presence of distortions and outliers. To enable this, we propose a dynamic normalization mechanism that allows bringing streaming signal subsequences to the scale of the query template. Additionally, we extend it for the case when multiple examples of a query template are available allowing for leveraging the wisdom of the crowds concepts in pattern matching settings. This significantly improves pattern retrieval capabilities, especially when sampling variance or time distortions are present. The proposed contributions for clustering, classification, and pattern matching are studied and validated on artificially generated datasets as well as on the real-world measurement data obtained from open sources or recorded in the laboratory of AGT Group (R&D) GmbH, Darmstadt, Germany. Multiple experiments are conducted to confirm and verify the performance consistency of the proposed methods as well as partly integrated into real-world solutions. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Status: | Verlagsversion | ||||
URN: | urn:nbn:de:tuda-tuprints-198976 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 600 Technik, Medizin, angewandte Wissenschaften > 620 Ingenieurwissenschaften und Maschinenbau | ||||
Fachbereich(e)/-gebiet(e): | 18 Fachbereich Elektrotechnik und Informationstechnik 18 Fachbereich Elektrotechnik und Informationstechnik > Institut für Nachrichtentechnik 18 Fachbereich Elektrotechnik und Informationstechnik > Institut für Nachrichtentechnik > Signalverarbeitung |
||||
Hinterlegungsdatum: | 18 Nov 2021 10:34 | ||||
Letzte Änderung: | 19 Nov 2021 09:56 | ||||
PPN: | |||||
Referenten: | Zoubir, Prof. Dr. Abdelhak M. ; Muma, Dr.-Ing. Michael | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 31 Mai 2021 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |