Langenecker, Sven (2024)
Towards Learned Metadata Extraction for Data Lakes.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00027469
Dissertation, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
In the landscape of data-driven enterprises, the concept of data lakes serves for storing and managing massive volumes of diverse data. Unlike traditional data warehousing methods characterized by rigid structures and predefined schemas, data lakes present a paradigm shift by embracing a more fluid architecture. Here, data arrives in its raw, unaltered form, preserving its inherent complexity and richness. The lack of predefined structures or standardized schemas makes it difficult to identify, find, understand, and use the relevant data sets contained in these repositories. To address this data discovery problem and enable an easy navigation, solutions for automatic metadata extraction are essential. Hence, a variety of Machine Learning (ML) based approaches for automated extracting of semantic types from table columns have recently been proposed. While initial results of these learned approaches seem promising, it is still not clear how well these approaches can generalize to new unseen data in real-world enterprise data lakes. This dissertation thus focuses on the challenge of making the task of semantic type extraction of table columns feasible for real-world enterprise data lakes. First, we studied existing approaches for semantic type extraction of table columns and evaluated how applicable they are in data lake environments to understand their limitations. Based on the findings that existing approaches are not usable out-of-the-box and always need to be adapted to the data lake where they are intended to be used, we advocate a weak supervision concept to adapt these learned semantic type detection models to the specific data lake with minimal effort. Thus, as a first contribution of this dissertation, we present a new data programming framework for semantic labeling based on the idea of weak supervision. Our new data programming framework comes with pre-designed Labeling Functions (LFs) to generate new training data that covers the new semantic types and data characteristics of the unseen data lake to which the learned semantic type extraction model is supposed to be applied. With the generated training data of our framework, the model can be re-trained/fine-tuned with minimal effort to achieve an adaption to the respective data lake and with this eliminate the barrier to apply recently learned semantic type detection approaches on enterprise data lakes. Furthermore, because the semantic labeling of numerical data is more challenging than of textual data, we present as a second contribution our novel training data generation procedure called Steered-Labeling. Steered-Labeling is integrated as a core component in our data programming framework and enables to generate high quality training data for textual and numerical table columns. The basic idea of the new procedure is to separate the labeling process into two sequential steps. In the first step, the framework labels the non-numerical columns, that are easier to label. Afterward, in the second step, the numerical columns are labeled by including the previously generated labels of the non-numeric columns, which serve as additional information. With this, the LFs achieves a much higher accuracy for numerical columns. We show by an extensive evaluation that our data programming framework with the Steered-Labeling procedure can adapt learned models to unseen data lakes with the automatically generated training data. During the experiments with our framework, we observed that the re-trained/fine-tuned end models performed worse on numeric columns than on non-numeric columns, even though the generated training data of the numerical columns is quite adequate. This is mainly because the existing models were designed, trained, and tested with datasets composed mainly of non-numerical data and therefore optimized to handle these data types. Although we used two data lakes that contain numerical columns in the evaluation of our Steered-Labeling procedure, these datasets could not be used for the design of a new model that better supports numerical columns because they are too small for this purpose. Thus, as a third contribution, we create and provide a new corpus for the task of semantic type detection of table columns called SportsTables. By scraping tables from various web pages of different sports domains, our corpus comprises tables that contain a much higher proportion of numerical columns than those in existing corpora. Furthermore, they are much larger both in the number of columns and rows. Hence, our new corpus reflects the characteristics of real-world data lakes and poses new challenges to semantic type detection models. We show through an evaluation of several recent semantic type detection models on our corpus, that they only perform robustly on textual data. To tackle the shortcomings of the existing models, we finally propose a new semantic type detection approach called Pythagoras, designed to support numerical along with non-numerical columns. To achieve this, the main idea of the new model is to use Graph Neural Networks (GNNs) together with a new graph representation of tables and their columns. This graph representation includes directed edges to aggregate necessary context information (e.g. table name, neighboring non-numerical column values) for predicting the correct semantic type of numerical columns using the GNN message passing mechanism. Thus, the model learns which contextual information is relevant for determining the semantic type. With this approach, our model can outperform all existing semantic type detection models on numerical table columns.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2024 | ||||
Autor(en): | Langenecker, Sven | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Towards Learned Metadata Extraction for Data Lakes | ||||
Sprache: | Englisch | ||||
Referenten: | Binnig, Prof. Dr. Carsten ; Papotti, Prof. PhD Paolo | ||||
Publikationsjahr: | 7 Juni 2024 | ||||
Ort: | Darmstadt | ||||
Kollation: | xxii, 218 Seiten | ||||
Datum der mündlichen Prüfung: | 4 Juni 2024 | ||||
DOI: | 10.26083/tuprints-00027469 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/27469 | ||||
Kurzbeschreibung (Abstract): | In the landscape of data-driven enterprises, the concept of data lakes serves for storing and managing massive volumes of diverse data. Unlike traditional data warehousing methods characterized by rigid structures and predefined schemas, data lakes present a paradigm shift by embracing a more fluid architecture. Here, data arrives in its raw, unaltered form, preserving its inherent complexity and richness. The lack of predefined structures or standardized schemas makes it difficult to identify, find, understand, and use the relevant data sets contained in these repositories. To address this data discovery problem and enable an easy navigation, solutions for automatic metadata extraction are essential. Hence, a variety of Machine Learning (ML) based approaches for automated extracting of semantic types from table columns have recently been proposed. While initial results of these learned approaches seem promising, it is still not clear how well these approaches can generalize to new unseen data in real-world enterprise data lakes. This dissertation thus focuses on the challenge of making the task of semantic type extraction of table columns feasible for real-world enterprise data lakes. First, we studied existing approaches for semantic type extraction of table columns and evaluated how applicable they are in data lake environments to understand their limitations. Based on the findings that existing approaches are not usable out-of-the-box and always need to be adapted to the data lake where they are intended to be used, we advocate a weak supervision concept to adapt these learned semantic type detection models to the specific data lake with minimal effort. Thus, as a first contribution of this dissertation, we present a new data programming framework for semantic labeling based on the idea of weak supervision. Our new data programming framework comes with pre-designed Labeling Functions (LFs) to generate new training data that covers the new semantic types and data characteristics of the unseen data lake to which the learned semantic type extraction model is supposed to be applied. With the generated training data of our framework, the model can be re-trained/fine-tuned with minimal effort to achieve an adaption to the respective data lake and with this eliminate the barrier to apply recently learned semantic type detection approaches on enterprise data lakes. Furthermore, because the semantic labeling of numerical data is more challenging than of textual data, we present as a second contribution our novel training data generation procedure called Steered-Labeling. Steered-Labeling is integrated as a core component in our data programming framework and enables to generate high quality training data for textual and numerical table columns. The basic idea of the new procedure is to separate the labeling process into two sequential steps. In the first step, the framework labels the non-numerical columns, that are easier to label. Afterward, in the second step, the numerical columns are labeled by including the previously generated labels of the non-numeric columns, which serve as additional information. With this, the LFs achieves a much higher accuracy for numerical columns. We show by an extensive evaluation that our data programming framework with the Steered-Labeling procedure can adapt learned models to unseen data lakes with the automatically generated training data. During the experiments with our framework, we observed that the re-trained/fine-tuned end models performed worse on numeric columns than on non-numeric columns, even though the generated training data of the numerical columns is quite adequate. This is mainly because the existing models were designed, trained, and tested with datasets composed mainly of non-numerical data and therefore optimized to handle these data types. Although we used two data lakes that contain numerical columns in the evaluation of our Steered-Labeling procedure, these datasets could not be used for the design of a new model that better supports numerical columns because they are too small for this purpose. Thus, as a third contribution, we create and provide a new corpus for the task of semantic type detection of table columns called SportsTables. By scraping tables from various web pages of different sports domains, our corpus comprises tables that contain a much higher proportion of numerical columns than those in existing corpora. Furthermore, they are much larger both in the number of columns and rows. Hence, our new corpus reflects the characteristics of real-world data lakes and poses new challenges to semantic type detection models. We show through an evaluation of several recent semantic type detection models on our corpus, that they only perform robustly on textual data. To tackle the shortcomings of the existing models, we finally propose a new semantic type detection approach called Pythagoras, designed to support numerical along with non-numerical columns. To achieve this, the main idea of the new model is to use Graph Neural Networks (GNNs) together with a new graph representation of tables and their columns. This graph representation includes directed edges to aggregate necessary context information (e.g. table name, neighboring non-numerical column values) for predicting the correct semantic type of numerical columns using the GNN message passing mechanism. Thus, the model learns which contextual information is relevant for determining the semantic type. With this approach, our model can outperform all existing semantic type detection models on numerical table columns. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Status: | Verlagsversion | ||||
URN: | urn:nbn:de:tuda-tuprints-274697 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Data and AI Systems |
||||
Hinterlegungsdatum: | 07 Jun 2024 12:05 | ||||
Letzte Änderung: | 11 Jun 2024 06:13 | ||||
PPN: | |||||
Referenten: | Binnig, Prof. Dr. Carsten ; Papotti, Prof. PhD Paolo | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 4 Juni 2024 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |