Hättasch, Benjamin (2024)
Democratizing Information Access through Low Overhead Systems.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00026737
Dissertation, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
Despite its importance, accessing information in storage systems or raw data is challenging or impossible for most people due to the sheer amount and heterogeneity of data as well as the overheads and complexities of existing systems. In this thesis, we propose several approaches to improve on that and therefore democratize information access.
Data-driven and AI based approaches make it possible to provide the necessary information access for many tasks at scale. Unfortunately, most existing approaches can only be built and used by IT experts and data scientists, yet the current demand for data scientists cannot be met by far. Furthermore, their application is expensive. To counter this, approaches with low overhead, i.e., without the need for large amounts of training data, manually annotating or extracting information, and extensive computation are needed. However, such systems still need to adapt to special terminology of different domains, and the individual information needs of the users. Moreover, they should be usable without extensive training; we thus aim to create ready-to-use systems that provide intuitive or familiar ways for interaction, e.g., chatbot-like natural language input or graphical user interfaces.
In this thesis, we propose a number of contributions to three important subfields of data exploration and processing: Natural Language Interfaces for Data Access & Manipulation, Personalized Summarizations of Text Collections, and Information Extraction & Integration. These approaches allow data scientists, domain experts and end users to access and manipulate information in a quick and easy way.
First, we propose two natural language interfaces for data access and manipulation. Natural language is a useful alternative interface for relational databases, since it allows users to formulate complex questions without requiring knowledge of SQL. We propose an approach based on weak supervision that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation. Moreover, we apply the idea to build a training pipeline for conversational agents (i.e., chatbot-like systems allowing to interact with a database and perform actions like ticket booking). The pipeline uses weak supervision to generate the training data automatically from a relational database and its set of defined transactions. Our approach is data-aware, i.e., it leverages the data characteristics of the DB at runtime to optimize the dialogue flow and reduce necessary interactions. Additionally, we complement this research by presenting a meta-study on the reproducibility and availability of natural language interfaces for databases (NLIDBs) for real-world applications, and a benchmark to evaluate the linguistic robustness of NLIDBs.
Second, we work on personalized summarization and its usage for data exploration. The central idea is to produce summaries that exactly cover the current information need of the users. By creating multiple summaries or shifting the focus during the interactive creation process, these summaries can be used to explore the contents of unknown text collections. We propose an approach to create such personalized summaries at interactive speed; this is achieved by carefully sampling from the inputs. As part of our research on multi-document summary, we noticed that there is a lack of diverse evaluation corpora for this task. We therefore present a framework that can be used to automatically create new summarization corpora, and apply and validate it.
Third, we provide ways to democratize information extraction and integration. This becomes relevant when data is scattered across different sources and there is no tabular representation that already contains all information needed. Therefore, it might be necessary to integrate different structured sources, or to even extract the required information pieces from text collections first and then to organize them. To integrate existing structured data sources, we present and evaluate a novel end-to-end approach for schema matching based on neural embeddings. Finally, we tackle the automatic creation of tables from text for situations where no suitable structured source to answer an information need is available. Our proposed approach can execute SQL-like queries on text collections in an ad-hoc manner, both to directly extract facts from text documents, and to produce aggregated tables stating information that is not explicitly mentioned in the documents. Our approach works by generalizing user feedback and therefore does not need domain-specific resources for the domain adaption. It runs at interactive speed even on commodity hardware.
Overall, our approaches can provide a quality level compared to state-of-the-art approaches, but often at a fraction of the associated costs. In other fields like the table extractions, we even provide functionality that is—to our knowledge—not covered by any generic tooling available to end users. There are still many interesting challenges to solve, and the recent rise of large language models has shifted what seems possible with regard to dealing with human language once more. Yet, we hope that our contributions provide a useful step towards democratization of information access.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2024 | ||||
Autor(en): | Hättasch, Benjamin | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Democratizing Information Access through Low Overhead Systems | ||||
Sprache: | Englisch | ||||
Referenten: | Binnig, Prof. Dr. Carsten ; Michel, Prof. Dr. Sebastian | ||||
Publikationsjahr: | 20 März 2024 | ||||
Ort: | Darmstadt | ||||
Kollation: | xix, 286 Seiten | ||||
Datum der mündlichen Prüfung: | 11 Dezember 2023 | ||||
DOI: | 10.26083/tuprints-00026737 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/26737 | ||||
Kurzbeschreibung (Abstract): | Despite its importance, accessing information in storage systems or raw data is challenging or impossible for most people due to the sheer amount and heterogeneity of data as well as the overheads and complexities of existing systems. In this thesis, we propose several approaches to improve on that and therefore democratize information access. Data-driven and AI based approaches make it possible to provide the necessary information access for many tasks at scale. Unfortunately, most existing approaches can only be built and used by IT experts and data scientists, yet the current demand for data scientists cannot be met by far. Furthermore, their application is expensive. To counter this, approaches with low overhead, i.e., without the need for large amounts of training data, manually annotating or extracting information, and extensive computation are needed. However, such systems still need to adapt to special terminology of different domains, and the individual information needs of the users. Moreover, they should be usable without extensive training; we thus aim to create ready-to-use systems that provide intuitive or familiar ways for interaction, e.g., chatbot-like natural language input or graphical user interfaces. In this thesis, we propose a number of contributions to three important subfields of data exploration and processing: Natural Language Interfaces for Data Access & Manipulation, Personalized Summarizations of Text Collections, and Information Extraction & Integration. These approaches allow data scientists, domain experts and end users to access and manipulate information in a quick and easy way. First, we propose two natural language interfaces for data access and manipulation. Natural language is a useful alternative interface for relational databases, since it allows users to formulate complex questions without requiring knowledge of SQL. We propose an approach based on weak supervision that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation. Moreover, we apply the idea to build a training pipeline for conversational agents (i.e., chatbot-like systems allowing to interact with a database and perform actions like ticket booking). The pipeline uses weak supervision to generate the training data automatically from a relational database and its set of defined transactions. Our approach is data-aware, i.e., it leverages the data characteristics of the DB at runtime to optimize the dialogue flow and reduce necessary interactions. Additionally, we complement this research by presenting a meta-study on the reproducibility and availability of natural language interfaces for databases (NLIDBs) for real-world applications, and a benchmark to evaluate the linguistic robustness of NLIDBs. Second, we work on personalized summarization and its usage for data exploration. The central idea is to produce summaries that exactly cover the current information need of the users. By creating multiple summaries or shifting the focus during the interactive creation process, these summaries can be used to explore the contents of unknown text collections. We propose an approach to create such personalized summaries at interactive speed; this is achieved by carefully sampling from the inputs. As part of our research on multi-document summary, we noticed that there is a lack of diverse evaluation corpora for this task. We therefore present a framework that can be used to automatically create new summarization corpora, and apply and validate it. Third, we provide ways to democratize information extraction and integration. This becomes relevant when data is scattered across different sources and there is no tabular representation that already contains all information needed. Therefore, it might be necessary to integrate different structured sources, or to even extract the required information pieces from text collections first and then to organize them. To integrate existing structured data sources, we present and evaluate a novel end-to-end approach for schema matching based on neural embeddings. Finally, we tackle the automatic creation of tables from text for situations where no suitable structured source to answer an information need is available. Our proposed approach can execute SQL-like queries on text collections in an ad-hoc manner, both to directly extract facts from text documents, and to produce aggregated tables stating information that is not explicitly mentioned in the documents. Our approach works by generalizing user feedback and therefore does not need domain-specific resources for the domain adaption. It runs at interactive speed even on commodity hardware. Overall, our approaches can provide a quality level compared to state-of-the-art approaches, but often at a fraction of the associated costs. In other fields like the table extractions, we even provide functionality that is—to our knowledge—not covered by any generic tooling available to end users. There are still many interesting challenges to solve, and the recent rise of large language models has shifted what seems possible with regard to dealing with human language once more. Yet, we hope that our contributions provide a useful step towards democratization of information access. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Status: | Verlagsversion | ||||
URN: | urn:nbn:de:tuda-tuprints-267378 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Data and AI Systems |
||||
Hinterlegungsdatum: | 20 Mär 2024 15:03 | ||||
Letzte Änderung: | 28 Mär 2024 08:45 | ||||
PPN: | |||||
Referenten: | Binnig, Prof. Dr. Carsten ; Michel, Prof. Dr. Sebastian | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 11 Dezember 2023 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |