Alkhatib, Wael (2020)
Semantically Enhanced and Minimally Supervised Models for Ontology Construction, Text Classification, and Document Recommendation.
Technische Universität Darmstadt
doi: 10.25534/tuprints-00011890
Dissertation, Erstveröffentlichung
Kurzbeschreibung (Abstract)
The proliferation of deliverable knowledge on the web, along with the rapidly increasing number of accessible research publications, make researchers, students, and educators overwhelmed. Linked data platforms like SciGraph reduce this information overload by combining data from heterogeneous information sources and link them to ontologies that describe how these resources are related. Linked data platforms provide functionalities to improve the accessibility and discoverability of these resources. These functionalities include methods for maintaining and updating the ontologies used, for the assignment of concepts to resources as well as for providing recommendations of relevant resources. About 80% of information sources on the Internet originate in form of unstructured content. This triggers the need for automated methods that leverage the wealth of information embedded in unstructured content to realize the needed functionalities.
This thesis provides contributions concerning three building blocks of the construction of linked data platforms from unstructured information sources, namely ontology construction and enrichment, text classification, and document recommendation. The majority of ML methods used for studying these problems are characterized by the intensive reliance on complicated feature engineering, which is a tedious, time consuming, and domain-specific process. Our work is motivated by the potential of using lexical-semantic resources and deep learning to address the research challenges in the current approaches. On the one side, existing lexical-semantic resources encode various types of information about words such as their meaning and semantic relations. On the other side, deep learning methods have achieved state-of-the-art performance on challenging NLP problems, i.e., text classification and semantic relation extraction. The rise of distributed representations is the key to the breakthrough of deep learning on various NLP tasks. The focus of this work is to develop, implement, and evaluate new approaches that better leverage the semantic similarities and regularities between words in large text corpora to minimize the hand-crafted feature engineering in current approaches.
With regard to ontology construction and enrichment, we present Onto.KOM: a minimally supervised ontology learning system that uses unstructured text as input in addition to existing lexical databases. We study the effectiveness of using our approach for semantic relation classification regarding different influencing aspects, namely the input representation, the deep network structure used, and the types of semantic relations.
In the scope of multi-label text classification, our contributions lie under three main areas: First, we propose an approach for feature selection using the typed dependencies between words as a measure to select the most essential features. We compare our approach with multiple statistical and semantic-based techniques, to investigate the advantage of leveraging the semantic and syntactic relationships between words to improve the quality of selected features. Second, we analyse the performance of deep learning structures on a small dataset of long documents where traditional techniques tend to perform better. Besides, we develop a new model that uses the distributed representations of document fragments and deep learning structures. We compare the new model with a wide range of feature selection and text classification techniques. Third, we address the label imbalance problem and the lack of sufficient training samples. In this scope, we develop a training-less classifier based on lexical-semantic resources as a base for classification. We transform the classification problem into graph matching problem.
Concerning the recommendation of relevant resources, we address the problem of citation recommendation as a particular use case of document recommendation. We propose two models for combining the different heterogeneous information sources, such as the content of papers, co-authorship information, and previously cited papers to provide personalized citation recommendation.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2020 | ||||
Autor(en): | Alkhatib, Wael | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Semantically Enhanced and Minimally Supervised Models for Ontology Construction, Text Classification, and Document Recommendation | ||||
Sprache: | Englisch | ||||
Referenten: | Steinmetz, Prof. Dr. Ralf ; Staab, Prof. Dr. Steffen | ||||
Publikationsjahr: | 2020 | ||||
Ort: | Darmstadt | ||||
Datum der mündlichen Prüfung: | 10 Juni 2020 | ||||
DOI: | 10.25534/tuprints-00011890 | ||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/11890 | ||||
Kurzbeschreibung (Abstract): | The proliferation of deliverable knowledge on the web, along with the rapidly increasing number of accessible research publications, make researchers, students, and educators overwhelmed. Linked data platforms like SciGraph reduce this information overload by combining data from heterogeneous information sources and link them to ontologies that describe how these resources are related. Linked data platforms provide functionalities to improve the accessibility and discoverability of these resources. These functionalities include methods for maintaining and updating the ontologies used, for the assignment of concepts to resources as well as for providing recommendations of relevant resources. About 80% of information sources on the Internet originate in form of unstructured content. This triggers the need for automated methods that leverage the wealth of information embedded in unstructured content to realize the needed functionalities. This thesis provides contributions concerning three building blocks of the construction of linked data platforms from unstructured information sources, namely ontology construction and enrichment, text classification, and document recommendation. The majority of ML methods used for studying these problems are characterized by the intensive reliance on complicated feature engineering, which is a tedious, time consuming, and domain-specific process. Our work is motivated by the potential of using lexical-semantic resources and deep learning to address the research challenges in the current approaches. On the one side, existing lexical-semantic resources encode various types of information about words such as their meaning and semantic relations. On the other side, deep learning methods have achieved state-of-the-art performance on challenging NLP problems, i.e., text classification and semantic relation extraction. The rise of distributed representations is the key to the breakthrough of deep learning on various NLP tasks. The focus of this work is to develop, implement, and evaluate new approaches that better leverage the semantic similarities and regularities between words in large text corpora to minimize the hand-crafted feature engineering in current approaches. With regard to ontology construction and enrichment, we present Onto.KOM: a minimally supervised ontology learning system that uses unstructured text as input in addition to existing lexical databases. We study the effectiveness of using our approach for semantic relation classification regarding different influencing aspects, namely the input representation, the deep network structure used, and the types of semantic relations. In the scope of multi-label text classification, our contributions lie under three main areas: First, we propose an approach for feature selection using the typed dependencies between words as a measure to select the most essential features. We compare our approach with multiple statistical and semantic-based techniques, to investigate the advantage of leveraging the semantic and syntactic relationships between words to improve the quality of selected features. Second, we analyse the performance of deep learning structures on a small dataset of long documents where traditional techniques tend to perform better. Besides, we develop a new model that uses the distributed representations of document fragments and deep learning structures. We compare the new model with a wide range of feature selection and text classification techniques. Third, we address the label imbalance problem and the lack of sufficient training samples. In this scope, we develop a training-less classifier based on lexical-semantic resources as a base for classification. We transform the classification problem into graph matching problem. Concerning the recommendation of relevant resources, we address the problem of citation recommendation as a particular use case of document recommendation. We propose two models for combining the different heterogeneous information sources, such as the content of papers, co-authorship information, and previously cited papers to provide personalized citation recommendation. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
URN: | urn:nbn:de:tuda-tuprints-118909 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 18 Fachbereich Elektrotechnik und Informationstechnik 18 Fachbereich Elektrotechnik und Informationstechnik > Institut für Datentechnik 18 Fachbereich Elektrotechnik und Informationstechnik > Institut für Datentechnik > Multimedia Kommunikation |
||||
Hinterlegungsdatum: | 02 Sep 2020 12:52 | ||||
Letzte Änderung: | 08 Sep 2020 09:14 | ||||
PPN: | |||||
Referenten: | Steinmetz, Prof. Dr. Ralf ; Staab, Prof. Dr. Steffen | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 10 Juni 2020 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |