Chen, Libo (2006)
Automatic Construction of Domain-Specific Concept Structures.
Technische Universität Darmstadt
Dissertation, Erstveröffentlichung
Kurzbeschreibung (Abstract)
One of the greatest challenges for search engines and other search tools, which are developed to cope with the information overload, is the vocabulary mismatch problem, referring to the fact that different people usually use different vocabularies to describe the same concepts. This problem can first of all lead to unsatisfactory search results, because the keywords in search queries often do not match the indices of search engines – either the queries are too imprecise to describe users’ actual needs, or, although correctly formulated, the queries simply do not contain the keywords with which authors write their documents. There is therefore a clear need to quickly build a concept structure for each possible topic or knowledge domain of user interest, which includes the most important concepts of a specific knowledge domain and the relationships between the concepts. Such concept structures can serve to standardize vocabularies in various knowledge domains, and help to bridge the vocabulary gap between information users, information creators, and search engines. Since manual approaches often suffer from the problem of low coverage and high expense, this dissertation focuses on corpus based statistical approaches to automatically build domain-specific concept structures. These automatic approaches first select suitable text corpora to represent domains of interest, then find statistical evidence about terms in the text corpora, and finally perform statistical analysis upon the evidence to construct concept structures. There exist two main challenges in the process of automatic construction of domain-specific concept structures: First, how the concepts in a domain can be found and extracted from text corpora (we refer to all important terms in a domain as concepts). Second, how the relationships between these concepts can be effectively determined. For the task of concept extraction, we first introduce a notion of topicality to define the importance of a term, indicating how topical a term is to a specific domain. We further divide term topicality into two factors: term representativeness which indicates how well a term is capable of covering the topic area of a domain, and term specificity which indicates how specific a term is to a certain domain compared to other knowledge domains. We further present a novel approach for specificity calculation, where we not only collect information for the domain of interest, but also collect information for a set of reference domains. A statistical measure called the “Distribution Grade” is developed to compare the distribution of a term in different domains to calculate its specificity more accurately. By combining representativeness and specificity, we are able to weight and sort terms in a text corpus according to their topicalities, and choose a limited number of top ranked terms as concepts in a domain of interest. Relationship determination between concepts is usually based on a notion of common context of concepts, which is quantified by means of a similarity measure that compares the individual context of concepts with their common context. In this work, we first provide formal definitions and a detail analysis on two kinds of existing context – with one of them counting the frequency of co-occurrences of concepts in texts, and another considering the terms occurring in the neighbourhood of the concepts. We further introduce a new notion of context to overcome the limitations of previous approaches by combining evidence on both co-occurrences and neighbourhood terms. A mutual conditional probability model is presented as a general framework for formalizing the most successful similarity measures. Each type of context is then quantified by the probability model and combined to form a hybrid similarity measure to determine a “Generally Related” relationship. In addition, we also investigate the possibility of determining a “Broader/Narrower” relationship which plays an important role for building hierarchical concept structures. We show that considering the individual conditional probabilities in the mutual conditional probability model on the premise of a close “Generally Related” relationship helps to better find the “Broader/Narrower” relationship. For an automatic evaluation of our approach, we employ widely accepted and manually built concept structures as “gold standards”, and automatically compare the extracted concepts and relationships with the entries in the gold standards. Experimental results show that our approaches achieve the best performance for a wide range of candidate terms and relationships, and for different types of text collections.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2006 | ||||
Autor(en): | Chen, Libo | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Automatic Construction of Domain-Specific Concept Structures | ||||
Sprache: | Englisch | ||||
Referenten: | Neuhold, Prof. Dr. Erich J. ; Hofmann, Prof. Dr. Thomas | ||||
Berater: | Neuhold, Prof. Dr. Erich J. | ||||
Publikationsjahr: | 13 April 2006 | ||||
Ort: | Darmstadt | ||||
Verlag: | Technische Universität | ||||
Datum der mündlichen Prüfung: | 28 März 2006 | ||||
URL / URN: | urn:nbn:de:tuda-tuprints-6798 | ||||
Kurzbeschreibung (Abstract): | One of the greatest challenges for search engines and other search tools, which are developed to cope with the information overload, is the vocabulary mismatch problem, referring to the fact that different people usually use different vocabularies to describe the same concepts. This problem can first of all lead to unsatisfactory search results, because the keywords in search queries often do not match the indices of search engines – either the queries are too imprecise to describe users’ actual needs, or, although correctly formulated, the queries simply do not contain the keywords with which authors write their documents. There is therefore a clear need to quickly build a concept structure for each possible topic or knowledge domain of user interest, which includes the most important concepts of a specific knowledge domain and the relationships between the concepts. Such concept structures can serve to standardize vocabularies in various knowledge domains, and help to bridge the vocabulary gap between information users, information creators, and search engines. Since manual approaches often suffer from the problem of low coverage and high expense, this dissertation focuses on corpus based statistical approaches to automatically build domain-specific concept structures. These automatic approaches first select suitable text corpora to represent domains of interest, then find statistical evidence about terms in the text corpora, and finally perform statistical analysis upon the evidence to construct concept structures. There exist two main challenges in the process of automatic construction of domain-specific concept structures: First, how the concepts in a domain can be found and extracted from text corpora (we refer to all important terms in a domain as concepts). Second, how the relationships between these concepts can be effectively determined. For the task of concept extraction, we first introduce a notion of topicality to define the importance of a term, indicating how topical a term is to a specific domain. We further divide term topicality into two factors: term representativeness which indicates how well a term is capable of covering the topic area of a domain, and term specificity which indicates how specific a term is to a certain domain compared to other knowledge domains. We further present a novel approach for specificity calculation, where we not only collect information for the domain of interest, but also collect information for a set of reference domains. A statistical measure called the “Distribution Grade” is developed to compare the distribution of a term in different domains to calculate its specificity more accurately. By combining representativeness and specificity, we are able to weight and sort terms in a text corpus according to their topicalities, and choose a limited number of top ranked terms as concepts in a domain of interest. Relationship determination between concepts is usually based on a notion of common context of concepts, which is quantified by means of a similarity measure that compares the individual context of concepts with their common context. In this work, we first provide formal definitions and a detail analysis on two kinds of existing context – with one of them counting the frequency of co-occurrences of concepts in texts, and another considering the terms occurring in the neighbourhood of the concepts. We further introduce a new notion of context to overcome the limitations of previous approaches by combining evidence on both co-occurrences and neighbourhood terms. A mutual conditional probability model is presented as a general framework for formalizing the most successful similarity measures. Each type of context is then quantified by the probability model and combined to form a hybrid similarity measure to determine a “Generally Related” relationship. In addition, we also investigate the possibility of determining a “Broader/Narrower” relationship which plays an important role for building hierarchical concept structures. We show that considering the individual conditional probabilities in the mutual conditional probability model on the premise of a close “Generally Related” relationship helps to better find the “Broader/Narrower” relationship. For an automatic evaluation of our approach, we employ widely accepted and manually built concept structures as “gold standards”, and automatically compare the extracted concepts and relationships with the entries in the gold standards. Experimental results show that our approaches achieve the best performance for a wide range of candidate terms and relationships, and for different types of text collections. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik | ||||
Hinterlegungsdatum: | 17 Okt 2008 09:22 | ||||
Letzte Änderung: | 05 Mär 2013 09:26 | ||||
PPN: | |||||
Referenten: | Neuhold, Prof. Dr. Erich J. ; Hofmann, Prof. Dr. Thomas | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 28 März 2006 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |