Eckart de Castilho, Richard (2014)
Natural Language Processing: Integration of Automatic and Manual Analysis.
Technische Universität Darmstadt
Dissertation, Erstveröffentlichung
Kurzbeschreibung (Abstract)
There is a current trend to combine natural language analysis with research questions from the humanities. This requires an integration of automatic analysis with manual analysis, e.g. to develop a theory behind the analysis, to test the theory against a corpus, to generate training data for automatic analysis based on machine learning algorithms, and to evaluate the quality of the results from automatic analysis. Manual analysis is traditionally the domain of linguists, philosophers, and researchers from other humanities disciplines, who are often not expert programmers. Automatic analysis, on the other hand, is traditionally done by expert programmers, such as computer scientists and more recently computational linguists. It is important to bring these communities, their tools, and data closer together, to produce analysis of a higher quality with less effort. However, promising cooperations involving manual and automatic analysis, e.g. for the purpose of analyzing a large corpus, are hindered by many problems:
- No comprehensive set of interoperable automatic analysis components is available.
- Assembling automatic analysis components into workflows is too complex.
- Automatic analysis tools, exploration tools, and annotation editors are not interoperable.
- Workflows are not portable between computers.
- Workflows are not easily deployable to a compute cluster.
- There are no adequate tools for the selective annotation of large corpora.
- In automatic analysis, annotation type systems are predefined, but manual annotation requires customizability.
- Implementing new interoperable automatic analysis components is too complex.
- Workflows and components are not sufficiently debuggable and refactorable.
- Workflows that change dynamically via parametrization are not readily supported.
- The user has no control over workflows that rely on expert skills from a different domain, undocumented knowledge, or third-party infrastructures, e.g. web services.
In cooperation with researchers from the humanities, we develop innovative technical solutions and designs to facilitate the use of automatic analysis and to promote the integration of manual and automatic analysis. To address these issues, we set foundations in four areas:
- Usability is improved by reducing the complexity of the APIs for building workflows and creating custom components, improving the handling of resources required by such components, and setting up auto-configuration mechanisms.
- Reproducibility is improved through a concept for self-contained, portable analysis components and workflows combined with a declarative modeling approach for dynamic parametrized workflows, that facilitates avoiding unnecessary auxiliary manual steps in automatic workflows.
- Flexibility is achieved by providing an extensive collection of interoperable automatic analysis components. We also compare annotation type systems used by different automatic analysis components to locate design patterns that allow for customization when used in manual analysis tasks.
- Interactivity is achieved through a novel "annotation-by-query" process combining corpus search with annotation in a multi-user scenario. The process is supported by a web-based tool.
We demonstrate the adequacy of our concepts through examples which represent whole classes of research problems. Additionally, we integrated all our concepts into existing open-source projects, or we implemented and published them within new open-source projects.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2014 | ||||
Autor(en): | Eckart de Castilho, Richard | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Natural Language Processing: Integration of Automatic and Manual Analysis | ||||
Sprache: | Englisch | ||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Henrich, Prof. Dr. Andreas ; Manning, PhD. Christopher D. | ||||
Publikationsjahr: | 2014 | ||||
Ort: | Darmstadt | ||||
Datum der mündlichen Prüfung: | 10 Februar 2014 | ||||
URL / URN: | http://tuprints.ulb.tu-darmstadt.de/4151 | ||||
Kurzbeschreibung (Abstract): | There is a current trend to combine natural language analysis with research questions from the humanities. This requires an integration of automatic analysis with manual analysis, e.g. to develop a theory behind the analysis, to test the theory against a corpus, to generate training data for automatic analysis based on machine learning algorithms, and to evaluate the quality of the results from automatic analysis. Manual analysis is traditionally the domain of linguists, philosophers, and researchers from other humanities disciplines, who are often not expert programmers. Automatic analysis, on the other hand, is traditionally done by expert programmers, such as computer scientists and more recently computational linguists. It is important to bring these communities, their tools, and data closer together, to produce analysis of a higher quality with less effort. However, promising cooperations involving manual and automatic analysis, e.g. for the purpose of analyzing a large corpus, are hindered by many problems: - No comprehensive set of interoperable automatic analysis components is available. - Assembling automatic analysis components into workflows is too complex. - Automatic analysis tools, exploration tools, and annotation editors are not interoperable. - Workflows are not portable between computers. - Workflows are not easily deployable to a compute cluster. - There are no adequate tools for the selective annotation of large corpora. - In automatic analysis, annotation type systems are predefined, but manual annotation requires customizability. - Implementing new interoperable automatic analysis components is too complex. - Workflows and components are not sufficiently debuggable and refactorable. - Workflows that change dynamically via parametrization are not readily supported. - The user has no control over workflows that rely on expert skills from a different domain, undocumented knowledge, or third-party infrastructures, e.g. web services. In cooperation with researchers from the humanities, we develop innovative technical solutions and designs to facilitate the use of automatic analysis and to promote the integration of manual and automatic analysis. To address these issues, we set foundations in four areas: - Usability is improved by reducing the complexity of the APIs for building workflows and creating custom components, improving the handling of resources required by such components, and setting up auto-configuration mechanisms. - Reproducibility is improved through a concept for self-contained, portable analysis components and workflows combined with a declarative modeling approach for dynamic parametrized workflows, that facilitates avoiding unnecessary auxiliary manual steps in automatic workflows. - Flexibility is achieved by providing an extensive collection of interoperable automatic analysis components. We also compare annotation type systems used by different automatic analysis components to locate design patterns that allow for customization when used in manual analysis tasks. - Interactivity is achieved through a novel "annotation-by-query" process combining corpus search with annotation in a multi-user scenario. The process is supported by a web-based tool. We demonstrate the adequacy of our concepts through examples which represent whole classes of research problems. Additionally, we integrated all our concepts into existing open-source projects, or we implemented and published them within new open-source projects. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Freie Schlagworte: | Natural Language Processing, Software Engineering, Automatic Text Analysis, Manual Text Analysis, Annotation Tool, Annotation Type Systems | ||||
URN: | urn:nbn:de:tuda-tuprints-41517 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik 400 Sprache > 400 Sprache, Linguistik |
||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung |
||||
Hinterlegungsdatum: | 16 Nov 2014 20:55 | ||||
Letzte Änderung: | 23 Aug 2018 12:08 | ||||
PPN: | |||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Henrich, Prof. Dr. Andreas ; Manning, PhD. Christopher D. | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 10 Februar 2014 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |