TU Darmstadt / ULB / TUbiblio

Using topic modeling to restructure the archive system of the German Waterways and Shipping Administration

Hoffmann, André ; Shi, Meiling ; Rüppel, Uwe
Hrsg.: Semenov, Vitaly ; Scherer, Raimar J. (2021)
Using topic modeling to restructure the archive system of the German Waterways and Shipping Administration.
13th European Conference on Product & Process Modelling (ECPPM 2021). Moscow, Russia (15-17 September 2021)
doi: 10.1201/9781003191476-30
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

The German Waterways and Shipping Administration (WSV) is responsible for a large number of technical documents in its archive system. These include the design process in accordance with its administrative regulations (VV-WSV), which covers the entire planning cycle from basic evaluation to implementation planning. In the process of planning, construction and operation of objects of the hydraulic engineering infrastructure, a large and varied number of documents is being accumulated at the responsible authorities. Hierarchical filing systems provided with metadata are often not sufficient to search the documents in a targeted manner. The object of research is therefore machine learning methods that generate new classification systems on the basis of the given document stock and can integrate the existing documents into them. The filing is object-related and the clerk specifies various descriptive attributes. Of interest are now procedures that automatically generate topic models on the basis of the specified texts in the metadata documents in order to assign the documents to them. For this study, the words in the metadata attributes were combined into so-called bag of words and latent Dirichlet allocation (LDA) was applied to automatically find word groups that belong together. With the topic models generated in this way, documents can be searched according to topic composition or, in the case of a keyword search, documents can be displayed which do not contain the keyword but which match the topic. Due to the high number of topics that overlapped within the planning data and the few words per document, the algorithm found it difficult to generate unambiguous topics that could be easily interpreted by humans. In order to generate such topics, so-called Seeded LDA was used. Here the generation of topics can be influenced by setting seed words per topic. With Seeded LDA it is possible to fix certain topics while the algorithm decides others freely and finds new topics.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2021
Herausgeber: Semenov, Vitaly ; Scherer, Raimar J.
Autor(en): Hoffmann, André ; Shi, Meiling ; Rüppel, Uwe
Art des Eintrags: Bibliographie
Titel: Using topic modeling to restructure the archive system of the German Waterways and Shipping Administration
Sprache: Englisch
Publikationsjahr: September 2021
Ort: London
Verlag: CRC Press
Buchtitel: ECPPM 2021 – eWork and eBusiness in Architecture, Engineering and Construction: Proceedings of the 13th European Conference on Product & Process Modelling 2021
Veranstaltungstitel: 13th European Conference on Product & Process Modelling (ECPPM 2021)
Veranstaltungsort: Moscow, Russia
Veranstaltungsdatum: 15-17 September 2021
DOI: 10.1201/9781003191476-30
URL / URN: https://www.taylorfrancis.com/books/9781003191476/chapters/1...
Zugehörige Links:
Kurzbeschreibung (Abstract):

The German Waterways and Shipping Administration (WSV) is responsible for a large number of technical documents in its archive system. These include the design process in accordance with its administrative regulations (VV-WSV), which covers the entire planning cycle from basic evaluation to implementation planning. In the process of planning, construction and operation of objects of the hydraulic engineering infrastructure, a large and varied number of documents is being accumulated at the responsible authorities. Hierarchical filing systems provided with metadata are often not sufficient to search the documents in a targeted manner. The object of research is therefore machine learning methods that generate new classification systems on the basis of the given document stock and can integrate the existing documents into them. The filing is object-related and the clerk specifies various descriptive attributes. Of interest are now procedures that automatically generate topic models on the basis of the specified texts in the metadata documents in order to assign the documents to them. For this study, the words in the metadata attributes were combined into so-called bag of words and latent Dirichlet allocation (LDA) was applied to automatically find word groups that belong together. With the topic models generated in this way, documents can be searched according to topic composition or, in the case of a keyword search, documents can be displayed which do not contain the keyword but which match the topic. Due to the high number of topics that overlapped within the planning data and the few words per document, the algorithm found it difficult to generate unambiguous topics that could be easily interpreted by humans. In order to generate such topics, so-called Seeded LDA was used. Here the generation of topics can be influenced by setting seed words per topic. With Seeded LDA it is possible to fix certain topics while the algorithm decides others freely and finds new topics.

ID-Nummer: pmid:50217399
Fachbereich(e)/-gebiet(e): 13 Fachbereich Bau- und Umweltingenieurwissenschaften
Hinterlegungsdatum: 30 Nov 2022 09:54
Letzte Änderung: 23 Dez 2022 10:26
PPN:
Zugehörige Links:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen