TU Darmstadt / ULB / TUbiblio

Summarization Beyond News: The Automatically Acquired Fandom Corpora

Hättasch, Benjamin ; Geisler, Nadja ; Meyer, Christian M. ; Binnig, Carsten (2020)
Summarization Beyond News: The Automatically Acquired Fandom Corpora.
Conference or Workshop Item

Abstract

Large state-of-the-art corpora for training neural networks to create abstractive summaries are mostly limited to the news genre, as it is expensive to acquire human-written summaries for other types of text at a large scale. In this paper, we present a novel automatic corpus construction approach to tackle this issue as well as three new large open-licensed summarization corpora based on our approach that can be used for training abstractive summarization models. Our constructed corpora contain fictional narratives, descriptive texts, and summaries about movies, television, and book series from different domains. All sources use a creative commons (CC) license, hence we can provide the corpora for download. In addition, we also provide a ready-to-use framework that implements our automatic construction approach to create custom corpora with desired parameters like the length of the target summary and the number of source documents from which to create the summary. The main idea behind our automatic construction approach is to use existing large text collections (e.g., thematic wikis) and automatically classify whether the texts can be used as (query-focused) multi-document summaries and align them with potential source texts. As a final contribution, we show the usefulness of our automatic construction approach by running state-of-the-art summarizers on the corpora and through a manual evaluation with human annotators.

Item Type: Conference or Workshop Item
Erschienen: 2020
Creators: Hättasch, Benjamin ; Geisler, Nadja ; Meyer, Christian M. ; Binnig, Carsten
Type of entry: Bibliographie
Title: Summarization Beyond News: The Automatically Acquired Fandom Corpora
Language: English
Date: May 2020
Publisher: European Language Resources Association
Book Title: Proceedings of The 12th Language Resources and Evaluation Conference
URL / URN: https://www.aclweb.org/anthology/2020.lrec-1.827
Abstract:

Large state-of-the-art corpora for training neural networks to create abstractive summaries are mostly limited to the news genre, as it is expensive to acquire human-written summaries for other types of text at a large scale. In this paper, we present a novel automatic corpus construction approach to tackle this issue as well as three new large open-licensed summarization corpora based on our approach that can be used for training abstractive summarization models. Our constructed corpora contain fictional narratives, descriptive texts, and summaries about movies, television, and book series from different domains. All sources use a creative commons (CC) license, hence we can provide the corpora for download. In addition, we also provide a ready-to-use framework that implements our automatic construction approach to create custom corpora with desired parameters like the length of the target summary and the number of source documents from which to create the summary. The main idea behind our automatic construction approach is to use existing large text collections (e.g., thematic wikis) and automatically classify whether the texts can be used as (query-focused) multi-document summaries and align them with potential source texts. As a final contribution, we show the usefulness of our automatic construction approach by running state-of-the-art summarizers on the corpora and through a manual evaluation with human annotators.

Uncontrolled Keywords: Text Summarization, corpus construction, multi-document summarization, query-focused summarization, AIPHES_area_d2, dm, dm_vi_ml, dm_fandom
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Data Management (2022 umbenannt in Data and AI Systems)
DFG-Graduiertenkollegs
DFG-Graduiertenkollegs > Research Training Group 1994 Adaptive Preparation of Information from Heterogeneous Sources
Date Deposited: 04 Jun 2020 11:56
Last Modified: 04 Jun 2020 11:56
PPN:
Projects: https://www.informatik.tu-darmstadt.de/datamanagement/datamanagement/dm_research/dm_research_projects/visual_interactive_data_exploration_and_machine_learning/dm_research_fandomcorpora/dm_projects_fandom_corpora.en.jsp
Export:
Suche nach Titel in: TUfind oder in Google
Send an inquiry Send an inquiry

Options (only for editors)
Show editorial Details Show editorial Details