TU Darmstadt / ULB / TUbiblio

TxT: Crossmodal End-to-End Learning with Transformers

Steitz, Jan-Martin O. ; Pfeiffer, Jonas ; Gurevych, Iryna ; Roth, Stefan (2021)
TxT: Crossmodal End-to-End Learning with Transformers.
43rd DAGM German Conference on Pattern Recognition 2021. virtual Conference (28.09.-01.10.2021)
doi: 10.1007/978-3-030-92659-5_26
Conference or Workshop Item, Bibliographie

Abstract

Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today’s multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today’s multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.

Item Type: Conference or Workshop Item
Erschienen: 2021
Creators: Steitz, Jan-Martin O. ; Pfeiffer, Jonas ; Gurevych, Iryna ; Roth, Stefan
Type of entry: Bibliographie
Title: TxT: Crossmodal End-to-End Learning with Transformers
Language: English
Date: 15 September 2021
Publisher: Springer
Book Title: Pattern Recognition
Series: Lecture Notes in Computer Science
Series Volume: 13024
Event Title: 43rd DAGM German Conference on Pattern Recognition 2021
Event Location: virtual Conference
Event Dates: 28.09.-01.10.2021
DOI: 10.1007/978-3-030-92659-5_26
URL / URN: https://link.springer.com/chapter/10.1007/978-3-030-92659-5_...
Corresponding Links:
Abstract:

Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today’s multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today’s multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.

Uncontrolled Keywords: UKP_p_emergencity, emergenCITY_INF
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Ubiquitous Knowledge Processing
20 Department of Computer Science > Visual Inference
LOEWE
LOEWE > LOEWE-Zentren
LOEWE > LOEWE-Zentren > emergenCITY
Date Deposited: 21 Sep 2021 14:04
Last Modified: 09 Sep 2022 06:44
PPN: 498944085
Corresponding Links:
Export:
Suche nach Titel in: TUfind oder in Google
Send an inquiry Send an inquiry

Options (only for editors)
Show editorial Details Show editorial Details