TU Darmstadt / ULB / TUbiblio

TxT: Crossmodal End-to-End Learning with Transformers

Steitz, Jan-Martin O. ; Pfeiffer, Jonas ; Gurevych, Iryna ; Roth, Stefan (2021):
TxT: Crossmodal End-to-End Learning with Transformers.
In: Lecture Notes in Computer Science, 13024, In: Pattern Recognition, pp. 405-420,
Springer, 43rd DAGM German Conference on Pattern Recognition 2021, virtual Conference, 28.09.-01.10.2021, ISBN 978-3-030-92659-5,
DOI: 10.1007/978-3-030-92659-5_26,
[Conference or Workshop Item]

Abstract

Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today’s multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today’s multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.

Item Type: Conference or Workshop Item
Erschienen: 2021
Creators: Steitz, Jan-Martin O. ; Pfeiffer, Jonas ; Gurevych, Iryna ; Roth, Stefan
Title: TxT: Crossmodal End-to-End Learning with Transformers
Language: English
Abstract:

Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today’s multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today’s multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.

Book Title: Pattern Recognition
Series: Lecture Notes in Computer Science
Series Volume: 13024
Publisher: Springer
ISBN: 978-3-030-92659-5
Uncontrolled Keywords: UKP_p_emergencity, emergenCITY_INF
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Ubiquitous Knowledge Processing
20 Department of Computer Science > Visual Inference
LOEWE
LOEWE > LOEWE-Zentren
LOEWE > LOEWE-Zentren > emergenCITY
Event Title: 43rd DAGM German Conference on Pattern Recognition 2021
Event Location: virtual Conference
Event Dates: 28.09.-01.10.2021
Date Deposited: 21 Sep 2021 14:04
DOI: 10.1007/978-3-030-92659-5_26
URL / URN: https://link.springer.com/chapter/10.1007/978-3-030-92659-5_...
PPN: 498944085
Corresponding Links:
Export:
Suche nach Titel in: TUfind oder in Google
Send an inquiry Send an inquiry

Options (only for editors)
Show editorial Details Show editorial Details