TU Darmstadt / ULB / TUbiblio

TxT: Crossmodal End-to-End Learning with Transformers

Steitz, Jan-Martin O. ; Pfeiffer, Jonas ; Gurevych, Iryna ; Roth, Stefan (2021)
TxT: Crossmodal End-to-End Learning with Transformers.
43rd DAGM German Conference on Pattern Recognition 2021. virtual Conference (28.09.-01.10.2021)
doi: 10.1007/978-3-030-92659-5_26
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today’s multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today’s multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2021
Autor(en): Steitz, Jan-Martin O. ; Pfeiffer, Jonas ; Gurevych, Iryna ; Roth, Stefan
Art des Eintrags: Bibliographie
Titel: TxT: Crossmodal End-to-End Learning with Transformers
Sprache: Englisch
Publikationsjahr: 15 September 2021
Verlag: Springer
Buchtitel: Pattern Recognition
Reihe: Lecture Notes in Computer Science
Band einer Reihe: 13024
Veranstaltungstitel: 43rd DAGM German Conference on Pattern Recognition 2021
Veranstaltungsort: virtual Conference
Veranstaltungsdatum: 28.09.-01.10.2021
DOI: 10.1007/978-3-030-92659-5_26
URL / URN: https://link.springer.com/chapter/10.1007/978-3-030-92659-5_...
Zugehörige Links:
Kurzbeschreibung (Abstract):

Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today’s multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today’s multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.

Freie Schlagworte: UKP_p_emergencity, emergenCITY_INF
Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
20 Fachbereich Informatik > Visuelle Inferenz
LOEWE
LOEWE > LOEWE-Zentren
LOEWE > LOEWE-Zentren > emergenCITY
Hinterlegungsdatum: 21 Sep 2021 14:04
Letzte Änderung: 09 Sep 2022 06:44
PPN: 498944085
Zugehörige Links:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen