TxT: Crossmodal End-to-End Learning with Transformers

Steitz, Jan-Martin O. ; Pfeiffer, Jonas ; Gurevych, Iryna ; Roth, Stefan (2021)
TxT: Crossmodal End-to-End Learning with Transformers.
43rd DAGM German Conference on Pattern Recognition 2021. virtual Conference (28.09.-01.10.2021)
doi: 10.1007/978-3-030-92659-5_26
Konferenzveröffentlichung, Bibliographie

URL / URN: https://link.springer.com/chapter/10.1007/978-3-030-92659-5_...

Kurzbeschreibung (Abstract)

Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today’s multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today’s multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.

Typ des Eintrags:	Konferenzveröffentlichung
Erschienen:	2021
Autor(en):	Steitz, Jan-Martin O. ; Pfeiffer, Jonas ; Gurevych, Iryna ; Roth, Stefan
Art des Eintrags:	Bibliographie
Titel:	TxT: Crossmodal End-to-End Learning with Transformers
Sprache:	Englisch
Publikationsjahr:	15 September 2021
Verlag:	Springer
Buchtitel:	Pattern Recognition
Reihe:	Lecture Notes in Computer Science
Band einer Reihe:	13024
Veranstaltungstitel:	43rd DAGM German Conference on Pattern Recognition 2021
Veranstaltungsort:	virtual Conference
Veranstaltungsdatum:	28.09.-01.10.2021
DOI:	10.1007/978-3-030-92659-5_26
URL / URN:	https://link.springer.com/chapter/10.1007/978-3-030-92659-5_...
Zugehörige Links:	Verwandtes Werk
Kurzbeschreibung (Abstract):	Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today’s multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today’s multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.
Freie Schlagworte:	UKP_p_emergencity, emergenCITY_INF
Fachbereich(e)/-gebiet(e):	20 Fachbereich Informatik 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung 20 Fachbereich Informatik > Visuelle Inferenz LOEWE LOEWE > LOEWE-Zentren LOEWE > LOEWE-Zentren > emergenCITY
Hinterlegungsdatum:	21 Sep 2021 14:04
Letzte Änderung:	09 Sep 2022 06:44
PPN:	498944085
Export:

Suche nach Titel in:	TUfind oder in Google

Frage zum Eintrag

Optionen (nur für Redakteure)

Redaktionelle Details anzeigen

OAI 2.0-Basis-URL: https://tubiblio.ulb.tu-darmstadt.de/cgi/oai2 TUbiblio verwendet EPrints 3.

Drucken |

Impressum |

Datenschutzerklärung