Steitz, Jan-Martin O. ; Pfeiffer, Jonas ; Gurevych, Iryna ; Roth, Stefan (2021)
TxT: Crossmodal End-to-End Learning with Transformers.
43rd DAGM German Conference on Pattern Recognition 2021. virtual Conference (28.09.2021-01.10.2021)
doi: 10.1007/978-3-030-92659-5_26
Konferenzveröffentlichung, Bibliographie
Kurzbeschreibung (Abstract)
Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today’s multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today’s multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.
Typ des Eintrags: | Konferenzveröffentlichung |
---|---|
Erschienen: | 2021 |
Autor(en): | Steitz, Jan-Martin O. ; Pfeiffer, Jonas ; Gurevych, Iryna ; Roth, Stefan |
Art des Eintrags: | Bibliographie |
Titel: | TxT: Crossmodal End-to-End Learning with Transformers |
Sprache: | Englisch |
Publikationsjahr: | 15 September 2021 |
Verlag: | Springer |
Buchtitel: | Pattern Recognition |
Reihe: | Lecture Notes in Computer Science |
Band einer Reihe: | 13024 |
Veranstaltungstitel: | 43rd DAGM German Conference on Pattern Recognition 2021 |
Veranstaltungsort: | virtual Conference |
Veranstaltungsdatum: | 28.09.2021-01.10.2021 |
DOI: | 10.1007/978-3-030-92659-5_26 |
URL / URN: | https://link.springer.com/chapter/10.1007/978-3-030-92659-5_... |
Zugehörige Links: | |
Kurzbeschreibung (Abstract): | Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today’s multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today’s multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering. |
Freie Schlagworte: | UKP_p_emergencity, emergenCITY_INF |
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung 20 Fachbereich Informatik > Visuelle Inferenz LOEWE LOEWE > LOEWE-Zentren LOEWE > LOEWE-Zentren > emergenCITY |
Hinterlegungsdatum: | 21 Sep 2021 14:04 |
Letzte Änderung: | 09 Sep 2022 06:44 |
PPN: | 498944085 |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |