TU Darmstadt / ULB / TUbiblio

DAPR: A Benchmark on Document-Aware Passage Retrieval

Wang, Kexin ; Reimers, Nils ; Gurevych, Iryna (2024)
DAPR: A Benchmark on Document-Aware Passage Retrieval.
62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand (11.08.2024 - 16.08.2024)
doi: 10.18653/v1/2024.acl-long.236
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task Document-Aware Passage Retrieval (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture of the easy and the hard queries, it completely fails on the hard queries that require document-context understanding. On the other hand, contextualized passage representations (e.g. prepending document titles) achieve good improvement on these hard queries, but overall they also perform rather poorly. Our created benchmark enables future research on developing and comparing retrieval systems for the new task. The code and the data are available.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2024
Autor(en): Wang, Kexin ; Reimers, Nils ; Gurevych, Iryna
Art des Eintrags: Bibliographie
Titel: DAPR: A Benchmark on Document-Aware Passage Retrieval
Sprache: Englisch
Publikationsjahr: August 2024
Verlag: ACL
Buchtitel: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Veranstaltungstitel: 62nd Annual Meeting of the Association for Computational Linguistics
Veranstaltungsort: Bangkok, Thailand
Veranstaltungsdatum: 11.08.2024 - 16.08.2024
DOI: 10.18653/v1/2024.acl-long.236
URL / URN: https://aclanthology.org/2024.acl-long.236/
Kurzbeschreibung (Abstract):

The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task Document-Aware Passage Retrieval (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture of the easy and the hard queries, it completely fails on the hard queries that require document-context understanding. On the other hand, contextualized passage representations (e.g. prepending document titles) achieve good improvement on these hard queries, but overall they also perform rather poorly. Our created benchmark enables future research on developing and comparing retrieval systems for the new task. The code and the data are available.

Freie Schlagworte: UKP_p_qa_sci_inf
Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
Hinterlegungsdatum: 20 Aug 2024 09:02
Letzte Änderung: 26 Nov 2024 14:11
PPN: 524135843
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen