TU Darmstadt / ULB / TUbiblio

M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Wang, Yuxia ; Mansurov, Jonibek ; Ivanov, Petar ; Su, Jinyan ; Shelmanov, Artem ; Tsvigun, Akim ; Whitehouse, Chenxi ; Afzal, Osama Mohammed ; Mahmoud, Tarek ; Sasaki, Toru ; Arnold, Thomas ; Aji, Alham Fikri ; Habash, Nizar ; Gurevych, Iryna ; Nakov, Preslav (2024)
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection.
18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julian's, Malta (17.03.-22.03.2024)
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark M4, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at https://github.com/mbzuai-nlp/M4

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2024
Autor(en): Wang, Yuxia ; Mansurov, Jonibek ; Ivanov, Petar ; Su, Jinyan ; Shelmanov, Artem ; Tsvigun, Akim ; Whitehouse, Chenxi ; Afzal, Osama Mohammed ; Mahmoud, Tarek ; Sasaki, Toru ; Arnold, Thomas ; Aji, Alham Fikri ; Habash, Nizar ; Gurevych, Iryna ; Nakov, Preslav
Art des Eintrags: Bibliographie
Titel: M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection
Sprache: Englisch
Publikationsjahr: März 2024
Verlag: ACL
Buchtitel: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Veranstaltungstitel: 18th Conference of the European Chapter of the Association for Computational Linguistics
Veranstaltungsort: St. Julian's, Malta
Veranstaltungsdatum: 17.03.-22.03.2024
URL / URN: https://aclanthology.org/2024.eacl-long.83/
Kurzbeschreibung (Abstract):

Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark M4, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at https://github.com/mbzuai-nlp/M4

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
Hinterlegungsdatum: 12 Apr 2024 11:05
Letzte Änderung: 06 Aug 2024 13:03
PPN: 520386973
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen