Purkayastha, Sukannya ; Ruder, Sebastian ; Pfeiffer, Jonas ; Gurevych, Iryna ; Vulić, Ivan (2023)
Romanization-based Large-scale Adaptation of Multilingual Language Models.
2023 Conference on Empirical Methods in Natural Language Processing. Singapore (06.12.2023-10.12.2023)
doi: 10.18653/v1/2023.findings-emnlp.538
Konferenzveröffentlichung, Bibliographie
Kurzbeschreibung (Abstract)
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. However, their large-scale deployment to many languages, besides pretraining data scarcity, is also hindered by the increase in vocabulary size and limitations in their parameter budget. In order to boost the capacity of mPLMs to deal with low-resource and unseen languages, we explore the potential of leveraging transliteration on a massive scale. In particular, we explore the UROMAN transliteration tool, which provides mappings from UTF-8 to Latin characters for all the writing systems, enabling inexpensive romanization for virtually any language. We first focus on establishing how UROMAN compares against other language-specific and manually curated transliterators for adapting multilingual PLMs. We then study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups: on languages with unseen scripts and with limited training data without any vocabulary augmentation. Further analyses reveal that an improved tokenizer based on romanized data can even outperform non-transliteration-based methods in the majority of languages.
Typ des Eintrags: | Konferenzveröffentlichung |
---|---|
Erschienen: | 2023 |
Autor(en): | Purkayastha, Sukannya ; Ruder, Sebastian ; Pfeiffer, Jonas ; Gurevych, Iryna ; Vulić, Ivan |
Art des Eintrags: | Bibliographie |
Titel: | Romanization-based Large-scale Adaptation of Multilingual Language Models |
Sprache: | Englisch |
Publikationsjahr: | Dezember 2023 |
Ort: | Singapore |
Verlag: | Association for Computational Linguistics |
Buchtitel: | Findings of the Association for Computational Linguistics: EMNLP 2023 |
Veranstaltungstitel: | 2023 Conference on Empirical Methods in Natural Language Processing |
Veranstaltungsort: | Singapore |
Veranstaltungsdatum: | 06.12.2023-10.12.2023 |
DOI: | 10.18653/v1/2023.findings-emnlp.538 |
URL / URN: | https://aclanthology.org/2023.findings-emnlp.538 |
Kurzbeschreibung (Abstract): | Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. However, their large-scale deployment to many languages, besides pretraining data scarcity, is also hindered by the increase in vocabulary size and limitations in their parameter budget. In order to boost the capacity of mPLMs to deal with low-resource and unseen languages, we explore the potential of leveraging transliteration on a massive scale. In particular, we explore the UROMAN transliteration tool, which provides mappings from UTF-8 to Latin characters for all the writing systems, enabling inexpensive romanization for virtually any language. We first focus on establishing how UROMAN compares against other language-specific and manually curated transliterators for adapting multilingual PLMs. We then study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups: on languages with unseen scripts and with limited training data without any vocabulary augmentation. Further analyses reveal that an improved tokenizer based on romanized data can even outperform non-transliteration-based methods in the majority of languages. |
Freie Schlagworte: | UKP_p_KRITIS |
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung |
Hinterlegungsdatum: | 18 Jan 2024 14:00 |
Letzte Änderung: | 22 Mär 2024 10:38 |
PPN: | 516506676 |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |