Data for paper: "Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval"
Item Type: | Dataset |
---|---|
Title: | Data for paper: "Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval" |
Date: | 20 January 2021 |
Creator: | Litschko, Robert |
DDC Classification: |
004 Computer science, internet |
---|---|
Abstract: | Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual IR (CLIR) tasks. Therefore, in this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained encoders fail to significantly outperform models based on CLWEs. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders `off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks. |
URL: | https://madata.bib.uni-mannheim.de/361/ |
---|---|
DOI: | https://doi.org/10.7801/361 |
Availability (Controlled): | Unknown |
File | Filename / Infos | Link |
---|---|---|
Archive
Filename: mbert_iso_layer_0.tar.gz |
Download (5GB)
|
|
Archive
Filename: mbert_aoc_layer_9.tar.gz |
Download (5GB)
|
|
Archive
Filename: xlm_iso_layer_1.tar.gz |
Download (9GB)
|
|
Archive
Filename: xlm_aoc_layer_12.tar.gz |
Download (9GB)
|
|
Archive
Filename: xlm_aoc_layer_15.tar.gz |
Download (9GB)
|
|
Archive
Filename: checkpoints.tar.gz |
Download (7GB)
|
Depositing User: | Robert Litschko |
---|---|
Date Deposited: | 22 Jan 2021 10:26 |
Last Modified: | 29 Feb 2024 20:37 |
Actions (login required)
View Item |