Data for paper: "Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval"
| Item Type: | Dataset |
|---|---|
| Title: | Data for paper: "Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval" |
| Date: | 20 January 2021 |
| Creator: | Litschko, Robert |
| DDC Classification: |
004 Computer science, internet |
|---|---|
| Abstract: | Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual IR (CLIR) tasks. Therefore, in this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained encoders fail to significantly outperform models based on CLWEs. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders `off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks. |
| URL: | https://madata.bib.uni-mannheim.de/361/ |
|---|---|
| DOI: | https://doi.org/10.7801/361 |
| Availability (Controlled): | Unknown |
| File | Filename / Infos | Link |
|---|---|---|
|
Archive
Filename: mbert_iso_layer_0.tar.gz
|
Download (5GB)
|
|
|
Archive
Filename: mbert_aoc_layer_9.tar.gz
|
Download (5GB)
|
|
|
Archive
Filename: xlm_iso_layer_1.tar.gz
|
Download (9GB)
|
|
|
Archive
Filename: xlm_aoc_layer_12.tar.gz
|
Download (9GB)
|
|
|
Archive
Filename: xlm_aoc_layer_15.tar.gz
|
Download (9GB)
|
|
|
Archive
Filename: checkpoints.tar.gz
|
Download (7GB)
|
| Depositing User: | Robert Litschko |
|---|---|
| Date Deposited: | 22 Jan 2021 10:26 |
| Last Modified: | 29 Feb 2024 20:37 |
Actions (login required)
![]() |
View Item |

