The Cross-Lingual NLI Corpus (XNLI)
Alexis Conneau
Guillaume Lample
Ruty Rinott
Holger Schwenk
Ves Stoyanov
Facebook AI
Adina Williams
Sam Bowman
NYU
Introduction
The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations. The corpus is made to evaluate how to perform inference in any language (including low-resources ones like Swahili or Urdu) when only English NLI data is available at training time. One solution is cross-lingual sentence encoding, for which XNLI is an evaluation benchmark.
Examples
Language | Premise | Label | Hypothesis |
Face-to-face conversation | |||
English | There's so much you could talk about on that I'll just skip that. | contradictory | I want to tell you everything I know about that! |
Letters | |||
French | Cet investissement a permis la rénovation et la vente de 60 maisons à des acheteurs modestes et la réhabilitation de plus de 100 appartements abordables et de grande qualité. | entailment | Les appartements étaient des dépotoirs et personne ne les a réparés. |
Telephone Speech | |||
Greek | Το κορίτσι που μπορεί να με βοηθήσει είναι στον δρόμο προς την πόλη. | neutral | Η κοπέλα που θα με βοηθήσει είναι 5 μίλια μακριά. |
9/11 Report | |||
Bulgarian | При измерване на ефективността, съвършенството е недостижимо. | entailment | Можете да бъдете перфектни, ако се опитате достатъчно. |
Fiction | |||
Urdu | دھکےلو، کپتان، اور انہیں ایک کشتی بھیجنے کا اشارہ کرو اور ان کو یقین دلاو کہ مس یہاں ہے۔ | contradiction | کشتی کو بلانے کی کوئی ضرورت نہ تھی کیوں کہ مس کبھی آئی ہی نہیں |
Download
XNLI is distributed in a single ZIP file containing the corpus as both JSON lines (jsonl) and tab-separated text (txt). The English training data can be found on the MultiNLI website.
Download: XNLI 1.0 (17MB, ZIP)
Data description paper and citation
A description of the data can be found here (PDF) or in the corpus package zip. If you use the corpus in an academic paper, please cite us:
@InProceedings{conneau2018xnli, author = "Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin", title = "XNLI: Evaluating Cross-lingual Sentence Representations", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", year = "2018", publisher = "Association for Computational Linguistics", location = "Brussels, Belgium", }
Baselines and Code
The XNLI paper presents several baselines for language adaptation.
Code will soon be made available. We also release the machine translated data for reproducing the TRANSLATE-TRAIN and TRANSLATE-TEST:
Download: XNLI-MT 1.0 (445MB, ZIP)
License
See details in the XNLI paper.
Acknowledgments
This project has benefited from financial support to Samuel R. Bowman by Google, Tencent Holdings, and Samsung Research.