PanLex-based bilingual lexicons for 210 language pairs (15 languages)

Contact: Ivan Vulić (iv250 AT cam DOT ac DOT uk)

This repo contains bilingual lexicons for 210 language pairs (15 languages in total) used in the empirical comparison paper Do We Really Need Fully Unsupervised Cross-Lingual Embeddings? (Vulić et al., EMNLP 2019)

The bilingual lexicons have been extracted from the PanLex database of translations: more details on the extraction procedure are provided in the paper.

References

If you use these lexicons in your own work, please cite the following paper:

@inproceedings{Vulic:2019clwe,
  author    = {Vuli\'{c}, Ivan and Glava\v{s}, Goran and Reichart, Roi and Korhonen, Anna},
  title     = {Do We Really Need Fully Unsupervised Cross-Lingual Embeddings?},
  booktitle = {Proceedings of the 2019 Conference 
              on Empirical Methods in Natural Language Processing (EMNLP)},
  year      = {2019},
  }

Please also acknowledge the use of PanLex by citing the following paper:

@inproceedings{Kamholz:2014panlex,
  author    = {David Kamholz and Jonathan Pool and Susan M. Colowick},
  title     = {{PanLex: B}uilding a Resource for Panlingual Lexical Translation},
  booktitle = {Proceedings of the 9th International Conference 
              on Language Resources and Evaluation (LREC)},
  pages     = {3145--3150},
  year      = {2014},
  }

Languages and Lexicons

The following 15 languages are currently covered in the repo. For each language X, we provide a separate zip archive where X is the source language (L₁) paired with the remaining 14 target (L₂) languages.

Bulgarian (bg): bg-L₂
Catalan (ca): ca-L₂
Esperanto (eo): eo-L₂
Estonian (et): et-L₂
Basque (eu): eu-L₂
Finnish (fi): fi-L₂
Hebrew (he): he-L₂
Hungarian (hu): hu-L₂
Indonesian (id): id-L₂
Georgian (ka): ka-L₂
Korean (ko): ko-L₂
Lithuanian (lt): lt-L₂
Norwegian Bokmål (no): no-L₂
Thai (th): th-L₂
Turkish (tr): tr-L₂

If you want to download the entire bundle for all 210 language pairs at once, please click here

Some (Important) Remarks:

The format of the lexicons should be self-explanatory. The files are tab-delimited. We provide training lexicons of different size N (N = 5000, 2000, 1000, 500).
The provided lexicons are of course not perfect and have not been manually cleaned or verified although the extraction process from PanLex was quite strict in order to focus on high precision. Still, there might be some noise in the lexicons. Therefore, the lexicons should be considered as silver standard.
For some language pairs (due to the strict extraction process), the number of pairs in the lexicons is smaller than the desired 5K training pairs or 2K test pairs - please double-check before running any size-related analyses of the lexicons and projection-based methods.
For any further questions, please contact Ivan Vulić (iv250 AT cam DOT ac DOT uk)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
lexicons		lexicons
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PanLex-based bilingual lexicons for 210 language pairs (15 languages)

References

Languages and Lexicons

Some (Important) Remarks:

About

Releases

Packages

License

cambridgeltl/panlex-bli

Folders and files

Latest commit

History

Repository files navigation

PanLex-based bilingual lexicons for 210 language pairs (15 languages)

References

Languages and Lexicons

Some (Important) Remarks:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages