This data is built from Wiktionary and Tatoeba datasets using my Wiktionary Parser and Spanish Tools
This data is used to build the free, open-source Spanish to English dictionary available in StarDict and Aard2/slob formats in the Release section. It's also used to build my 6001 Spanish Vocab anki deck, and is provided here with the hope that others may find additional uses for it.
- es-en.data - Spanish to English Wiktionary data formatted for use with enwiktionary_wordlist
- frequency.csv - a list of the most frequently used Spanish lemmas with part of speech and word forms combined into lemma
- sentences.tsv - English/Spanish sentence pairs from tatoeba.org with users self-reported proficiency, part of speech tags, and lemmas
- es-en.data (CC-BY-SA Attribution: wiktionary.org)
- frequency.csv (CC-BY-SA 3.0 github.com/hermitdave/FrequencyWords)
- sentences.tsv (CC-BY 2.0 FR Attribution: tatoeba.org)
- tatoeba user CK for the list of reviewed English sentences
- tatoeba user arh for the list of reviewed Spanish sentences
- FreeLing for the part of speech tagging
sudo apt install curl bzip2 gawk pv unzip zip pkg-config dictzip make
pip3 install ijson pywikibot mwparserfromhell pyglossary PyICU Levenshtein
Install FreeLing on Debian (for other distros, check the FreeLing instructions)
wget https://github.com/TALP-UPC/FreeLing/releases/download/4.2/freeling-4.2-buster-amd64.deb
sudo apt install ./freeling-4.2-buster-amd64.deb libboost-chrono1.67.0 libboost-date-time1.67.0
curl https://github.com/doozan/spanish_data/raw/master/Makefile -o Makefile
make