Skip to content

We present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).

License

Notifications You must be signed in to change notification settings

helldog-star/LanguageCodes

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Language Codes

It is hard to tell the exact number of human languages on this planet because the definition of "language" varies depending on how one defines the distinction between languages and dialects. E.g., some languages can be grouped into a language family, and show small differences from others. For a general definition of language, the number of living languages is over 7,000[1] but most of them are non-digitized. Here we list 353 languages with their codes, families, regions and etc. This list covers most of the majority languages in the world and a large number of minority languages. Also, we collect links to sites of multi-lingual corpora. They might help when one studies these languages and/or develop multi-lingual natural language processing (NLP) systems.

The Language List

Language ›
Chinese Name
ISO 639 Language Family›
Branch
Writing System Macro-area
1 2 3
Albanian ›
阿尔巴尼亚语
sq alb(B)
sqi(T)
sqi Indo-European ›
Albanian
Latin
Albanian Braille
Asia
Europe
Arabic ›
阿拉伯语
ar ara ara Afro-Asiatic ›
Semitic
Arabic
Arabic Braille
Arabizi
Africa
Asia
Amharic ›
阿姆哈拉语
am amh amh Afro-Asiatic ›
Semitic
Geʽez
Ge'ez Braille
Africa
Azerbaijani ›
阿塞拜疆语
az aze aze Turkic ›
Common
Turkic
Latin
Perso-Arabic
Cyrillic
Georgian
Asia
Ewe ›
埃维语
ee ewe ewe Niger–Congo ›
Atlantic-Congo
Latin
Ewe Braille
Africa
Irish ›
爱尔兰语
ga gle gle Indo-European ›
Celtic
Latin
Irish Braille
Europe
Estonian ›
爱沙尼亚语
et est est Uralic ›
Finnic
Latin
Estonian Braille
Europe
Oromoo ›
奥罗莫语
om orm orm Afro-Asiatic ›
Cushitic
Latin Africa
Ossetic ›
奥赛梯语
os oss oss Indo-European ›
Indo-Iranian
Cyrillic
Georgian
Latin
Europe
Asia
Tok Pisin ›
巴布亚皮钦语
N/A tpi tpi English
Creole ›
Pacific
Latin
Pidgin Braille
Oceania
Bashkir ›
巴什基尔语
ba bak bak Turkic ›
Common
Turkic
Cyrillic Europe
Basque ›
巴斯克语
eu baq eus Language
isolate
Basque
Basque Braille
Europe
Belarusian ›
白俄罗斯语
be bel bel Indo-European ›
Balto-Slavic
Cyrillic
Belarusian Braille
Belarusian Latin
Europe
Hmong ›
白苗文
N/A hmn mww Hmong–Mien ›
Hmongic
Latin
Pahawh Hmong
Pollard
Asia
Bulgarian ›
保加利亚语
bg bul bul Indo-European ›
Balto-Slavic
Cyrillic
Bulgarian Braille
Latin
Europe
Bislama ›
比斯拉马语
bi bis bis English
Creole ›
Pacific
Latin
Avoiuli
Oceania
Bemba ›
别姆巴语
N/A bem bem Niger–Congo ›
Atlantic–Congo
Latin
Bemba Braille
Africa
Asia
Icelandic ›
冰岛语
is ice(B)
isl(T)
isl Indo-European ›
Germanic
Latin 
Icelandic Braille
Europe
Polish ›
波兰语
pl pol pol Indo-European ›
Balto-Slavic
Latin
Polish Braille
Africa
Europe
Bosnian ›
波斯尼亚语
bs bos bos Indo-European ›
Balto-Slavic
Latin
Cyrillic
Yugoslav Braille
Arabic
Bosnian Cyrillic
Europe
Asia
Persian ›
波斯语
fa per(B)
fas(T)
fas Indo-European ›
Indo-Iranian
Persian
Tajik
Hebrew
Persian Braille
Asia
Tibetan ›
藏语
bo tib(B)
bod(T)
bod Sino-Tibetan ›
Tibeto-Burman
Tibetan Asia
Tswana ›
茨瓦纳语
tn tsn tsn Niger–Congo ›
Atlantic–Congo
Latin
Tswana Braille
Africa
Xitsonga ›
聪加语
ts tso tso Niger–Congo ›
Atlantic–Congo
Latin
Tsonga Braille
Africa
Tatar ›
鞑靼语
tt tat tat Turkic ›
Common
Turkic
Tatar Europe
Danish ›
丹麦语
da dan dan Indo-European ›
Germanic
Latin
Dano-Norwegian
Danish orthography
Danish Braille
Europe
German ›
德语
de ger(B)
deu(T)
deu Indo-European ›
Germanic
Latin
German Braille
North America
Africa
Europe
Asia
Russian ›
俄语
ru rus rus Indo-European ›
Balto-Slavic
Cyrillic
Russian Braille
Europe
Asia
French ›
法语
fr fre(B)
fra(T)
fra Indo-European ›
Italic
Signed French North America
Oceania
Africa
Europe
Asia
Filipino ›
菲律宾语
N/A fil fil Austronesian ›
Malayo-Polynesian
Latin
Philippine Braille
Asia
Fijian ›
斐济语
fj fij fij Austronesian ›
Malayo-Polynesian
Latin-based Oceania
Finnish ›
芬兰语
fi fin fin Uralic ›
Finnic
Latin
Finnish Braille
Europe
Frisian ›
弗里西语
fy fry fry Indo-European ›
Germanic
Latin Europe
Kikongo ›
刚果语
kg kon kon Niger–Congo ›
Atlantic–Congo
Latin
Mandombe
Africa
Khmer ›
高棉语
km khm khm Austroasiatic ›
Proto-Mon-Khmer
Khmer
Khmer Braille
Asia
Georgian ›
格鲁吉亚语
ka geo(B)
kat(T)
kat Kartvelian ›
Karto-Zan
Georgian
Georgian Braille
Europe
Asia
Gujarati ›
古吉拉特语
gu guj guj Indo-European ›
Indo-Iranian
Gujarati
Gujarati Braille
Devanagari
Africa
Asia
Kazakh ›
哈萨克语
kk kaz kaz Turkic ›
Common
Turkic
Arabic Asia
Kazakh(Cyrillic) ›
哈萨克语(西里尔)
kk kaz kaz Turkic ›
Common
Turkic
Cyrillic
Kazakh Braille
Asia
Haitian Creole ›
海地克里奥尔语
ht hat hat French
Creole
Latin North America
Korean ›
韩语
ko kor kor Koreanic Hangul
Hanja
Korean Braille
Asia
Hausa ›
豪萨语
ha hau hau Afro-Asiatic ›
Chadic
Latin
Arabic
Hausa Braille
Africa
Dutch ›
荷兰语
nl dut(B)
nld(T)
nld Indo-European ›
Germanic
Latin
Dutch Braille
Africa
South America
Kyrgyz ›
吉尔吉斯语
ky kir kir Turkic ›
Common
Turkic
Cyrillic
Perso-Arabic
formerly Latin
Kyrgyz Braille
Asia
Galician ›
加利西亚语
gl glg glg Indo-European ›
Italic
Latin
Galician Braille
Europe
Catalan ›
加泰罗尼亚语
ca cat cat Indo-European ›
Italic
Latin
Catalan Braille
Europe
Czech ›
捷克语
cs cze(B)
ces(T)
ces Indo-European ›
Balto-Slavic
Latin
Czech Braille
Europe
Kannada ›
卡纳达语
kn kan kan Dravidian Kannada
Kannada Braille
Tigalari
Asia
Qeqchi ›
凯克其语
N/A N/A kek Mayan ›
Quichean–Mamean
Latin North America
Europe
Corsican ›
科西嘉语
co cos cos Indo-European ›
Italic
Latin Europe
Queretaro Otomi ›
克雷塔罗奥托米语
N/A N/A otq Oto-Manguean ›
Oto-Pamean
Latin North America
Croatian ›
克罗地亚语
hr hrv hrv Indo-European ›
Balto-Slavic
Latin
Yugoslav Braille
Europe
Kurdish ›
库尔德语
ku kur kur Indo-European ›
Indo-Iranian
Hawar
Sorani
Cyrillic
Armenian
Asia
Latin ›
拉丁语
la lat lat Indo-European ›
Italic
Latin Europe
Latvian ›
拉脱维亚语
lv lav lav Indo-European ›
Balto-Slavic
Latin
Latvian Braille
Europe
Lao ›
老挝语
lo lao lao Kra–Dai ›
Tai
Lao
Thai
Thai and Lao Braille
Asia
Lithuanian ›
立陶宛语
lt lit lit Indo-European ›
Balto-Slavic
Latin
Lithuanian Braille
Europe
Lingala ›
林加拉语
ln lin lin Niger–Congo ›
Atlantic–Congo
Latin
Mandombe
Africa
Kirundi ›
隆迪语
rn run run Niger–Congo ›
Atlantic–Congo
Latin Africa
Luganda ›
卢干达语
lg lug lug Niger–Congo ›
Atlantic–Congo
Latin
Ganda Braille
Africa
Luxembourgish ›
卢森堡语
lb ltz ltz Indo-European ›
Germanic
Latin
Luxembourgish Braille
Europe
Kinyarwanda ›
卢旺达语
rw kin kin Niger–Congo ›
Atlantic–Congo
Latin
Arabic
Africa
Romanian ›
罗马尼亚语
ro rum(B)
ron(T)
ron Indo-European ›
Italic
Latin
Cyrillic
Romanian Braille
Europe
Malagasy ›
马尔加什语
mg mlg mlg Austronesian ›
Malayo-Polynesian
Latin
Malagasy Braille
Africa
Maltese ›
马耳他语
mt mlt mlt Afro-Asiatic ›
Semitic
Latin
Maltese Braille
Europe
Marathi ›
马拉地语
mr mar mar Indo-European ›
Indo-Iranian
Devanagari
Devanagari Braille
Modi
Asia
Malayalam ›
马拉雅拉姆语
ml mal mal Dravidian Malayalam
Malayalam Braille
Asia
Malay ›
马来语
ms may(B)
msa(T)
msa Austronesian ›
Malayo-Polynesian
Latin
Arabic
Thai
Malay Braille
Asia
Mari ›
马里语
N/A chm mhr Uralic ›
Finno-Permic
Mari
Cyrillic
Europe
Macedonian ›
马其顿语
mk mac(B)
mkd(T)
mkd Indo-European ›
Balto-Slavic
Cyrillic
Macedonian Braille
Europe
Maori ›
毛利语
mi mao(B)
mri(T)
mri Austronesian ›
Malayo-Polynesian
Latin
Māori Braille
Oceania
Mongolian(Cyrillic) ›
蒙古语(西里尔)
mn mon mon Mongolic Cyrillic
Mongolian Braille
Asia
Bengali ›
孟加拉语
bn ben ben Indo-European ›
Indo-Iranian
Bengali-Assamese
Bengali Braille
Asia
Burmese ›
缅甸语
my bur(B)
mya(T)
mya Sino-Tibetan ›
Lolo-Burmese
Burmese
Burmese Braille
Asia
Afrikaans ›
南非荷兰语
af afr afr Indo-European ›
Germanic
Latin using Afrikaans
Arabic
Afrikaans Braille
Africa
Xhosa ›
南非科萨语
xh xho xho Niger–Congo ›
Atlantic–Congo
Latin
Xhosa Braille
Africa
Zulu ›
南非祖鲁语
zu zul zul Niger–Congo ›
Atlantic–Congo
Latin
Zulu Braille
Africa
Nepali ›
尼泊尔语
ne nep nep Indo-European ›
Indo-Iranian
Devanagari
Devanagari Braille
Asia
Norwegian ›
挪威语
no nor nor Indo-European ›
Germanic
Latin
Norwegian Braille
Europe
Papiamento ›
帕皮阿门托语
N/A pap pap Portuguese
Creole
Latin Europe
Punjabi ›
旁遮普语
pa pan pan Indo-European ›
Indo-Iranian
Gurmukhī
Perso-Arabic
Punjabi Braille
Laṇḍā
Mahajani
Asia
Portuguese ›
葡萄牙语
pt por por Indo-European ›
Italic
Latin
Portuguese Braille
Africa
South America
Europe
Asia
Pashto ›
普什图语
ps pus pus Indo-European ›
Indo-Iranian
Perso-Arabic Asia
Chewa ›
齐切瓦语
ny nya nya Niger–Congo ›
Atlantic–Congo
Latin
Mwangwego
Chewa Braille
Africa
Twi ›
契维语
tw twi twi Niger–Congo ›
Atlantic–Congo
Latin Africa
Japanese ›
日语
ja jpn jpn Japonic Mixed scripts of Kanji and Kana
Japanese Braille
Oceania
Asia
Swedish ›
瑞典语
sv swe swe Indo-European ›
Germanic
Latin
Swedish Braille
Europe
Samoan ›
萨摩亚语
sm smo smo Austronesian ›
Malayo-Polynesian
Latin
Samoan Braille
Oceania
Serbian ›
塞尔维亚语
sr srp srp Indo-European ›
Balto-Slavic
Serbian Cyrillic
Serbian Latin
Yugoslav Braille
Europe
Seychelles Creole ›
塞舌尔克里奥尔语
N/A N/A crs French
Creole ›
Bourbonnais
Creoles
Latin Africa
Sesotho ›
塞索托语
st sot sot Niger–Congo ›
Atlantic–Congo
Latin
Sotho Braille
Africa
Sango ›
桑戈语
sg sag sag Creole Latin Africa
Sinhalese ›
僧伽罗语
si sin sin Indo-European ›
Indo-Iranian
Sinhala
Sinhala Braille
Asia
Hill Mari ›
山地马里语
N/A N/A mrj Uralic ›
Finno-Ugric
Cyrillic Europe
Slovak ›
斯洛伐克语
sk slo(B)
slk(T)
slk Indo-European ›
Balto-Slavic
Latin
Slovak Braille
Europe
Slovenian ›
斯洛文尼亚语
sl slv slv Indo-European ›
Balto-Slavic
Latin
Slovene Braille
Europe
Swahili ›
斯瓦希里语
sw swa swa Niger–Congo ›
Atlantic–Congo
Latin
Arabic
Swahili Braille
Africa
Scottish Gaelic ›
苏格兰盖尔语
gd gla gla Indo-European ›
Celtic
Scottish Gaelic Europe
Somali ›
索马里语
so som som Afro-Asiatic ›
Cushitic
Somali Latin
Wadaad writing
Osmanya
Borama
Kaddare
Africa
Tajik ›
塔吉克语
tg tgk tgk Indo-European ›
Indo-Iranian
Cyrillic
Latin
Persian
Tajik Braille
Asia
Tahitian ›
塔希提语
ty tah tah Austronesian ›
Malayo-Polynesian
Latin Europe
Telugu ›
泰卢固语
te tel tel Dravidian ›
South-Central
Telugu
Telugu Braille
Africa
Asia
Tamil ›
泰米尔语
ta tam tam Dravidian ›
Southern
Tamil
Tamil-Brahmi
Grantha
Vatteluttu
Pallava
Kolezhuthu
Arwi
Tamil Braille
Latin
North America
Africa
Asia
Thai ›
泰语
th tha tha Kra–Dai Thai
Thai Braille
Asia
Tongan ›
汤加语
to ton ton Austronesian ›
Malayo-Polynesian
Latin Oceania
Africa
Tigre ›
提格雷语
N/A tig tig Afro-Asiatic ›
Semitic
Tigre
Arabic
Africa
Turkish ›
土耳其语
tr tur tur Turkic ›
Common
Turkic
Latin
Turkish Braille
Europe
Asia
Turkmen ›
土库曼语
tk tuk tuk Turkic ›
Common
Turkic
Latin
Cyrillic
Arabic
Turkmen Braille
Europe
Asia
Waray ›
瓦瑞语
N/A war war Austronesian ›
Malayo-Polynesian
Latin Asia
Welsh ›
威尔士语
cy wel(B)
cym(T)
cym Indo-European ›
Celtic
Latin
Welsh Braille
Europe
Uyghur ›
维吾尔语
ug uig uig Turkic ›
Common
Turkic
Uyghur
Uyghur Perso-Arabic
Uyghur Cyrillic
Uyghur Latin
Uyghur New
Asia
Udmurt ›
乌德穆尔特语
N/A udm udm Uralic ›
Finno-Ugric
Latin
Cyrillic
Europe
Urdu ›
乌尔都语
ur urd urd Indo-European ›
Indo-Iranian
Perso-Arabic
Roman Urdu
Urdu Braille
Africa
Asia
Ukrainian ›
乌克兰语
uk ukr ukr Indo-European ›
Balto-Slavic
Cyrillic
Ukrainian Braille
Ukrainian Latin
Europe
Uzbek ›
乌兹别克语
uz uzb uzb Turkic ›
Common
Turkic
Latin
Cyrillic
Perso-Arabic
Uzbek Braille
Asia
Spanish ›
西班牙语
es spa spa Indo-European ›
Italic
Latin
Spanish Braille
North America
Africa
South America
Europe
Hebrew ›
希伯来语
he heb heb Afro-Asiatic ›
Semitic
Hebrew
Hebrew Braille
Paleo-Hebrew
Imperial Aramaic
Asia
Greek ›
希腊语
el gre(B)
ell(T)
ell Indo-European ›
Hellenic
Greek Africa
Europe
Hawaiian ›
夏威夷语
N/A haw haw Austronesian ›
Malayo-Polynesian
Latin
Hawaiian Braille
North America
Sindhi ›
信德语
sd snd snd Indo-European ›
Indo-Iranian
Arabic
Devanagari
Roman Sindhi
Asia
Hungarian ›
匈牙利语
hu hun hun Uralic ›
Finno-Ugric
Latin
Hungarian Braille
Old Hungarian
Europe
Shona ›
修纳语
sn sna sna Niger–Congo ›
Atlantic–Congo
Latin
Arabic
Shona Braille
Africa
Cebuano ›
宿务语
N/A ceb ceb Austronesian ›
Malayo-Polynesian
Latin
Philippine Braille
Baybayin
Asia
Armenian ›
亚美尼亚语
hy arm(B)
hye(T)
hye
hyw
Indo-European Armenian
Armenian Braille
Europe
Asia
Igbo ›
伊博语
ig ibo ibo Niger–Congo ›
Atlantic–Congo
Latin
Nwagu Aneke
Igbo Braille
Africa
Italian ›
意大利语
it ita ita Indo-European ›
Italic
Latin
Italian Braille
Africa
Europe
Yiddish ›
意第绪语
yi yid yid Indo-European ›
Germanic
Hebrew
Latin
Europe
Hindi ›
印地语
hi hin hin Indo-European ›
Indo-Iranian
Devanagari
Kaithi
Roman
Devanagari Braille
Africa
Asia
Sundanese ›
印尼巽他语
su sun sun Austronesian ›
Malayo-Polynesian
Latin
Sundanese
Old Sundanese
Sundanese Cacarakan
Sundanese Pégon
Buda
Kawi
Pallava
Pranagari
Vatteluttu
Asia
Indonesian ›
印尼语
id ind ind Austronesian ›
Malayo-Polynesian
Latin
Indonesian Braille
Asia
Javanese ›
印尼爪哇语
jv jav jav Austronesian ›
Malayo-Polynesian
Latin
Javanese
Pegon
Asia
English ›
英语
en eng eng Indo-European ›
Germanic
Latin
Anglo Saxon runes
English Braille
Unified English Braille
North America
Oceania
Africa
South America
Europe
Asia
Yucatec Maya ›
尤卡坦玛雅语
N/A N/A yua Mayan Latin North America
Yoruba ›
约鲁巴语
yo yor yor Niger–Congo ›
Atlantic–Congo
Latin
Yoruba Braille
Arabic
Africa
Vietnamese ›
越南语
vi vie vie Austroasiatic Latin
Vietnamese Braille
Chữ Hán and Chữ Nôm
Europe
Asia
Cantonese ›
粤语
N/A N/A yue Sino-Tibetan ›
Sinitic
Written Cantonese
Cantonese Braille
Written Chinese
Asia
Chinese (Traditional) ›
中文(繁体)
zh zho(T)
chi(B)
zho Sino-Tibetan ›
Sinitic
Traditional Chinese Asia
Chinese (Simplified) ›
中文(简体)
zh zho(T)
chi(B)
zho Sino-Tibetan ›
Sinitic
Simplified Chinese North America
Africa
Asia
Venda ›
文达语
ve ven ven Niger–Congo ›
Atlantic–Congo
Latin
Venda Braille
Africa
Achuar ›
阿丘雅语
N/A N/A acu Jivaroan Latin South America
Aguaruna ›
阿瓜鲁纳语
N/A N/A agr Chicham Latin South America
Akawaio ›
阿卡瓦伊语
N/A N/A ake Cariban ›
Venezuelan
Carib
Latin South America
Amuzgo ›
阿穆斯戈语
N/A N/A amu Oto-Manguean ›
Eastern
Otomanguean
Latin North America
Ndyuka ›
恩都卡语
N/A N/A djk English
Creole
Afaka
Latin
South America
Barasana ›
巴拉萨纳语
N/A N/A bsn Tucanoan ›
Eastern
Tucanoan
Latin South America
Cabecar ›
卡韦卡尔语
N/A N/A cjp Chibchan ›
Core-Chibchan
Latin North America
Cakchiquel ›
卡克奇克尔语
N/A N/A cak Mayan ›
Quichean–Mamean
Latin North America
Campa ›
坎帕语
N/A N/A cni Maipurean ›
Southern
Maipurean
Latin South America
Camsa ›
科奇语
N/A N/A kbh Language
isolate
Latin South America
Chamorro ›
查莫罗语
ch cha cha Austronesian ›
Malayo-Polynesian
Latin North America
Cherokee ›
切诺基语
N/A chr chr Iroquoian ›
Southern
Iroquoian
Cherokee
Latin
North America
Chinantec ›
奇南特克语
N/A N/A chq Oto-Manguean ›
Western
Oto-Mangue
Latin North America
Coptic ›
科普特语
N/A cop cop Afro-Asiatic ›
Egyptian
Coptic Africa
Dinka ›
丁卡语
N/A din dik Nilo-Saharan ›
Eastern
Sudanic
Latin Africa
Galela ›
加莱拉语
N/A N/A gbi West
Papuan ›
North
Halmahera
Latin Asia
Jakalteko ›
雅加达语
N/A N/A jac Mayan ›
Qʼanjobalan–Chujean
Latin North America
Kiche ›
基切语
N/A N/A quc Mayan ›
Eastern
Qʼanjobalan–Chujean
Latin North America
Kabyle ›
卡拜尔语
N/A kab kab Afro-Asiatic ›
Berber
Latin
Tifinagh
Africa
Lukpa ›
卢克帕语
N/A N/A dop Niger–Congo ›
Atlantic–Congo
Latin Africa
Mam ›
马姆语
N/A N/A mam Mayan ›
Eastern
Mayan
Latin North America
Manx ›
马恩岛语
gv glv glv Indo-European ›
Celtic
Latin Europe
Nahuatl ›
纳瓦特尔语
N/A nah nhg Uto-Aztecan ›
Southern
Uto-Aztecan
Latin North America
Ojibwa ›
奥吉布瓦语
oj oji ojb Algic ›
Algonquian
Latin
Ojibwe
Great Lakes Algonquian
North America
Paite ›
派特语
N/A N/A pck Sino-Tibetan ›
Kuki-Chin-Naga
Latin Asia
Potawatomi ›
波塔瓦托米语
N/A N/A pot Algic ›
Algonquian
Latin
Great Lakes Algonquian
North America
Quichua ›
盖丘亚语
qu N/A quw Quechuan Latin South America
Romani ›
罗姆语
N/A rom rmn Indo-European ›
Indo-Iranian
Latin Europe
Shuar ›
舒阿尔语
N/A N/A jiv Chicham Latin South America
Syriac ›
叙利亚语
N/A N/A syc Afro-Asiatic ›
Semitic
Syriac Asia
Berber ›
柏柏尔语
N/A ber ber Afro-Asiatic Latin Africa
Tachelhit ›
希尔哈语
N/A N/A shi Afro-Asiatic ›
North
Afroasiatic
Arabic
Latin
Tifinagh
Africa
Tamajaq ›
图阿雷格语
N/A N/A tmh Afro-Asiatic ›
Berber
Latin Africa
Uma ›
乌玛语
N/A N/A ppk Austronesian ›
Malayo-Polynesian
Latin Asia
Uspanteco ›
乌斯潘坦语
N/A N/A usp Mayan ›
Quichean–Mamean
Latin North America
Wolaytta ›
瓦拉莫语
N/A wal wal Afro-Asiatic ›
Omotic
Latin Africa
Wolof ›
沃洛夫语
wo wol wol Niger–Congo ›
Atlantic–Congo
Latin
Arabic
Garay
Africa
Zarma ›
哲尔马语
N/A N/A dje Nilo-Saharan ›
Songhay
Latin Africa
Oriya ›
奥利亚语
or ori ori Indo-European ›
Indo-Iranian
Odia
Odia Braille
Asia
Aceh ›
亚齐语
N/A ace ace Austronesian ›
Malayo-Polynesian
Latin
Jawi
Asia
Faroese ›
法罗语
fo fao fao Indo-European ›
Germanic
Latin
Faroese Braille
Europe
Tetun ›
德顿语
N/A N/A tet Austronesian ›
Malayo-Polynesian
Latin Asia
Brezhoneg ›
布列塔尼语
br bre bre Indo-European ›
Celtic
Latin Europe
Chuvash ›
楚瓦什语
cv chv chv Turkic ›
Oghur
Cyrillic Europe
Divehi ›
迪维希语
dv div div Indo-European ›
Indo-Iranian
Thaana Asia
Montenegrin ›
黑山语
N/A cnr cnr Indo-European ›
Balto-Slavic
Cyrillic
Latin
Yugoslav Braille
Europe
Dzongkha ›
宗喀语
dz dzo dzo Sino-Tibetan ›
Tibeto-Kanauri
Tibetan
Dzongkha Braille
Asia
Dyula ›
迪尤拉语
N/A dyu dyu Mande ›
Western
Mande
N'Ko
Latin
Arabic
Africa
Northern Kurdish ›
北库尔德语
N/A N/A kmr Indo-European ›
Indo-Iranian
Hawar
Sorani
Arabic
Cyrillic
Asia
Manipuri ›
曼尼普尔语
N/A mni mni Sino-Tibetan ›
Tibeto-Burman
Ancient Meitei
Meetei Mayek
Bengali
Latin
Asia
Wali ›
瓦利语
N/A N/A wlx Niger–Congo ›
Atlantic–Congo
Latin Africa
South Azerbaijani ›
南阿塞拜疆语
N/A N/A azb Turkic ›
Common
Turkic
Latin
Perso-Arabic
Cyrillic
Georgian
Asia
Ika ›
伊卡语
N/A N/A ikk Niger–Congo ›
Atlantic–Congo
Latin Africa
Cañar Highland Quichua ›
卡纳尔高地-基丘亚语
N/A N/A qxr Quechuan Latin South America
Poqomchi’ ›
波孔奇语
N/A N/A poh Mayan ›
Quichean–Mamean
Latin North America
Kuanua ›
库阿努阿语
N/A N/A ksd Austronesian ›
Malayo-Polynesian
Latin
Tolai Braille
Oceania
Central Ifugao ›
中部伊富高语
N/A N/A ifa Austronesian ›
Malayo-Polynesian
Latin Asia
Motu ›
摩图语
N/A N/A meu Austronesian ›
Malayo-Polynesian
Latin
Motu Braille
Oceania
Cusco Quechua ›
库斯科克丘亚语
N/A N/A quz Quechuan Latin South America
Marshallese ›
马绍尔语
mh mah mah Austronesian ›
Malayo-Polynesian
Latin Oceania
Zotung Chin ›
佐通钦语
N/A N/A czt Sino-Tibetan ›
Tibeto-Burman
Latin Asia
Wa ›
佤语
N/A N/A prk Austroasiatic ›
Khasi–Palaungic
Latin Asia
Ayangan Ifugao ›
阿雅安伊富高语
N/A N/A ifb Austronesian ›
Malayo-Polynesian
Latin Asia
Bambara ›
班巴拉语
bm bam bam Niger-Congo ›
Mande
Latin
N'Ko
Africa
Northern Mam ›
北部马姆语
N/A N/A mam Mayan ›
Eastern
Mayan
Latin North America
South Bolivian Quechua ›
南玻利维亚克丘亚语
N/A N/A quh Quechuan Latin South America
Hawaiian Creole English ›
夏威夷克里奥尔英语
N/A N/A hwc English
Creole
Latin North America
Hakha Chin ›
哈卡钦语
N/A N/A cnh Sino-Tibetan ›
Tibeto-Burman
Latin
Burmese
Asia
Lomwe ›
隆韦语
N/A N/A ngl Niger–Congo ›
Atlantic–Congo
Latin Africa
Kiribati ›
基里巴斯语
N/A gil gil Austronesian ›
Malayo-Polynesian
Latin Oceania
Hiri Motu ›
希里莫图语
ho hmo hmo Austronesian ›
Malayo-Polynesian
Latin Oceania
Tampulma ›
坦普尔马语
N/A N/A tpm Niger–Congo ›
Atlantic–Congo
Latin Africa
Enxet ›
恩舍特语
N/A N/A enx Mascoian Latin South America
Maranao ›
马拉瑙语
N/A N/A mrw Austronesian ›
Malayo-Polynesian
Latin
Arabic
Asia
Tedim Chin ›
特丁钦语
N/A N/A ctd Sino-Tibetan ›
Tibeto-Burman
Latin
Pau Cin Hau
Asia
Aymara ›
艾马拉语
ay aym aym Aymaran Latin South America
Acateco ›
阿卡特克语
N/A N/A knj Mayan ›
Qʼanjobalan–Chujean
Latin North America
Ditammari ›
迪塔马利语
N/A N/A tbz Niger–Congo ›
Atlantic–Congo
Latin Africa
Jingpho ›
景颇语
N/A N/A kac Sino-Tibetan ›
Sal
Latin
Burmese
Asia
Maale ›
马勒语
N/A N/A mdy Afro-Asiatic ›
Omotic
Ethiopic Africa
Western Lawa ›
西部拉威语
N/A N/A lcp Austroasiatic ›
Khasi–Palaungic
Thai Asia
Sidamo ›
锡达莫语
N/A N/A sid Afro-Asiatic ›
Cushitic
Ethiopic
Latin
Africa
Bariba ›
巴里巴语
N/A N/A bba Niger–Congo ›
Atlantic–Congo
Latin Africa
Izi ›
伊兹语
N/A N/A izz Niger–Congo ›
Atlantic–Congo
Latin Africa
Roviana ›
罗维那语
N/A N/A rug Austronesian ›
Malayo-Polynesian
Latin Oceania
Dadibi ›
达迪比语
N/A N/A mps Papuan
Gulf
Latin Oceania
Lun Bawang ›
弄巴湾语
N/A N/A lnd Austronesian ›
Malayo-Polynesian
Latin Asia
Chechen ›
车臣语
ce che che Northeast
Caucasian ›
Nakh
Cyrillic Europe
Kapingamarangi ›
卡平阿马朗伊语
N/A N/A kpg Austronesian ›
Malayo-Polynesian
Latin Oceania
Western Bukidnon Manobo ›
西布基农马诺布语
N/A N/A mbb Austronesian ›
Malayo-Polynesian
Latin Asia
Crimean Tatar ›
克里米亚鞑靼语
N/A crh crh Turkic ›
Common
Turkic
Cyrillic
Latin
Europe
Guajajára ›
瓜哈哈拉语
N/A N/A gub Tupian ›
Tupí–Guaraní
Latin South America
Timugon Murut ›
蒂穆贡-穆鲁特语
N/A N/A tih Austronesian ›
Malayo-Polynesian
Latin Asia
Lacid ›
勒期语
N/A N/A lsi Sino-Tibetan ›
Tibeto-Burman
Latin Asia
Huli ›
胡里语
N/A N/A hui Engan ›
South
Engan
Latin Oceania
Antipolo Ifugao ›
安蒂波洛伊富高语
N/A N/A ify Austronesian ›
Malayo-Polynesian
Latin Asia
Central Dusun ›
中部杜顺语
N/A N/A dtp Austronesian ›
Malayo-Polynesian
Latin Asia
Madurese ›
马都拉语
N/A N/A mad Austronesian ›
Malayo-Polynesian
Latin
Carakan
Pegon
Asia
Yom ›
约姆语
N/A N/A pil Niger–Congo ›
Atlantic–Congo
Latin Africa
Tuvan ›
图瓦语
N/A N/A tyv Turkic ›
Common
Turkic
Cyrillic
Old Turkic
Europe
Bokobaru ›
博科巴鲁语
N/A N/A bus Niger–Congo ›
Mande
Latin Africa
Busa ›
布萨语
N/A N/A bqp Niger–Congo ›
Mande
Latin Africa
Achi ›
阿奇语
N/A N/A acr Mayan ›
Quichean ›
Mamean
Latin North America
Mossi ›
莫西语
N/A mos mos Niger–Congo ›
Atlantic–Congo
Latin Africa
Nigerian Fulfulde ›
尼日利亚富拉语
ff ful fuv Niger–Congo ›
Atlantic–Congo
Latin
Adlam
Arabic
Africa
Goffa ›
果发语
N/A N/A gof Afro-Asiatic ›
Omotic
Ethiopic
Latin
Africa
Kasem ›
格森语
N/A N/A xsm Niger–Congo ›
Atlantic–Congo
Latin Africa
Eastern Cagayan Agta ›
东部卡加延-阿格塔语
N/A N/A duo Austronesian ›
Malayo-Polynesian
Latin Oceania
Shipibo ›
西皮沃语
N/A N/A shp Panoan ›
Mainline
Panoan
Latin South America
Bola ›
波拉语
N/A N/A bnp Austronesian ›
Malayo-Polynesian
Latin Oceania
Ambai ›
安拜语
N/A N/A amk Austronesian ›
Malayo-Polynesian
Latin Asia
Yabem ›
雅比姆语
N/A N/A jae Austronesian ›
Malayo-Polynesian
Latin Oceania
Numanggang ›
努曼干语
N/A N/A nop Trans–New
Guinea ›
Finisterre–Huon
Latin Oceania
Yongkom ›
永贡语
N/A N/A yon Trans–New
Guinea ›
Central
&
South
New
Guinea
Latin Oceania
Kalmyk-Oirat ›
卡尔梅克卫拉特语
N/A xal xal Mongolic ›
Central
Mongolic
Cyrillic
Latin
Europe
Tuma-Irumu ›
图马伊鲁穆语
N/A N/A iou Trans–New
Guinea ›
Finisterre–Huon
Latin Oceania
Siroi ›
西罗伊语
N/A N/A ssd Trans–New
Guinea ›
Madang
Latin Oceania
Lingao ›
临高语
N/A N/A onb Kra–Dai ›
Be–Tai
N/A Asia
Waskia ›
瓦吉语
N/A N/A wsk Trans–New
Guinea ›
Madang
Latin Oceania
Halbi ›
亥比语
N/A N/A hlb Indo-European ›
Indo-Iranian
Devanagari Asia
Nateni ›
纳特尼语
N/A N/A ntm Niger–Congo ›
Atlantic–Congo
Latin Africa
Yongbei Zhuang ›
邕北壮语
N/A N/A zyb Kra–Dai N/A Asia
Bariai ›
巴里亚语
N/A N/A bch Austronesian ›
Malayo-Polynesian
Latin Oceania
Bantoanon ›
班通安隆语
N/A N/A bno Austronesian ›
Malayo-Polynesian
Latin Asia
Gbaya ›
格巴亚语
N/A N/A krs Niger–Congo ›
Atlantic–Congo
Latin Africa
Keliko ›
克利科语
N/A N/A kbo Nilo-Saharan ›
Central
Sudanic
Latin Africa
Tennet ›
腾内特语
N/A N/A tex Nilo-Saharan ›
Eastern
Sudanic
Latin Africa
Oroko ›
奥罗科语
N/A N/A bdu Niger–Congo ›
Atlantic–Congo
Latin Africa
Bandial ›
班迪亚勒语
N/A N/A bqj Niger–Congo ›
Atlantic–Congo
Latin Africa
Tungag ›
通加格语
N/A N/A lcm Austronesian ›
Malayo-Polynesian
Latin Oceania
Baka ›
巴卡语
N/A N/A bdh Ubangian ›
Sere–Mba
Latin Africa
Suau ›
苏奥语
N/A N/A swp Austronesian ›
Malayo-Polynesian
Latin Oceania
Muthuvan ›
穆图凡语
N/A N/A muv Dravidian Tamil Asia
Pele-Ata ›
佩勒-阿塔语
N/A N/A ata West
New
Britain
Latin Oceania
Samberigi ›
桑贝里吉语
N/A N/A ssx Engan Latin Oceania
Western Bolivian Guarani ›
西部玻利维亚瓜拉尼语
N/A N/A gnw Tupian ›
Tupi–Guarani
Latin South America
Sabaot ›
萨鲍特语
N/A N/A spy Nilo-Saharan ›
Eastern
Sudanic
Latin Africa
Bambam ›
邦邦语
N/A N/A ptu Austronesian ›
Malayo-Polynesian
Latin Asia
Tsimané ›
齐马内语
N/A N/A cas Moseten–Chonan ›
Chimane
Latin South America
Waris ›
瓦里斯语
N/A N/A wrs Border ›
Bewani
Range
Latin Oceania
Yipma ›
伊普马语
N/A N/A byr Trans–New
Guinea ›
Angan
Latin Oceania
Adhola ›
阿多拉语
N/A N/A adh Nilo-Saharan ›
Eastern
Sudanic
Latin Africa
Agni Sanvi ›
阿格尼桑维语
N/A N/A any Niger–Congo ›
Atlantic–Congo
Latin Africa
Ashéninka ›
阿舍宁卡语
N/A N/A cpb Arawakan Latin South America
Teso ›
特索语
N/A N/A teo Nilo-Saharan ›
Eastern
Sudanic
Latin Africa
Bari ›
巴里语
N/A N/A bfa Nilo-Saharan ›
Eastern
Sudanic
Arabic
Latin
Africa
Chakma ›
查克玛语
N/A N/A ccp Indo-European ›
Indo-Iranian
Bengali
Chakma
Latin
Asia
Bualkhaw Chin ›
布阿尔考钦语
N/A N/A cbl Sino-Tibetan ›
Tibeto-Burman
Latin Asia
Falam Chin ›
法兰钦语
N/A N/A cfm Sino-Tibetan ›
Tibeto-Burman
Bengali
Latin
Asia
Chiru ›
茨鲁语
N/A N/A cdf Sino-Tibetan ›
Tibeto-Burman
Bengali
Latin
Asia
Frafra ›
法拉法拉语
N/A N/A gur Niger–Congo ›
Atlantic–Congo
Latin Africa
Northern Grebo ›
北部格雷博语
N/A grb gbo Niger–Congo ›
Atlantic–Congo
Latin Africa
San Mateo del Mar Huave ›
圣马特奥德马尔-瓦维语
N/A N/A huv Language
isolate
Latin North America
Kakwa ›
卡库瓦语
N/A N/A keo Puinave-Maku ›
Northwestern
Puinave-Maku
Latin Africa
Kaqchikel ›
喀克其奎语
N/A myn cki Mayan ›
Quichean–Mamean
Latin North America
Kaulong ›
卡乌龙语
N/A N/A pss Austronesian ›
Malayo-Polynesian
Latin Oceania
Western Kayah ›
西部克耶语
N/A N/A kyu Sino-Tibetan ›
Karen
Kayah Li
Latin
Myanmar
Asia
Kisiha ›
斯哈语
N/A N/A jmc Niger–Congo ›
Atlantic–Congo
Latin Africa
Nyakyusa ›
尼亚库萨语
N/A N/A nyy Niger–Congo ›
Atlantic–Congo
Latin Africa
Vunjo ›
文约语
N/A N/A vun Niger–Congo ›
Atlantic–Congo
Latin Africa
Kulung ›
库隆语
N/A N/A kle Sino-Tibetan ›
Mahakiranti
Devanagari Asia
Yi language ›
彝语
ii iii iii Sino-Tibetan ›
Lolo–Burmese
Yi Asia
Mongolian ›
蒙古语
mn mon mon Mongolic Traditional Mongolian Asia
Zhuang language ›
壮语
za zha zha Kra–Dai ›
Tai
Zhuang
Old Zhuang
Sawndip
Sawgoek
Asia
Esperanto ›
世界语
eo epo epo Constructed
language
Latin
Esperanto Braille
N/A
Abkhaz ›
阿布哈兹语
ab abk abk Northwest
Caucasian ›
Abkhaz–Abaza
Cyrillic Asia
Aragonese ›
阿拉贡语
an arg arg Indo-European ›
Italic
Latin Europe
Algerian Arabic ›
阿尔及利亚阿拉伯语
N/A N/A arq Afro-Asiatic ›
Semitic
Arabic Africa
Assamese ›
阿萨姆语
as asm asm Indo-European ›
Indo-Iranian
Eastern Nagari
Ahom
Assamese Braille
Latin
Asia
Asturian ›
阿斯图里亚斯语
N/A ast ast Indo-European ›
Italic
Latin Europe
Cornish ›
康沃尔语
kw cor cor Indo-European ›
Celtic
Latin Europe
Malay trade and creole ›
马来语克里奥尔语
N/A crp N/A Creole Latin Asia
Oceania
Kashubian ›
卡舒比语
N/A csb csb Indo-European ›
Balto-Slavic
Latin Europe
Lower Sorbian ›
下索布语
N/A dsb dsb Indo-European ›
Balto-Slavic
Latin Europe
Canadian French ›
加拿大法语
N/A N/A N/A Indo-European ›
Italic
Latin North America
Middle French ›
中古法语
N/A frm frm Indo-European ›
Italic
Latin Europe
Franco-Provençal ›
法兰克-普罗旺斯语
N/A N/A frp Indo-European ›
Italic
Latin Europe
Friulian ›
弗留利语
N/A fur fur Indo-European ›
Italic
Latin Europe
Guarani ›
瓜拉尼语
gn grn grn Tupian ›
Tupi–Guarani
Guarani
Latin
South America
Chhattisgarhi ›
恰蒂斯加尔语
N/A N/A hne Indo-European ›
Indo-Iranian
Devanagari
Odia
Asia
Upper Sorbian ›
上索布语
N/A hsb hsb Indo-European ›
Balto-Slavic
Latin
Sorbian
Europe
Hupa ›
胡帕语
N/A hup hup Dené–Yeniseian Latin North America
Interlingua ›
因特语
ia ina ina Constructed
language
Latin N/A
Interlingue ›
西方国际语
ie ile ile Constructed
language
Latin N/A
Ido ›
伊多语
io ido ido Constructed
language
Latin N/A
Jakun ›
贾昆语
N/A N/A jak Austronesian ›
Malayo-Polynesian
Latin Asia
Lojban ›
逻辑语
N/A jbo jbo Constructed
language
Latin N/A
Greenlandic ›
格陵兰语
kl kal kal Eskimo–Aleut Latin
Scandinavian Braille
North America
Kanuri ›
卡努里语
kr kau kau Nilo-Saharan ›
Saharan
Latin Africa
Kashmiri ›
克什米尔语
ks kas kas Indo-European ›
Indo-Iranian
Perso-Arabic
Devanagari
Sharada
Asia
Lingua Franca Nova ›
新通用语
N/A N/A lfn Constructed
language
Latin
Cyrillic
N/A
Limburgs ›
林堡语
li lim lim Indo-European ›
Germanic
Latin Europe
Maithili ›
迈蒂利语
N/A mai mai Indo-European ›
Indo-Iranian
Tirhuta
Kaithi
Devanagari
Asia
Mirandese ›
米兰达语
N/A mwl mwl Indo-European ›
Italic
Latin Europe
Bokmål ›
书面挪威语
nb nob nob Indo-European ›
Germanic
Latin Europe
Low German ›
低地德语
N/A nds nds Indo-European ›
Germanic
Latin Europe
Nynorsk ›
新挪威语
nn nno nno Indo-European ›
Germanic
Latin Europe
Southern Ndebele ›
南恩德贝莱语
nr nbl nbl Niger–Congo ›
Atlantic–Congo
Latin
Ndebele Braille
Africa
Northern Sotho ›
北索托语
N/A nso nso Niger–Congo ›
Atlantic–Congo
Latin
Sotho Braille
Africa
Occitan ›
奥克语
oc oci oci Indo-European ›
Italic
Latin Europe
Pampanga ›
邦板牙语
N/A pam pam Austronesian ›
Malayo-Polynesian
Latin
Kulitan
Asia
Iranian Persian ›
伊朗波斯语
N/A N/A pes Indo-European ›
Indo-Iranian
Perso-Arabic Asia
Plateau Malagasy ›
高原马达加斯加语
N/A N/A plt Austronesian ›
Malayo-Polynesian
Latin
Malagasy Braille
Africa
Brazilian Portuguese ›
巴西葡萄牙语
N/A N/A N/A Indo-European ›
Italic
Latin
Portuguese Braille
South America
Romansh ›
罗曼什语
rm roh roh Indo-European ›
Italic
Latin Europe
Sanskrit ›
梵语
sa san san Indo-European ›
Indo-Iranian
Devanagari
Brahmic
Asia
Sardinian ›
撒丁语
sc srd srd Indo-European ›
Italic
Latin Europe
Northern Sami ›
北萨米语
se sme sme Uralic ›
Finno-Ugric
Latin
Northern Sami Braille
Europe
Serbo-Croatian ›
塞尔维亚-克罗地亚语
sh N/A hbs Indo-European ›
Balto-Slavic
Latin
Cyrillic
Yugoslav Braille
Europe
Shan ›
掸语
N/A shn shn Kra–Dai ›
Kam–Tai
Burmese Asia
Swazi ›
斯威士语
ss ssw ssw Niger–Congo ›
Atlantic–Congo
Latin
Swazi Braille
Africa
Klingon ›
克林贡语
N/A tlh tlh Constructed
language
Latin
Klingon
N/A
Toki Pona ›
道本语
N/A N/A N/A Constructed
language
N/A N/A
Walon ›
瓦隆语
wa wln wln Indo-European ›
Italic
Latin Europe

Notes

  1. ISO 639 is a standardized nomenclature used to classify languages. Each language is assigned a two-letter (639-1) and three-letter (639-2 and 639-3) lowercase abbreviation, amended in later versions of the nomenclature[2].
  2. For several minority languages without official Chinese name, we consult to other reliable sources[3] or use its transliteration name.

Corpora

  • Multi-lingual Data Projects:

Corpora Type Language Detail Domain
DGT Multilingual Parallel bg cs da de el
en es et fi fr ga
hr hu it lt, etc.
25 languages,
299 bitexts,
113.52M sents.
Law
CCAligned English at core af ak am ar as
ay az be bg bm
bn br bs ca cb, etc.
113 languages,
112 bitexts,
2.25G sents.
Web
Document
News-Commentary Multilingual Parallel ar cs de en es
fr hi id it ja
kk nl pt ru zh
15 languages,
109 bitexts,
2.97M sents.
News
Commentaries
Europarl Multilingual Parallel bg cs da de el
en es et fi fr
hu it lt lv nl, etc.
21 languages,
211 bitexts,
30.32M sents.
Parliament
Proceedings
wikimedia Multilingual Parallel ab ace ady af
ak am an ang
ar arc ary arz as
ast atj, etc.
306 languages,
2,575 bitexts,
31.62M sents.
Mixed
EuroPat Multilingual Parallel de en es fr
hr no pl
7 languages,
21 bitexts,
143.74M sents.
Patent
WikiMatrix Multilingual Parallel an ar arz as
az azb ba bar
be bg bn br bs
ca ceb, etc.
86 languages,
1,620 bitexts,
300.27M sents.
Wikipedia
Article
UNPC Multilingual Parallel ar en es fr ru zh 6 languages,
15 bitexts,
172.04M sents.
Parliamentary
Records
MultiParaCrawl Multilingual Parallel bg ca cs da de
el es et eu fi
fr ga gl ha
hr, etc.
40 languages,
669 bitexts,
505.48M sents.
Mixed
TildeMODEL Multilingual Parallel bg cs da de el
en es et fi fr
hr hu is it
lt, etc.
30 languages,
274 bitexts,
62.44M sents.
Mixed
Tatoeba English at core ab acm ady af
afb afh aii ain
ajp akl aln alt
am an ang, ect.
366 languages,
3,632 bitexts,
9.52M sents.
Oral
SETIMES Multilingual Parallel bg bs el en hr
mk ro sq sr tr
10 languages,
45 bitexts,
17.60M sents.
Official
Documents
Wikititles Multilingual Parallel ar bg cs da de
el en es fa fi
fr hu it ja
ko, etc.
23 languages,
506 bitexts,
24.25M sents.
Title
OpenSubtile Multilingual Parallel af ar bg bn br
bs ca cs da de
el en eo es
et, ect.
62 languages,
1,782 bitexts,
3.35G sents.
Subtitles
XNLI Multilingual Parallel fr es de el bg
ru tr ar vi th
zh hi sw ur en
15 languages,
bitexts 105,
1.5 M sents.
Mixed
stanford English at core cs de vi 3 languages,
3 bitexts,
20.3M sents.
Mixed
Um-Corpus Bilingual Parallel en zh 2.0M sents. Mixed
ASPEC Bilingual Parallel en ja 3.0M sents. Paper
Abstract
EVB Bilingual Parallel en vi 10.0M sents. Book
IIT Bilingual Parallel en hi 1.6M sents. Mixed
  • Multi-lingual Data Shared by MT Conference/Workshop:

    CCMT China Conference on Machine Translation (CCMT) , formerly known as China Workshop on Machine Translation (CWMT), is a flagship conference of machine translation in China.Its evaluations focus mainly on Chinese, English and domestic minority languages (Mongolian, Tibetan, Uyghur, etc.) in domains of news, spoken languages, governmental documents, etc. In addition, CCMT publishes all evaluation-related data on line.

    WMT WMT is hosted by Special Interest Group for Machine Translation (SIGMT) annually since 2006. WMT evaluation campaigns focuses on languages between English and over ten languages such as English, German, Finnish, Czech, Romanian, Polish, Russian, etc. in domains of news, information technology, biomedicine. WMT publishes all evaluation resources specific to each evaluation task, you can find it at Shared Task-Provided Data.

    NIST The NIST machine translation evaluation started in 2001 as part of the DARPA TIDES program. The evaluations are driven and coordinated by NIST as NIST OpenMT. In the early days, NIST evaluations mainly evaluated the translation performance from languages such as Arabic and Chinese to English. Furthermore, NIST has begun to evaluate low-resource language technologies since 2016. Results of past MT evaluations as well as resources specific to each evaluation can be accessed via the year-specific links.

    IWSLT The International Conference on Spoken Language Translation (IWSLT), which has been held annually since 2004, is also a distinctive evaluation campaign on spoken language translation. The test data includes multilingual subtitles of TED talks and QED lectures. The languages involve English, French, German, Czech, Chinese, Arabic and many other languages. Evaluation-related resources can be accessed at Shared Tasks-Training and Development Data.

    WAT The Workshop on Asian Translation (WAT) is a new open evaluation campaign focusing on Asian languages. The successive 8 workshops has been successfully jointly held by the Japan Science and Technology Agency (JST), the National Institute of Information and Communications Technology (NICT) and other institutions. WAT focuses on translation from mainstream Asian languages (Chinese, Korean, Hindi, etc.) and English to Japanese in comprehensive domains such as academic papers, patents, news and recipes. Datasets for evaluation can be accessed at Translation Task-Dataset via the year-specific links.

References

About

We present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published