Endangered Languages

There is no centralised list of open-source code that would be useful for documenting, conserving, developing, preserving, or working with endangered languages. According to some estimates, half of the 7,000~ currently spoken languages are expected to become extinct this century (Wikipedia). However, there is a lot of work by academics, independant scholars, organizations, communities, and individuals which goes towards stopping or slowing this trend. This list is intended to provide a central location to document those efforts.

Contribute

To edit this list, simply click here. If you would like to discuss anything at all related to this, please open an issue. Please edit the list, either using the link before or by submitting pull requests, if you know of any resource available that is not on this list.

For further examples please consider looking at examples in the example template file.

In general, please link directly to the resource or to the page describing the resource. The blurb after the link should be something short - the GitHub description generally works well, although the blurb may have to be written manually for non-GitHub links or for GitHub links which lack descriptions. Please make sure each link is on one line, to help with automatic alphabetization.

Definitions

Endangered languages are human languages that are in danger of extinction. This list also encompasses minority languages - languages which are spoken by a stable, but small, population (for example, Maltese or Hawai'ian); and low- or under-resourced languages, which are spoken by a significant population but under-represented on the web (for instance, Quechua). These languages share certain characteristics in common; the most pertinent is sparse data and a lack of resources, ranging from spell-checkers to grammars to machine translation corpora. Other under-resourced languages that do not fall under this list include constructed languages (for instance, Klingon or Na'vi), computer languages (for instance, Javascript or Lua), and extinct languages that are so sparse as to be rendered computationally irrelevant for most purposes (for instance, Tocharian).

Open Source "promotes a universal access via a free license to a product's design or blueprint, and universal redistribution of that design or blueprint, including subsequent improvements to it by anyone." (Wiki). This is important because money and resources allocated towards a language or project that are not open source is spent at the expense of possible extensibility elsewhere.

Looking for resources for code languages? Take a look at the awesome collection of other awesome lists.

Generic Repositories
i18n-related Repositories
Audio automation
Text automation
Experimentation
Natural language generation
Computing systems
Android Applications
Chrome Extensions
FieldDB
- FieldDB Webservices/Components/Plugins
Academic Research Paper-Specific Repositories
Example Repositories
Language & Code Interfaces
Organisations
- On GitHub
- Other OSS Organisations
Language Specific Projects
- Amharic
- Arabic
- Bengali
- Chichewa
- Georgian
- Guarani
- Hindi
- Høgnorsk
- Inuktitut
- Irish
- Japanese
- Kinyarwanda
- Korean
- Lingala
- Malay
- Malagasy
- Migmaq
- Minderico
- Nishnaabe
- Oromo
- Quechua
- Sami
- Scottish Gaelic
- Secwepemctsín
- Somali
- Tigrinya
- Zulu
Closed Source Resources

Generic Repositories

##Massive Dictionary and Lexicography projects

ABVD Austronesian Basic Vocabulary Database
CBOLD Comparative Bantu OnLine Dictionary
IE Indo-european comparitive lexical resource
REFLEX a comparative dictionary project for Africa based out of CNRS in France.
Southeast Asian lexicography Several Southeast Asian lexicons hosted.
STEDT Tibeto-burman focused project where dictionaries from several languages are comparable.
Tibeto-burman lexicography

##Single language lexicography projects and utilities ###Utilities

DictionaryChromeExtension Dictionary for websites in low-resource languages. App and codebase which connects to a Wiktionary to provide definitions of any term on any website (current languages Cherokee 194,426 entries, Inuktitut 251 entries, Kartuli 7,363 entries, Plains Cree (incubation) 0 entries)
Project for Free Electronic Dictionaries Is a project for a java MIDlet for mobile phones - for indigenous language dictionaries.
Webonary Site which hosts digital dictionaries for single languages.
WeSay Allows language communities to build their own dictionaries. http://wesay.org (by the SIL International)

###Interactions and presentations of data

Dict.cc An exlimpar model of a successful bilingual (German-English) dictionary as it has grown from a hobby to a business employing 22 people.
Koasati Digital Dictionary The Coushatta Tribe of Louisana
Ojibwe People's Dictionary
Talking dictionary of Khinina-ang Bontok: The language spoken in Guina-ang, Bontoc, Mountain Province, the Philippines. Notice that this dictionary is best viewed with Firefox 3.0 on Windows XP... what is the lifespan of these works which we create and how do we create a sustainable infrastructure? this has really been the bane of the digital age and many academics are not able to overcome this challenge.
[Template for Multilayered Language Learning Resources] (https://github.com/eddersko/web-template) This is a web-based template that may be used to present language learning resources to aid language revitalization efforts. It includes a talking dictionary, and a phrasicon, containing sentences and phrases.
The Yurok Langauge Project
Yami Dictionary

##Software

accentuate.us a.k.a. "charlifter". Statistical Unicodification of plain text for many languages
AGTK AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). The internal data structures are based on annotation graphs.
Anki Anki is a program to make and share flaschard decks (including audio) for any language or writing system. http://ankisrs.net/
ANNIS Search and Visualization in Multilayer Linguistic Corpora
Apertium Apertium is a toolbox to build open-source shallow-transfer machine translation systems, especially suitable for related language pairs: it includes the engine, maintenance tools, and open linguistic data for several language pairs.
ark-tweet-nlp CMU ARK Twitter Part-of-Speech Tagger (Fork)
bayesline A Multinomial Bayesian Classification for Language Identification
BloomDesktop Bloom Desktop is a hybrid c#/javascript/html/css Windows application that dramatically "lowers the bar" for language communities who want books in their own languages. Bloom delivers a low-training, high-output system where mother tongue speakers and their advocates work together to foster both community authorship and access to external materia… http://bloomlibrary.org/
brain Neural networks in JavaScript
cdec Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms http://cdec-decoder.org/
charlint Charlint is a character normalization/checking tool written in Perl. Among else, it implements Normalization Form C of Unicode TR 15, as a test platform for Early Uniform Normalization in the W3C Character Model.
clam Computational Linguistics Application Mediator -- Quickly turn NLP applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice.
clld The clld python package is a toolkit to build cross-linguistic databases. A list of databases built with it is available at http://clld.org/datasets.html
Cog Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques. It can be used to automate much of the process of comparing word lists from different language varieties. http://sillsdev.github.io/cog/
CorpusTools Phonological CorpusTools http://phonologicalcorpustools.github.io/CorpusTools/
CTK Built around LDC's champollion sentence aligner kernel, Champollion Tool Kit (CTK) aims to providing ready-to-use parallel text sentence alignment tools for as many language pairs as possible.
CuPED CuPED ('Customizable Presentation of ELAN Documents') is a tool for transforming time-aligned transcripts, such as those produced by ELAN, into a variety of presentation formats.
DataTags A system to assess the sensitivity and privacy risk of a dataset, and assign a tag to describe how the dataset must be transfered, stored and accessed. (Fork)
dataverse A data repository framework to share and publish research data.
dative A single-page application that interacts with multiple linguistic fieldwork web service databases.
DeepLearnToolbox Matlab/Octave toolbox for deep learning. Includes Deep Belief Nets, Stacked Autoencoders, Convolutional Neural Nets, Convolutional Autoencoders and vanilla Neural Nets. Each method has examples to get you started.
Desmeme Database and tools for exploring linguistic templates
dictdb dictionary database for language translation
discoursegraphs Python-based tool to convert and merge multilayer annotated linguistic data
DLTK Deutsch Language Tool Kit http://goo.gl/wdnz1W
ELAN ELAN is a professional tool for the creation of complex annotations on video and audio resources.
ELDER: Endangered Language Data Electronic Repository Endangered Language Data Electronic Repository: A web-based ontologically-compliant collaborative linguistic data cataloguing tool.
EMMA A Novel Evaluation Metric for Morphological Analysis
eopas ETHNOER Online Presentation and Annotation System
FieldWorks FieldWorks is a suite of software tools for language and cultural data, with support for complex scripts. http://fieldworks.sil.org/
FLEx / FieldWorks FieldWorks is popular a suite of software tools for language and cultural data, with support for complex scripts. http://fieldworks.sil.org/ FieldWorks Language Explorer (or FLEx, for short) is designed to help field linguists perform many common language documentation and analysis tasks. It can help you: elicit and record lexical information, create dictionaries, interlinearize texts, analyze discourse features, study morphology
Gaia Gaia is a HTML5-based Phone UI for the Boot 2 Gecko Project. NOTE: For details of what branches are used for what releases, see https://wiki.mozilla.org/B2G. If you're interested in setting up a keyboard in new language, see this.
Glottolog data Glottolog provides comprehensive reference information for the world's languages. The data published in Glottolog is curated in https://github.com/clld/glottolog-data
graf-python The library graf-python is an open source Python implemenation to parse and write GrAF/XML files as described in ISO 24612. The parser of the library creates an annotation graph from the files. The user may then query the annotation graph via the API of graf-python.
Gramadóir Grammar checking engine that is designed for the rapid development of grammar checkers for minority languages and other languages with limited computational resources.
https://github.com/hyphenliu/cnminlangwebcollect Chinese minorities website languages detection and websites collection
https://github.com/leebock/languages Application files for the Smithsonian endangered languages story map.
iLanguageCloud An HTML5/Android word cloud generation codebase
itweets-geodata Geodata from Indigenous Tweets
koreksyon Tools for developing and implementing spell-checking and grammar-checking capabilities in low-resource languages
l20n.js L20n reinvents software localization. Users should be able to benefit from the entire expressive power of natural languages. L20n keeps simple things simple, and at the same time makes complex things possible. This is the JavaScript implementation of L20n. http://l20n.org
langtech A host of resources provided in SVN by the University of Tromsø. Details are here and in English here.
LDC Word Aligner LDC Word Aligner is a software tool used for manual annotation of word alignment developed to support Arabic-English and Chinese-English word alignment tasks. It has a clean, easy-to-use interface. Since its development in 2009, LDC has used LDC Word Aligner to generate over 1,000,000 tokens of annotated word alignment data from a variety of genres including broadcast, newswire and web-based sources.
LEGO Unified Concepticon Material relating to the LEGO Unified Concepticon
Lex4All pronunciation LEXicons for Any Low-resource Language http://lex4all.github.io/lex4all/
LinGO Grammar Matrix The LinGO Grammar Matrix is a framework for the development of broad-coverage, precision, implemented grammars for diverse languages.
Lingpy LingPy: Python library for quantitative tasks in historical linguistics http://lingpy.org
Linguistica Linguistica is a program designed to explore the unsupervised learning of natural language, with primary focus on morphology (word-structure). It runs under Windows, Mac OS X and Linux, and is written in C++ within the Qt development framework. Its demands on memory depend on the size of the corpus analyzed.
lrl For work concerning low resource languages.
Machine Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages (used by FLEx)
Make-extensions Scripts for generating hunspell spellchecking extensions
Minority Translate Minority Translate is a simple program for helping content generation on smaller sized Wikipedias (actually any sized) by giving pointers to existing articles in other language Wikipedias, so that the user can easily translate or adapt existing texts and thus increase the size and useability of their Wikipedia editions.
morfessor Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
moz-l10n-tiers Creates a pseudo-locale to evaluate string prioritization for l10n
Natural Javascript general natural language facilities for node
NIST 2008 Open Machine Translation Evalutation
NLTK Python Natural Language Tool Kit. NLTK Source http://nltk.github.com/
node-panlex node.js client for PanLex
norma A tool for automatic spelling normalization
octothorpe CouchDB-powered wiki thing
ogoki iPhone & iPad template source code (as a .zip) for language learning. The app built and open sourced by the ogokilearning.com company who also offers developer training for first nations apps.
old-webapp Online Linguistic Database --- software for creating web applications to collaboratively document languages.http://www.onlinelinguisticdatabase.org
OpenDataKit Open Data Kit (ODK) is an open-source suite of tools that helps organizations author, field, and manage mobile data collection solutions
panlex-tools This package contains scripts to transform lexical resources into a format suitable for importing into PanLex. Documentation may be found at http://dev.panlex.org/tools/
pepper Pepper is a pluggable, Java-based, open source converter framework for linguistic data.
poio-analyzer Poio is a collection of software tools for linguists working in language documentation, descriptive linguistics and/or language typology. It allows linguists to manage and analyze their data. The Poio Interlinear Editor allows to add morpho-syntactic annotations to transcriptions. It supports various file formats for input, but will only output standardized XML defined by the Corpus Encoding Standard and the Text Encoding Initiative. Several tools for analyzing linguistic data will be made available to further process annotated data. Poio tools are written in Python and are based on PyQt.
poio-api Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan’s EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called “Graph Annotation F… http://media.cidles.eu/poio/poio-api
poio-corpus The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.
poio-doc Documentation of the Poio project.http://www.poio.eu
poio-site The website of the Poio project - http://www.poio.eu
pressagio Pressagio is a library that predicts text based on n-gram models. For example, you can send a string and the library will return the most likely word completions for the last token in the string.
pyannotation PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files.
PyAnnotationGraph Implement the formal model for linguistic annotations described in Bird and Liberman (2001) using Python and SQL
pyDelphin Python libraries for DELPH-IN (Friendly Fork)
Rosetta Pangloss The Rosetta Project's Pangloss system
Salt A graph-based model to store and manipulate linguistic data.
Secwepemc-Facebook Translate Facebook into unsupported languages
SeedLing Building and Using A Seed Corpus for the Human Language Project
Skype in your language Translate Skype into unsupported languages
SPHERE Conversion Tools Many LDC corpora contain speech files in NIST SPHERE format. The programs below convert SPHERE files to other formats.
Stanford CoreNLP Python Python wrapper for Stanford CoreNLP tools
sugali This is a legacy repository of the language identification project for many (many) languages project for the software project course, NLP projects for low-resource languages.
SuGarLike Language Identification for Low Resource Languages (by Susanne, Guy and Liling)
TeraDict Translate English words into hundreds of languages!
TexNLP TexNLP: Texas Natural Language Processing tools
Toney Tone Classification Software
Toolbox Scripts for ELAN Mirror of Alexander Koenig's Toolbox Scripts https://tla.mpi.nl/tools/tla-tools/elan/thirdparty/
ToolsForFieldLinguistics A collection of scripts and recipes for linguistics
translitit-engine A transliteration engine written in JavaScript
Tsammalex data Tsammalex is a multilingual lexical database on plants and animals. The data published on the Tsammalex website is curated collaboratively at https://github.com/clld/tsammalex-data
tweet2learn An app to make it easier to use your native language on Twitter
Unicodify Unicodify is a suite of programs for converting text in a variety of 8-bit encodings to Unicode (using the UTF-16 encoding). Unicodify was particularly designed to handle HTML-based text using non-ISCII 8-bit fonts to render South Asian scripts. However, elements of the suite can map other types of non-ASCII 8-bit encodings, such as Latin-2, ISCII and PASCII.
UniversalDependencies docs Universal Dependencies online documentation http://universaldependencies.github.io/docs/
UniversalDependencies tools Various utilities for processing the data.
wavesurfer.js Navigable waveform built on Web Audio and Canvas http://www.wavesurfer.fm (Also has an ELAN plugin)
WeSay Allows language communities to build their own dictionaries. http://wesay.org (by the SIL International)
WordBoundary An experiment in the detection and segmentation of word boundaries
wordbyword WordByWord is a free, open source, easy-to-use multimedia vocabulary trainer developed by Vera Ferreira, Peter Bouda, and Ricardo Filipe at CIDLeS with the support of the Foundation for Endangered Languages.
WSI4URLang Word Sense Induction (WSI) for Under-resourced Languages (URLang) http://www.mohammadnasiruddin.eu/under-resourced-language-urlang.html
XDXF_Makedict XDXF dictionary format and "makedict" dictionary converting software (official repository)
XTrans Trans is a next generation multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. The XTrans toolkit provides new and efficient solutions to common transcription challenges and addresses critical gaps in existing tools.Designed with input from experienced human transcribers working with real world data, XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics.

i18n-related Repositories

Polyglot.js Give your JavaScript the ability to speak many languages.
Transifex - System for providing a nice, userfriendly/project oriented approach to translating .po files. Great for non-technical users, free for open-source projects, decent for minority languages; however, it can take a while to get a new language added to the Transifex system because the ticketing system Transifex uses results in them losing tickets sometimes. Provides translation memory, ability to appoint reviewers, etc. Transifex used to have an open source system that you could host on your own, but that seems to have disappeared.

Audio automation

arctic-prompts Generate prompts PDF for CMU ARCTIC dataset
AudioWebService a simple nodejs server which accepts upload of audio and runs it through praat
AuToBI Automatic prosodic annotation tool written in Java.
BashScriptsForPhonetics (Fork of a dormant project)
esv-text-audio-aligner ESV Text/Audio Aligner to programmatically obtain the timings for each word in the corresponding audio
et-pocketsphinx-tutorial Tutorial of Estonian speech recognition using PocketSphinx
html5-audio-read-along HTML5 Audio Read-Along
ipa-chart International Phonetic Alphabet (IPA) Unicode Chart and Character Picker
lex4all pronunciation LEXicons for Any Low-resource Language (Fork of a student project)
opensauce GNU Octave-compatible version of VoiceSauce
pocketsphinx.js Speech recognition in JavaScript
praat-py From my PhD days: Praat-Py is a custom build of Praat, the computer program used by linguists for doing phonetic analysis on sound files, to allow for scripts to be written in the Python programming language, rather than in Praat's built-in language. (Fork of a dormant project)
Praat-Scripts Mietta's Scripts
PraatTextGridJS A small library which can parse TextGrid into json and json into TextGrid
prosodicParsing different kinds of HMMs to use for incorporating prosody into basic parsing
Prosodylab-Aligner Python interface for forced audio alignment using HTK and SoX
prosodylab.alignertools
Recordmp3js Record MP3 files directly from the browser using JS and HTML

Text automation

clld Cross Linguistic Linked Data python library
LaTeX2HTML5 LaTeX web components
MultilingualCorporaExtractor Node io Spider for extracting multilingual corpora (Fork of a student project)
SeedLing Building and Using A Seed Corpus for the Human Language Project (Fork of a student project)

Experimentation

experigen A framework for creating linguistic experiments
GamifyPsycholinguisticsExperiments A simple node server to gamify linguistics experiments, runs offline on a laptop for small scale experiements and online on a server for large scale experiments. Data is sent to a Google spreadsheet. (Fork of a dormant project)
OpenSesame Graphical experiment builder for the social sciences
OPrime Open Source Experimentation Libraries - Online and Offline for Android and HTML5
psychopyMegProsody Runs MegProsody using PsychoPy.
PsychScript A HTML5/Javascript library for running behavioural experiments online.

Natural language generation

hailo A conversation bot using Markov chains

Computing systems

Common Language Resources and Technology Infrastructure Norway / Clarino - One of their projects (not clearly listed here) is about providing an online system for language analysis, so users can connect resources visually, dump in text, and get a result. Kind of like the Yahoo! Pipes but for language processing. Uses the ABEL cluster.

Android Applications

Aikuma Android software for recording and translation http://lp20.org/aikuma
AndroidFieldDB An Android app which lets the user build a custom visual and auditory vocabulary, useful for guided anomia treatment and self designed language lessons by heritage speakers.
AndroidFieldDBElicitationRecorder A general purpose video recording tool
AndroidLanguageLearningClientForFieldDB An Android language learning app which plugs into a FieldDB corpus to create language learning apps.
AndroidProductionExperiment Android App to run perception experiments
Bevara Android Phone Application designed for Linguistic Fieldwork to help preserve, maintain, and save endangered languages
ojoVoz A mobile app for sending georeferenced image and voice recordings from an Adroid phone to an email address. For more information, please go to http://sautiyawakulima.net/ojovoz
[Template for Word-Learning App] (https://github.com/eddersko/android-template) This is a template of an Android word-learning app that may be used a way to introduce a language. It includes a quiz. For the documentation, go to http://eddersko.github.io/android-template/

Chrome Extensions

babelfrog Chrome extension to help learn languages as you browse.
DictionaryChromeExtension Dictionary for websites in low-resource languages. App and codebase which connects to a Wiktionary to provide definitions of any term on any website (current languages Cherokee 194,426 entries, Inuktitut 251 entries, Kartuli 7,363 entries, Plains Cree (incubation) 0 entries) use
KartuliChromeExtension Codebase to turn any webpage from any alphabet into another alphabet, the default is to turn latin letters into Kartuli. use "Do your friends keep commenting on Facebook with English keyboards (either because they forgot to switch, or because they didn't/can't install a Georgian keyboard)? Now you can read the web through კართული eyes."

FieldDB

FieldDB is actively worked on by the [https://github.com/OpenSourceFieldlinguistics] group. These repos explicitly work with it but could be repurposed for other projects.

FieldDB An offline/online field database which adapts to its user's terminology and I-Language, has plugins for various data automation routines along the process of primary data collection to cleaning to publication and archival. use

FieldDB Webservices/Components/Plugins

AndroidLanguageLearningClientForFieldDB-sikuli Sikuli tests for AndroidLanguageLearningClientForFieldDB
AuthenticationWebService A node.js web service which mananges users and corpora creation and authentication use
bower-fielddb-angular A bower repository which hosts fielddb-angular components, bower install fielddb-angular --save
bower-fielddb A bower repository which hosts fielddb core components, bower install fielddb --save
dative A single-page application that interacts with multiple linguistic fieldwork web service databases. use
fielddb-spreadsheet-sikuli sikuli tests for the spreadsheet module use
FieldDBActivityFeed A fielddb activity feed widget which can be embedded in other codebases, websites etc use
FieldDBGlosser A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. bower install fielddb-glosser --save
FieldDBLexicon A lexicon browser/editor web widget for FieldDB databases use
FieldDBWebServer Web server which can display FieldDB public corpora/user's share pages use
LanguageClassDashboard App which provides a view of FieldDB corpora for language teachers use
LexiconWebService A node.js ElasticSearch wrapper for indexing/training lexicons from corpora use
LexiconWebServiceSample A node.js web server which implements the fieldlinguist's lexicon API for the FieldDB project. use

Academic Research Paper-Specific Repositories

Gargantua Fast Unsupervised Sentence Aligner described in "Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora", COLING 2010.
ldc-kiy Materials for: The experimental state of mind in elicitation: illustrations from tonal fieldwork. Dubmitted to Language Documentation & Conservation, How to study a tone language.
Learning to map into a Univerisal POS tagset Yuan Zhang, Roi Reichart, Regina Barzilay and Amir Globerson
low-resource-pos-tagging-2013 and low-resource-pos-tagging-2013 Published in: Learning a Part-of-Speech Tagger from Two Hours of Annotation. Dan Garrette and Jason Baldridge. In Proceedings of NAACL 2013. And in: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages. Dan Garrette, Jason Mielens, and Jason Baldridge. In Proceedings of ACL 2013.
orthotree Linguistic family tree based on orthographic distance
type-supervised-tagging-2012emnlp This repository contains the code, scripts, and instructions needed to reproduce the results in the paper: Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries. Dan Garrette and Jason Baldridge. In Proceedings of EMNLP 2012. This code is frozen as of the version used to obtain the results in the paper. It will not be maintained. To see the updated code, visit nlp
visualizing-language For visualizations of WALS and other typological databases
WALS-APiCS Code for working with WALS-APiCS (Atlas of Pidgin and Creole Language Structures) complexity metrics

Example Repositories

These are repositories that are generally only interesting for training purposes or seeing how something is done.

CorpusWebService über-simple node.js-Proxy to enable CORS request for couchdb
CorporaForFieldLinguistics
startR
lucenerevolution-2013 Demo examples for linguistics in Lucene and Solr
berlin-buzzwords-2013 Demo examples for Lucene, Solr, ElasticSearch and OpenNLP from Berlin Buzzwords 2013 talk

Language & Code Interfaces

قلب ‬ is a simple, Scheme-like programming language that you code entirely in Arabic. It is an exploration of the impact of human culture on computer science, the role of tradition in software engineering, and the connection between natural and computer languages.

Organisations

On GitHub

batumi Speech recognition and natural language processing for low-resource languages
lex4all
longnow
NLTK http://nltk.github.com/
OpenSourceFieldLinguistics
PhonologicalCorpusTools
Projet de recherche sur l'écriture crowdsourcing or conducting large scale psycholinguistics experiments (or statistically significant field linguistics)
SIL International SIL International SIL (originally known as the Summer Institute of Linguistics, Inc.) is probably the leading organization which provides software and tools tailored for use by field linguists and lexicographers working on endangered languages. A little known fact is that much of it's code is open sourced on GitHub and SIL is happy to recieve open source contributions and collaborate on open source projects.
SIL NRSI SIL Non-Roman Script Initiative. The NRSI is a department of SIL International, whose task is to provide assistance, research and development for SIL International and its partners to support the use of non-Roman and complex scripts in language development.

Other OSS Organisations

The Language Archive Part of the MPI

Language Specific Projects

###Amharic

amh :: አማርኛ

HornMorpho - morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs

###Arabic

ara :: العربية

Buckwalter A small python script that transliterates Arabic text using the Buckwalter Transliteration Scheme. It allows for multiple decisions to be made around whether or not to include all types of diacritics and characters or ignore them. Useful for NLP experiments where you may want to normalize text.

###Bengali

ben :: বাংলা

Bangla-অঙ্কুর for Mac This project aims to develop a phonetic based Bangla typing system for Macintosh computer which can be developed into a transliteration technique in the future.
Bengali Writer `Bengali Writer' is a set of utilities for computerized editing and typesetting in Bengali, a language of India and Bangladesh. It comprises a set of fonts for Bengali in several formats (METAFONT, BDF, PS), a text editor with spell-cheking, export, and more.
Ekushey Bangla Computing and Localization Project for the Bangla speaking people.
Lekho A collection of tools and resources for using bangla on computers

###Chichewa

nya :: chicheŵa

Chichewa NLP resources for Chichewa

###Georgian

kat :: ქართული

translitit-latin-to-mkhedruli-georgian A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript

###Guarani

grn :: Guarani

ParaMorfo - morphological analysis and generation of Spanish and Guarani verbs, nouns, and adjectives. Used to be here.

###Hindi

hin :: हिन्दी

hindi-morph An open source morphological analyzer for Hindi

###Høgnorsk

nno :: Høgnorsk

hunspell-hn_NO A beginning to a spellchecking tool for Høgnorsk, a conservative variant of Norwegian Nynorsk, based on a set of corpuses.

###Inuktitut

_iku :: Inuktitut

InuktitutComputing Inuktitut Morphological Analyser, transcoder, transliterator, corpus tools, and lexical lists for working with Inuktitut. Usable online at http://inuktitutcomputing.ca/index.php

###Irish

gle :: Gaeilge

aimsigh Source for the now-defunct aimsigh.com Irish search engine
caighdean Code for standardizing Irish language text
fleiscin Irish hyphenation patterns for TeX http://borel.slu.edu/fleiscin/
GaelSpell Sources for an Irish language spell checker
morphological analyzer & syntactic disambiguator Elaine Uí Dhonnchadha has produced a morphology in XFST/FOMA, which now seems to be hosted by [Giellatekno]. Includes syntax written in VISL Constraint Grammar.
tesseract-gle-uncial OCR for old Irish fonts

###Japanese

jpn :: 日本語

kuromoji Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
kuromoji-server Kuromoji server and demo that shows Japanese morphological analyzer capabilities

###Kinyarwanda

kin :: Ikinyarwanda

kin-morph-fst Kinyarwanda morphological analyzer
TurboTagger & TurboParser for Kinyarwanda (download) TurboTagger & TurboParser for Kinyarwanda

###Korean

kor :: 한국어

komoran Korean morphological analyzer

###Lingala

lin :: Lingála

Lingala NLP NLP tools and resources for Lingala

###Malay

MorfoMalayu morphological analysis of Malay words

###Malagasy

mlg :: Malagasy

Global Voices Malagasy Project This page provides a link to a corpus of parallel news articles in Malagasy and English from the Global Voices project. This corpus was collected and aligned at the sentence level by Victor Chahuneau.

###Migmaq

mic :: Mi'kmaq

migmaqLessons

###Minderico

fredericajordarzambarino A web based game for mobile devices in minderico based in the "Who Wants to be a Millionaire" TV show.
mindericobot

###Nishnaabe

_oji :: Ojibwe, Oddawa, Chippewa, Anishinaabemowin, ᐊᓂᔑᓈᐯᒧᐎᓐ

Ojibway-iphone-app An iPhone app with audio and images for learning the Ojibway language.
OjibwayMap An iPhone app with audio and images for learning Ojibway language and culture.
nishanimate A desktop app to facilitate Nishnaabe-language acquisition via animations produced by the natural language processing of audio-accompanied text.

###Oromo

orm :: Oromo

hornmorpho morphological analysis and generation of amharic and oromo verbs and nouns and tigrinya verbs

###Quechua

que :: Runa Simi

AntiMorfo - morphological analysis and generation of Quechua nouns, adjectives, and verbs and Spanish verbs
Morphology, spellchecker - XFST and FOMA, plus OpenOffice plugin.

###Sami

sma :: Sámi/Saami

Divvun Sámi proofing tools. This links to the documentation page, which explains how to access the svn repository.
Giellatekno A host of Sámi tools.

Mobile keyboards (iOS and Android), learning apps, dictionaries, morphologies, syntax disambiguators, some amount of project collaboration with Apertium on shallow translation between Saami languages, and
Oahpa! - A learning portal for Saami languages. Includes WordPress based, media rich lesson-based learning, and morphological and syntactic exercizes generated from the morphological and syntactic tools
Gïelese - A media learning application for South Saami, including images, sound and animation for learning basic phrases and core vocabulary. JavaScript application, playable on the web or via PhoneGap apps in Android or iOS.
Neahttadigisánit - A morphologically sensitive dictionary, with modes for 'social media input' (which allows users to type a 'relaxed' version of the orthography (acdnstz will be recognized also as áčđŋšŧz̄), and also includes a JavaScript bookmarklet to offer click-to-read dictionary lookup functionality. Also available for other Uralic, and non-Uralic languages. Giellatekno does a lot for other minority Uralic languages. Following are some keywords for CTRL+F friendliness:

Saami languages: North Saami, Lule Saami, South Saami // Inari Saami, Kildin Saami, Pite Saami, Skolt Saami.
Other Uralic languages: Erzya, Finnish, Hill Mari, Ingrian, Khanty, Kven, Komi, Livonian, Meadow Mari, Moksha, Nenets, Nganasan, Olonetsian, Udmurt, Veps.
Other languages: Buriat, Cornish, Faroese, Greenlandic, Iñupiaq, Northern Haida, Ojibwe, Plains Cree, Russian.

###Scottish Gaelic

gla :: Gàidhlig

hunspell-gd Files for building Scottish Gaelic spell checkers

###Secwepemctsín

secwepemctsnem A project to help people learn Secwepemctsín.

###Somali

som :: Soomaaliga

somorph Somali morphological and syntactic analyzers and generators built on XFST and VISL-CG Constraint Grammar. Up to date version checked in in Giellatekno's repository.
qaamuus.so morphologically aware dictionary based on lexical resources found online, and the somali morphology.

###Tigrinya

tir :: ትግርኛ

HornMorpho morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs

###Zulu

zul :: zulu

Ukwabelana An open-source morphological Zulu corpus

Closed Source Resources

Noto Fonts Noto is Google’s free font family that aims to support all the world’s scripts. Its design goal is to achieve visual harmonization across languages. Noto fonts are under Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Endangered Languages

Contribute

Definitions

Table of Contents

Generic Repositories

i18n-related Repositories

Audio automation

Text automation

Experimentation

Natural language generation

Computing systems

Android Applications

Chrome Extensions

FieldDB

FieldDB Webservices/Components/Plugins

Academic Research Paper-Specific Repositories

Example Repositories

Language & Code Interfaces

Organisations

On GitHub

Other OSS Organisations

Language Specific Projects

Closed Source Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

Endangered Languages

Contribute

Definitions

Table of Contents

Generic Repositories

i18n-related Repositories

Audio automation

Text automation

Experimentation

Natural language generation

Computing systems

Android Applications

Chrome Extensions

FieldDB

FieldDB Webservices/Components/Plugins

Academic Research Paper-Specific Repositories

Example Repositories

Language & Code Interfaces

Organisations

On GitHub

Other OSS Organisations

Language Specific Projects

Closed Source Resources