Increased coverage of language codes not in iso639 lib #132

tpimentelms · 2020-03-21T12:07:57Z

This PR adds new language codes to the dictionary in wikipron/languagecodes.py to increase language coverage.

kylebgorman · 2020-03-21T14:54:36Z

Hi! @jacksonllee tells me that there are some issues with this he noticed; for instance we already scrape Galician as glg; glc is Bon Gula. I'm going to assign him and @lfashby as reviewers since I don't have a chance to look more closely right now.

jacksonllee

Hi @tpimentelms , thank you for the contribution! wikipron/languagecodes.py is primarily designed to handle languages from Wiktionary that has at least a certain number of entries to be worth scraping (this threshold is set to be 100 at the moment). More concretely, if the iso639 package can't handle a language code/name but the language doesn't have enough data on Wiktionary to pass our threshold for scraping, we don't add it to wikipron/languagecodes.py (yet -- Wiktionary is constantly expanding and we add missing language codes/names when they make our threshold).

If you'd like to contribute by adding missing language codes/names that WikiPron should currently handle but doesn't yet, WikiPron has this test that throws us warnings for which languages on Wiktionary now have enough data but WikiPron can't handle. The Circle CI builds from your PR (example) show that there are three that we could potentially add at this point:

tests/test_languagecodes.py::test_language_coverage
  /root/project/tests/test_languagecodes.py:97: UserWarning: WikiPron cannot handle "Bouyei".
    warnings.warn(f'WikiPron cannot handle "{language}".')

tests/test_languagecodes.py::test_language_coverage
  /root/project/tests/test_languagecodes.py:97: UserWarning: WikiPron cannot handle "Laboya".
    warnings.warn(f'WikiPron cannot handle "{language}".')

tests/test_languagecodes.py::test_language_coverage
  /root/project/tests/test_languagecodes.py:97: UserWarning: WikiPron cannot handle "Moroccan Arabic".
    warnings.warn(f'WikiPron cannot handle "{language}".')

Would you be interested in adding just these three to wikipron/languagecodes.py?

For the changes in this PR thus far, I didn't have time to check all the dozens of new key-value pairs one by one, but did some quick spot checking. Most of them don't make our threshold of 100, it looks like?

wikipron/languagecodes.py

lfashby · 2020-03-21T16:01:27Z

Regarding Jackson's comments, I'd like to add that Wikipron can currently scrape the languages that fail test_languagecodes.py but we still receive warnings about them because we did not add the lowercase Wiktionary language name in addition to the iso639 code. Something like:

    # Bouyei. ISO 639-3 only.
    "pcc": "Bouyei",
    "bouyei": "Bouyei",

would prevent the warning for these languages. Additionally, "lmy": "Lamboya", should be changed to "lmy": "Laboya", as "Lamboya" is the iso639 language name.

One question that the proposed changes brings up is whether we should expand languagecodes.py to handle someone using a code like ekk for Estonian. est apparently refers to the Estonian macrolanguage while ekk refers to Standard Estonian.

tpimentelms · 2020-03-21T17:49:21Z

But couldn't these language codes be there, but not be automatically scraped by wikipron?
I added these ones because I needed them for a personal project.

About the ekk vs est discussion I don't have an opinion. I can remove it if you think that is better.

lfashby · 2020-03-21T18:45:56Z

I don't think there is any harm in having these in languagecodes.py. Looks to me like all the language codes are correct and running a Wikipron scrape with them doesn't throw any errors. I'd like to hear what Jackson has to say about it though, perhaps if we include these it would make sense to eventually remove the min language size requirement in test_languagecodes.py?

Similar to the ekk and est thing there is also per versus pes for Persian - per is the iso639-2/B code for the Persian macro-language while pes is the iso639-3 code for Iranian Persian. Not sure which we prefer and I'd defer to Jackson on that question.

Just out of curiosity - did you use Wikipron to scrape for these languages and get usable data for all of them?

wikipron/languagecodes.py

tpimentelms · 2020-03-21T21:24:14Z

I focused on getting only phonemic data (and not phonetic) and languages 'kat', 'ket', 'olo', 'kca', 'lbe', 'nio', 'hye' didn't have any entries.
The others I fetched with wikipron. Didn't check the resulting files for the other languages yet, though.

jacksonllee

LGTM! Sorry for the delay in reviewing this. I've checked all the language codes against the ISO 639-3 list as well as their corresponding language names on Wiktionary. Thanks for the contribution, @tpimentelms!

Re: adding language codes for languages with fewer than 100 entries on Wiktionary, I agree that there's no harm including them now.

I'll defer to @kylebgorman for merging and/or for another look if need be.

wikipron/languagecodes.py

tpimentelms added 9 commits March 17, 2020 15:50

Fix Abkhaz language

50507b4

Fix for Avar language

4d6933e

Fix for Buryat language

545faa0

Fix for Chukchi language

e639211

Brushaski language code

7f9b565

Galician language code

bf25233

Increasing number of languages

5127d3d

Added missing language codes

13e00b7

Added new language codes

699e1c5

kylebgorman requested review from lfashby and jacksonllee March 21, 2020 14:54

jacksonllee requested changes Mar 21, 2020

View reviewed changes

wikipron/languagecodes.py Outdated Show resolved Hide resolved

tpimentelms added 2 commits March 21, 2020 17:42

Removed wrong langauge code and fixed lmy

3bc6ecb

Added lower case version of language name to languagecodes

e34294e

lfashby reviewed Mar 21, 2020

View reviewed changes

wikipron/languagecodes.py Outdated Show resolved Hide resolved

Removing duplicated line for ket language code

db1430c

jacksonllee approved these changes Apr 16, 2020

View reviewed changes

wikipron/languagecodes.py Show resolved Hide resolved

wikipron/languagecodes.py Show resolved Hide resolved

kylebgorman merged commit ff8964a into CUNY-CL:master Apr 18, 2020

jacksonllee mentioned this pull request Apr 18, 2020

Remove Persian from language code map #148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increased coverage of language codes not in iso639 lib #132

Increased coverage of language codes not in iso639 lib #132

tpimentelms commented Mar 21, 2020

kylebgorman commented Mar 21, 2020

jacksonllee left a comment

lfashby commented Mar 21, 2020 •

edited

Loading

tpimentelms commented Mar 21, 2020

lfashby commented Mar 21, 2020

tpimentelms commented Mar 21, 2020

jacksonllee left a comment

Increased coverage of language codes not in iso639 lib #132

Increased coverage of language codes not in iso639 lib #132

Conversation

tpimentelms commented Mar 21, 2020

kylebgorman commented Mar 21, 2020

jacksonllee left a comment

Choose a reason for hiding this comment

lfashby commented Mar 21, 2020 • edited Loading

tpimentelms commented Mar 21, 2020

lfashby commented Mar 21, 2020

tpimentelms commented Mar 21, 2020

jacksonllee left a comment

Choose a reason for hiding this comment

lfashby commented Mar 21, 2020 •

edited

Loading