Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased coverage of language codes not in iso639 lib #132

Merged
merged 12 commits into from
Apr 18, 2020

Conversation

tpimentelms
Copy link
Contributor

This PR adds new language codes to the dictionary in wikipron/languagecodes.py to increase language coverage.

@kylebgorman
Copy link
Collaborator

Hi! @jacksonllee tells me that there are some issues with this he noticed; for instance we already scrape Galician as glg; glc is Bon Gula. I'm going to assign him and @lfashby as reviewers since I don't have a chance to look more closely right now.

Copy link
Collaborator

@jacksonllee jacksonllee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tpimentelms , thank you for the contribution! wikipron/languagecodes.py is primarily designed to handle languages from Wiktionary that has at least a certain number of entries to be worth scraping (this threshold is set to be 100 at the moment). More concretely, if the iso639 package can't handle a language code/name but the language doesn't have enough data on Wiktionary to pass our threshold for scraping, we don't add it to wikipron/languagecodes.py (yet -- Wiktionary is constantly expanding and we add missing language codes/names when they make our threshold).

If you'd like to contribute by adding missing language codes/names that WikiPron should currently handle but doesn't yet, WikiPron has this test that throws us warnings for which languages on Wiktionary now have enough data but WikiPron can't handle. The Circle CI builds from your PR (example) show that there are three that we could potentially add at this point:

tests/test_languagecodes.py::test_language_coverage
  /root/project/tests/test_languagecodes.py:97: UserWarning: WikiPron cannot handle "Bouyei".
    warnings.warn(f'WikiPron cannot handle "{language}".')

tests/test_languagecodes.py::test_language_coverage
  /root/project/tests/test_languagecodes.py:97: UserWarning: WikiPron cannot handle "Laboya".
    warnings.warn(f'WikiPron cannot handle "{language}".')

tests/test_languagecodes.py::test_language_coverage
  /root/project/tests/test_languagecodes.py:97: UserWarning: WikiPron cannot handle "Moroccan Arabic".
    warnings.warn(f'WikiPron cannot handle "{language}".')

Would you be interested in adding just these three to wikipron/languagecodes.py?

For the changes in this PR thus far, I didn't have time to check all the dozens of new key-value pairs one by one, but did some quick spot checking. Most of them don't make our threshold of 100, it looks like?

wikipron/languagecodes.py Outdated Show resolved Hide resolved
@lfashby
Copy link
Collaborator

lfashby commented Mar 21, 2020

Regarding Jackson's comments, I'd like to add that Wikipron can currently scrape the languages that fail test_languagecodes.py but we still receive warnings about them because we did not add the lowercase Wiktionary language name in addition to the iso639 code. Something like:

    # Bouyei. ISO 639-3 only.
    "pcc": "Bouyei",
    "bouyei": "Bouyei",

would prevent the warning for these languages. Additionally, "lmy": "Lamboya", should be changed to "lmy": "Laboya", as "Lamboya" is the iso639 language name.

One question that the proposed changes brings up is whether we should expand languagecodes.py to handle someone using a code like ekk for Estonian. est apparently refers to the Estonian macrolanguage while ekk refers to Standard Estonian.

@tpimentelms
Copy link
Contributor Author

But couldn't these language codes be there, but not be automatically scraped by wikipron?
I added these ones because I needed them for a personal project.

About the ekk vs est discussion I don't have an opinion. I can remove it if you think that is better.

@lfashby
Copy link
Collaborator

lfashby commented Mar 21, 2020

I don't think there is any harm in having these in languagecodes.py. Looks to me like all the language codes are correct and running a Wikipron scrape with them doesn't throw any errors. I'd like to hear what Jackson has to say about it though, perhaps if we include these it would make sense to eventually remove the min language size requirement in test_languagecodes.py?

Similar to the ekk and est thing there is also per versus pes for Persian - per is the iso639-2/B code for the Persian macro-language while pes is the iso639-3 code for Iranian Persian. Not sure which we prefer and I'd defer to Jackson on that question.

Just out of curiosity - did you use Wikipron to scrape for these languages and get usable data for all of them?

wikipron/languagecodes.py Outdated Show resolved Hide resolved
@tpimentelms
Copy link
Contributor Author

I focused on getting only phonemic data (and not phonetic) and languages 'kat', 'ket', 'olo', 'kca', 'lbe', 'nio', 'hye' didn't have any entries.
The others I fetched with wikipron. Didn't check the resulting files for the other languages yet, though.

Copy link
Collaborator

@jacksonllee jacksonllee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Sorry for the delay in reviewing this. I've checked all the language codes against the ISO 639-3 list as well as their corresponding language names on Wiktionary. Thanks for the contribution, @tpimentelms!

Re: adding language codes for languages with fewer than 100 entries on Wiktionary, I agree that there's no harm including them now.

I'll defer to @kylebgorman for merging and/or for another look if need be.

wikipron/languagecodes.py Show resolved Hide resolved
wikipron/languagecodes.py Show resolved Hide resolved
@kylebgorman kylebgorman merged commit ff8964a into CUNY-CL:master Apr 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants