-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increased coverage of language codes not in iso639 lib #132
Conversation
Hi! @jacksonllee tells me that there are some issues with this he noticed; for instance we already scrape Galician as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tpimentelms , thank you for the contribution! wikipron/languagecodes.py
is primarily designed to handle languages from Wiktionary that has at least a certain number of entries to be worth scraping (this threshold is set to be 100 at the moment). More concretely, if the iso639
package can't handle a language code/name but the language doesn't have enough data on Wiktionary to pass our threshold for scraping, we don't add it to wikipron/languagecodes.py
(yet -- Wiktionary is constantly expanding and we add missing language codes/names when they make our threshold).
If you'd like to contribute by adding missing language codes/names that WikiPron should currently handle but doesn't yet, WikiPron has this test that throws us warnings for which languages on Wiktionary now have enough data but WikiPron can't handle. The Circle CI builds from your PR (example) show that there are three that we could potentially add at this point:
tests/test_languagecodes.py::test_language_coverage
/root/project/tests/test_languagecodes.py:97: UserWarning: WikiPron cannot handle "Bouyei".
warnings.warn(f'WikiPron cannot handle "{language}".')
tests/test_languagecodes.py::test_language_coverage
/root/project/tests/test_languagecodes.py:97: UserWarning: WikiPron cannot handle "Laboya".
warnings.warn(f'WikiPron cannot handle "{language}".')
tests/test_languagecodes.py::test_language_coverage
/root/project/tests/test_languagecodes.py:97: UserWarning: WikiPron cannot handle "Moroccan Arabic".
warnings.warn(f'WikiPron cannot handle "{language}".')
Would you be interested in adding just these three to wikipron/languagecodes.py
?
For the changes in this PR thus far, I didn't have time to check all the dozens of new key-value pairs one by one, but did some quick spot checking. Most of them don't make our threshold of 100, it looks like?
Regarding Jackson's comments, I'd like to add that Wikipron can currently scrape the languages that fail
would prevent the warning for these languages. Additionally, One question that the proposed changes brings up is whether we should expand |
But couldn't these language codes be there, but not be automatically scraped by wikipron? About the |
I don't think there is any harm in having these in Similar to the Just out of curiosity - did you use Wikipron to scrape for these languages and get usable data for all of them? |
I focused on getting only phonemic data (and not phonetic) and languages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Sorry for the delay in reviewing this. I've checked all the language codes against the ISO 639-3 list as well as their corresponding language names on Wiktionary. Thanks for the contribution, @tpimentelms!
Re: adding language codes for languages with fewer than 100 entries on Wiktionary, I agree that there's no harm including them now.
I'll defer to @kylebgorman for merging and/or for another look if need be.
This PR adds new language codes to the dictionary in
wikipron/languagecodes.py
to increase language coverage.