-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Spanish dialect selection #553
Conversation
One open question I have is whether to include additional country selectors for Latin America (i.e. Columbia, Chile, etc) to the scrape's config, but I don't know how prevalent these are wiktionary or if they're wanted in the wider Latin America dialect file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and I appreciate the new test, which is very needed.
Okay, so I think the right way to test this though is to run the big scrape on Spanish and incorporate changes into the PR. For that you'd install (from this PR), navigate to data/scrape
and issue ./scrape --restriction spa && ./postprocess
and wait (about 12-24 hours), then stage and commit the changed files. Is this feasible on your end?
We don't have any discoverability for dialect strings. Ideally there'd be some way to get a list of them in descending frequency order and then you could manually cluster and write back into |
Sure, I can run that in the next couple of days. I did do some work for getting single parse for multiple dialects working, but it's a bit hacky and I'll rerun the scape using just this current branch. |
Excited to see this, this will be a huge improvement we've wanted forever. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, big thanks to you for this.
Unreleased
inCHANGELOG.md
to reflect the changes in code or data.