Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Spanish dialect selection #553

Merged
merged 4 commits into from
Jul 27, 2024

Conversation

mmcauliffe
Copy link
Contributor

  • [ x ] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.

@mmcauliffe
Copy link
Contributor Author

One open question I have is whether to include additional country selectors for Latin America (i.e. Columbia, Chile, etc) to the scrape's config, but I don't know how prevalent these are wiktionary or if they're wanted in the wider Latin America dialect file.

Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and I appreciate the new test, which is very needed.

Okay, so I think the right way to test this though is to run the big scrape on Spanish and incorporate changes into the PR. For that you'd install (from this PR), navigate to data/scrape and issue ./scrape --restriction spa && ./postprocess and wait (about 12-24 hours), then stage and commit the changed files. Is this feasible on your end?

@kylebgorman
Copy link
Collaborator

One open question I have is whether to include additional country selectors for Latin America (i.e. Columbia, Chile, etc) to the scrape's config, but I don't know how prevalent these are wiktionary or if they're wanted in the wider Latin America dialect file.

We don't have any discoverability for dialect strings. Ideally there'd be some way to get a list of them in descending frequency order and then you could manually cluster and write back into languages.json. Nice feature request though.

@mmcauliffe
Copy link
Contributor Author

Sure, I can run that in the next couple of days. I did do some work for getting single parse for multiple dialects working, but it's a bit hacky and I'll rerun the scape using just this current branch.

@kylebgorman
Copy link
Collaborator

Sure, I can run that in the next couple of days. I did do some work for getting single parse for multiple dialects working, but it's a bit hacky and I'll rerun the scape using just this current branch.

Excited to see this, this will be a huge improvement we've wanted forever.

Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, big thanks to you for this.

@kylebgorman kylebgorman merged commit 130a0d7 into CUNY-CL:master Jul 27, 2024
10 checks passed
kylebgorman added a commit to kylebgorman/wikipron that referenced this pull request Jul 27, 2024
kylebgorman added a commit that referenced this pull request Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants