Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undoing casefolding? #469

Closed
jhdeov opened this issue Nov 3, 2022 · 6 comments
Closed

Undoing casefolding? #469

jhdeov opened this issue Nov 3, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@jhdeov
Copy link
Contributor

jhdeov commented Nov 3, 2022

The commandline lets the user choose to apply casefolding so that entries like English can be changed to either English or english. But for the scraped data on the repo, it seems you apply casefolding by default. Would it be more useful if the online data didn't do casefolding? That way,

  • If the user wanted to get the original data (with the correct cases), then they can just use the scraped data online instead of running WIkipron on the terminal
  • If the user wanted to get the casefolded data, then they can take the un-casefolded data from the repo and then apply casefolding on their on their own machine (a simple fast Excel function).

Right now, if the user wants to get the original cases, then they have to run the terminal option (which takes a while).

@kylebgorman
Copy link
Collaborator

I'm not opposed. Would you send a PR? You'll just remove casefold: true from languages.json and run the scrape.

@kylebgorman kylebgorman added enhancement New feature or request good first issue Good for newcomers labels Nov 4, 2022
@jhdeov
Copy link
Contributor Author

jhdeov commented Nov 4, 2022

Just to confirm, you mean delete casefold: true and not simply change it to casefold: false?
Sadly, I don't think I have a good enough computer/internet to rescrape everything :(

@kylebgorman
Copy link
Collaborator

kylebgorman commented Nov 4, 2022 via email

@jhdeov
Copy link
Contributor Author

jhdeov commented Nov 7, 2022

Did a PR
I wonder if the various cleanup processes (casefolding, syllable removal, stress removal, etc.) could be turned into a single script. So that the WikiPron scrape has the pure form of everything; and then if the user is interested, they could run a cleanup script to apply all the default casefoldings and etc?

@kylebgorman
Copy link
Collaborator

Did a PR I wonder if the various cleanup processes (casefolding, syllable removal, stress removal, etc.) could be turned into a single script. So that the WikiPron scrape has the pure form of everything; and then if the user is interested, they could run a cleanup script to apply all the default casefoldings and etc?

We have a hint of this in our notion of "filtered" vs. "unfiltered", this could just be an additional layer.

@sonofthomp
Copy link
Contributor

I was working on this and trying to run step 1 of "the big scrape", but I ran into a weird error with some languages not being recognized, details here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants