Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix English word list retrieval in qmk generate-autocorrect-data #20915

Merged
merged 2 commits into from
May 20, 2023

Conversation

fretep
Copy link
Contributor

@fretep fretep commented May 13, 2023

Description

The english_words package has breaking changes in v2.0 which results in the failure to load the English dictionary and an error indicating the english_words package should be installed for the functionality to work. This PR implements support for either v1 or v2 of the english_words package.

Types of Changes

  • Core
  • Bugfix
  • New feature
  • Enhancement/optimization
  • Keyboard (addition or update)
  • Keymap/layout/userspace (addition or update)
  • Documentation

Issues Fixed or Closed by This PR

Checklist

  • My code follows the code style of this project: C, Python
  • I have read the PR Checklist document and have made the appropriate changes.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • I have tested the changes and verified that they work and don't break anything (as well as I can manage).

- Update to address breaking changes in the english_words package
- english_words_lower_alpha_set now replaced by get_english_words_set
@github-actions github-actions bot added cli qmk cli command python labels May 13, 2023
Copy link
Member

@drashna drashna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can confirm that this works.

Also, looks like the new english_words package causes it to catch a LOT more potential issues.

@drashna drashna requested a review from a team May 13, 2023 03:42
@drashna drashna added the bug label May 13, 2023
@fretep
Copy link
Contributor Author

fretep commented May 13, 2023

Yes, the word lists seem quite a bit bigger (almost 10x). There are actually two lists, and I selected web2 which is the larger list. There is also gcide, or both could be used.

Version 1 list:

> len(english_words.english_words_lower_alpha_set)
25463

Version 2 lists:

> len(english_words.get_english_words_set(['web2'], lower=True, alpha=True))
235970
> len(english_words.get_english_words_set(['gcide'], lower=True, alpha=True))
114753
> len(english_words.get_english_words_set(['web2','gcide'], lower=True, alpha=True))
340264

@fretep
Copy link
Contributor Author

fretep commented May 13, 2023

Package documentation including the source for each of the lists - https://pypi.org/project/english-words/

@tzarc
Copy link
Member

tzarc commented May 13, 2023

Perhaps add a cli argument to select the different word lists?

@fretep
Copy link
Contributor Author

fretep commented May 13, 2023

I feel an argument might be over-engineering.

A bigger word list will show more potential issues, but it doesn't block the generation of the file, or return an exit code so should not be a breaking change, unless someone is parsing the output. If someone is parsing the output for an automated process, then I feel like they care enough to at least look at the clashes and probably should have a means to acknowledge/ignore the clash.

Both word lists are quite a bit bigger than the version 1 list. If the old behavior (smaller list) is desired the user can pin english_words to verson 1.1.0, I believe installation of the package is a manual process the user needs to do if they want the check to occur, qmk_install doesn't install the package.

python3 -m pip install --upgrade --user english-words==1.1.0

If going down the argument path, then should it allow custom lists? I feel like it probably should if an argument is being added, as there isn't a significant difference between the two lists and if someone is unhappy with the list they are more likely to want their own list than either of the ones provided by the package.

I chose the web2 list over gcide as gcide is based on a 1918 dictionary and a few of the word clashes it threw up felt a bit dated to me, but I didn't research how much effort the project has put into modernizing the list.

I feel like this feature is intended as more of a heads up that you might have some conflicts with correctly spelt words. A bigger list will generate more matches, but then that is probably desirable and it's not hard to scan the list for words you actually use.

Thoughts?

@tzarc
Copy link
Member

tzarc commented May 13, 2023

I personally don't use autocorrect; was more that you had a few options for selecting a wordlist -- figured it could be parameterised.

Will defer to @drashna here -- he's an avid user.

@filterpaper
Copy link
Contributor

Would it make more sense to just implement support for v2?

@drashna
Copy link
Member

drashna commented May 13, 2023

Would it make more sense to just implement support for v2?

Since the new version doesn't get automatically updated... worst case, we'd want to check for the old version and issue a notice.

That said, given that the new version seems to be a lot more robust, I'd rather we push for that version.

I chose the web2 list over gcide as gcide is based on a 1918 dictionary and a few of the word clashes it threw up felt a bit dated to me, but I didn't research how much effort the project has put into modernizing the list.

Looks like it's from a 1913 Websters dictionary, specifically. So yeah, the web2 library would likely be the best option, as while outdated still, it's from 2017, a full century more recently.

And as for alpha/lower setting, that does look to be the same as from the 1.1 version.

@filterpaper
Copy link
Contributor

Since the new version doesn't get automatically updated... worst case, we'd want to check for the old version and issue a notice.

It can probably be checked with english_words.__version__ to suggest v2. Would there be less maintenance if only v2 was supported?

@fretep
Copy link
Contributor Author

fretep commented May 13, 2023

Would there be less maintenance if only v2 was supported?

It's only a couple of lines to handle the old version. I don't think it's a significant maintenance issue, and could be removed when people have had time to update if it does become a problem. I'll make it output a message to suggest updating. Thanks for the __version__ tip, it's better than what I'm currently doing, I'll use that.

@fretep
Copy link
Contributor Author

fretep commented May 14, 2023

The english_words module doesn't expose __version__. I've simplified the logic, fixed the CI failure and output a message if the english_words module is outdated.

Copy link
Contributor

@filterpaper filterpaper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and tested on my end.

@drashna
Copy link
Member

drashna commented May 14, 2023

Sorry, first time contributing here. I didn't run the tests locally, I expected the tests to run on the PR creation. I'll get that sorted. I was trying to provide a message there rather than bomb the process, but really, it's such a low probability code path it can probably be removed.

No worries!

And qmk pytest will test any changes to the python library, and can be ran locally.

It can probably be checked with english_words.__version__ to suggest v2. Would there be less maintenance if only v2 was supported?

Ideally, should support both for now, push people to the v2 module, and then later (next cycle or so), remove support for the 1.x version.

@drashna drashna requested a review from a team May 14, 2023 04:40
@tzarc tzarc merged commit 7b31c18 into qmk:develop May 20, 2023
coquizen pushed a commit to coquizen/qmk_firmware that referenced this pull request Jun 22, 2023
autoferrit pushed a commit to SpaceRockMedia/bastardkb-qmk that referenced this pull request Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cli qmk cli command python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants