Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a basic spellchecking functionality to information page #4

Merged
merged 1 commit into from
Apr 26, 2022

Conversation

nonprofittechy
Copy link
Member

@nonprofittechy nonprofittechy commented Apr 4, 2022

Fix #1

This feature is pretty basic but it works similarly to the other functionality for now.

A major limit is that the spelling errors aren't displayed with any context, but that's the same for the remaining warnings and errors. I assume we will eventually want to display errors and warnings on a per-block basis so that the author gets more context.

We may want to consider adding our own library of terms that will be incorrectly flagged, as the pyspellchecker dataset is drawn from closed captioning and may not capture some relatively common legal language.

@BryceStevenWilley
Copy link
Collaborator

I tried it out, and while I agree with the feature, I'm not a big fan of pyspellchecker. I thought things were broken because it wasn't catching when I changed "will" to "willl", but turns out it just thinks that "willl" is a valid word. The data comes from OpenSubtitles, and at a quick glance contains words like "willl", "willservicethisgreatnation" (one word), "hoje" (a Portuguese word), and "hi'm". Others have run into this issue as well. While we could use our own dictionary, it might be worth it to just use something else like symspellpy instead.

@nonprofittechy
Copy link
Member Author

I tried it out, and while I agree with the feature, I'm not a big fan of pyspellchecker. I thought things were broken because it wasn't catching when I changed "will" to "willl", but turns out it just thinks that "willl" is a valid word. The data comes from OpenSubtitles, and at a quick glance contains words like "willl", "willservicethisgreatnation" (one word), "hoje" (a Portuguese word), and "hi'm". Others have run into this issue as well. While we could use our own dictionary, it might be worth it to just use something else like symspellpy instead.

I hadn't run into that library in my initial search, but it looks fine. Happy to switch to a different dictionary.

@BryceStevenWilley
Copy link
Collaborator

Did a little work on this. While symspellpy is really powerful, it too has weird quirks (it doesn't seem to like capitalization very much, and the options to preserve it don't seem to work at all). The API is different, but idk if I'd call it better or not. From all of this, the best solution is probably to make our own dictionary, with a much stricter source corpus. The author of pyspellchecker describes how to do that in a github discussion.

Since the API itself isn't going to be changing, just the dictionary, this is probably fine to merge now. We can add the better dictionary later, so I'm going to approve.

@nonprofittechy nonprofittechy merged commit da9a1eb into main Apr 26, 2022
@nonprofittechy nonprofittechy deleted the spellchecking branch April 26, 2022 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Identify spelling errors
2 participants