Skip to content

Commit

Permalink
fix(dict): Remove only corrections if a space could be inserted as well
Browse files Browse the repository at this point in the history
The typo dictionary words.csv previously contained
a bunch of problematic entries such as:

    abouta,about
    algorithmi,algorithm
    attachen,attach
    shouldbe,should
    anumber,number

Which resulted in wrong automatic corrections if the following
spaces (indicated by ␣) were accidentally missed:

    about␣a
    algorithm␣i developed
    attach␣en masse
    should␣be
    a␣number

Many of these entries were introduced by taking entries from the
codespell-dict and removing corrections containing spaces (since typos
currently doesn't support them), e.g the codespell dictionary contains:

    abouta->about a, about,
    shouldbe->should, should be,

This commit updates `tests/verify.rs` to automatically remove
corrections in the form of `{correction}{common_word},{correction}`
or `{common_word}{correction},{correction}`, where `{common_word}` is
one of the 1000 most frequent English words (except if `{correction}`
also ends/starts in `{common_word}`, since we still want to correct e.g.
"extrememe" to "extreme").

The top-1000-most-frequent-words.csv file was generated by running:

    curl https://norvig.com/ngrams/count_1w.txt \
      | head -n1024 \
      | awk '{print $1;}' \
      | grep -vE '^([^ia]|al|re)$' \
      > top-1000-most-frequent-words.csv
  • Loading branch information
not-my-profile committed Aug 7, 2023
1 parent d4258b1 commit 68cce1a
Show file tree
Hide file tree
Showing 4 changed files with 1,229 additions and 162 deletions.
Loading

0 comments on commit 68cce1a

Please sign in to comment.