Release v3.0: The "handle numbers better" release · rspeer/wordfreq

Previously, wordfreq would group all digit sequences of the same 'shape',
with length 2 or more, into a single token and return the frequency of that
token, which would be a vast overestimate.

Now it distributes the frequency over all numbers of that shape, with an
estimated distribution that allows for Benford's law (lower numbers are more
frequent) and a special frequency distribution for 4-digit numbers that look
like years (2010 is more frequent than 1020).

More changes related to digits:

Functions such as iter_wordlist and top_n_list no longer return
multi-digit numbers (they used to return them in their "smashed" form, such
as "0000").
lossy_tokenize no longer replaces digit sequences with 0s. That happens
instead in a place that's internal to the word_frequency function, so we can
look at the values of the digits before they're replaced.

Other changes:

wordfreq is now developed using poetry as its package manager, and with
pyproject.toml as the source of configuration instead of setup.py.
The minimum version of Python supported is 3.7.
Type information is exported using py.typed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.0: The "handle numbers better" release