Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable smoothing #432

Open
XenonMolecule opened this issue Jun 9, 2023 · 1 comment
Open

Disable smoothing #432

XenonMolecule opened this issue Jun 9, 2023 · 1 comment

Comments

@XenonMolecule
Copy link

Hi! I am using KenLM on massive corpora of text to explore the properties of those datasets (i.e., Common Crawl, Wikipedia, etc.).

I am not trying to use KenLM to generate new text; I want to explore the occurrences of specific phrases and the raw counts of n-gram occurrences in the training corpus (fine if this is the log probability of a sequence, don't necessarily need exactly counts). As such, I want to disable smoothing so I can be sure that one phrase is more probable than another because those n-grams appear more frequently, not because of smoothing out-of-vocabulary or rare tokens.

Can I disable smoothing altogether with KenLM, or is this not the right tool for my use case? If so, how? Thanks!

@kpu
Copy link
Owner

kpu commented Jun 10, 2023

You can query one if you can make an ARPA file. lmplz is hard-coded to modified Kneser-Ney smoothing though you can override the discounts. So if you can work out discounts that reduce to what you want, fine. Otherwise you'll need something else to build the ARPA file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants