Disable smoothing #432

XenonMolecule · 2023-06-09T17:47:20Z

Hi! I am using KenLM on massive corpora of text to explore the properties of those datasets (i.e., Common Crawl, Wikipedia, etc.).

I am not trying to use KenLM to generate new text; I want to explore the occurrences of specific phrases and the raw counts of n-gram occurrences in the training corpus (fine if this is the log probability of a sequence, don't necessarily need exactly counts). As such, I want to disable smoothing so I can be sure that one phrase is more probable than another because those n-grams appear more frequently, not because of smoothing out-of-vocabulary or rare tokens.

Can I disable smoothing altogether with KenLM, or is this not the right tool for my use case? If so, how? Thanks!

kpu · 2023-06-10T14:44:14Z

You can query one if you can make an ARPA file. lmplz is hard-coded to modified Kneser-Ney smoothing though you can override the discounts. So if you can work out discounts that reduce to what you want, fine. Otherwise you'll need something else to build the ARPA file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable smoothing #432

Disable smoothing #432

XenonMolecule commented Jun 9, 2023

kpu commented Jun 10, 2023

Disable smoothing #432

Disable smoothing #432

Comments

XenonMolecule commented Jun 9, 2023

kpu commented Jun 10, 2023