You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I am using KenLM on massive corpora of text to explore the properties of those datasets (i.e., Common Crawl, Wikipedia, etc.).
I am not trying to use KenLM to generate new text; I want to explore the occurrences of specific phrases and the raw counts of n-gram occurrences in the training corpus (fine if this is the log probability of a sequence, don't necessarily need exactly counts). As such, I want to disable smoothing so I can be sure that one phrase is more probable than another because those n-grams appear more frequently, not because of smoothing out-of-vocabulary or rare tokens.
Can I disable smoothing altogether with KenLM, or is this not the right tool for my use case? If so, how? Thanks!
The text was updated successfully, but these errors were encountered:
You can query one if you can make an ARPA file. lmplz is hard-coded to modified Kneser-Ney smoothing though you can override the discounts. So if you can work out discounts that reduce to what you want, fine. Otherwise you'll need something else to build the ARPA file.
Hi! I am using KenLM on massive corpora of text to explore the properties of those datasets (i.e., Common Crawl, Wikipedia, etc.).
I am not trying to use KenLM to generate new text; I want to explore the occurrences of specific phrases and the raw counts of n-gram occurrences in the training corpus (fine if this is the log probability of a sequence, don't necessarily need exactly counts). As such, I want to disable smoothing so I can be sure that one phrase is more probable than another because those n-grams appear more frequently, not because of smoothing out-of-vocabulary or rare tokens.
Can I disable smoothing altogether with KenLM, or is this not the right tool for my use case? If so, how? Thanks!
The text was updated successfully, but these errors were encountered: