You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
if (sp.type() == ModelProto::SentencePiece::NORMAL) {
min_score_ = std::min(min_score_, sp.score());
max_score_ = std::max(max_score_, sp.score());
}
}
As FLT_MIN is a very small positive number (on my system it's 1.17549435e-38) and token scores are negative, max_score initialized with FLT_MIN will always be greater than the token score and will never be changed to another value. If the idea was to calculate max score among all the token scores, you should initialize max_score with -FLT_MAX instead.
However, I think that if you fix this, then the following will break:
Now correctly calculated negative max_score multiplied by length can be even more negative and subtracting 0.1 will make it even more negative (and the idea is that larger score wins). So I think setting the user-defined token score to length * max_score_ - 0.1 will make it less likely to be chosen, not more likely. Why don't you simply set score to 0 for user-defined tokens?
The text was updated successfully, but these errors were encountered:
I think there is a bug in calculation of max_score in unigram_model.cc:
sentencepiece/src/unigram_model.cc
Lines 657 to 664 in 6225e08
As FLT_MIN is a very small positive number (on my system it's 1.17549435e-38) and token scores are negative, max_score initialized with FLT_MIN will always be greater than the token score and will never be changed to another value. If the idea was to calculate max score among all the token scores, you should initialize max_score with -FLT_MAX instead.
However, I think that if you fix this, then the following will break:
sentencepiece/src/unigram_model.cc
Lines 978 to 989 in 6225e08
Now correctly calculated negative max_score multiplied by length can be even more negative and subtracting 0.1 will make it even more negative (and the idea is that larger score wins). So I think setting the user-defined token score to
length * max_score_ - 0.1
will make it less likely to be chosen, not more likely. Why don't you simply set score to 0 for user-defined tokens?The text was updated successfully, but these errors were encountered: