Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong calculation of max_score in unigram_model.cc #1024

Open
fairydreaming opened this issue Jun 12, 2024 · 0 comments
Open

Wrong calculation of max_score in unigram_model.cc #1024

fairydreaming opened this issue Jun 12, 2024 · 0 comments

Comments

@fairydreaming
Copy link

I think there is a bug in calculation of max_score in unigram_model.cc:

min_score_ = FLT_MAX;
max_score_ = FLT_MIN;
for (const auto &sp : model_proto_->pieces()) {
if (sp.type() == ModelProto::SentencePiece::NORMAL) {
min_score_ = std::min(min_score_, sp.score());
max_score_ = std::max(max_score_, sp.score());
}
}

As FLT_MIN is a very small positive number (on my system it's 1.17549435e-38) and token scores are negative, max_score initialized with FLT_MIN will always be greater than the token score and will never be changed to another value. If the idea was to calculate max score among all the token scores, you should initialize max_score with -FLT_MAX instead.

However, I think that if you fix this, then the following will break:

// User defined symbol receives extra bonus to always be selected.
const auto score = IsUserDefinedInlined(ret)
? (length * max_score_ - 0.1)
: GetScoreInlined(ret);
const auto candidate_best_path_score =
score + best_path_score_till_here;
if (target_node.starts_at == -1 ||
candidate_best_path_score > target_node.best_path_score) {
target_node.best_path_score = candidate_best_path_score;
target_node.starts_at = starts_at;
target_node.id = ret;
}

Now correctly calculated negative max_score multiplied by length can be even more negative and subtracting 0.1 will make it even more negative (and the idea is that larger score wins). So I think setting the user-defined token score to length * max_score_ - 0.1 will make it less likely to be chosen, not more likely. Why don't you simply set score to 0 for user-defined tokens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant