-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in BPE algorithm #318
Comments
Thank you for the report. Do you think we can fix this issue just by removing the condition in the if statement? |
@xbelonogov , I agree that after the first merge, the count for "aa" should be 1. Indeed, if we note "X=aa", the corpus becomes "bcXXaa". Then both "bc" and "aa" have a count of 1, in fact, all the bigrams occur only once. So why would you prefer "aa" to be merged instead of "bc" as they should have the same counts? Thanks. |
I see that the v0.1.96 output is still "bc", not "aa". ▁a put into the final_pieces_, and the freq of "aa" reset to 0. the second iteration: sentencepiece put "bc" into the final_pieces. Is there a problem with this process? |
Sorry for the late response. This bug is going to be fixed in the next release. |
I think the BPE algorithm is not working properly. This code snippet reproduces the bug.
The input for BPE is
"bc a aaa"
I got the following output.
The first merged pair should be
_ + a = _a
and it's correct. (i: 3 piece: _a
)After that the second pair should be
a + a = aa
but algorithm produced pairbc
. (i: 4 piece: bc
)I used the debug output and found out that at the second iteration here
symbol->freq
foraa
is 0 andsymbol->freq
forbc
is 1.It happened because logic in this
if
statement is incorrect. You can't simply remove this positions.The text was updated successfully, but these errors were encountered: