-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenisation output and tokenizer.explain is inconsistent #9136
Comments
Confirmed the output is the same in 3.1.2. Thanks for the report! See also #7694. |
Yes, thanks for the report! Here I think that the So far we'd only had the case where Since this will potentially change the tokenizer output for the same stored settings, we'll aim to fix it in v3.2.0. |
|
I may even be getting myself mixed up here about the regex behavior, but it looks like the problem is what happens right after a prefix is recognized and that's where the main algorithm and What happens currently in the main tokenizer:
What happens in
I think that the |
Definitely agree that the My work is very sensitive to tokenisation too, so definitely hope that this is nothing major. |
@adrianeboyd I have update the test to make sure that i communicated clearly.
What you said about the overlap between The use of both |
You can't see the difference in the resulting tokens, but the internal difference in the final case is that the tokenizer thinks that |
Got it, thanks for attending and explain it to me! :) |
Not sure why, but github is sometimes a bit flaky about closing linked issues from pull requests. This has been fixed by #9155 and will be in v3.2.0. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
Output from print
Output from explanation is different from actual tokenised result.
Your Environment
The text was updated successfully, but these errors were encountered: