-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
token.idx doesn't always match original text index #886
Comments
Thanks for the report! It looks like this is related to #859. Just added a regression test and it works for me using the version on master (which already includes the fix for that issue). We're just finishing off the last fixes for v1.7 and training the models – the new update will be available very soon. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
When there are newline characters in a text, the idx of a token in a spacy doc sometimes doesn't return the correct index into the original text.
Here's an example:
It seems the newline character is counted twice, once as the last char of the the first token, and then as a token of its own.
Your Environment
The text was updated successfully, but these errors were encountered: