Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

token.idx doesn't always match original text index #886

Closed
gartentrio opened this issue Mar 13, 2017 · 2 comments
Closed

token.idx doesn't always match original text index #886

gartentrio opened this issue Mar 13, 2017 · 2 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@gartentrio
Copy link

When there are newline characters in a text, the idx of a token in a spacy doc sometimes doesn't return the correct index into the original text.
Here's an example:

nlp = spacy.load('de')
input_text = "Datum:2014-06-02\nDokument:76467"
doc = nlp(text)
for token in doc:
    print("%s (text len: %d, text_with_ws len: %d)" % (repr(token.text), len(token.text), len(token.text_with_ws)))
    assert(input_text[token.idx] == token.text[0]), ('expected: ', input_text[token.idx], 'actual: ', token.text[0])
'Datum:2014-06-02' (text len: 16, text_with_ws len: 17)
'\n' (text len: 1, text_with_ws len: 1)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-115-083d5fc58a69> in <module>()
      4 for token in doc:
      5     print("%s (text len: %d, text_with_ws len: %d)" % (repr(token.text), len(token.text), len(token.text_with_ws)))
----> 6     assert(input_text[token.idx] == token.text[0]), ('expected: ', input_text[token.idx], 'actual: ', token.text[0])

AssertionError: ('expected: ', 'D', 'actual: ', '\n')

It seems the newline character is counted twice, once as the last char of the the first token, and then as a token of its own.

Your Environment

  • Operating System: Linux
  • Python Version Used: 3.5
  • spaCy Version Used: 1.6.0
  • Environment Information:
ines added a commit that referenced this issue Mar 13, 2017
@ines ines added the bug Bugs and behaviour differing from documentation label Mar 13, 2017
@ines
Copy link
Member

ines commented Mar 13, 2017

Thanks for the report! It looks like this is related to #859. Just added a regression test and it works for me using the version on master (which already includes the fix for that issue).

We're just finishing off the last fixes for v1.7 and training the models – the new update will be available very soon.

@ines ines closed this as completed Mar 13, 2017
ines added a commit that referenced this issue Mar 13, 2017
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

1 participant