token.idx doesn't always match original text index #886

gartentrio · 2017-03-13T10:06:16Z

When there are newline characters in a text, the idx of a token in a spacy doc sometimes doesn't return the correct index into the original text.
Here's an example:

nlp = spacy.load('de')
input_text = "Datum:2014-06-02\nDokument:76467"
doc = nlp(text)
for token in doc:
    print("%s (text len: %d, text_with_ws len: %d)" % (repr(token.text), len(token.text), len(token.text_with_ws)))
    assert(input_text[token.idx] == token.text[0]), ('expected: ', input_text[token.idx], 'actual: ', token.text[0])

'Datum:2014-06-02' (text len: 16, text_with_ws len: 17)
'\n' (text len: 1, text_with_ws len: 1)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-115-083d5fc58a69> in <module>()
      4 for token in doc:
      5     print("%s (text len: %d, text_with_ws len: %d)" % (repr(token.text), len(token.text), len(token.text_with_ws)))
----> 6     assert(input_text[token.idx] == token.text[0]), ('expected: ', input_text[token.idx], 'actual: ', token.text[0])

AssertionError: ('expected: ', 'D', 'actual: ', '\n')

It seems the newline character is counted twice, once as the last char of the the first token, and then as a token of its own.

Your Environment

Operating System: Linux
Python Version Used: 3.5
spaCy Version Used: 1.6.0
Environment Information:

The text was updated successfully, but these errors were encountered:

ines · 2017-03-13T10:49:09Z

Thanks for the report! It looks like this is related to #859. Just added a regression test and it works for me using the version on master (which already includes the fix for that issue).

We're just finishing off the last fixes for v1.7 and training the models – the new update will be available very soon.

lock · 2018-05-09T02:38:29Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added a commit that referenced this issue Mar 13, 2017

Add regression test for #886

51ba3ef

ines added the bug Bugs and behaviour differing from documentation label Mar 13, 2017

ines closed this as completed Mar 13, 2017

ines added a commit that referenced this issue Mar 13, 2017

Update docstring in #886 regression test

d70386e

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token.idx doesn't always match original text index #886

token.idx doesn't always match original text index #886

gartentrio commented Mar 13, 2017

ines commented Mar 13, 2017

lock bot commented May 9, 2018

token.idx doesn't always match original text index #886

token.idx doesn't always match original text index #886

Comments

gartentrio commented Mar 13, 2017

Your Environment

ines commented Mar 13, 2017

lock bot commented May 9, 2018