Spanish tokenizer mutates the supplied text, but appears to keep no record of changes made #1452

aoodham · 2017-10-23T17:53:08Z

Hi,

I'm using the spanish es_core_web_md and am finding that it mutates the supplied text such at the phrase 'al' gets transformed into 'ael'. I'm relatively new to the world of spacy, but there seems to be no record made of this transformation. This is a feature request to record these substitutions made so it's possible to map back from the spacy token.idx parameter to an index in the raw text supplied to the spacy pipeline. The cumulative effect of these extra 'e's means that tokens towards the end of large documents can have their index in spacy shifted quite dramatically from the original one supplied.

Currently we can by-pass this issue by suppling a special_case to the tokenizer to transform 'al' to 'al'. This however ignores the fact that normally this gets tokenized into two tokens, 'a' and 'el'.

Your Environment

Info about spaCy

spaCy version: 1.8.2
Platform: Linux-4.10.0-37-generic-x86_64-with-debian-stretch-sid
Python version: 3.6.1
Installed models:

The text was updated successfully, but these errors were encountered:

honnibal · 2017-10-23T19:19:47Z

If I'm understanding correctly, that's definitely a bug. The following should be true for any unicode string:

    text == nlp(text).text

Any case that breaks this invariant is a bug.

On the current v2 the problem seems to be solved:

    >>> import spacy
    >>> nlp =spacy.blank('es')
    >>> nlp(u'al')
    al

The v2 model also performs quite a bit better on Spanish than the v1 model. You can get it with pip install spacy-nightly. Docs are available at https://alpha.spacy.io

aoodham · 2017-10-24T08:21:28Z

Thanks very much honnibal. I had suspected that the text == nlp(text).text should hold as an axiom of spacy.

I'll look into upgrading our system.

lock · 2018-05-08T13:27:27Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Oct 23, 2017

ines added the lang / es Spanish language data and models label Oct 23, 2017

honnibal closed this as completed Oct 24, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanish tokenizer mutates the supplied text, but appears to keep no record of changes made #1452

Spanish tokenizer mutates the supplied text, but appears to keep no record of changes made #1452

aoodham commented Oct 23, 2017

honnibal commented Oct 23, 2017 •

edited

Loading

aoodham commented Oct 24, 2017

lock bot commented May 8, 2018

Spanish tokenizer mutates the supplied text, but appears to keep no record of changes made #1452

Spanish tokenizer mutates the supplied text, but appears to keep no record of changes made #1452

Comments

aoodham commented Oct 23, 2017

Your Environment

Info about spaCy

honnibal commented Oct 23, 2017 • edited Loading

aoodham commented Oct 24, 2017

lock bot commented May 8, 2018

honnibal commented Oct 23, 2017 •

edited

Loading