Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanish tokenizer mutates the supplied text, but appears to keep no record of changes made #1452

Closed
aoodham opened this issue Oct 23, 2017 · 3 comments
Labels
bug Bugs and behaviour differing from documentation lang / es Spanish language data and models

Comments

@aoodham
Copy link

aoodham commented Oct 23, 2017

Hi,

I'm using the spanish es_core_web_md and am finding that it mutates the supplied text such at the phrase 'al' gets transformed into 'ael'. I'm relatively new to the world of spacy, but there seems to be no record made of this transformation. This is a feature request to record these substitutions made so it's possible to map back from the spacy token.idx parameter to an index in the raw text supplied to the spacy pipeline. The cumulative effect of these extra 'e's means that tokens towards the end of large documents can have their index in spacy shifted quite dramatically from the original one supplied.

Currently we can by-pass this issue by suppling a special_case to the tokenizer to transform 'al' to 'al'. This however ignores the fact that normally this gets tokenized into two tokens, 'a' and 'el'.

Your Environment

Info about spaCy

  • spaCy version: 1.8.2
  • Platform: Linux-4.10.0-37-generic-x86_64-with-debian-stretch-sid
  • Python version: 3.6.1
  • Installed models:
@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Oct 23, 2017
@honnibal
Copy link
Member

honnibal commented Oct 23, 2017

If I'm understanding correctly, that's definitely a bug. The following should be true for any unicode string:

    text == nlp(text).text

Any case that breaks this invariant is a bug.

On the current v2 the problem seems to be solved:

    >>> import spacy
    >>> nlp =spacy.blank('es')
    >>> nlp(u'al')
    al

The v2 model also performs quite a bit better on Spanish than the v1 model. You can get it with pip install spacy-nightly. Docs are available at https://alpha.spacy.io

@ines ines added the lang / es Spanish language data and models label Oct 23, 2017
@aoodham
Copy link
Author

aoodham commented Oct 24, 2017

Thanks very much honnibal. I had suspected that the text == nlp(text).text should hold as an axiom of spacy.

I'll look into upgrading our system.

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation lang / es Spanish language data and models
Projects
None yet
Development

No branches or pull requests

3 participants