-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spanish tokenizer mutates the supplied text, but appears to keep no record of changes made #1452
Comments
If I'm understanding correctly, that's definitely a bug. The following should be true for any unicode string: text == nlp(text).text Any case that breaks this invariant is a bug. On the current v2 the problem seems to be solved: >>> import spacy
>>> nlp =spacy.blank('es')
>>> nlp(u'al')
al The v2 model also performs quite a bit better on Spanish than the v1 model. You can get it with |
Thanks very much honnibal. I had suspected that the I'll look into upgrading our system. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Hi,
I'm using the spanish es_core_web_md and am finding that it mutates the supplied text such at the phrase 'al' gets transformed into 'ael'. I'm relatively new to the world of spacy, but there seems to be no record made of this transformation. This is a feature request to record these substitutions made so it's possible to map back from the spacy
token.idx
parameter to an index in the raw text supplied to the spacy pipeline. The cumulative effect of these extra 'e's means that tokens towards the end of large documents can have their index in spacy shifted quite dramatically from the original one supplied.Currently we can by-pass this issue by suppling a special_case to the tokenizer to transform 'al' to 'al'. This however ignores the fact that normally this gets tokenized into two tokens, 'a' and 'el'.
Your Environment
Info about spaCy
The text was updated successfully, but these errors were encountered: