Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_bytes and from_bytes changes the token lemma #636

Closed
rajhans opened this issue Nov 18, 2016 · 7 comments
Closed

to_bytes and from_bytes changes the token lemma #636

rajhans opened this issue Nov 18, 2016 · 7 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@rajhans
Copy link

rajhans commented Nov 18, 2016

import spacy
nlp=spacy.load('en')
x1=nlp('I cant do this.')
[t.lemma_ for t in x1]
['i', 'can', 'not', 'do', 'this', '.']
g=x1.to_bytes()
d=spacy.tokens.doc.Doc(nlp.vocab)
d.from_bytes(g)
I cant do this.
[t.lemma_ for t in d]
['i', 'ca', 'nt', 'do', 'this', '.']

@rajhans
Copy link
Author

rajhans commented Nov 18, 2016

FYI, I am using 1.1.2

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Nov 18, 2016
@honnibal
Copy link
Member

Thanks. I think the bug here is that the serializer tries to get away with not saving the lemmas, because it thinks it can recalculate them given the POS tags. This turns out to be untrue in this case, because the lemma is a special-case. Hmm.

@rajhans
Copy link
Author

rajhans commented Nov 21, 2016

I see. Maybe as a stop gap for my project, is it possible to know for which words (like cant) can this problem arise? As in if it is a finite knowable set of words, I can just hackishly fix for those.

@honnibal
Copy link
Member

Yes, the special-case rules are listed in spacy/en/language_data.py. You want to check whether the "F" value matches the the token.text. If it does, you want to assign the value keyed by "L".

@rajhans
Copy link
Author

rajhans commented Nov 22, 2016

thanks!

@ines
Copy link
Member

ines commented May 7, 2017

Closing this and making #1045 the master issue. Work in progress for spaCy v2.0!

@ines ines closed this as completed May 7, 2017
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

3 participants