Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode apostrophe confuses tokenizer #685

Closed
pokey opened this issue Dec 14, 2016 · 2 comments
Closed

Unicode apostrophe confuses tokenizer #685

pokey opened this issue Dec 14, 2016 · 2 comments

Comments

@pokey
Copy link
Contributor

pokey commented Dec 14, 2016

When the tokenizer sees the unicode apostrophe, it doesn't tokenize correctly. For example:

import spacy

nlp = spacy.load('en', parser=False)

print(list(nlp.tokenizer("I'm hungry")))
print(list(nlp.tokenizer("I\u2019m hungry")))

outputs

[I, 'm, hungry]
[I’m, hungry]

My Environment

  • OS X 10.11.6
  • Python 3.5.2
  • spacy 1.2.0
@ines ines added this to the Reorganise language data milestone Dec 15, 2016
@ines
Copy link
Member

ines commented Dec 15, 2016

Thanks, good catch! This should probably be added to the tokenizer exceptions.

I'm currently in the process of reorganising the language data and we're gonna merge the changes pretty soon, so there's little point in fixing this in the old, messy format now. I'll do it it on the new organize-language-data branch instead so it will definitely be fixed in the next release.

@ines ines added the 🌙 nightly Discussion and contributions related to nightly builds label Dec 15, 2016
@ines ines closed this as completed in 5445074 Dec 18, 2016
@ines ines removed the 🌙 nightly Discussion and contributions related to nightly builds label Dec 18, 2016
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants