Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemmatization errors when text contains contracted forms of 'be' #674

Closed
gppatt opened this issue Dec 9, 2016 · 2 comments
Closed

Lemmatization errors when text contains contracted forms of 'be' #674

gppatt opened this issue Dec 9, 2016 · 2 comments

Comments

@gppatt
Copy link

gppatt commented Dec 9, 2016

I've noticed some inconsistent behavior here:

nlp = spacy_nlp(u"I'm hungry. You're hungry. He's hungry. It's hungry. We're hungry. They're hungry.")
for tok in nlp:
print tok, tok.lemma_

Gives output:

I i
'm be
hungry hungry
. .
You you
're 're
hungry hungry
. .
He he
's '
hungry hungry
. .
It it
's '
hungry hungry
. .
We we
're 're
hungry hungry
. .
They they
're 're
hungry hungry
. .

A related error is for "won't" (and for the much rarer "shan't"):

nlp = spacy_nlp(u"They won't move.")
for tok in nlp:
print tok, tok.lemma_

They they
wo wo
n't not
move move
. .

I think I once even saw a similar lemmatization error for "can't", but I am not able to recreate this error.

Your Environment

OSX 10.11.6
Spyder 3.0.0
spaCy 1.2.0

@ines ines added performance 🌙 nightly Discussion and contributions related to nightly builds labels Dec 10, 2016
@ines ines added this to the Reorganise language data milestone Dec 10, 2016
@ines
Copy link
Member

ines commented Dec 10, 2016

Thanks for the report! Some of these should probably be handled in the morphological analyser (like "He's", where the lemma is ambiguous). But the others are definitely cases for the TOKENIZER_EXCEPTIONS.

I'm currently in the process of reorganising the language data (see organize-language-data branch). I'll add the missing exceptions, so this should all be fixed in the v2.0 release.

@ines ines closed this as completed in a223221 Dec 18, 2016
@ines ines removed the 🌙 nightly Discussion and contributions related to nightly builds label Dec 18, 2016
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants