Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💫 Lemmatizer should apply rules on OOV words #781

Closed
honnibal opened this issue Jan 28, 2017 · 7 comments
Closed

💫 Lemmatizer should apply rules on OOV words #781

honnibal opened this issue Jan 28, 2017 · 7 comments

Comments

@honnibal
Copy link
Member

@juanmirocks points out in #327 that the lemmatizer fails on OOV words:

>>> nlp.vocab.morphology.lemmatizer(u'endosomes', 'noun', morphology={'number': 'plur'})set([u'endosomes'])
>>> nlp.vocab.morphology.lemmatizer(u'chromosomes', 'noun', morphology={'number': 'plur'})
set([u'chromosome'])

Suggested patch to lemmatizer.py

    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)

Your Environment

  • Operating System:
  • Python Version Used:
  • spaCy Version Used:
  • Environment Information:
@honnibal honnibal added bug Bugs and behaviour differing from documentation performance and removed bug Bugs and behaviour differing from documentation labels Jan 28, 2017
@ines ines changed the title Lemmatizer should apply rules on OOV words 💫 Lemmatizer should apply rules on OOV words Jan 29, 2017
@juanmirocks
Copy link
Contributor

@honnibal thanks for looking into this -- Know I understand the problem better.

Looking forward to a fix as I think it will greatly improve the performance of my methods.

Kindly let me know if I can help with the patch.

@ines
Copy link
Member

ines commented Jan 30, 2017

@juanmirocks If you have time to add the patch to lemmatizer.py and make a pull request, that would be great 👍

The only thing that's important is to make sure it's properly tested. In this case, I think it'd be fine to just add a regression test for this issue, using the above examples.

@juanmirocks
Copy link
Contributor

juanmirocks commented Jan 31, 2017 via email

@juanmirocks
Copy link
Contributor

Other examples that I will use for debugging and testing:

  • "colocalizes" is lemmatized the same, "colocalizes"
    • PP2A colocalizes with shugoshin at centromeres and is required for centromeric protection .

@juanmirocks
Copy link
Contributor

Pull request created

@honnibal
Copy link
Member Author

Merged!

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants