💫 Lemmatizer should apply rules on OOV words #781

honnibal · 2017-01-28T13:11:55Z

@juanmirocks points out in #327 that the lemmatizer fails on OOV words:

>>> nlp.vocab.morphology.lemmatizer(u'endosomes', 'noun', morphology={'number': 'plur'})set([u'endosomes'])
>>> nlp.vocab.morphology.lemmatizer(u'chromosomes', 'noun', morphology={'number': 'plur'})
set([u'chromosome'])

Suggested patch to lemmatizer.py

    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)

Your Environment

Operating System:
Python Version Used:
spaCy Version Used:
Environment Information:

juanmirocks · 2017-01-30T12:15:34Z

@honnibal thanks for looking into this -- Know I understand the problem better.

Looking forward to a fix as I think it will greatly improve the performance of my methods.

Kindly let me know if I can help with the patch.

ines · 2017-01-30T13:48:05Z

@juanmirocks If you have time to add the patch to lemmatizer.py and make a pull request, that would be great 👍

The only thing that's important is to make sure it's properly tested. In this case, I think it'd be fine to just add a regression test for this issue, using the above examples.

juanmirocks · 2017-01-31T13:33:55Z

Thanks Ines. I will gladly help and try. Is that patch code all what is necessary as per decided for Hannibal? I will dive in

…

On Mon, 30 Jan 2017 at 14:48, Ines Montani ***@***.***> wrote: @juanmirocks <https://github.com/juanmirocks> If you have time to add the patch to lemmatizer.py and make a pull request, that would be great 👍 The only thing that's important is to make sure it's properly tested. In this case, I think it'd be fine to just add a regression test <https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md#fixing-bugs> for this issue, using the above examples. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGQH0KVj1NvUEf7qdGmE9e2roqlMaOAks5rXeoZgaJpZM4LwgLo> .

juanmirocks · 2017-02-03T08:36:11Z

Other examples that I will use for debugging and testing:

"colocalizes" is lemmatized the same, "colocalizes"
- PP2A colocalizes with shugoshin at centromeres and is required for centromeric protection .

juanmirocks · 2017-03-01T20:53:18Z

Pull request created

honnibal · 2017-03-16T22:57:54Z

Merged!

lock · 2018-05-09T02:38:19Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added bug Bugs and behaviour differing from documentation performance and removed bug Bugs and behaviour differing from documentation labels Jan 28, 2017

ines changed the title ~~Lemmatizer should apply rules on OOV words~~ 💫 Lemmatizer should apply rules on OOV words Jan 29, 2017

This was referenced Feb 3, 2017

‼️Crush the Baseline Rostlab/LocText#11

Closed

Explore Different Sentece Models and achieve F1 > 80 Rostlab/LocText#34

Open

ines added this to the Update lemmatizer and morphology milestone Feb 18, 2017

juanmirocks mentioned this issue Mar 1, 2017

#781 Apply suggested patch -- passing regression test added #866

Merged

8 tasks

honnibal closed this as completed Mar 16, 2017

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💫 Lemmatizer should apply rules on OOV words #781

💫 Lemmatizer should apply rules on OOV words #781

honnibal commented Jan 28, 2017

juanmirocks commented Jan 30, 2017

ines commented Jan 30, 2017

juanmirocks commented Jan 31, 2017 via email

juanmirocks commented Feb 3, 2017

juanmirocks commented Mar 1, 2017

honnibal commented Mar 16, 2017

lock bot commented May 9, 2018

💫 Lemmatizer should apply rules on OOV words #781

💫 Lemmatizer should apply rules on OOV words #781

Comments

honnibal commented Jan 28, 2017

Your Environment

juanmirocks commented Jan 30, 2017

ines commented Jan 30, 2017

juanmirocks commented Jan 31, 2017 via email

juanmirocks commented Feb 3, 2017

juanmirocks commented Mar 1, 2017

honnibal commented Mar 16, 2017

lock bot commented May 9, 2018