Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency parser/tagger misidentifies a verb as a noun #1021

Closed
anna-hope opened this issue Apr 26, 2017 · 4 comments
Closed

Dependency parser/tagger misidentifies a verb as a noun #1021

anna-hope opened this issue Apr 26, 2017 · 4 comments
Labels
lang / en English language data and models models Issues related to the statistical models

Comments

@anna-hope
Copy link

anna-hope commented Apr 26, 2017

Here is an example input


nlp = spacy.load('en_depent_web_md')
doc = nlp("Does this phone work?")

for token in doc:
    print(token, token.pos_, token.tag_, token.dep_, token.head)
    print()

Here is the output:

Does VERB VBZ ROOT Does

this DET DT det work

phone NOUN NN compound work

work NOUN NN dobj Does

? PUNCT . punct Does

As you can see, spaCy incorrectly classifies work as a noun, which (I assume) leads to the dependency parser failing to label it as the root, and thus misidentifying the root as Does.

You can play with variations of the above input, such as "Will this phone work?" or "Would this phone work?" In all of the above cases, spaCy fails to pull out "work" as the root.

This would be a minor annoyance, but I rely on the dependency parse for a lot of my downstream tasks, and the "{does/will/would} this work" pattern is common for my data. (I can only think of one example where labelling work as a noun would be correct, such as "I did some work on this phone", but that strikes me as a rarer case than the one I have encountered, and that doesn't explain the case of a sentence starting with {would/will}.)

I don't know if this problem lies with the part of speech tagger, which assigns an erroneous tag to work, and thus messes up the dependency parser, or if it's something else.

Would you have any idea about what is causing this? If so, is there a way to fix it without re-training the whole model?

Thanks!

Info about spaCy

  • spaCy version: 1.8.2
  • Platform: Linux-4.4.0-43-Microsoft-x86_64-with-Ubuntu-16.04-xenial
  • Python version: 3.6.1
  • Installed models: cache, en, en-1.1.0, en_core_web_md, en_default, en_depent_web_md
@anna-hope anna-hope changed the title Dependency parser/tagger misidentifies a verb as part of a compound noun Dependency parser/tagger misidentifies a verb as a noun Apr 26, 2017
@anna-hope
Copy link
Author

Could be related to #1015.

How many examples would one need to correctly update the pre-trained model?

@anna-hope
Copy link
Author

anna-hope commented Apr 27, 2017

I tried the following code, based on the one from #1015, but even after 100,000 iterations I had no luck making it recognise work as a verb:

training_data = [
    ('Will this phone work?', 'MD DT NN VB .'),
    ('Would this phone work?', 'MD DT NN VB .'),
    ('Does this car work?', 'VBZ DT NN VB .'),
    ('This does work', 'DT VBZ VB'),
    ('Can this work?', 'MD DT VB .'),
    ('work', 'VB')
]

def update_tagger(tagger, example):
    orth_text, label_text = example
    doc = nlp.tokenizer(orth_text)
    tags = label_text.split()
    assert len(doc) == len(tags), 'Tokenisation does not match tags for {}'.format(orth_text)
    gold = spacy.gold.GoldParse(doc, tags=tags)
    tagger.update(doc, gold)

def train_tagger(tagger, examples):
    for i in range(100000):
        for example in examples:
            update_tagger(tagger, example)
    tagger.model.end_training()

train_tagger(nlp.tagger, training_data)

At this point, I would prefer it to err on the side of work always being a verb rather than a noun (I understand that such behaviour might be desired in the general case, but it would work for my data).

In the meantime, I've found that if I replace work with some other verb that I'm not likely to see in my data set, like "hasten", I would get the correct dependency parse. But that feels like a very silly workaround.

@ines ines added models Issues related to the statistical models lang / en English language data and models and removed models Issues related to the statistical models labels May 13, 2017
@ines
Copy link
Member

ines commented May 13, 2017

Closing this and making #1057 the master issue – work in progress for spaCy v2.0!

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / en English language data and models models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

3 participants