Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inappropriate CCOMP/NSUBJ dependency labels #905

Closed
thedataist opened this issue Mar 22, 2017 · 4 comments
Closed

Inappropriate CCOMP/NSUBJ dependency labels #905

thedataist opened this issue Mar 22, 2017 · 4 comments
Labels
lang / en English language data and models models Issues related to the statistical models

Comments

@thedataist
Copy link

thedataist commented Mar 22, 2017

I've just finished an extensive test of the dependency parser for a social media/messaging use case (1-2 sentence expressions) and have noticed one case where spaCy consistently performs poorly in comparison to other parsers...

For expressions like: "I want a different thing can i have the red one instead", the parser will return:
(grouping verb phrases for "want" and "have" here for brevity)

"I want" -ccomp-> "a different thing can I have the red one instead" (e.g. "a different thing" is the nsubj for "have" instead of "want")

the correct parse should be:
"I want a different thing" -ccomp-> "can I have the red one instead"

placing a comma after "thing", i.e. "I want a different thing, can i have the red one instead", elicits the correct parse by clearing up the ambiguity.

I assume this is a flaw in the model, but realise this is a slightly ambiguous parse as you could get expressions like "I hurt bob can you get me to the doctor", where the parse could go either way ("i hurt" or "i hurt bob" could be valid as the first verb phrase). In cases where the first verb phrase contains "want", "need", "can i get", "can i have", etc. the correct parse should almost always be the nsubj as a child of the root verb though.

That said, in almost all other cases parser performance was very good and I've found the library to be very well thought out. I'd really like to put spaCy into production, but this may be a show stopper for me. Is there a chance the model will be corrected soon, or should I consider building my own?

macOS Sierra
Python 2.7
spaCy 1.7.2
model en_core_web_md-1.2.1

@honnibal
Copy link
Member

Hi,

Thanks, interesting analysis. To give you some insight on this, the current parsing model is "greedy", in that it only considers a single analysis. However, it does have a "repair" mechanism, that allows it to alter the partial analysis it's building, in light of new information.

I can think of a few ways you might correct this. First, you might want to use the step_through() method of the parser to investigate these sentences, and print out the sequence of actions the parser is taking. This would be helpful in determining whether there's actually some error in the way the parser is evaluating these things.

Assuming all is correct, I would say the best solution would be to do some additional training on the cases you're interested in. Given you've already set up a couple of other parsers, I would recommend "tri training" as a strategy: have your other parsers analyse a bunch of text, and take the sentences on which they agree. Then designate these as gold-standard sentences, and use them to train spaCy. Specifically, something like this:

for text in examples:
    heads1, deps1 = parser1(text)
    heads2, deps2 = parser2(text)
    if match(heads1, deps1, heads2, deps2):
        doc = nlp.tokenizer(text)
        gold = GoldParse(doc, heads=heads1, labels=deps1)

        nlp.tagger(doc)
        nlp.parser.update(doc, gold)

It may take some fiddling to get this right, but the basic concept of this is pretty well established in the literature.

Best,
Matt

@thedataist
Copy link
Author

Thanks Matt, will give that a go. This is really the only issue preventing me from selecting spaCy as our parser. Can't beat the speed with anything else we've tried, so will see if I can train it out for this case.

FYI the only other issue we noticed was the tendency for the parser to (not unreasonably) mangle locations (addresses, place names, etc.) that should be compound nouns, but solved it with a custom NER for locations.

Best,
Peter

@ines ines added lang / en English language data and models models Issues related to the statistical models and removed models Issues related to the statistical models labels May 13, 2017
@ines
Copy link
Member

ines commented May 13, 2017

Closing this and making #1057 the master issue – work in progress for spaCy v2.0!

@ines ines closed this as completed May 13, 2017
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / en English language data and models models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

3 participants