-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong span for compounds? #264
Comments
Hey @ardeego , thanks for reporting! I'll fix this. Just to be sure I get this right, though, could you provide a few example sentences containing both 2- and 3+ word compound nouns? (I'm having a hard time coming up with good examples to test...) |
Another reason for more examples is that I don't think I'm seeing the issue you mentioned in the first example I came up with:
You can see that both "cherry" and "blossom" are in the lefts of "trees". |
Hi, I think for the most part it get's it right, but I've seen it here: doc = nlp(u"The microsoft browser program window shadow isn't looking very good on the screen")
for tok in doc:
print(tok, "=>", [(l, l.dep_) for l in tok.lefts])
I used this to better visualize the issue: from spacy import displacy
displacy.render(doc, style="dep") Most consistently it happens if a proper noun is part of the compound. Does this parse the same way for you? Otherwise this might be an issue with spaCy? |
The code I posted earlier is not a fix either btw cuz it misses the places where your code skipped left correctly. |
Mine actually parses differently. (I'm using the small English 2.1 model, you?).
I'm honestly not sure how to proceed. If the parser were 100% correct, would we expect all nouns in a single compound noun to be in the |
I was using the large English model:
I'm not sure how to proceed either. Nor am I sure what 100% correct would mean for a compound, I thought going over the lefts on a noun chunk would be sufficient, til I found several examples where it was not. I have seen a whole lot of combination of compound dependency arrangements. I guess the full compound would be the longest "compound dependency path" in the parsing tree. I am sitting a bit on the fence on this, as I wonder if a longer compound noun is necessarily better than a partial one. If the compound word gets too long it would be come very specific and very rare over the corpus, so I wonder how much of a downstream value such a compound would serve. Maybe @honnibal can advise on the parsing correctness? |
This is the non-recursive solution I ended up settling on (handles all parsing variations of compound nouns I came across). Not as slick as your itertools solution, but performance should be pretty good, as no generators had to be expanded into lists. def get_span_for_compound_noun(noun):
"""
Return document indexes spanning all (adjacent) tokens
in a compound noun.
"""
end = noun.i
start = end
lefts = noun.lefts
while(True):
token = None
for token in lefts:
if token.dep_ == "compound":
start = token.i
lefts = token.lefts
break
if token:
if start != token.i:
break
else:
break
return (start, end) |
I'm still a bit torn about this, but if it works, I'm happy to accept the empirical evidence. :) Could you provide me with a handful of example sentences and their expected outputs? I'll add a test to make sure everything works as expected. Thanks for keeping on this! |
Will make some time for it. Thanks. |
Hey there,
I was wondering if
get_span_for_compound_noun(noun)
is correctly implemented? Iterating over thelefts
is not going to work for triple (plus) compounds as they won't be rooted at the same noun but in a cascading fashion (this would not get the compound noun "pool table stick" for example) This is what I ended up doing:Technically, for it to be a proper span it probably should have returned
end_i+1
?The text was updated successfully, but these errors were encountered: