Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

label keyword argument in Span.merge has no effect #862

Closed
acowlikeobject opened this issue Feb 27, 2017 · 6 comments
Closed

label keyword argument in Span.merge has no effect #862

acowlikeobject opened this issue Feb 27, 2017 · 6 comments
Labels
bug Bugs and behaviour differing from documentation help wanted (easy) Contributions welcome! (also suited for spaCy beginners)

Comments

@acowlikeobject
Copy link

I am trying to add custom entities via add_entity API. I'd like to be able to specify their entity types. It does not appear that the label keyword in add_entity is doing anything.

(The following description looks long, but is the same code block repeated with minor changes.)

If I use the snippet provided in issue #523:

def merge_phrase(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        span.merge(label=label, tag='NNP' if label else span.root.tag_)

nlp = spacy.load('en')
nlp.matcher.add_entity('MorganStanley', on_match=merge_phrase)
nlp.matcher.add_pattern('MorganStanley', [{'orth': 'Morgan'}, {'orth': 'Stanley'}], label='ORG')
nlp.pipeline = [nlp.tagger, nlp.entity, nlp.matcher, nlp.parser]

It looks promising:

# Okay, now we've got our pipeline set up...
doc = nlp(u'Morgan Stanley fires Vice President')
for word in doc:
    print(word.text, word.tag_, word.dep_, word.head.text, word.ent_type_)

Morgan Stanley NNP amod fires ORG
fires NNS ROOT fires 
Vice NNP compound President 
President NNP appos fires 

However, I'm not sure the label='ORG' is actually doing anything. If I remove it, I get the same output.

nlp = spacy.load('en')
nlp.matcher.add_entity('MorganStanley', on_match=merge_phrase)
nlp.matcher.add_pattern('MorganStanley', [{'orth': 'Morgan'}, {'orth': 'Stanley'}])
nlp.pipeline = [nlp.tagger, nlp.entity, nlp.matcher, nlp.parser]

doc = nlp(u'Morgan Stanley fires Vice President')
for word in doc:
    print(word.text, word.tag_, word.dep_, word.head.text, word.ent_type_)

Morgan Stanley NNP amod fires ORG
fires NNS ROOT fires 
Vice NNP compound President 
President NNP appos fires

In fact, anything of the form ABC [Brothers/Limited/Company/Bank] gets labeled as an 'ORG'. And I can't get other patterns to be labeled 'ORG'. E.g.:

nlp = spacy.load('en')
nlp.matcher.add_entity('MorganStanley', on_match=merge_phrase)
nlp.matcher.add_pattern('MorganStanley', [{'orth': 'State'}, {'orth': 'Street'}], label='ORG')
nlp.pipeline = [nlp.tagger, nlp.entity, nlp.matcher, nlp.parser]

doc = nlp(u'State Street fires Vice President')
for word in doc:
    print(word.text, word.tag_, word.dep_, word.head.text, word.ent_type_)

State Street NNP compound fires 
fires NNS ROOT fires 
Vice NNP compound President 
President NNP appos fires 

The .label_ property of doc.ents is always ''.

How do I set the entity label? And what is the difference between doc.ents[0].label_ and doc[0].ent_type_?

  • Operating System: Debian 8
  • Python Version Used: 3.5
  • spaCy Version Used: 1.6
@honnibal
Copy link
Member

Hi,

First --- I don't have utmost confidence that everything works the way it should here. I'll anaswer what should happen, but this part of the code isn't very well tested yet, after undergoing some changes through v1.0.

Associating the label to the entity in the matcher will only control what gets passed over to the on_match callback.Inside that callback, you decide to act on the pattern, and actually write to the document. In the example you quoted, the entities are being retokenized and merged, with the label being used to set the token's ent_type attribute.

If you omit the label= argument in the matcher, it won't be passed in to the on_match callback, and the label= attribute won't be passed into the .merge() method. However, the .merge() method will still try to inherit attributes for the new token from the subtokens, including the entity type. So you might still get an entity type on the new token.

Finally, label vs ent_type. The Span object is a labelled slice. It's a view, so you can have multiple overlapping spans, spans with different labels, etc. The underying data is an array of structs, held by the Doc. All annotations are specified on the TokenC struct. One of these annotations is the entity type. When you do doc.ents, you're creating Span objects using the .ent_iob and .ent_type attributes on the tokens. The ent_type attribute is passed to the Span as the label.

@dlmiyamoto
Copy link

It doesn't seem that the label kwarg in the span.merge of the callback does anything. span.merge passes **attributes to doc.merge, but doc.merge doesn't seem to do anything with a label kwarg. Is this a result of outdated documentation?

@dlmiyamoto
Copy link

Using *args instead of **attributes in the callback correctly labels the entity.

def merge_phrase(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        span.merge(span.text, "NNP" if label else span.root.tag_, nlp.vocab.strings[label])

@honnibal
Copy link
Member

honnibal commented Feb 28, 2017

@dlmiyamoto Thanks for the analysis. Will comment here when I get to this. In the meantime, I think this might be an easy patch if someone else wants to have a look at it. It's just a matter of wiring up the keyword argument so it does what the positional arguments already do.

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Feb 28, 2017
@honnibal honnibal changed the title Adding entity labels via add_entity does not appear to be working? label keyword argument in Span.merge has no effect Feb 28, 2017
@honnibal honnibal added the help wanted (easy) Contributions welcome! (also suited for spaCy beginners) label Feb 28, 2017
honnibal added a commit that referenced this issue Mar 30, 2017
Add option to use label=ent_type in doc.merge arguments (Bug fix for issue #862)
@honnibal
Copy link
Member

Fixed by #935. Thanks Eric! 🎉

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation help wanted (easy) Contributions welcome! (also suited for spaCy beginners)
Projects
None yet
Development

No branches or pull requests

3 participants