Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matcher does not assign entity IDs to relevant spans. #615

Closed
aikramer2 opened this issue Nov 8, 2016 · 7 comments
Closed

Matcher does not assign entity IDs to relevant spans. #615

aikramer2 opened this issue Nov 8, 2016 · 7 comments
Labels
bug Bugs and behaviour differing from documentation usage General spaCy usage

Comments

@aikramer2
Copy link
Contributor

aikramer2 commented Nov 8, 2016

Related to #605 and #577

Passing a document through a custom matcher, the matcher indeed tags the relevant span as a entity, but the span comprising the entity itself does not have entity metadata associated with it (assertion 1 passes, assertion 2 fails). Not sure whether this is a bug exactly, as to my understanding the matcher is meant to have minimal side effects unless the user does something explicit. However, I'm not sure how to assign entity metadata to the appropriate span myself. I've tried adding a callback to the matcher to assign the entity manually but the set methods for the span don't seem to be allowed.

import spacy 
from spacy.attrs import ORTH
def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        span.merge(label=label, tag='NNP' if label else span.root.tag_)

nlp = spacy.en.English()
text = u"""The golf club is broken"""
doc = nlp(text)

golf_pattern =     [ 
        { ORTH: "golf"},
        { ORTH: "club"}
    ]

matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add_entity('Sport_Equipment'
                   ,on_match = merge_phrases
                  )
matcher.add_pattern("Sport_Equipment", golf_pattern, label = 'Sport_Equipment')

match = matcher(doc)
entities = list(doc.ents)

assert entities != [] #assertion 1
assert entities[0].label != 0 #assertion 2

Your Environment

  • Ubuntu
  • Python 2.7
  • spaCy 1.2.0
@aikramer2 aikramer2 changed the title Matcher does not assign entity IDs to relevant tokens. Matcher does not assign entity IDs to relevant spans. Nov 8, 2016
@aikramer2
Copy link
Contributor Author

As a related side note, I tested the merge phrases callback below, the label assignment does not seem to work:

def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]

    for ent_id, label, span in spans:
        span.merge(label=label, tag='NNP' if label else span.root.tag_)
        assert span.label == label #assertion fails

@ines ines added the bug Bugs and behaviour differing from documentation label Nov 9, 2016
@honnibal honnibal added the usage General spaCy usage label Nov 10, 2016
@honnibal
Copy link
Member

Hmm.

The entity ID feature isn't yet 100% well thought out. There are a few complications.

One implementation detail to keep in mind is that spans don't "really" exist. They're views of slices of the document. We can have labels on the spans, and I guess we can extend this to having entity IDs on the spans as well. I'm worried that if we start writing stuff to the span object, things will quickly get confusing. Consider:

for span in doc.ents:
    span.ent_id = lookup_ent_id(span.text)

This isn't going to work. The .ents property is an iterator that generates the entities on the fly. We're not holding a list of spans here — Span object we just wrote to is going to go out-of-scope, and we'll be left wondering where our entity ID just went...

Do you think it's too limiting for entity IDs to stay a per-token property? If the entity ID is per token, we don't have this problem. You can always merge a span into a single token anyway.

@aikramer2
Copy link
Contributor Author

@honnibal In my view, there should be a story establishing entity id for the token resultant from merged span. The issue is that on the merge statement, there needs to be a set attribute method defined for taking the entity metadata passed to matcher and setting that to the merged token. Perhaps something like this should be able to set the entity_id data to the merged token:

span.merge(label=label, ent_id = ent_id)

in which case if i check out the resulting token, all of the entity attributes should be set
(ent_id, ent_id_, ent_type_, etc...)

What do you think?

@honnibal
Copy link
Member

I thought that already worked! It definitely should.

Actually merge should support all of the token attributes.

@aikramer2
Copy link
Contributor Author

aikramer2 commented Nov 10, 2016

If you run this on merge function within the matcher above:


def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        assert ent_id != 0L #assertion 1
        span.merge(label=label, tag='NNP' if label else span.root.tag_, ent_id = ent_id)
        assert span.root.ent_id != 0L #assertion 2

Assertion 1 passes, but assertion 2 does not, so it seems like merge isnt setting the attributes. To be sure, i also tried checked the ent_id of the resulting token after the matcher runs, and get the same result.

honnibal added a commit that referenced this issue Nov 23, 2016
@honnibal
Copy link
Member

Not sure precisely when this was fixed, but the test for it is there and it's been green for some time --- so, closing this.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

3 participants