Matcher does not assign entity IDs to relevant spans. #615

aikramer2 · 2016-11-08T20:31:07Z

Related to #605 and #577

Passing a document through a custom matcher, the matcher indeed tags the relevant span as a entity, but the span comprising the entity itself does not have entity metadata associated with it (assertion 1 passes, assertion 2 fails). Not sure whether this is a bug exactly, as to my understanding the matcher is meant to have minimal side effects unless the user does something explicit. However, I'm not sure how to assign entity metadata to the appropriate span myself. I've tried adding a callback to the matcher to assign the entity manually but the set methods for the span don't seem to be allowed.

import spacy 
from spacy.attrs import ORTH
def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        span.merge(label=label, tag='NNP' if label else span.root.tag_)

nlp = spacy.en.English()
text = u"""The golf club is broken"""
doc = nlp(text)

golf_pattern =     [ 
        { ORTH: "golf"},
        { ORTH: "club"}
    ]

matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add_entity('Sport_Equipment'
                   ,on_match = merge_phrases
                  )
matcher.add_pattern("Sport_Equipment", golf_pattern, label = 'Sport_Equipment')

match = matcher(doc)
entities = list(doc.ents)

assert entities != [] #assertion 1
assert entities[0].label != 0 #assertion 2

Your Environment

Ubuntu
Python 2.7
spaCy 1.2.0

The text was updated successfully, but these errors were encountered:

aikramer2 · 2016-11-08T21:08:04Z

As a related side note, I tested the merge phrases callback below, the label assignment does not seem to work:

def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]

    for ent_id, label, span in spans:
        span.merge(label=label, tag='NNP' if label else span.root.tag_)
        assert span.label == label #assertion fails

honnibal · 2016-11-10T22:02:27Z

Hmm.

The entity ID feature isn't yet 100% well thought out. There are a few complications.

One implementation detail to keep in mind is that spans don't "really" exist. They're views of slices of the document. We can have labels on the spans, and I guess we can extend this to having entity IDs on the spans as well. I'm worried that if we start writing stuff to the span object, things will quickly get confusing. Consider:

for span in doc.ents:
    span.ent_id = lookup_ent_id(span.text)

This isn't going to work. The .ents property is an iterator that generates the entities on the fly. We're not holding a list of spans here — Span object we just wrote to is going to go out-of-scope, and we'll be left wondering where our entity ID just went...

Do you think it's too limiting for entity IDs to stay a per-token property? If the entity ID is per token, we don't have this problem. You can always merge a span into a single token anyway.

aikramer2 · 2016-11-10T23:01:01Z

@honnibal In my view, there should be a story establishing entity id for the token resultant from merged span. The issue is that on the merge statement, there needs to be a set attribute method defined for taking the entity metadata passed to matcher and setting that to the merged token. Perhaps something like this should be able to set the entity_id data to the merged token:

span.merge(label=label, ent_id = ent_id)

in which case if i check out the resulting token, all of the entity attributes should be set
(ent_id, ent_id_, ent_type_, etc...)

What do you think?

honnibal · 2016-11-10T23:04:49Z

I thought that already worked! It definitely should.

Actually merge should support all of the token attributes.

aikramer2 · 2016-11-10T23:20:54Z

If you run this on merge function within the matcher above:


def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        assert ent_id != 0L #assertion 1
        span.merge(label=label, tag='NNP' if label else span.root.tag_, ent_id = ent_id)
        assert span.root.ent_id != 0L #assertion 2

Assertion 1 passes, but assertion 2 does not, so it seems like merge isnt setting the attributes. To be sure, i also tried checked the ent_id of the resulting token after the matcher runs, and get the same result.

honnibal · 2017-03-31T11:10:21Z

Not sure precisely when this was fixed, but the test for it is there and it's been green for some time --- so, closing this.

lock · 2018-05-09T00:39:07Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

aikramer2 changed the title ~~Matcher does not assign entity IDs to relevant tokens.~~ Matcher does not assign entity IDs to relevant spans. Nov 8, 2016

ines added the bug Bugs and behaviour differing from documentation label Nov 9, 2016

honnibal added the usage General spaCy usage label Nov 10, 2016

honnibal added a commit that referenced this issue Nov 23, 2016

Work on test for #615

e01c187

ines added a commit that referenced this issue Jan 12, 2017

Fix regression test for #615 and remove unnecessary imports

51ef75f

honnibal closed this as completed Mar 31, 2017

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matcher does not assign entity IDs to relevant spans. #615

Matcher does not assign entity IDs to relevant spans. #615

aikramer2 commented Nov 8, 2016 •

edited

Loading

aikramer2 commented Nov 8, 2016

honnibal commented Nov 10, 2016

aikramer2 commented Nov 10, 2016

honnibal commented Nov 10, 2016

aikramer2 commented Nov 10, 2016 •

edited

Loading

honnibal commented Mar 31, 2017

lock bot commented May 9, 2018

Matcher does not assign entity IDs to relevant spans. #615

Matcher does not assign entity IDs to relevant spans. #615

Comments

aikramer2 commented Nov 8, 2016 • edited Loading

Your Environment

aikramer2 commented Nov 8, 2016

honnibal commented Nov 10, 2016

aikramer2 commented Nov 10, 2016

honnibal commented Nov 10, 2016

aikramer2 commented Nov 10, 2016 • edited Loading

honnibal commented Mar 31, 2017

lock bot commented May 9, 2018

aikramer2 commented Nov 8, 2016 •

edited

Loading

aikramer2 commented Nov 10, 2016 •

edited

Loading