Possible index off by one in matches by the ZERO_PLUS operator #766

savkov · 2017-01-22T13:34:09Z

Hi,

@mbatchkarov and I found a bug in the matcher when using a ZERO_PLUS operator. There is also a possible inconsistency in the matches, which may or may not be true. Take a look at the following code example:

from spacy.en import English
from spacy.matcher import Matcher
from spacy.attrs import ORTH

nlp = English()

matcher = Matcher(nlp.vocab)
matcher.add_pattern('KleenePhilippe', [{ORTH: 'Philippe', 'OP': '+'}])

doc = nlp('Philippe Philippe of Philippe.')

m = matcher(doc)

def print_matcher_output(m):
    for ent_id, label, start, end in m:
        print(str(doc[start:end]))

print_matcher_output(m)

Output:

>>> Philippe Philippe of
>>> Philippe of
>>> Philippe.

The obvious bug is related to the index that is passed to the list of matches. We are not sure if this is due to a faulty index passed by the matcher or by a faulty match. The fact that it matches any token after what is the match means it is probably a bad index.

Apart from the index, it is not quite clear what the behaviour of the ZERO_PLUS operator should be. In the case above we see two interpretations:

['Philippe Philippe', 'Philippe'] to match a greedy matching behaviour (like re.findall('(P+)', 'PP of P')),
['Philippe', 'Philippe Philippe', 'Philippe', 'Philippe'] to produce all possible matches consistent with how matches from different rules behave.

It is not clear what the logic of the current output is, so maybe it's just the manifestation of another bug.

Here is another test case that doesn't work at all:

matcher = Matcher(nlp.vocab)
matcher.add_pattern('KleenePhilippe', [{ORTH: 'Philippe', 'OP':'+'}], label=321)

doc = nlp('Philippe Philippe')

m = matcher(doc)

print(m)

Output:

[]

The text was updated successfully, but these errors were encountered:

…end of document. Closes Issue #766

honnibal · 2017-02-24T13:32:52Z

Hi,

Sorry for the delay getting to this. Two issues here:

There was a bug in the matcher that meant that patterns ending with "optional" items that could be filled at the end of the string failed to match. I've fixed this (although the fix is a little under-tested, which makes me nervous)
The '+' is implemented as a sequence of operators: ONE, ZERO_PLUS. The ZERO_PLUS operator isn't greedy, so you'd get a length-2 match. I agree this isn't great. I've exposed the ONE operator with the op string '1', to give better control of these things. It'd be nice to have a more satisfying system here.

Matt

lock · 2018-05-09T02:38:55Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added the bug Bugs and behaviour differing from documentation label Jan 22, 2017

honnibal mentioned this issue Feb 24, 2017

matcher doesn't match with '*' operator and Boolean flag #850

Closed

honnibal added a commit that referenced this issue Feb 24, 2017

Add 1 operator to matcher, and make sure open patterns are closed at …

8f94897

…end of document. Closes Issue #766

honnibal closed this as completed Feb 24, 2017

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible index off by one in matches by the ZERO_PLUS operator #766

Possible index off by one in matches by the ZERO_PLUS operator #766

savkov commented Jan 22, 2017

honnibal commented Feb 24, 2017

lock bot commented May 9, 2018

Possible index off by one in matches by the ZERO_PLUS operator #766

Possible index off by one in matches by the ZERO_PLUS operator #766

Comments

savkov commented Jan 22, 2017

honnibal commented Feb 24, 2017

lock bot commented May 9, 2018