Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible index off by one in matches by the ZERO_PLUS operator #766

Closed
savkov opened this issue Jan 22, 2017 · 2 comments
Closed

Possible index off by one in matches by the ZERO_PLUS operator #766

savkov opened this issue Jan 22, 2017 · 2 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@savkov
Copy link
Contributor

savkov commented Jan 22, 2017

Hi,

@mbatchkarov and I found a bug in the matcher when using a ZERO_PLUS operator. There is also a possible inconsistency in the matches, which may or may not be true. Take a look at the following code example:

from spacy.en import English
from spacy.matcher import Matcher
from spacy.attrs import ORTH

nlp = English()

matcher = Matcher(nlp.vocab)
matcher.add_pattern('KleenePhilippe', [{ORTH: 'Philippe', 'OP': '+'}])

doc = nlp('Philippe Philippe of Philippe.')

m = matcher(doc)

def print_matcher_output(m):
    for ent_id, label, start, end in m:
        print(str(doc[start:end]))

print_matcher_output(m)

Output:

>>> Philippe Philippe of
>>> Philippe of
>>> Philippe.

The obvious bug is related to the index that is passed to the list of matches. We are not sure if this is due to a faulty index passed by the matcher or by a faulty match. The fact that it matches any token after what is the match means it is probably a bad index.

Apart from the index, it is not quite clear what the behaviour of the ZERO_PLUS operator should be. In the case above we see two interpretations:

  1. ['Philippe Philippe', 'Philippe'] to match a greedy matching behaviour (like re.findall('(P+)', 'PP of P')),
  2. ['Philippe', 'Philippe Philippe', 'Philippe', 'Philippe'] to produce all possible matches consistent with how matches from different rules behave.

It is not clear what the logic of the current output is, so maybe it's just the manifestation of another bug.

Here is another test case that doesn't work at all:

matcher = Matcher(nlp.vocab)
matcher.add_pattern('KleenePhilippe', [{ORTH: 'Philippe', 'OP':'+'}], label=321)

doc = nlp('Philippe Philippe')

m = matcher(doc)

print(m)

Output:

[]
@ines ines added the bug Bugs and behaviour differing from documentation label Jan 22, 2017
honnibal added a commit that referenced this issue Feb 24, 2017
@honnibal
Copy link
Member

Hi,

Sorry for the delay getting to this. Two issues here:

  1. There was a bug in the matcher that meant that patterns ending with "optional" items that could be filled at the end of the string failed to match. I've fixed this (although the fix is a little under-tested, which makes me nervous)

  2. The '+' is implemented as a sequence of operators: ONE, ZERO_PLUS. The ZERO_PLUS operator isn't greedy, so you'd get a length-2 match. I agree this isn't great. I've exposed the ONE operator with the op string '1', to give better control of these things. It'd be nice to have a more satisfying system here.

Matt

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

3 participants