Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenisation output and tokenizer.explain is inconsistent #9136

Closed
delzac opened this issue Sep 4, 2021 · 10 comments · Fixed by #9155
Closed

Tokenisation output and tokenizer.explain is inconsistent #9136

delzac opened this issue Sep 4, 2021 · 10 comments · Fixed by #9155
Labels
bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer

Comments

@delzac
Copy link
Contributor

delzac commented Sep 4, 2021

How to reproduce the behaviour

import en_core_web_sm
from spacy.util import compile_prefix_regex, compile_suffix_regex

nlp = en_core_web_sm.load()

prefixes = ['a(?=.)']
# clash between 2 valid regex causes inconsistent tokenisation from nlp() and nlp.tokenizer.explain()
suffixes = [r'(?<=\w)\.$', r'(?<=a)\d+\.']
# suffixes = [r'(?<=\w)\.$']

prefixes_re = compile_prefix_regex(prefixes)
suffixes_re = compile_suffix_regex(suffixes)

nlp.tokenizer.prefix_search = prefixes_re.search
nlp.tokenizer.suffix_search = suffixes_re.search

a = 'a10.'
tokens = [t.text for t in nlp(a)]
tokens_w_explanation = nlp.tokenizer.explain(a)

print(tokens)                   # ['a', '10.']
print(tokens_w_explanation)     # [('PREFIX', 'a'), ('TOKEN', '10'), ('SUFFIX', '.')]

assert len(tokens) == 2
assert len(tokens_w_explanation) == 3
assert len(tokens) != len(tokens_w_explanation)

Output from print

>>> ['a', '10.']
>>> [('PREFIX', 'a'), ('TOKEN', '10'), ('SUFFIX', '.')]

Output from explanation is different from actual tokenised result.

Your Environment

  • Operating System: Windows 10
  • Python Version Used: Python 3.6
  • spaCy Version Used: spacy 3.0.6
  • Environment Information: nil
@polm polm added bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer labels Sep 4, 2021
@polm
Copy link
Contributor

polm commented Sep 4, 2021

Confirmed the output is the same in 3.1.2. Thanks for the report!

See also #7694.

@adrianeboyd
Copy link
Contributor

Yes, thanks for the report! Here I think that the explain output is the intended output and there's a bug in the main tokenizer algorithm related to a suffix that overlaps with a prefix that should have already been split off.

So far we'd only had the case where explain was incorrect (since it's a completely separate implementation of the same algorithm), so this is unexpected!

Since this will potentially change the tokenizer output for the same stored settings, we'll aim to fix it in v3.2.0.

@delzac
Copy link
Contributor Author

delzac commented Sep 6, 2021

@adrianeboyd To clarify, based on my testing if any one of the two suffixes regex was removed then we nlp() and explain() will be the same.

So my conclusion was that clashes between two valid suffixes trigger this behaviour.

@adrianeboyd
Copy link
Contributor

I may even be getting myself mixed up here about the regex behavior, but it looks like the problem is what happens right after a prefix is recognized and that's where the main algorithm and explain differ.

What happens currently in the main tokenizer:

  • a matches as a prefix (and the token string is not modified yet)
  • 10. matches as a suffix in the token string that still contains a
  • a and 10. are processed as a valid prefix and a valid suffix

What happens in explain:

  • a matches as a prefix and is lopped off
  • . matches as a suffix since there's no a for the second suffix pattern to match, and then it's also lopped off
  • the rest 10 is a token

I think that the explain behavior is actually what we want, since the idea is that the prefix is removed before it moves on to the suffix patterns. I'll double-check within the spacy team to be sure (we want to be extremely careful with tokenizer modifications because it affects so many users), but if so, it's a one-line change in the end.

@delzac
Copy link
Contributor Author

delzac commented Sep 6, 2021

Definitely agree that the explain behaviour is that we want. Thanks for taking the time to explain what's happening under the hood! :)

My work is very sensitive to tokenisation too, so definitely hope that this is nothing major.

@delzac
Copy link
Contributor Author

delzac commented Sep 6, 2021

@adrianeboyd I have update the test to make sure that i communicated clearly.

import en_core_web_sm
from spacy.util import compile_prefix_regex, compile_suffix_regex

nlp = en_core_web_sm.load()

text = 'a10.'
suffix1 = r'(?<=\w)\.$'
suffix2 = r'(?<=a)\d+\.'
prefixes = ['a(?=.)']

prefixes_re = compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefixes_re.search

suffixes_re = compile_suffix_regex([suffix1, suffix2])
nlp.tokenizer.suffix_search = suffixes_re.search

# both suffix1 and suffix 2 present
assert len(nlp(text)) == 2                      # ['a', '10.']
assert len(nlp.tokenizer.explain(text)) == 3    # [('PREFIX', 'a'), ('TOKEN', '10'), ('SUFFIX', '.')]

suffixes_re = compile_suffix_regex([suffix1])
nlp.tokenizer.suffix_search = suffixes_re.search

# suffix1 is present but NOT suffix2
assert len(nlp(text)) == 3                      # ['a', '10', '.']
assert len(nlp.tokenizer.explain(text)) == 3    # [('PREFIX', 'a'), ('TOKEN', '10'), ('SUFFIX', '.')]


suffixes_re = compile_suffix_regex([suffix2])
nlp.tokenizer.suffix_search = suffixes_re.search

# suffix1 is NOT present but suffix2 is present
assert len(nlp(text)) == 2                      # ['a', '10.'']
assert len(nlp.tokenizer.explain(text)) == 2    # [('PREFIX', 'a'), ('TOKEN', '10.')]

What you said about the overlap between suffix2 and the prefix is right, it does trigger a wrong tokenisation. (but nlp() and explain() is consistent in this case).

The use of both suffix1 together with suffix2 cause inconsistency between nlp() and explain().

@adrianeboyd
Copy link
Contributor

You can't see the difference in the resulting tokens, but the internal difference in the final case is that the tokenizer thinks that 10. is a suffix while explain thinks it's a token. If another suffix pattern is added, explain will keep looking for more suffixes but the tokenizer won't.

@adrianeboyd adrianeboyd linked a pull request Sep 6, 2021 that will close this issue
3 tasks
@delzac
Copy link
Contributor Author

delzac commented Sep 6, 2021

Got it, thanks for attending and explain it to me! :)

@adrianeboyd
Copy link
Contributor

Not sure why, but github is sometimes a bit flaky about closing linked issues from pull requests. This has been fixed by #9155 and will be in v3.2.0.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants