Tokenisation output and tokenizer.explain is inconsistent #9136

delzac · 2021-09-04T07:39:51Z

How to reproduce the behaviour

import en_core_web_sm
from spacy.util import compile_prefix_regex, compile_suffix_regex

nlp = en_core_web_sm.load()

prefixes = ['a(?=.)']
# clash between 2 valid regex causes inconsistent tokenisation from nlp() and nlp.tokenizer.explain()
suffixes = [r'(?<=\w)\.$', r'(?<=a)\d+\.']
# suffixes = [r'(?<=\w)\.$']

prefixes_re = compile_prefix_regex(prefixes)
suffixes_re = compile_suffix_regex(suffixes)

nlp.tokenizer.prefix_search = prefixes_re.search
nlp.tokenizer.suffix_search = suffixes_re.search

a = 'a10.'
tokens = [t.text for t in nlp(a)]
tokens_w_explanation = nlp.tokenizer.explain(a)

print(tokens)                   # ['a', '10.']
print(tokens_w_explanation)     # [('PREFIX', 'a'), ('TOKEN', '10'), ('SUFFIX', '.')]

assert len(tokens) == 2
assert len(tokens_w_explanation) == 3
assert len(tokens) != len(tokens_w_explanation)

Output from print

>>> ['a', '10.']
>>> [('PREFIX', 'a'), ('TOKEN', '10'), ('SUFFIX', '.')]

Output from explanation is different from actual tokenised result.

Your Environment

Operating System: Windows 10
Python Version Used: Python 3.6
spaCy Version Used: spacy 3.0.6
Environment Information: nil

The text was updated successfully, but these errors were encountered:

polm · 2021-09-04T11:08:44Z

Confirmed the output is the same in 3.1.2. Thanks for the report!

See also #7694.

adrianeboyd · 2021-09-06T12:02:36Z

Yes, thanks for the report! Here I think that the explain output is the intended output and there's a bug in the main tokenizer algorithm related to a suffix that overlaps with a prefix that should have already been split off.

So far we'd only had the case where explain was incorrect (since it's a completely separate implementation of the same algorithm), so this is unexpected!

Since this will potentially change the tokenizer output for the same stored settings, we'll aim to fix it in v3.2.0.

delzac · 2021-09-06T12:16:57Z

~~@adrianeboyd To clarify, based on my testing if any one of the two suffixes regex was removed then we nlp() and explain() will be the same.~~

~~So my conclusion was that clashes between two valid suffixes trigger this behaviour.~~

adrianeboyd · 2021-09-06T12:47:26Z

I may even be getting myself mixed up here about the regex behavior, but it looks like the problem is what happens right after a prefix is recognized and that's where the main algorithm and explain differ.

What happens currently in the main tokenizer:

a matches as a prefix (and the token string is not modified yet)
10. matches as a suffix in the token string that still contains a
a and 10. are processed as a valid prefix and a valid suffix

What happens in explain:

a matches as a prefix and is lopped off
. matches as a suffix since there's no a for the second suffix pattern to match, and then it's also lopped off
the rest 10 is a token

I think that the explain behavior is actually what we want, since the idea is that the prefix is removed before it moves on to the suffix patterns. I'll double-check within the spacy team to be sure (we want to be extremely careful with tokenizer modifications because it affects so many users), but if so, it's a one-line change in the end.

delzac · 2021-09-06T12:56:01Z

Definitely agree that the explain behaviour is that we want. Thanks for taking the time to explain what's happening under the hood! :)

My work is very sensitive to tokenisation too, so definitely hope that this is nothing major.

delzac · 2021-09-06T13:13:54Z

@adrianeboyd I have update the test to make sure that i communicated clearly.

import en_core_web_sm
from spacy.util import compile_prefix_regex, compile_suffix_regex

nlp = en_core_web_sm.load()

text = 'a10.'
suffix1 = r'(?<=\w)\.$'
suffix2 = r'(?<=a)\d+\.'
prefixes = ['a(?=.)']

prefixes_re = compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefixes_re.search

suffixes_re = compile_suffix_regex([suffix1, suffix2])
nlp.tokenizer.suffix_search = suffixes_re.search

# both suffix1 and suffix 2 present
assert len(nlp(text)) == 2                      # ['a', '10.']
assert len(nlp.tokenizer.explain(text)) == 3    # [('PREFIX', 'a'), ('TOKEN', '10'), ('SUFFIX', '.')]

suffixes_re = compile_suffix_regex([suffix1])
nlp.tokenizer.suffix_search = suffixes_re.search

# suffix1 is present but NOT suffix2
assert len(nlp(text)) == 3                      # ['a', '10', '.']
assert len(nlp.tokenizer.explain(text)) == 3    # [('PREFIX', 'a'), ('TOKEN', '10'), ('SUFFIX', '.')]


suffixes_re = compile_suffix_regex([suffix2])
nlp.tokenizer.suffix_search = suffixes_re.search

# suffix1 is NOT present but suffix2 is present
assert len(nlp(text)) == 2                      # ['a', '10.'']
assert len(nlp.tokenizer.explain(text)) == 2    # [('PREFIX', 'a'), ('TOKEN', '10.')]

What you said about the overlap between suffix2 and the prefix is right, it does trigger a wrong tokenisation. (but nlp() and explain() is consistent in this case).

The use of both suffix1 together with suffix2 cause inconsistency between nlp() and explain().

adrianeboyd · 2021-09-06T13:18:34Z

You can't see the difference in the resulting tokens, but the internal difference in the final case is that the tokenizer thinks that 10. is a suffix while explain thinks it's a token. If another suffix pattern is added, explain will keep looking for more suffixes but the tokenizer won't.

delzac · 2021-09-06T13:22:44Z

Got it, thanks for attending and explain it to me! :)

adrianeboyd · 2021-11-02T09:47:50Z

Not sure why, but github is sometimes a bit flaky about closing linked issues from pull requests. This has been fixed by #9155 and will be in v3.2.0.

github-actions · 2021-12-03T00:01:47Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

polm added bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer labels Sep 4, 2021

adrianeboyd linked a pull request Sep 6, 2021 that will close this issue

Ignore prefix in suffix matches #9155

Merged

3 tasks

adrianeboyd closed this as completed Nov 2, 2021

github-actions bot locked as resolved and limited conversation to collaborators Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenisation output and tokenizer.explain is inconsistent #9136

Tokenisation output and tokenizer.explain is inconsistent #9136

delzac commented Sep 4, 2021

polm commented Sep 4, 2021

adrianeboyd commented Sep 6, 2021

delzac commented Sep 6, 2021 •

edited

Loading

adrianeboyd commented Sep 6, 2021

delzac commented Sep 6, 2021

delzac commented Sep 6, 2021

adrianeboyd commented Sep 6, 2021

delzac commented Sep 6, 2021

adrianeboyd commented Nov 2, 2021

github-actions bot commented Dec 3, 2021

Tokenisation output and tokenizer.explain is inconsistent #9136

Tokenisation output and tokenizer.explain is inconsistent #9136

Comments

delzac commented Sep 4, 2021

How to reproduce the behaviour

Your Environment

polm commented Sep 4, 2021

adrianeboyd commented Sep 6, 2021

delzac commented Sep 6, 2021 • edited Loading

adrianeboyd commented Sep 6, 2021

delzac commented Sep 6, 2021

delzac commented Sep 6, 2021

adrianeboyd commented Sep 6, 2021

delzac commented Sep 6, 2021

adrianeboyd commented Nov 2, 2021

github-actions bot commented Dec 3, 2021

delzac commented Sep 6, 2021 •

edited

Loading