Infixes Update Not Applying Properly to Tokenizer #13779

Rayan-Allali · 2025-03-26T08:58:49Z

Rayan-Allali
Mar 26, 2025

Title: Infixes Update Not Applying Properly to Tokenizer

Description

I tried updating the infix patterns in spaCy, but the changes are not applying correctly to the tokenizer. Specifically, I'm trying to modify how apostrophes and other symbols ( ') are handled. However, even after setting a new regex, the tokenizer does not reflect these changes.

Steps to Reproduce

Here are the two approaches I tried:

1️⃣ Removing apostrophe-related rules from infixes and recompiling:

default_infixes = [pattern for pattern in nlp.Defaults.infixes if "'" not in pattern]
infix_re = compile_infix_regex(default_infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer

Issue: Even after modifying the infix rules, contractions like "can't" still split incorrectly.

2️⃣ Manually adding new infix rules (including hyphens, plus signs, and dollar signs):

infixes = nlp.Defaults.infixes + [r"'",]  
infixe_regex = spacy.util.compile_infix_regex(infixes)  
nlp.tokenizer.infix_finditer = infixe_regex.finditer

Expected Behavior

The tokenizer should correctly apply the new infix rules.

Actual Behavior

Changes to nlp.tokenizer.infix_finditer do not seem to take effect.

Question

Am I missing something in how infix rules should be updated? Is there a correct way to override infix splitting?

Thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infixes Update Not Applying Properly to Tokenizer #13779

{{title}}

Replies: 0 comments

Select a reply

Infixes Update Not Applying Properly to Tokenizer #13779

Rayan-Allali Mar 26, 2025

Title: Infixes Update Not Applying Properly to Tokenizer

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Question

Replies: 0 comments

Rayan-Allali
Mar 26, 2025