Infixes Update Not Applying Properly to Tokenizer #13779
Unanswered
Rayan-Allali
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Title: Infixes Update Not Applying Properly to Tokenizer
Description
I tried updating the infix patterns in spaCy, but the changes are not applying correctly to the tokenizer. Specifically, I'm trying to modify how apostrophes and other symbols (
'
) are handled. However, even after setting a new regex, the tokenizer does not reflect these changes.Steps to Reproduce
Here are the two approaches I tried:
1️⃣ Removing apostrophe-related rules from
infixes
and recompiling:Issue: Even after modifying the infix rules, contractions like
"can't"
still split incorrectly.2️⃣ Manually adding new infix rules (including hyphens, plus signs, and dollar signs):
Expected Behavior
Actual Behavior
nlp.tokenizer.infix_finditer
do not seem to take effect.Question
Am I missing something in how infix rules should be updated? Is there a correct way to override infix splitting?
Thanks for your help!
Beta Was this translation helpful? Give feedback.
All reactions