Improve basic tokenizer #320

jenstroeger · 2024-06-02T18:34:23Z

Environment

Python 3.10.14 (main, Mar 21 2024, 01:58:23) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin

Additional context

n/a

Steps to Reproduce

>>> list(enchant.tokenize.basic_tokenize("He’d said, “Hello”"))
[('He’d', 0), ('said', 5), ('“Hello”', 11)]

Expected behavior

>>> list(enchant.tokenize.basic_tokenize("He’d said, “Hello”"))
[('He’d', 0), ('said', 5), ('Hello', 12)]

Additional context

The tokenizer uses these characters when l/rstripping a word:

pyenchant/enchant/tokenize/__init__.py

Lines 283 to 285 in 1b7b059

    
           # Chars to remove from start/end of words 
        
           strip_from_start = '"' + "'`([" 
        
           strip_from_end = '"' + "'`]).!,?;:"

but I think “‘ and ”’ should also be considered.

Happy to provide a PR.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve basic tokenizer #320

Improve basic tokenizer #320

jenstroeger commented Jun 2, 2024

Improve basic tokenizer #320

Improve basic tokenizer #320

Comments

jenstroeger commented Jun 2, 2024