documentation of the `pattern` parameter in `pre_tokenizers.Split` is incorrect #1565

craigschmidt · 2024-07-10T15:02:11Z

The documentation for pre_tokenizers.Split states:

pattern (str or Regex) — A pattern used to split the string. Usually a string or a a regex built with tokenizers.Regex

However, this is incorrect. A str does not work, using tokenizers 0.19.1. The following example demonstrates:

from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Split
from tokenizers import Regex

# similar issue to https://github.com/huggingface/tokenizers/pull/1264
# but the documentation is still incorrect as a string does not work

GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

tokenizer_bad = Tokenizer.from_pretrained("bert-base-uncased")
# the documentation says pattern can be a string or `tokenizers.Regex` object
# but the string doesn't work
# have similar issues for other `behavior` values
tokenizer_bad.pre_tokenizer = Split(pattern=GPT2_SPLIT_PATTERN, behavior="isolated")  

pretokenized_output_bad = tokenizer_bad.pre_tokenizer.pre_tokenize_str("This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script")

print("incorrect:")
print(pretokenized_output_bad)
print()

# incorrect:
# [('This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script', (0, 63))]

# Regex seems to be undocumented 
re = Regex(GPT2_SPLIT_PATTERN)

tokenizer_good = Tokenizer.from_pretrained("bert-base-uncased")
tokenizer_good.pre_tokenizer = Split(pattern=re, behavior="isolated")

pretokenized_output_good = tokenizer_good.pre_tokenizer.pre_tokenize_str("This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script")

print("correct:")
print(pretokenized_output_good)

# correct:
# [('This', (0, 4)), (' is', (4, 7)), (' a', (7, 9)), (' --', (9, 12)), (' -', (12, 14)), (' 0', (14, 16)), (' 00', (16, 19)), (' 0', (19, 21)), (' 0', (21, 23)), (' 13', (23, 26)), ('##$', (26, 29)), ('2', (29, 30)), ('#', (30, 31)), ('6', (31, 32)), ('klwt', (32, 36)), (' gtek', (36, 41)), (' jhrthr', (41, 48)), (' testing', (48, 56)), (' script', (56, 63))]

Can you please update the documentation to state it must be a tokenizers.Regex object, and give an example, as above, on how that can be done. It would also be good to document tokenizers.Regex somewhere. Another alternative would be to update the code so it does work with a str value.

Note that PR 1264 improved the documentation, but it is still incorrect and confusing.

Also, while you're at it, the documentation for behavior could use the example in the rust docs here

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-07-12T13:33:05Z

Hey! regarding the behavior documentation, feel free to open a PR I'll review it!
For the SPLIT IMO we should rather make it work with a simple regex! (should be better for users no?) I can also open a PR for this fix unless you want to tackle it? 🤗

meliksahturker mentioned this issue Jul 17, 2024

BPE Split pretokenization rule is not reflected in the vocabulary #1572

Closed

ArthurZucker added the documentation Improvements or additions to documentation label Jul 26, 2024

ArthurZucker mentioned this issue Aug 5, 2024

Fix doc about split #1591

Merged

Narsil closed this as completed in #1591 Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documentation of the `pattern` parameter in `pre_tokenizers.Split` is incorrect #1565

documentation of the `pattern` parameter in `pre_tokenizers.Split` is incorrect #1565

craigschmidt commented Jul 10, 2024 •

edited

Loading

ArthurZucker commented Jul 12, 2024

documentation of the pattern parameter in pre_tokenizers.Split is incorrect #1565

documentation of the pattern parameter in pre_tokenizers.Split is incorrect #1565

Comments

craigschmidt commented Jul 10, 2024 • edited Loading

ArthurZucker commented Jul 12, 2024

documentation of the `pattern` parameter in `pre_tokenizers.Split` is incorrect #1565

documentation of the `pattern` parameter in `pre_tokenizers.Split` is incorrect #1565

craigschmidt commented Jul 10, 2024 •

edited

Loading