Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentation of the pattern parameter in pre_tokenizers.Split is incorrect #1565

Closed
craigschmidt opened this issue Jul 10, 2024 · 1 comment · Fixed by #1591
Closed

documentation of the pattern parameter in pre_tokenizers.Split is incorrect #1565

craigschmidt opened this issue Jul 10, 2024 · 1 comment · Fixed by #1591
Labels
documentation Improvements or additions to documentation

Comments

@craigschmidt
Copy link

craigschmidt commented Jul 10, 2024

The documentation for pre_tokenizers.Split states:

pattern (str or Regex) — A pattern used to split the string. Usually a string or a a regex built with tokenizers.Regex

However, this is incorrect. A str does not work, using tokenizers 0.19.1. The following example demonstrates:

from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Split
from tokenizers import Regex

# similar issue to https://github.com/huggingface/tokenizers/pull/1264
# but the documentation is still incorrect as a string does not work

GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

tokenizer_bad = Tokenizer.from_pretrained("bert-base-uncased")
# the documentation says pattern can be a string or `tokenizers.Regex` object
# but the string doesn't work
# have similar issues for other `behavior` values
tokenizer_bad.pre_tokenizer = Split(pattern=GPT2_SPLIT_PATTERN, behavior="isolated")  

pretokenized_output_bad = tokenizer_bad.pre_tokenizer.pre_tokenize_str("This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script")

print("incorrect:")
print(pretokenized_output_bad)
print()

# incorrect:
# [('This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script', (0, 63))]

# Regex seems to be undocumented 
re = Regex(GPT2_SPLIT_PATTERN)

tokenizer_good = Tokenizer.from_pretrained("bert-base-uncased")
tokenizer_good.pre_tokenizer = Split(pattern=re, behavior="isolated")

pretokenized_output_good = tokenizer_good.pre_tokenizer.pre_tokenize_str("This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script")

print("correct:")
print(pretokenized_output_good)

# correct:
# [('This', (0, 4)), (' is', (4, 7)), (' a', (7, 9)), (' --', (9, 12)), (' -', (12, 14)), (' 0', (14, 16)), (' 00', (16, 19)), (' 0', (19, 21)), (' 0', (21, 23)), (' 13', (23, 26)), ('##$', (26, 29)), ('2', (29, 30)), ('#', (30, 31)), ('6', (31, 32)), ('klwt', (32, 36)), (' gtek', (36, 41)), (' jhrthr', (41, 48)), (' testing', (48, 56)), (' script', (56, 63))]

Can you please update the documentation to state it must be a tokenizers.Regex object, and give an example, as above, on how that can be done. It would also be good to document tokenizers.Regex somewhere. Another alternative would be to update the code so it does work with a str value.

Note that PR 1264 improved the documentation, but it is still incorrect and confusing.

Also, while you're at it, the documentation for behavior could use the example in the rust docs here

@ArthurZucker
Copy link
Collaborator

Hey! regarding the behavior documentation, feel free to open a PR I'll review it!
For the SPLIT IMO we should rather make it work with a simple regex! (should be better for users no?) I can also open a PR for this fix unless you want to tackle it? 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants