Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨 Support updating template processors #1652

Merged
merged 22 commits into from
Jan 28, 2025
Merged

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Oct 14, 2024

Goal:

from tokenizers import Tokenizer
from tokenizers.processors import TemplateProcessing
tokenizer = Tokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")
tokenizer.post_processor 

tokenizer.post_processor[1] = TemplateProcessing(
    single="[CLS] $0 [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1), ("[SEP]", 0)],
)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@McPatate McPatate force-pushed the sequential-post-processor branch from 11533c5 to 4bb595b Compare January 14, 2025 02:37
@McPatate McPatate marked this pull request as ready for review January 16, 2025 02:33
Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-approving here as it is your PR @ArthurZucker, waiting for your review before merging

Copy link
Collaborator Author

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! WOuld just add python tests! 😉

let's check set_item and also that get_item_ is mutable

bindings/python/Cargo.toml Show resolved Hide resolved
bindings/python/pyproject.toml Outdated Show resolved Hide resolved
bindings/python/src/normalizers.rs Outdated Show resolved Hide resolved
bindings/python/src/normalizers.rs Outdated Show resolved Hide resolved
bindings/python/src/normalizers.rs Show resolved Hide resolved
bindings/python/src/normalizers.rs Outdated Show resolved Hide resolved
bindings/python/src/normalizers.rs Show resolved Hide resolved
bindings/python/src/normalizers.rs Outdated Show resolved Hide resolved
tokenizers/src/processors/template.rs Outdated Show resolved Hide resolved
bindings/python/tests/bindings/test_normalizers.py Outdated Show resolved Hide resolved
@McPatate McPatate force-pushed the sequential-post-processor branch from d37229f to ff80e9f Compare January 27, 2025 23:03
@McPatate McPatate changed the title Support updating template processors 🚨 Support updating template processors Jan 28, 2025
@McPatate McPatate merged commit c45aebd into main Jan 28, 2025
30 checks passed
@McPatate McPatate deleted the sequential-post-processor branch January 28, 2025 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants