[WIP] HuggingFaceModelTokenizer #2723

krammnic · 2025-05-12T17:30:34Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

Integrate Hugging Face tokenizers with torchtune #2706

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

Basically, this is a first pass (I'm thinking about how to add masking), but jinja render works surprisingly well.

pytorch-bot · 2025-05-12T17:30:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2723

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers

Thanks @krammnic for taking this one on! This will be huge for lowering the barrier to onboard new models. Let's definitely make sure to add unit tests for this one. (You can likely create some dummy tokenizer_config.json files and check them directly into the repo, since they should be pretty small.)

ebsmothers · 2025-05-13T00:43:34Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        special_tokens_mapping = {}
+        for token in self.special_tokens:
+            special_tokens_mapping[token] = self.base_tokenizer.encode(token)
+        rendered_template = self.template.render(


Wow this actually wound up being quite easy lol

Unfortunately, tool calling will still be quite tricky

@krammnic Other than the lack of tool calls in tt Message class, is there any other reasons behind why tool calling will be tricky?

Probably, no

ebsmothers · 2025-05-13T00:46:34Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+            if content := token_info.get("content"):
+                special_tokens.add(content)
+
+    # We sort lexicographically in order to get real tokens after all <|dummy_x|>


Sorry I don't fully understand this comment. I assume this is referring to reserved special tokens? If so, why is string sort the thing to use here?

Probably we can drop it, it might just simplify debugging in case if we face some problems with new configs.

ebsmothers · 2025-05-13T00:49:30Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        self.base_tokenizer = HuggingFaceBaseTokenizer(
+            tokenizer_json_path=tokenizer_json_path,
+            tokenizer_config_json_path=tokenizer_config_json_path,
+            generation_config_path=generation_config_path,
+        )


I know @joecummings had some thoughts on whether we should use generic base_tokenizer instead of constraining to use HuggingFaceBaseTokenizer. I suspect the latter is better for making sure everything works together, but I know at least Qwen2Tokenizer still relies on the merges + vocab files instead of the tokenizer.json file (I alluded to this at the very bottom of #2706). So we should figure out if this will work for that case

ebsmothers · 2025-05-13T00:50:43Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+                {"role": m.role, "content": m.content[0]["content"]} for m in messages
+            ],
+            add_generation_prompt=add_eos,
+            **special_tokens_mapping,  # We assume that the naming is consitent


Yeah I think this should be a reasonable assumption (as long as we are also getting the special_tokens from the same place as the template)

ebsmothers

Thanks for your patience! Left a handful of comments. Personally I would just add a unit test now, it'll make it easier to reason about things and help validate that this is giving the expected results.

ebsmothers · 2025-05-19T15:47:30Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        self.truncation_type = truncation_type
+
+    def _raise_helper(self, msg):
+        raise Exception(msg)


Any reason to use this instead of the more specific Jinja Template Error used by HF here?

ebsmothers · 2025-05-19T15:51:43Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+    def _raise_helper(self, msg):
+        raise Exception(msg)
+
+    def _get_token_from_config(self, config: Dict[str, Any], key: str) -> str:


I'm confused by this method. Based on the docstring and implementation it seems like it is being used to get special tokens. But then you are only using it for chat_template, which iiuc is always a string.

ebsmothers · 2025-05-19T15:53:04Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+                messages=current_messages,
+                add_generation_prompt=add_eos if i == len(messages) - 1 else False,
+                **special_tokens_mapping,  # We assume that the naming is consistent
+                **self.top_level_variables,


Noob q: are top-level variables always sufficient? (I.e. is there ever a case where HF templates based on some nested field?)

Yes, in some configs the special token might be bos_id, for instance, but the bos is used in the chat_template, which is redefined as the top-level variable. Generally, we know that there is a possibility to pass some extra variables. Therefore, it is better to prevent some errors here with such trick.

ebsmothers · 2025-05-19T15:56:26Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+
+            rendered = self.template.render(
+                messages=current_messages,
+                add_generation_prompt=add_eos if i == len(messages) - 1 else False,


What are the implications of this? E.g. during finetuning do we actually want to add a generation prompt at the end of all the messages? (I would assume no)

Good point!

No, bad point, we should have this argument, in another case we will not be able to render.

Ah I think you're right, it is a bad point. I think I may have misread this line.. my point was that we should only add the generation prompt during inference, but I see that add_eos is basically a proxy for inference

ebsmothers · 2025-05-19T15:57:34Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        # This part is extremely hacky, but we need to handle case where we have variable access with jinja
+        special_tokens_mapping = {}
+        for token in self.special_tokens:
+            special_tokens_mapping[token] = self.base_tokenizer.encode(token)


I don't think I fully understand this comment. I also don't understand why we need to rebuild the special tokens mapping on every invocation of tokenize_messages. (Is that what the comment is referring to?)

this comment became a legacy during the changes, good catch

ebsmothers · 2025-05-19T15:59:24Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+            if message.masked:
+                tokenized_messages.extend([True] * len(delta))
+            else:
+                tokenized_messages.extend(delta)
+
+            mask.extend([message.masked] * len(delta))


This doesn't seem right to me

ebsmothers · 2025-05-19T16:00:51Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        tokenized_messages = truncate(
+            tokens=tokenized_messages,
+            max_seq_len=max_seq_len,
+            eos_id=None,


Should we also be adding eos here?

ebsmothers · 2025-05-19T16:03:55Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+            current_messages = [
+                {"role": m.role, "content": m.content[0]["content"]}
+                for m in messages[: i + 1]
+            ]


This seems a bit strange to me.. in my mind we should either be able to (a) render/tokenize all messages in one shot, or (b) loop over messages one at a time, render, tokenize, and concat. Why do we need to do this "cumulative tokenization"? (Is it because of the difficulties you mentioned with masking? If so I wonder whether there is an alternative)

krammnic · 2025-05-26T18:22:52Z

Let me push unit test and we will iterate on this one more time

krammnic · 2025-05-28T12:37:39Z

@ebsmothers Let's iterate

huggingface model tokenizer

3131906

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 12, 2025

ebsmothers reviewed May 13, 2025

View reviewed changes

Mark Obozov added 7 commits May 14, 2025 18:35

fixes

902051b

lint

9c27bc2

remove debug print

b985e76

masking

6771cf6

lint

785442c

remove lexicographic sort

06143e0

remove useless comment

97f1165

krammnic mentioned this pull request May 16, 2025

Add tool calling to OpenAIToMessages Transform #2617

Open

ebsmothers reviewed May 19, 2025

View reviewed changes

Mark Obozov added 6 commits May 19, 2025 19:36

typehints to pass unittests for python3.9

7864b87

change raise helper

b8e3ad1

fixes

a44fbdf

eos

3f9ca8f

revert generation_prompt

57852ad

lint

6b2fae3

unit test

0153b90

This was referenced May 28, 2025

Eval chat template #2574

Open

[VERY WIP] DSV3 #2764

Draft

[WIP] HuggingFaceModelTokenizer #2723

Are you sure you want to change the base?

[WIP] HuggingFaceModelTokenizer #2723

Uh oh!

Conversation

krammnic commented May 12, 2025

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2723

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krammnic commented May 26, 2025

Uh oh!

krammnic commented May 28, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented May 12, 2025 •

edited

Loading