Honor trust_remote_code for custom tokenizers #28854

rl337 · 2024-02-03T20:09:56Z

When trying to use AutoTokenizer.from_pretrained for a tokenizer that is not recognized in existing tokenizer maps, you're required respond to this trust_remote_code prompt even if you specify trust_remote_code=True.

The cause of this is, in tokenization_auto.py, we kwargs.pop("trust_remote_code"...) but then don't explicitly pass it when we call the _from_pretrained(). There are two obvious fixes. Either kwargs.get() instead of kwargs.pop() or explicitly pass it along to _from_pretrained(). This PR does the latter because we don't necessarily want to keep the trust_remote_code in the kwargs when we pass it down into other functions.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

ArthurZucker

LGTM, would you mind adding a small test? 🤗
also cc @Rocketknight1 as you had a recent PR to fix something a bit similar

Rocketknight1 · 2024-02-06T14:49:07Z

Yep - I think I might have accidentally caused this with that PR! The original problem I was fixing was that the prompt was not displaying when trust_remote_code=None. Let me take a look at this fix.

Rocketknight1 · 2024-02-06T15:33:08Z

Hi @rl337, can you give me some code to reproduce the issue? I picked Qwen/Qwen-VL because it's a recent release with a custom tokenizer, but I didn't get any prompt when I ran AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True).

rl337 · 2024-02-06T16:28:58Z

Hi @rl337, can you give me some code to reproduce the issue? I picked Qwen/Qwen-VL because it's a recent release with a custom tokenizer, but I didn't get any prompt when I ran AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True).

Sure. I started working on a test but i couldn't get the test suite to run properly so i pulled the test out into a standalone script. Rename verify_tokenizer_standalone.txt to a .py and it's only dependency is transformers so it should be quick to create a venv and try it out.

I had to put a os.chdir() to the temp directory because i couldn't seem to get the subdir param to work either. Likely it is broken too.

If i drop the meat of this test into tests/models/auto/test_tokenization_auto.py, will that work as a test?

verify_tokenizer_standalone.txt

rl337 · 2024-02-07T15:14:27Z

@Rocketknight1 @ArthurZucker okay i took the body of that script i added and created a test in test_tokenization_auto.py. see latest commits.

ArthurZucker

Thanks for the clean fix. Can you make sure CIs are green (rebasing on main should suffice)

ArthurZucker · 2024-02-08T02:25:15Z

tests/models/auto/test_tokenization_auto.py

Lgtm, not that bad to not push to the hub that way we see exactly what code was used 😉

…okenizers specified by config add test

rl337 · 2024-02-08T16:30:11Z

okay so here's the deal. os.chdir causes failures in other tests so i'm going to hack this to look up the current directory before the chdir and restore the current directory after the test is run.

I traced the subfolder= down to what i think is the root cause I'll open a new PR against it once i confirm that. Basically get_class_from_dynamic_module() takes the subfolder= via kwargs but doesn't pass it to the subsequent call to get_cached_module_file() partly because get_cached_module_file doesn't take either a subfolder or kwargs. eventually we call cached_file() which does take a subfolder= but it ends up being None.

rl337 · 2024-02-08T17:58:47Z

ok we've got a clean build

rl337 · 2024-02-09T03:19:44Z

@ArthurZucker i don't have write access. who should do the actual merge?

rl337 · 2024-02-09T15:45:32Z

Can someone please merge this? i don't have write access.

HuggingFaceDocBuilderDev · 2024-02-09T16:11:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

rl337 · 2024-02-09T16:13:21Z

verified change in my workflow.

amyeroberts · 2024-02-09T16:16:58Z

verified change in my workflow.

@rl337 Could you expand a bit on what you mean? A few minutes ago, there was a comment saying that you were still getting prompted. Do the tests here reflect the tests being run to verify on your workflow?

rl337 · 2024-02-09T16:17:54Z

@amyeroberts i deleted that comment. I realized that i was working from an out of date checkout of my fork. sorry for the confusion.

Rocketknight1 · 2024-02-09T18:12:56Z

Hi @rl337, I'm still a bit confused about this! Like I mentioned above, I tried loading models like Qwen/Qwen-VL and didn't see the prompt issue described here, even though they have custom code in their tokenizer. Before we merge this PR, I'd like to know what conditions actually trigger this bug - for example, does it only occur with local repos, rather than repos on the Hub, or is there some specific config value that triggers the issue in your code that isn't an issue for Qwen?

rl337 · 2024-02-09T20:02:18Z

@Rocketknight1 The code that's in verify_tokenizer_standalone.txt as well as what i attached in the unit test both exercise the bug. I'm not sure what the key difference between what's in the test and what's exercised by Qwen/Qwen-VL. I can look into it when i get a chance.

rl337 · 2024-02-11T17:47:23Z

@Rocketknight1 Okay. I have an answer for you. The difference between the configs from the test case and Qwen/Qwen-VL comes down to the contents of the tokenizer_config.json. In the Owen/Owen-VL, there's an explicitly defined tokenizer_class entry which circumvents the need to check to see if a tokenizer class is defined in the config.json via AutoConfig which is where my patch adds the trust_remote_code.

I don't 100% understand this behavior of looking up a tokenizer_class given that to get to this code, we're already calling the tokenizer class's from_pretrained() which means that cls is the tokenizer class. Is this to allow the tokenizer_config.json to override the AutoMap's tokenizer definitions? but there you have it. that's why Owen/Owen-VL works but my test case does not.

rl337 · 2024-02-13T14:55:51Z

@ArthurZucker @Rocketknight1 @amyeroberts is there anything else i need to verify / or do to get this merged?

Rocketknight1 · 2024-02-13T15:43:44Z

Hi @rl337, the delay is internal - we're trying to figure out if this is the right approach to the problem, since this is a fairly complex issue that touches some of the core transformers code. We want to fix it, but also avoid patches that will create further problems in future. You don't need to do anything else for now, but give us a little bit of time to investigate!

Rocketknight1 · 2024-02-13T16:32:29Z

Posting my understanding of the issue:

When you load a tokenizer, the tokenizer tries to figure out the tokenizer class of the repo
To do this, the tokenizer reads tokenizer_config.json and looks for a tokenizer_class key to determine the model class
If the key isn't present, the tokenizer tries to initialize a config from the repo with AutoConfig
Before this PR, trust_remote_code was not propagated correctly to the AutoConfig call. Therefore, an unwanted confirmation prompt is created if:
- You load a custom code tokenizer with AutoTokenizer.from_pretrained()
- You set trust_remote_code=True
- tokenizer_class is not defined in tokenizer_config.json
- The model config also requires custom code
The solution in this PR is to add a trust_remote_code argument to tokenizer.from_pretrained(). This argument does nothing in the function itself, but is passed to the AutoConfig call.
This PR also updates AutoTokenizer.from_pretrained() to pass its trust_remote_code value to tokenizer.from_pretrained().

After investigating, I think this is a good change, and doesn't introduce other security issues, or conflict with other areas of the library, so I'm willing to approve it.

However, one thing worth noting is that the tokenizer doesn't actually use the tokenizer class string in any of the loading code! The only purpose of the code block that causes this issue is just to check the tokenizer class name against the repo tokenizer name and raise a warning for users loading a tokenizer with a different tokenizer class. Since most people load tokenizers with AutoTokenizer now, we could consider just removing that block instead, as I don't think we need that big code block to raise a mostly unused warning anymore.

cc @ArthurZucker

rl337 · 2024-02-13T16:46:42Z

@Rocketknight1 I see. Yeah your summarization is my understanding of the situation.

I think that removing the path to lead to a warning and instead failing fast with an appropriate error message would be awesome.

Another thing that I was thinking was to add some kind of LoadingPolicy object which can be used to aggregate options for loading model classes, et al instead of relying on kwargs to propagate these options. It'll future proof the API because adding additional members to the policy object won't change all of the signatures all the way down the stack but still expose allow deep code to access stuff from the original calling function. One can also have a visible json which describes explicitly what the policy of the loader is which can then be used across different code.

So concretely something like this:

class LoaderPolicy:
    trust_remote_code: bool
    local_files_only: bool
    cache_directory: string
    ...

    @staticmethod
    def from_json(cls, filename: str,  policy_dir: str = '.')
        using open(os.path.join(policy_dir, filename),) as fp:
        policy_json = json.load(fp)
        # fill in members here

    policy = LoaderPolicy.from_json('loader_policy.json', policy_dir='some_dir/policies')

    AutoModel.from_pretrained(model_id, load_policy=policy)

Rocketknight1 · 2024-02-15T15:56:49Z

Hi @rl337 - it's a cool idea, but I'd worry about users having to create a Policy object. Although that might simplify the internal infrastructure, it would complicate the UX for people who often just want to make one single call to AutoModel or AutoTokenizer.

Anyway, for now I think we should just merge this PR, and consider removal of that entire code block in a future PR (@rl337 if you want to open an issue or PR for that after this, I'd support it, but no pressure - it's mostly for code cleanup rather than an essential feature!) You could also open an issue suggesting your Policy class if you want, but there might be some pushback!

Anyways, since we have core maintainer approval already, @rl337 are you okay with me merging now?

rl337 · 2024-02-15T19:40:22Z

@Rocketknight1 Sure. go ahead and merge it. I'll see if i have time to write a PR with the policy idea. I don't think it'd have to make things more complicated for end users and it'd allow flexibility for people who want a more formal way of specifying how to load models depending on environment.

I'll tag you when i get to it.

Rocketknight1 · 2024-02-16T13:41:12Z

Got it, and thanks for the fix! Even if we just leave the code block as-is, it's still a really nice usability improvement for transformers.

ArthurZucker · 2024-02-19T03:34:31Z

Down to remove the block that adds a warning as well

Rocketknight1 · 2024-02-19T13:09:48Z

@rl337 want to make a follow-up PR to remove the entire warning block, in that case?

* pass through trust_remote_code for dynamically loading unregistered tokenizers specified by config add test * change directories back to previous directory after test * fix ruff check * Add a note to that block for future in case we want to remove it later --------- Co-authored-by: Matt <rocketknight1@gmail.com>

rl337 · 2024-02-22T15:48:47Z

@Rocketknight1 I'll try when i have the time.

is there a place i can chat with you folks about these kinds of changes? like for the policy PR i want to submit, it'd be good to understand the concerns before i put the time into a PR. I had started a thread on discord regarding another PR i submitted but i got no responses.

Rocketknight1 · 2024-02-22T16:39:12Z

@rl337 Probably the easiest way is just to open an issue to co-ordinate it and ping me and any other maintainers you need, and we can discuss it there before you spend any time on the PR?

* pass through trust_remote_code for dynamically loading unregistered tokenizers specified by config add test * change directories back to previous directory after test * fix ruff check * Add a note to that block for future in case we want to remove it later --------- Co-authored-by: Matt <rocketknight1@gmail.com>

ArthurZucker approved these changes Feb 5, 2024

View reviewed changes

ArthurZucker mentioned this pull request Feb 7, 2024

[bug fix]tokenizer init via tokenization_utils_base.py in some case make "Do you accept" emerge even if trust_remote_code=True is setted #25736

Closed

1 task

ArthurZucker approved these changes Feb 8, 2024

View reviewed changes

pass through trust_remote_code for dynamically loading unregistered t…

0ed6c09

…okenizers specified by config add test

rl337 force-pushed the pass_trust_remote_code_for_custom_tokenizer branch from 4e89355 to 0ed6c09 Compare February 8, 2024 15:14

rl337 added 2 commits February 8, 2024 08:35

change directories back to previous directory after test

2b1033f

fix ruff check

9fa4f7d

Rocketknight1 approved these changes Feb 13, 2024

View reviewed changes

Add a note to that block for future in case we want to remove it later

db83f65

Rocketknight1 merged commit be42c24 into huggingface:main Feb 16, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honor trust_remote_code for custom tokenizers #28854

Honor trust_remote_code for custom tokenizers #28854

rl337 commented Feb 3, 2024

ArthurZucker left a comment

Rocketknight1 commented Feb 6, 2024

Rocketknight1 commented Feb 6, 2024

rl337 commented Feb 6, 2024

rl337 commented Feb 7, 2024

ArthurZucker left a comment

ArthurZucker Feb 8, 2024

rl337 commented Feb 8, 2024

rl337 commented Feb 8, 2024

rl337 commented Feb 9, 2024

rl337 commented Feb 9, 2024

HuggingFaceDocBuilderDev commented Feb 9, 2024

rl337 commented Feb 9, 2024

amyeroberts commented Feb 9, 2024

rl337 commented Feb 9, 2024

Rocketknight1 commented Feb 9, 2024 •

edited

Loading

rl337 commented Feb 9, 2024

rl337 commented Feb 11, 2024 •

edited

Loading

rl337 commented Feb 13, 2024

Rocketknight1 commented Feb 13, 2024 •

edited

Loading

Rocketknight1 commented Feb 13, 2024 •

edited

Loading

rl337 commented Feb 13, 2024

Rocketknight1 commented Feb 15, 2024

rl337 commented Feb 15, 2024

Rocketknight1 commented Feb 16, 2024

ArthurZucker commented Feb 19, 2024

Rocketknight1 commented Feb 19, 2024

rl337 commented Feb 22, 2024

Rocketknight1 commented Feb 22, 2024

Honor trust_remote_code for custom tokenizers #28854

Honor trust_remote_code for custom tokenizers #28854

Conversation

rl337 commented Feb 3, 2024

Before submitting

Who can review?

ArthurZucker left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Feb 6, 2024

Rocketknight1 commented Feb 6, 2024

rl337 commented Feb 6, 2024

rl337 commented Feb 7, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Feb 8, 2024

Choose a reason for hiding this comment

rl337 commented Feb 8, 2024

rl337 commented Feb 8, 2024

rl337 commented Feb 9, 2024

rl337 commented Feb 9, 2024

HuggingFaceDocBuilderDev commented Feb 9, 2024

rl337 commented Feb 9, 2024

amyeroberts commented Feb 9, 2024

rl337 commented Feb 9, 2024

Rocketknight1 commented Feb 9, 2024 • edited Loading

rl337 commented Feb 9, 2024

rl337 commented Feb 11, 2024 • edited Loading

rl337 commented Feb 13, 2024

Rocketknight1 commented Feb 13, 2024 • edited Loading

Rocketknight1 commented Feb 13, 2024 • edited Loading

rl337 commented Feb 13, 2024

Rocketknight1 commented Feb 15, 2024

rl337 commented Feb 15, 2024

Rocketknight1 commented Feb 16, 2024

ArthurZucker commented Feb 19, 2024

Rocketknight1 commented Feb 19, 2024

rl337 commented Feb 22, 2024

Rocketknight1 commented Feb 22, 2024

Rocketknight1 commented Feb 9, 2024 •

edited

Loading

rl337 commented Feb 11, 2024 •

edited

Loading

Rocketknight1 commented Feb 13, 2024 •

edited

Loading

Rocketknight1 commented Feb 13, 2024 •

edited

Loading