-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Honor trust_remote_code for custom tokenizers #28854
Honor trust_remote_code for custom tokenizers #28854
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, would you mind adding a small test? 🤗
also cc @Rocketknight1 as you had a recent PR to fix something a bit similar
Yep - I think I might have accidentally caused this with that PR! The original problem I was fixing was that the prompt was not displaying when |
Hi @rl337, can you give me some code to reproduce the issue? I picked |
Sure. I started working on a test but i couldn't get the test suite to run properly so i pulled the test out into a standalone script. Rename verify_tokenizer_standalone.txt to a .py and it's only dependency is transformers so it should be quick to create a venv and try it out. I had to put a os.chdir() to the temp directory because i couldn't seem to get the subdir param to work either. Likely it is broken too. If i drop the meat of this test into tests/models/auto/test_tokenization_auto.py, will that work as a test? |
@Rocketknight1 @ArthurZucker okay i took the body of that script i added and created a test in test_tokenization_auto.py. see latest commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clean fix. Can you make sure CIs are green (rebasing on main should suffice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, not that bad to not push to the hub that way we see exactly what code was used 😉
…okenizers specified by config add test
4e89355
to
0ed6c09
Compare
okay so here's the deal. os.chdir causes failures in other tests so i'm going to hack this to look up the current directory before the chdir and restore the current directory after the test is run. I traced the subfolder= down to what i think is the root cause I'll open a new PR against it once i confirm that. Basically get_class_from_dynamic_module() takes the subfolder= via kwargs but doesn't pass it to the subsequent call to get_cached_module_file() partly because get_cached_module_file doesn't take either a subfolder or kwargs. eventually we call cached_file() which does take a subfolder= but it ends up being None. |
ok we've got a clean build |
@ArthurZucker i don't have write access. who should do the actual merge? |
Can someone please merge this? i don't have write access. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
verified change in my workflow. |
@rl337 Could you expand a bit on what you mean? A few minutes ago, there was a comment saying that you were still getting prompted. Do the tests here reflect the tests being run to verify on your workflow? |
@amyeroberts i deleted that comment. I realized that i was working from an out of date checkout of my fork. sorry for the confusion. |
Hi @rl337, I'm still a bit confused about this! Like I mentioned above, I tried loading models like |
@Rocketknight1 The code that's in verify_tokenizer_standalone.txt as well as what i attached in the unit test both exercise the bug. I'm not sure what the key difference between what's in the test and what's exercised by Qwen/Qwen-VL. I can look into it when i get a chance. |
@Rocketknight1 Okay. I have an answer for you. The difference between the configs from the test case and Qwen/Qwen-VL comes down to the contents of the I don't 100% understand this behavior of looking up a tokenizer_class given that to get to this code, we're already calling the tokenizer class's from_pretrained() which means that cls is the tokenizer class. Is this to allow the tokenizer_config.json to override the AutoMap's tokenizer definitions? but there you have it. that's why Owen/Owen-VL works but my test case does not. |
@ArthurZucker @Rocketknight1 @amyeroberts is there anything else i need to verify / or do to get this merged? |
Hi @rl337, the delay is internal - we're trying to figure out if this is the right approach to the problem, since this is a fairly complex issue that touches some of the core |
Posting my understanding of the issue:
After investigating, I think this is a good change, and doesn't introduce other security issues, or conflict with other areas of the library, so I'm willing to approve it. However, one thing worth noting is that the tokenizer doesn't actually use the tokenizer class string in any of the loading code! The only purpose of the code block that causes this issue is just to check the tokenizer class name against the repo tokenizer name and raise a warning for users loading a tokenizer with a different tokenizer class. Since most people load tokenizers with |
@Rocketknight1 I see. Yeah your summarization is my understanding of the situation. I think that removing the path to lead to a warning and instead failing fast with an appropriate error message would be awesome. Another thing that I was thinking was to add some kind of LoadingPolicy object which can be used to aggregate options for loading model classes, et al instead of relying on kwargs to propagate these options. It'll future proof the API because adding additional members to the policy object won't change all of the signatures all the way down the stack but still expose allow deep code to access stuff from the original calling function. One can also have a visible json which describes explicitly what the policy of the loader is which can then be used across different code. So concretely something like this:
|
Hi @rl337 - it's a cool idea, but I'd worry about users having to create a Anyway, for now I think we should just merge this PR, and consider removal of that entire code block in a future PR (@rl337 if you want to open an issue or PR for that after this, I'd support it, but no pressure - it's mostly for code cleanup rather than an essential feature!) You could also open an issue suggesting your Anyways, since we have core maintainer approval already, @rl337 are you okay with me merging now? |
@Rocketknight1 Sure. go ahead and merge it. I'll see if i have time to write a PR with the policy idea. I don't think it'd have to make things more complicated for end users and it'd allow flexibility for people who want a more formal way of specifying how to load models depending on environment. I'll tag you when i get to it. |
Got it, and thanks for the fix! Even if we just leave the code block as-is, it's still a really nice usability improvement for |
Down to remove the block that adds a warning as well |
@rl337 want to make a follow-up PR to remove the entire warning block, in that case? |
* pass through trust_remote_code for dynamically loading unregistered tokenizers specified by config add test * change directories back to previous directory after test * fix ruff check * Add a note to that block for future in case we want to remove it later --------- Co-authored-by: Matt <rocketknight1@gmail.com>
@Rocketknight1 I'll try when i have the time. is there a place i can chat with you folks about these kinds of changes? like for the policy PR i want to submit, it'd be good to understand the concerns before i put the time into a PR. I had started a thread on discord regarding another PR i submitted but i got no responses. |
@rl337 Probably the easiest way is just to open an issue to co-ordinate it and ping me and any other maintainers you need, and we can discuss it there before you spend any time on the PR? |
* pass through trust_remote_code for dynamically loading unregistered tokenizers specified by config add test * change directories back to previous directory after test * fix ruff check * Add a note to that block for future in case we want to remove it later --------- Co-authored-by: Matt <rocketknight1@gmail.com>
When trying to use AutoTokenizer.from_pretrained for a tokenizer that is not recognized in existing tokenizer maps, you're required respond to this trust_remote_code prompt even if you specify trust_remote_code=True.
The cause of this is, in tokenization_auto.py, we kwargs.pop("trust_remote_code"...) but then don't explicitly pass it when we call the _from_pretrained(). There are two obvious fixes. Either kwargs.get() instead of kwargs.pop() or explicitly pass it along to _from_pretrained(). This PR does the latter because we don't necessarily want to keep the trust_remote_code in the kwargs when we pass it down into other functions.
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.