Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Honor trust_remote_code for custom tokenizers #28854

Conversation

rl337
Copy link
Contributor

@rl337 rl337 commented Feb 3, 2024

When trying to use AutoTokenizer.from_pretrained for a tokenizer that is not recognized in existing tokenizer maps, you're required respond to this trust_remote_code prompt even if you specify trust_remote_code=True.

The cause of this is, in tokenization_auto.py, we kwargs.pop("trust_remote_code"...) but then don't explicitly pass it when we call the _from_pretrained(). There are two obvious fixes. Either kwargs.get() instead of kwargs.pop() or explicitly pass it along to _from_pretrained(). This PR does the latter because we don't necessarily want to keep the trust_remote_code in the kwargs when we pass it down into other functions.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, would you mind adding a small test? 🤗
also cc @Rocketknight1 as you had a recent PR to fix something a bit similar

@Rocketknight1
Copy link
Member

Yep - I think I might have accidentally caused this with that PR! The original problem I was fixing was that the prompt was not displaying when trust_remote_code=None. Let me take a look at this fix.

@Rocketknight1
Copy link
Member

Hi @rl337, can you give me some code to reproduce the issue? I picked Qwen/Qwen-VL because it's a recent release with a custom tokenizer, but I didn't get any prompt when I ran AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True).

@rl337
Copy link
Contributor Author

rl337 commented Feb 6, 2024

Hi @rl337, can you give me some code to reproduce the issue? I picked Qwen/Qwen-VL because it's a recent release with a custom tokenizer, but I didn't get any prompt when I ran AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True).

Sure. I started working on a test but i couldn't get the test suite to run properly so i pulled the test out into a standalone script. Rename verify_tokenizer_standalone.txt to a .py and it's only dependency is transformers so it should be quick to create a venv and try it out.

I had to put a os.chdir() to the temp directory because i couldn't seem to get the subdir param to work either. Likely it is broken too.

If i drop the meat of this test into tests/models/auto/test_tokenization_auto.py, will that work as a test?

verify_tokenizer_standalone.txt

@rl337
Copy link
Contributor Author

rl337 commented Feb 7, 2024

@Rocketknight1 @ArthurZucker okay i took the body of that script i added and created a test in test_tokenization_auto.py. see latest commits.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clean fix. Can you make sure CIs are green (rebasing on main should suffice)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, not that bad to not push to the hub that way we see exactly what code was used 😉

@rl337 rl337 force-pushed the pass_trust_remote_code_for_custom_tokenizer branch from 4e89355 to 0ed6c09 Compare February 8, 2024 15:14
@rl337
Copy link
Contributor Author

rl337 commented Feb 8, 2024

okay so here's the deal. os.chdir causes failures in other tests so i'm going to hack this to look up the current directory before the chdir and restore the current directory after the test is run.

I traced the subfolder= down to what i think is the root cause I'll open a new PR against it once i confirm that. Basically get_class_from_dynamic_module() takes the subfolder= via kwargs but doesn't pass it to the subsequent call to get_cached_module_file() partly because get_cached_module_file doesn't take either a subfolder or kwargs. eventually we call cached_file() which does take a subfolder= but it ends up being None.

@rl337
Copy link
Contributor Author

rl337 commented Feb 8, 2024

ok we've got a clean build

@rl337
Copy link
Contributor Author

rl337 commented Feb 9, 2024

@ArthurZucker i don't have write access. who should do the actual merge?

@rl337
Copy link
Contributor Author

rl337 commented Feb 9, 2024

Can someone please merge this? i don't have write access.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@rl337
Copy link
Contributor Author

rl337 commented Feb 9, 2024

verified change in my workflow.

@amyeroberts
Copy link
Collaborator

verified change in my workflow.

@rl337 Could you expand a bit on what you mean? A few minutes ago, there was a comment saying that you were still getting prompted. Do the tests here reflect the tests being run to verify on your workflow?

@rl337
Copy link
Contributor Author

rl337 commented Feb 9, 2024

@amyeroberts i deleted that comment. I realized that i was working from an out of date checkout of my fork. sorry for the confusion.

@Rocketknight1
Copy link
Member

Rocketknight1 commented Feb 9, 2024

Hi @rl337, I'm still a bit confused about this! Like I mentioned above, I tried loading models like Qwen/Qwen-VL and didn't see the prompt issue described here, even though they have custom code in their tokenizer. Before we merge this PR, I'd like to know what conditions actually trigger this bug - for example, does it only occur with local repos, rather than repos on the Hub, or is there some specific config value that triggers the issue in your code that isn't an issue for Qwen?

@rl337
Copy link
Contributor Author

rl337 commented Feb 9, 2024

@Rocketknight1 The code that's in verify_tokenizer_standalone.txt as well as what i attached in the unit test both exercise the bug. I'm not sure what the key difference between what's in the test and what's exercised by Qwen/Qwen-VL. I can look into it when i get a chance.

@rl337
Copy link
Contributor Author

rl337 commented Feb 11, 2024

@Rocketknight1 Okay. I have an answer for you. The difference between the configs from the test case and Qwen/Qwen-VL comes down to the contents of the tokenizer_config.json. In the Owen/Owen-VL, there's an explicitly defined tokenizer_class entry which circumvents the need to check to see if a tokenizer class is defined in the config.json via AutoConfig which is where my patch adds the trust_remote_code.

I don't 100% understand this behavior of looking up a tokenizer_class given that to get to this code, we're already calling the tokenizer class's from_pretrained() which means that cls is the tokenizer class. Is this to allow the tokenizer_config.json to override the AutoMap's tokenizer definitions? but there you have it. that's why Owen/Owen-VL works but my test case does not.

@rl337
Copy link
Contributor Author

rl337 commented Feb 13, 2024

@ArthurZucker @Rocketknight1 @amyeroberts is there anything else i need to verify / or do to get this merged?

@Rocketknight1
Copy link
Member

Rocketknight1 commented Feb 13, 2024

Hi @rl337, the delay is internal - we're trying to figure out if this is the right approach to the problem, since this is a fairly complex issue that touches some of the core transformers code. We want to fix it, but also avoid patches that will create further problems in future. You don't need to do anything else for now, but give us a little bit of time to investigate!

@Rocketknight1
Copy link
Member

Rocketknight1 commented Feb 13, 2024

Posting my understanding of the issue:

  • When you load a tokenizer, the tokenizer tries to figure out the tokenizer class of the repo
  • To do this, the tokenizer reads tokenizer_config.json and looks for a tokenizer_class key to determine the model class
  • If the key isn't present, the tokenizer tries to initialize a config from the repo with AutoConfig
  • Before this PR, trust_remote_code was not propagated correctly to the AutoConfig call. Therefore, an unwanted confirmation prompt is created if:
    • You load a custom code tokenizer with AutoTokenizer.from_pretrained()
    • You set trust_remote_code=True
    • tokenizer_class is not defined in tokenizer_config.json
    • The model config also requires custom code
  • The solution in this PR is to add a trust_remote_code argument to tokenizer.from_pretrained(). This argument does nothing in the function itself, but is passed to the AutoConfig call.
  • This PR also updates AutoTokenizer.from_pretrained() to pass its trust_remote_code value to tokenizer.from_pretrained().

After investigating, I think this is a good change, and doesn't introduce other security issues, or conflict with other areas of the library, so I'm willing to approve it.

However, one thing worth noting is that the tokenizer doesn't actually use the tokenizer class string in any of the loading code! The only purpose of the code block that causes this issue is just to check the tokenizer class name against the repo tokenizer name and raise a warning for users loading a tokenizer with a different tokenizer class. Since most people load tokenizers with AutoTokenizer now, we could consider just removing that block instead, as I don't think we need that big code block to raise a mostly unused warning anymore.

cc @ArthurZucker

@rl337
Copy link
Contributor Author

rl337 commented Feb 13, 2024

@Rocketknight1 I see. Yeah your summarization is my understanding of the situation.

I think that removing the path to lead to a warning and instead failing fast with an appropriate error message would be awesome.

Another thing that I was thinking was to add some kind of LoadingPolicy object which can be used to aggregate options for loading model classes, et al instead of relying on kwargs to propagate these options. It'll future proof the API because adding additional members to the policy object won't change all of the signatures all the way down the stack but still expose allow deep code to access stuff from the original calling function. One can also have a visible json which describes explicitly what the policy of the loader is which can then be used across different code.

So concretely something like this:

class LoaderPolicy:
    trust_remote_code: bool
    local_files_only: bool
    cache_directory: string
    ...

    @staticmethod
    def from_json(cls, filename: str,  policy_dir: str = '.')
        using open(os.path.join(policy_dir, filename),) as fp:
        policy_json = json.load(fp)
        # fill in members here

    policy = LoaderPolicy.from_json('loader_policy.json', policy_dir='some_dir/policies')

    AutoModel.from_pretrained(model_id, load_policy=policy)

@Rocketknight1
Copy link
Member

Hi @rl337 - it's a cool idea, but I'd worry about users having to create a Policy object. Although that might simplify the internal infrastructure, it would complicate the UX for people who often just want to make one single call to AutoModel or AutoTokenizer.

Anyway, for now I think we should just merge this PR, and consider removal of that entire code block in a future PR (@rl337 if you want to open an issue or PR for that after this, I'd support it, but no pressure - it's mostly for code cleanup rather than an essential feature!) You could also open an issue suggesting your Policy class if you want, but there might be some pushback!

Anyways, since we have core maintainer approval already, @rl337 are you okay with me merging now?

@rl337
Copy link
Contributor Author

rl337 commented Feb 15, 2024

@Rocketknight1 Sure. go ahead and merge it. I'll see if i have time to write a PR with the policy idea. I don't think it'd have to make things more complicated for end users and it'd allow flexibility for people who want a more formal way of specifying how to load models depending on environment.

I'll tag you when i get to it.

@Rocketknight1 Rocketknight1 merged commit be42c24 into huggingface:main Feb 16, 2024
21 checks passed
@Rocketknight1
Copy link
Member

Got it, and thanks for the fix! Even if we just leave the code block as-is, it's still a really nice usability improvement for transformers.

@ArthurZucker
Copy link
Collaborator

Down to remove the block that adds a warning as well

@Rocketknight1
Copy link
Member

@rl337 want to make a follow-up PR to remove the entire warning block, in that case?

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Feb 19, 2024
* pass through trust_remote_code for dynamically loading unregistered tokenizers specified by config
add test

* change directories back to previous directory after test

* fix ruff check

* Add a note to that block for future in case we want to remove it later

---------

Co-authored-by: Matt <rocketknight1@gmail.com>
@rl337
Copy link
Contributor Author

rl337 commented Feb 22, 2024

@Rocketknight1 I'll try when i have the time.

is there a place i can chat with you folks about these kinds of changes? like for the policy PR i want to submit, it'd be good to understand the concerns before i put the time into a PR. I had started a thread on discord regarding another PR i submitted but i got no responses.

@Rocketknight1
Copy link
Member

@rl337 Probably the easiest way is just to open an issue to co-ordinate it and ping me and any other maintainers you need, and we can discuss it there before you spend any time on the PR?

itazap pushed a commit that referenced this pull request May 14, 2024
* pass through trust_remote_code for dynamically loading unregistered tokenizers specified by config
add test

* change directories back to previous directory after test

* fix ruff check

* Add a note to that block for future in case we want to remove it later

---------

Co-authored-by: Matt <rocketknight1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants