Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow pickle.load unless TRUST_REMOTE_CODE=True #27776

Merged
merged 5 commits into from
Dec 4, 2023

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Nov 30, 2023

What does this PR do?

For security reason, require the users to explicitly allow access to the places where pickle.load is used.

import os
from transformers import TransfoXLTokenizer, TransfoXLLMHeadModel

os.environ["ALLOW_ACCESS_TO_POTENTIAL_INSECURE_CODE"] = "True"

checkpoint = 'transfo-xl-wt103'
revision = '40a186da79458c9f9de846edfaea79c412137f97'

tokenizer = TransfoXLTokenizer.from_pretrained(checkpoint, revision=revision)
model = TransfoXLLMHeadModel.from_pretrained(checkpoint, revision=revision)

Before merge

There are a few pickle.load in examples/research_projects/, some conversion files and test files. These are considered less severe, but we can still apply the change to all of them to get rid of this unlovely pickle stuff.

@@ -288,33 +256,6 @@ def test_custom_hf_index_retriever_save_and_from_pretrained_from_disk(self):
out = retriever.retrieve(hidden_states, n_docs=1)
self.assertTrue(out is not None)

def test_legacy_index_retriever_retrieve(self):
n_docs = 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to test the "legacy" case anymore after (not merged yet) #27748

@ydshieh ydshieh requested a review from LysandreJik November 30, 2023 15:51
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use TRUST_REMOTE_CODE as the var name, will extend to the other usecase and paves the way for a feature that was asked a lot

@ydshieh
Copy link
Collaborator Author

ydshieh commented Dec 1, 2023

Not really. TRUST_REMOTE_CODE is for remote code, here is the code in transformers. And the pickle file could be a local file downloaded or a link to a file (not code).

@ydshieh
Copy link
Collaborator Author

ydshieh commented Dec 1, 2023

Happy the give another name if we come up with a better one though!

@LysandreJik
Copy link
Member

Thanks for the implementation @ydshieh!

An environment variable for remote code has also been asked in the past, and given the proximity to this I think it would make sense to have the same variable name for the two.

TRUST_REMOTE_CODE looks good to me as we're indeed taking the risk of executing code from the Hub when downloading pickle files from there and running them; and trust_remote_code=True has the same approach as these files might also be locally cached and still require the flag to be set.

If you're strongly against TRUST_REMOTE_CODE for this I'd be happy to hear your proposal, but I think it would be nice to unify the two and to be in-line with the trust_remote_code argument that we have used elsewhere.

Would that be ok for you?

@ydshieh
Copy link
Collaborator Author

ydshieh commented Dec 1, 2023

I am fine with the name, but I think we might adjust the documentation a bit for TRUST_REMOTE_CODE. The current one for trust_remote_code is

                "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option"
                "should only be set to `True` for repositories you trust and in which you have read the code, as it will "
                "execute code present on the Hub on your local machine."

And this sounds like if an user doesn't run custom models defined on the Hub, they are safe.

I will update the PR.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Dec 1, 2023

Updated. It turns out that I don't need to update the message - it's clear enough 😅

@ydshieh ydshieh requested a review from ArthurZucker December 1, 2023 10:56
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🫡

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ydshieh!

cc @julien-c for visibility

@julien-c
Copy link
Member

julien-c commented Dec 4, 2023

Thanks! and i agree that using the same env var is a good idea / makes sense 👍

@ydshieh
Copy link
Collaborator Author

ydshieh commented Dec 4, 2023

Sorry about that, but there is a tiny issue with the condition if not os.getenv("TRUST_REMOTE_CODE", False):. I will fix this then merge.

Thank you for all the review.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ydshieh ydshieh changed the title Disallow pickle.load by default Disallow pickle.load unless TRUST_REMOTE_CODE=True Dec 4, 2023
@ydshieh ydshieh merged commit 1d63b0e into main Dec 4, 2023
19 checks passed
@ydshieh ydshieh deleted the difficult_to_access_pickle branch December 4, 2023 15:48
@LysandreJik
Copy link
Member

This seems to have now triggered a dependabot update in downstream repos: https://github.com/huggingface/api-inference-community/security/dependabot/78

The concerned userbase is extremely limited and all we did was protect from pickle loading so really unsure about the "critical" report, but good to have done it nonetheless. Thanks Yih-Dar!

@ydshieh
Copy link
Collaborator Author

ydshieh commented Dec 21, 2023

I show scan dependabot in all the dependency GitHub repositories to show my impact 😆

thschmitt pushed a commit to thschmitt/transformers that referenced this pull request May 23, 2024
…7776)

* fix

* fix

* Use TRUST_REMOTE_CODE

* fix doc

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants