Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privacy&Security: Network Contact Every Model Load #15927

Closed
4 tasks done
xloem opened this issue Mar 3, 2022 · 5 comments
Closed
4 tasks done

Privacy&Security: Network Contact Every Model Load #15927

xloem opened this issue Mar 3, 2022 · 5 comments

Comments

@xloem
Copy link
Contributor

xloem commented Mar 3, 2022

I decided to add this after commenting on deepset-ai/haystack#2118

Who can help

@sgugger

Information

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Use a tool such as wireshark to observe network activity, or set up an SSL-supporting proxy with a local certificate.
  2. Load a cached model from huggingface.co, in a loop.
  3. Observe the code query the network to update the model every load, even though it is cached.

Expected behavior

  • When offline, network accesses for cached material should fail gracefully by default, easing and encouraging users to work in safer offline environments.
  • Cached material need only be refreshed daily at most, not every load. Otherwise the user's load-to-load behavior is broadcast to their network.
  • Access to the network for huggingface.co material should happen only when users request it. Otherwise, frequent automatic model updating can be leveraged by an adversary able to make a fake ssl certificate, to arbitrarily and frequently mutate the model the user is running.
  • The cache should store models as true git repositories. so that their integrity can be manually verified using the normal git tools. Ideally it would automatically perform this integriy verification, and detect when server material has changed.

This is a multifaceted situation, and each part is valuable in its own right.

I have memory and accuracy issues and may have already opened a similar issue to this, or stated something slightly false above. If so, I apologise. I wanted to take responsibility for my comments in the project linked at top.

@sgugger
Copy link
Collaborator

sgugger commented Mar 3, 2022

cc @LysandreJik @Narsil This relates to things we talked about internally.

I agree with the first point, the second and third point deserve their debate so thank you for writing this issue!

For the fourth point, the situation is a bit more complicated as every user would need to download the whole repo when using a model, even if they don't need all the files (for instance most downloaded models have weights in PyTorch, TF and Flax).

@Narsil
Copy link
Contributor

Narsil commented Mar 4, 2022

Thanks @xloem,

Thank you for writing this !
As @sgugger said, we had some internal debate about this and are considering our options.

Here are my initial thoughts (just trying to bring food for thought)

When offline, network accesses for cached material should fail gracefully by default, easing and encouraging users to work in safer offline environments.

Yes, probably with a warning so that users are aware they might not be using the latest version. (There's a flag TRANSFORMERS_OFFLINE=1 by the way to remove network access globally across the code base afaik).

Cached material need only be refreshed daily at most, not every load. Otherwise the user's load-to-load behavior is broadcast to their network.

We should still be careful that some users might actually be testing the updates to their model that they do by pushing on the hub. IMO a global form of caching to bypass network should be opt-in, not opt-out. And indeed it would be very nice to have.

Access to the network for huggingface.co material should happen only when users request it. Otherwise, frequent automatic model updating can be leveraged by an adversary able to make a fake ssl certificate, to arbitrarily and frequently mutate the model the user is running.

Arguable, but it's also a nice thing when I upload a model while training to see the changes happening live.
AutoModel.from_pretrained("gpt2") in my personal view is already asking for network (whereas Automodel.from_pretrained("./gpt2") is not)
An attacker triggering many updates on a model is an issue that goes beyond SSL IMO. Defending against such things would
have to be more enabled than by default

The cache should store models as true git repositories. so that their integrity can be manually verified using the normal git tools. Ideally it would automatically perform this integrity verification, and detect when server material has changed.

Agree with @sgugger , as long as multiple files for weights are on the hub, doing a full git clone seems hard.
There might be ways to consider using GIT_LFS_SMUDGE=0 and download specifically some files.
We also have to keep in mind that the repo itself might contain data totally unrelated to the model that users might want to ignore.
Another thing to consider is that any change to the caching mechanism of transformers is going to lead to a full redownload of every user out there. It's doable, but not to be done lightly.

One element in favor of "repo" based caching would be to reduce the number of HEAD calls during from_pretrained currently there's a bit too many when things are in cache, and on a flaky network it adds up pretty fast.
There could be ways to emulate that "repo" based caching without doing an actual git clone (which would enable finer grained control over what happens there for the reasons mentionned above).

@xloem
Copy link
Contributor Author

xloem commented Mar 5, 2022

When you say an attacker triggering many updates on a model is an issue that goes beyond SSL, could you elaborate more on what you mean and why you might generally require users to enable a more secure setup, rather than having it a default? With transformers getting larger and larger, it seems the centralisation and frequent default network downloads for models could become a significant danger to nations. As someone who's worked with wireshark and implemented man in the middle attacks, I see network access as an opportunity for a network peer to mutate the data received, each request. It isn't a complex, impossible thing: the protocols are all public and people study them.

This can be mitigated by making it nondefault or even emitting output when HEAD requests are made. I see that changing the default behavior could make things harder for people with existing setups that rely on the network behavior, but the plan could still be made for a future major release.

As someone who's worked with git a lot, I don't understand well the concerns around the git clone. git-lfs is a separate system from git, and has support for download of individual files. git also has partial filter cloning now that can prevent download of unneeded git objects, although it isn't well documented in my experience. I agree that enabling all these inside python is an engineering challenge. The value of using git repositories is that it exposes the backend to the user and their administrators so they can perform their own audits and review changes provided by model updates.

Just thoughts. Thanks for keeping this issue open.

@julien-c
Copy link
Member

julien-c commented Mar 7, 2022

we had an internal (?) discussion about changing the cached file layout to better map with git workflow, i.e. to be able to know if we have the latest version of a model just by doing one HTTP call.

Does anyone remember where this discussion was?

@github-actions
Copy link

github-actions bot commented Apr 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants