-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverting Deta cloning mecanism. #22656
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed the example, as well as the 2 currently failing DETA tests, work now with this revert. It's indeed better to have a working version while we still need more time to dive into the root of the issue.
Thank you @Narsil !
There is however |
The new code should fix everything. @sgugger for a new review since the change has evolved quite a bit and is not a simple revert anymore. |
@@ -1768,7 +1768,8 @@ def save_pretrained( | |||
# We're going to remove aliases before saving | |||
ptrs = collections.defaultdict(list) | |||
for name, tensor in state_dict.items(): | |||
ptrs[tensor.data_ptr()].append(name) | |||
ident = (tensor.data_ptr(), tensor.device, tensor.shape, tensor.stride()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change exists because of multi-gpu setup and potential peculiar sharing of tensors.
Tensors are considered shared, and droppable if and only if they are the exact same tensor.
same ptr, same device, same shape, same stride.
We don't need to handle device meta here I think since trying to save a model with device meta should already be a bug.
src/transformers/modeling_utils.py
Outdated
# This makes sure even if the pattern covers all names | ||
# that we keep at least 1 copy of the name. | ||
for name in sorted(del_names)[: len(names) - 1]: | ||
del state_dict[name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix for Deta.
Deta _keys_to_ignore_on_load
regexp are a bit too generous and cover, ALL duplicates for some layers.
This code ensures that we keep at least 1 key in the dict in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did, can_use_safetensors
test should crash (because safetensors just refuses straight out shared tensors).
In terms of logic we can delete at most n-1
names from the code and if we deleted less, it would mean that the names
are the same (because of the use of set
). I could stick to lists if you'd prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of logic, I think we should keep the first one and not the last? Usually tensor sharing is written as tensor_2 = tensor_1
not the opposite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I guess you somehow saw my deleted message?) I think the current change is fine and so I deleted my previous question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed again the new changes works for the relevant tests. And LGTM with the explanations.
(except the 2 changes in _load_pretrained_model
- but it's because I am not familiar with the codebase here. I think @sgugger would know much better than me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we tried it your way and it doesn't work. Can we try to use Accelerate to detect the tied weights instead as suggested initially?
src/transformers/modeling_utils.py
Outdated
# This makes sure even if the pattern covers all names | ||
# that we keep at least 1 copy of the name. | ||
for name in sorted(del_names)[: len(names) - 1]: | ||
del state_dict[name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of logic, I think we should keep the first one and not the last? Usually tensor sharing is written as tensor_2 = tensor_1
not the opposite.
Because We could definitely use For instance, I wonder what happens for buffers.
Why ? It seems you're using the hash (via |
So actually This exhibits the different between find_tied_weights and the state_dict. Here the tensors from the state_dict don't share the hash, while the parameters do on the model, yet the tensors on the state dict do share memory. |
In both situations, you have access to the model, and
If this situation (the opposite) does not appear in Transformers, let's just use I also would like to drive the point home that |
Why are we even caring about |
In order to help with ease of use of which sorts of mimics what is done here. However I still think this PR and the mechanism in transformer should be kept, since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for considering shared weights in safetensors
directly. I agree it would still be cleaner to have the same kind of mechanism in Transformers. Could you please explain to me once again why the hash check does not work for the first changes in the PR (dropping weights in the checkpoint before passing it to safetensors). I don't think we ever tie weights in Transformers other than just setting the same tensors.
Apart from that, just rebasing on main should be necessary here.
Note that I will rework the constants in future work to have one distinct key for the tied weights (as sometimes they are not tied and we are currently not warning the user if they are missing), but it's orthogonal to this PR.
Mostly this: state_dict = kwargs.pop("state_dict", None) Users can send a state_dict, not linked to Then there are even further edge cases: class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.a = torch.nn.Linear(100, 100)
self.b = self.a
model = Model()
assert model.a is model.b # OK ! A = torch.zeros((1000, 100))
a = A[:100]
model.a.weight = nn.Parameter(a)
model.b.weight = model.a.weight
assert model.a is model.b # Well indeed it's the same parameter, but both are shared with respect to a larger tensor class NoSharedModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.a = torch.nn.Linear(100, 100)
self.b = torch.nn.Linear(100, 100)
model = NoSharedmodel()
A = torch.zeros((100, 100))
model.a.weight = nn.Parameter(A)
model.b.weight = nn.Parameter(A[:10])
assert model.a.weight is not model.b .weight # A is not B in parameters, however, the underlying tensors are indeed shared I haven't looked at that deeply when fintune occurs to see if the autograd starts to copy the tensors If you want I could take a look at But the biggest reason, really is the optional
Great ! |
Seeing the rebase, import torch
A = torch.zeros((10, 10))
B = A[1]
A.untyped_storage().data_ptr() == B.untyped_storage().data_ptr()
hash(A) != hash(B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Good for me for save_pretrained
but in from_pretrained
I think it's better to rely on the hash and catch all the situations that happen in Transformers (we do not use slice) while having something that works when the model is on the meta device (which will become the default utlimately) instead of relying on the data pointers and not doing anything when the model is on the meta device.
src/transformers/modeling_utils.py
Outdated
def _tensor_hash(tensor): | ||
# This is better than `tensor.data_ptr()` | ||
# Since A = torch.zeros((10, 10)) | ||
# B = A[2, :] | ||
# Then A.data_ptr() != B.data_ptr() | ||
# But actually the storage is still shared | ||
try: | ||
ptr = tensor.untyped_storage().data_ptr() | ||
except AttributeError: | ||
# Fallback for torch==1.10 | ||
try: | ||
ptr = tensor.storage().data_ptr() | ||
except NotImplementedError: | ||
# Fallback for meta storage like in 2.0 | ||
ptr = 0 | ||
return (ptr, tensor.device) | ||
|
||
existing_ptrs = { | ||
_tensor_hash(model_state_dict[k]) | ||
for k in loaded_keys | ||
if k in model_state_dict and model_state_dict[k].device != torch.device("meta") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the check is not right as it does not find the tied weights when the model is on the meta device (which is going to be the default ultimately to load without using RAM). The goal is to detect tied parameters in the model in any case, so we can rely on the hash for the model weights (there are no shared slices in Transformers models) for this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I swapped for accelerate here.
Hurray !!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bearing with me!
src/transformers/modeling_utils.py
Outdated
@@ -28,6 +28,7 @@ | |||
from typing import Any, Callable, Dict, List, Optional, Tuple, Union | |||
|
|||
import torch | |||
from accelerate.utils.modeling import find_tied_parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just protect this by an is_accelerate_available
? If users installed transformers and PyTorch separately, they won't have it (they'd need to do pip install transformers["torch"]
) and in this case we just skip the test of missing tied parameters (so there would be maybe extra warning in this case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation !
Failing tests seem to be linked to newly release huggingface_hub==0.14.0 @sgugger Merge if you think it's OK, I'm going to not merge given this PR affects core modeling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one last nit (always explicit tests for bool values and no Python conversion magic as usual ;-) ) and it should be good to merge.
This is more correct there, since it handles meta device seemlessly and we don't need to handle "non-duplicate" tensors (slices of each other).
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Fixed the revert by making sure that even the regexp can cover all duplicates. * Code simplification using hash. * Fixing the `ident`. * Fixing ignoring patterened duplicate names. * Using `accelerate@find_tied_parameters` for from_pretrained This is more correct there, since it handles meta device seemlessly and we don't need to handle "non-duplicate" tensors (slices of each other). * Protecting accelerate. * Update src/transformers/modeling_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Fixed the revert by making sure that even the regexp can cover all duplicates. * Code simplification using hash. * Fixing the `ident`. * Fixing ignoring patterened duplicate names. * Using `accelerate@find_tied_parameters` for from_pretrained This is more correct there, since it handles meta device seemlessly and we don't need to handle "non-duplicate" tensors (slices of each other). * Protecting accelerate. * Update src/transformers/modeling_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
What does this PR do?
This one is quite odd.
With the revert the slow test will work (I guess what we care most
about):
However if I incorporate this:
Then, the output is garbage again (this isn't using safetensors and isnot linked to the original change).
I even tried to revert the PR that introduced the bug.
The change of output is due to safetensors. I need to thoroughly check this.
This revert will fix the slow PR anyway.
I think something is not properly setup in this model, becuase the
uploaded model seems to have those layers NOT linked (hence the
copy.deepcopy) but the rest of the configuration seems to supposed
to assume they are, hence the issue maybe ?
Fixes #22437 (comment)
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.