-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support HRA #1864
support HRA #1864
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this PR to add HRA to PEFT. This method looks very promising and I'm happy to add it to PEFT.
I took a look at the paper (which I have to admit was hard for me to understand) and the code, which already looks quite good.
I didn't start the CI yet, because there is currently an issue with PyTorch and numpy 2.0 which is hopefully resolved very soon. However, I ran some tests locally and found a few of them failing. Could you please check if you find the same when you run pytest tests/ -k hra
locally?
I haven't checked all tests but it appears that at least some of them are related to the initialization. By default, you have init_weights=True
, which leads to Kaiming uniform initialization. As a result, with an untrained HRA adapter, calling forward is not an identity transform. Is that understanding correct?
To get an identity transform, presumably we have to set init_weights=False
. Note that this is the opposite of what we have for the other adapters like LoRA: There, if we set init_weights=True
or init_lora_weights=True
, it is the identity transform, i.e. at the start of the training, the adapter has no effect on the output.
There can be three solutions to this:
- You change the default to
init_weights=False
, but this can be confusing to users, as the default for the other methods isTrue
. - You leave the default as is but for the failing tests, you have to set
init_weights=False
. - You change the meaning such that
init_weights=True
leads to zero-initialization. This is more in line with what LoRA etc. do but probably does not conform to the paper.
Note that I tested setting init_weights=False
to see if this leads to the identity transform but instead forward
returned nan
. So this needs to be checked too.
Before I do a full in-depth review of the code, could you please also do the following:
- Add some documentation. It would be especially helpful to write a short, easy to digest description of the method and how it compares to other methods, especially LoRA/OFT/BOFT. Also, let's try to give good description of the hyper parameters so that users know how to tweak them without having to read and understand the whole paper.
- It would be really nice if you could add an example to the
examples/
folder. Maybe something based of the original repo that we can use to check that the PEFT implementation is faithful. - Speaking of the original repo, if some of the code in this PR is a 1:1 copy of https://github.com/DaShenZi721/HRA, it can be helpful to mark this with code comments and perma-links.
- Please run
make style
on the PR.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Hello, thank you for your suggestions! Here are the updates on the PR:
Thank you for your review! If there are any issues, please let me know. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks @DaShenZi721 for the updates. Could you please run |
OK, I just updated the ruff version from 0.2.2 to 0.4.10, and ran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the additions to the PR. It is much more complete now.
While reviewing, I found a few smaller issues. There is also a potentially bigger one, namely in the forward methods the layers. Please check my comments, where I described the details.
Also thanks for adding the examples. I haven't reviewed them yet, as I think we must first address the other issues. I'll get to that once the rest is sorted out.
src/peft/tuners/hra/__init__.py
Outdated
@@ -0,0 +1,20 @@ | |||
# Copyright 2023-present the HuggingFace Inc. team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change all years to 2024.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/peft/tuners/hra/config.py
Outdated
Whether to apply Gram-Schmidt orthogonalization. | ||
""" | ||
|
||
r: int = field(default=8, metadata={"help": "HRA rank"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, r
should generally be an even number, is that right? If yes, could you please document that here and in the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/peft/tuners/hra/config.py
Outdated
"help": "The layer pattern name, used only if `layers_to_transform` is different to None and if the layer pattern is not in the common layers pattern." | ||
}, | ||
) | ||
fan_in_fan_out: bool = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not missing something, this is never really used. If so, let's remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/peft/tuners/hra/layer.py
Outdated
if reverse: | ||
for i in range(rank - 1, -1, -1): | ||
unit_v = opt_u[:, i].view(-1, 1) | ||
weight = weight @ ( | ||
torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t() | ||
) | ||
else: | ||
for i in range(rank): | ||
unit_v = opt_u[:, i].view(-1, 1) | ||
weight = weight @ ( | ||
torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT?
if reverse: | |
for i in range(rank - 1, -1, -1): | |
unit_v = opt_u[:, i].view(-1, 1) | |
weight = weight @ ( | |
torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t() | |
) | |
else: | |
for i in range(rank): | |
unit_v = opt_u[:, i].view(-1, 1) | |
weight = weight @ ( | |
torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t() | |
) | |
if reverse: | |
indices = range(rank - 1, -1, -1) | |
else: | |
indices = range(rank) | |
for i in indices: | |
unit_v = opt_u[:, i].view(-1, 1) | |
weight = weight @ ( | |
torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t() | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/peft/tuners/hra/layer.py
Outdated
x = x.to(self.get_base_layer().weight.data.dtype) | ||
orig_weight = self.get_base_layer().weight.data | ||
delta_weight = self.get_delta_weight(active_adapter) | ||
new_weight = torch.mm(orig_weight, delta_weight) | ||
|
||
result = F.linear(input=x, weight=new_weight, bias=self.base_layer.bias) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code is not quite correct. Above, we calculate the output from the base layer: result = self.base_layer(x, *args, **kwargs)
. Here, we just completely override result
. So not only did we calculate self.base_layer(x, *args, **kwargs)
for nothing (in case of a single active adapter), we also have an issue when there are multiple activate adapters, as only the result from the last adapter would be returned, right?
Ideally, inside the loop, we should incrementally update result
with the adapter activations, something like result = result + hra_result
. Not sure if that works here though. Alternatively, inside the loop we would have to accumulate the delta weights for each adapter and then outside the loop calculate the whole output.
I was surprised that this was not caught by the tests but for that to happen, we need to add HRA also to the multi adapter tests here:
peft/tests/test_custom_models.py
Line 356 in 184beaf
MULTIPLE_ACTIVE_ADAPTERS_TEST_CASES = [ |
I added HRA to these tests and indeed some of them are failing. Could you please do that and then fix the error? Please do the same adjustments for conv2d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, we didn't account for the multiple activate adapters. This bug has now been fixed, and we have added and successfully passed the test cases you mentioned.
README.md
Outdated
@@ -1,5 +1,5 @@ | |||
<!--- | |||
Copyright 2023 The HuggingFace Team. All rights reserved. | |||
Copyright 2024 The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you have replaced all the years in all files. This should not be done. Please revert this change. The year only needs to be updated in the files that you created for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry...
Hi! Do you find any other problems? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, good work. Overall, this looks pretty good and I only found a few minor issues that need to be addressed before we can merge -- please check my comments. Apart from those, could you please run make style
?
|
||
if __name__ == "__main__": | ||
args = parse_args() | ||
main(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the instructions and started training with ./train_dreambooth.sh 0 0
. It ran fine for a while but then I got an error. Any idea what could be the reason?
Steps: 6%|██████████▍ | 98/1705 [00:42<09:11, 2.92it/s, loss=0.35, lr=0.001]GPU Memory before entering the train : 4796
GPU Memory consumed at the end of the train (end-begin): 21
GPU Peak Memory consumed during the train (max-begin): 7640
GPU Total Peak Memory consumed during the train (max): 12436
CPU Memory before entering the train : 3384
CPU Memory consumed at the end of the train (end-begin): 202
CPU Peak Memory consumed during the train (max-begin): 517
CPU Total Peak Memory consumed during the train (max): 3901
/opt/conda/conda-bld/pytorch_1716905969073/work/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [276,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[...]
/opt/conda/conda-bld/pytorch_1716905969073/work/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [96,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 925, in main
[rank1]: encoder_hidden_states = text_encoder(batch["input_ids"])[0]
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank1]: else self._run_ddp_forward(*inputs, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
[rank1]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/utils/operations.py", line 822, in forward
[rank1]: return model_forward(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/utils/operations.py", line 810, in __call__
[rank1]: return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/work/forks/peft/src/peft/peft_model.py", line 692, in forward
[rank1]: return self.get_base_model()(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 806, in forward
[rank1]: return self.text_model(
[rank1]: ^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 710, in forward
[rank1]: encoder_outputs = self.encoder(
[rank1]: ^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 637, in forward
[rank1]: layer_outputs = encoder_layer(
[rank1]: ^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 384, in forward
[rank1]: hidden_states = self.mlp(hidden_states)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 339, in forward
[rank1]: hidden_states = self.fc1(hidden_states)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 116, in forward
[rank1]: return F.linear(input, self.weight, self.bias)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: CUDA error: device-side assert triggered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank1]: During handling of the above exception, another exception occurred:
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 1083, in <module>
[rank1]: main(args)
[rank1]: File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 896, in main
[rank1]: with TorchTracemalloc() if not args.no_tracemalloc else nullcontext() as tracemalloc:
[rank1]: File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 443, in __exit__
[rank1]: torch.cuda.empty_cache()
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/cuda/memory.py", line 162, in empty_cache
[rank1]: torch._C._cuda_emptyCache()
[rank1]: RuntimeError: CUDA error: device-side assert triggered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank1]:[W CUDAGuardImpl.h:118] Warning: CUDA warning: device-side assert triggered (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1716905969073/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7a9a6ca93897 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7a9a6ca43b25 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7a9a6cb6d718 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d8d6 (0x7a9a6cb388d6 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1f5e3 (0x7a9a6cb3a5e3 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1f922 (0x7a9a6cb3a922 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5a7380 (0x7a9a635a7380 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6a36f (0x7a9a6ca7836f in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7a9a6ca711cb in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7a9a6ca71379 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #10: c10d::Reducer::~Reducer() + 0x5c4 (0x7a9a5bdec3b4 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #11: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7a9a63cf2402 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #12: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7a9a63471268 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0xcf5eb1 (0x7a9a63cf5eb1 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0x47c8e3 (0x7a9a6347c8e3 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x47d861 (0x7a9a6347d861 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #16: /home/name/anaconda3/envs/peft/bin/python() [0x502cba]
frame #17: /home/name/anaconda3/envs/peft/bin/python() [0x55ed0e]
frame #18: /home/name/anaconda3/envs/peft/bin/python() [0x540ea5]
frame #19: /home/name/anaconda3/envs/peft/bin/python() [0x53f8d8]
frame #20: /home/name/anaconda3/envs/peft/bin/python() [0x53f914]
frame #21: /home/name/anaconda3/envs/peft/bin/python() [0x4f72cb]
frame #22: PyDict_SetItemString + 0xab (0x4fad1b in /home/name/anaconda3/envs/peft/bin/python)
frame #23: /home/name/anaconda3/envs/peft/bin/python() [0x5fc824]
frame #24: Py_FinalizeEx + 0x143 (0x5eb603 in /home/name/anaconda3/envs/peft/bin/python)
frame #25: Py_RunMain + 0x193 (0x5f6da3 in /home/name/anaconda3/envs/peft/bin/python)
frame #26: Py_BytesMain + 0x39 (0x5bbc79 in /home/name/anaconda3/envs/peft/bin/python)
frame #27: <unknown function> + 0x2a1ca (0x7a9a7ee2a1ca in /lib/x86_64-linux-gnu/libc.so.6)
frame #28: __libc_start_main + 0x8b (0x7a9a7ee2a28b in /lib/x86_64-linux-gnu/libc.so.6)
frame #29: /home/name/anaconda3/envs/peft/bin/python() [0x5bbac3]
Steps: 6%|██████████▌ | 99/1705 [00:42<12:33, 2.13it/s, loss=0.35, lr=0.001]W0701 15:36:00.773000 123747233916416 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 96374 closing signal SIGTERM
E0701 15:36:00.887000 123747233916416 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 96375) of binary: /home/name/anaconda3/envs/peft/bin/python
Traceback (most recent call last):
File "/home/name/anaconda3/envs/peft/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1088, in launch_command
multi_gpu_launcher(args)
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher
distrib_run.run(args)
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train_dreambooth.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-01_15:36:00
host : name-machine
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 96375)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 96375
======================================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've encountered this issue as well, but it disappeared after retraining. You could try retraining to see if it resolves the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I had to delete the logs
directory and then after retraining, it works. Would be good to a add a comment about this for users who encounter the same problem -- or even better fixing the problem :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks great, we're almost there. Just some very small comments from my side, please take a look.
tests/test_stablediffusion.py
Outdated
@@ -1,4 +1,4 @@ | |||
# Copyright 2023-present the HuggingFace Inc. team. | |||
# Copyright 2024-present the HuggingFace Inc. team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please undo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tests/test_common_gpu.py
Outdated
from datasets import load_dataset | ||
from transformers import AutoImageProcessor, AutoModelForImageClassification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These imports can be on the root, not necessary to make them local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
if __name__ == "__main__": | ||
args = parse_args() | ||
main(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I had to delete the logs
directory and then after retraining, it works. Would be good to a add a comment about this for users who encounter the same problem -- or even better fixing the problem :)
echo "BASE_MODEL_NAME = $MODEL_NAME" | ||
echo "ADAPTER_MODEL_PATH = $OUTPUT_DIR" | ||
echo "PROMPT = $instance_prompt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion, this way users can copy&paste into the notebook without having to edit the code.
echo "BASE_MODEL_NAME = $MODEL_NAME" | |
echo "ADAPTER_MODEL_PATH = $OUTPUT_DIR" | |
echo "PROMPT = $instance_prompt" | |
echo "BASE_MODEL_NAME = '$MODEL_NAME'" | |
echo "ADAPTER_MODEL_PATH = '$OUTPUT_DIR'" | |
echo "PROMPT = '$instance_prompt'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
def __getattr__(self, name: str): | ||
"""Forward missing attributes to the wrapped module.""" | ||
try: | ||
return super().__getattr__(name) # defer to nn.Module's logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small thing, we just merged a small bug in __getattr__
: #1892. Could you please apply the fix here too?
if name == "model": # see #1892: prevent infinite recursion if class is not initialized
raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a merge conflict in test_common_gpu.py
, which should hopefully be easy to resolve.
We have applied the fix #1892 and resolved the conflicts! |
@DaShenZi721 Thanks for the updates. We just merged another PR, which resulted in many merge conflicts, but they should be very straightforward to resolve. Could you please take a look? I'll depart soon to the EuroPython conference, so unfortunately, I'll not be able to review the PR this week. |
We have resolved the conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code and the recent changes all look good. I had a fresh look at the training script and actually found a couple of issues when trying not to use wandb for logging. Maybe these issues are already in the original code, not sure, but please take a look. It would be nice if users could easily switch the logger or set no logger at all. Besides this, the PR is good to go.
|
||
``` | ||
hra_dreambooth | ||
├── data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no data
directory, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data
directory should have been automatically created when running the train_dreambooth.sh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the training script, it now works well for me without wandb. And also big thanks for adding HRA to PEFT. I'll merge the PR once CI is green.
Thank you so much for your careful and responsible guidance! |
Hi @DaShenZi721, I noticed that one of our unit tests was particularly slow and it is involves HRA. The test is I wonder if you have some ideas what the exact reason could be and if we could speed up HRA, which would also help with adoption. Otherwise, I'll probably just remove this test. |
Sorry, I missed this message... I will now look into how to address this issue. |
I wasn't able to reproduce the issue. Has it already been resolved? |
I switched the underlying model which is probably why you can't reproduce now. From that point of view, this is no longer urgent. I think it could still be worth investigating why |
OK, I'll check it out. |
Hello! I'm the author of Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation.
This PR integrates our method HRA.