Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support HRA #1864

Merged
merged 23 commits into from
Jul 16, 2024
Merged

support HRA #1864

merged 23 commits into from
Jul 16, 2024

Conversation

DaShenZi721
Copy link
Contributor

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR to add HRA to PEFT. This method looks very promising and I'm happy to add it to PEFT.

I took a look at the paper (which I have to admit was hard for me to understand) and the code, which already looks quite good.

I didn't start the CI yet, because there is currently an issue with PyTorch and numpy 2.0 which is hopefully resolved very soon. However, I ran some tests locally and found a few of them failing. Could you please check if you find the same when you run pytest tests/ -k hra locally?

I haven't checked all tests but it appears that at least some of them are related to the initialization. By default, you have init_weights=True, which leads to Kaiming uniform initialization. As a result, with an untrained HRA adapter, calling forward is not an identity transform. Is that understanding correct?

To get an identity transform, presumably we have to set init_weights=False. Note that this is the opposite of what we have for the other adapters like LoRA: There, if we set init_weights=True or init_lora_weights=True, it is the identity transform, i.e. at the start of the training, the adapter has no effect on the output.

There can be three solutions to this:

  1. You change the default to init_weights=False, but this can be confusing to users, as the default for the other methods is True.
  2. You leave the default as is but for the failing tests, you have to set init_weights=False.
  3. You change the meaning such that init_weights=True leads to zero-initialization. This is more in line with what LoRA etc. do but probably does not conform to the paper.

Note that I tested setting init_weights=False to see if this leads to the identity transform but instead forward returned nan. So this needs to be checked too.

Before I do a full in-depth review of the code, could you please also do the following:

  1. Add some documentation. It would be especially helpful to write a short, easy to digest description of the method and how it compares to other methods, especially LoRA/OFT/BOFT. Also, let's try to give good description of the hyper parameters so that users know how to tweak them without having to read and understand the whole paper.
  2. It would be really nice if you could add an example to the examples/ folder. Maybe something based of the original repo that we can use to check that the PEFT implementation is faithful.
  3. Speaking of the original repo, if some of the code in this PR is a 1:1 copy of https://github.com/DaShenZi721/HRA, it can be helpful to mark this with code comments and perma-links.
  4. Please run make style on the PR.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@DaShenZi721
Copy link
Contributor Author

Hello, thank you for your suggestions! Here are the updates on the PR:

  1. Modified the initialization. Currently, if we set init_weights=True, HRA will be the identity transform, i.e. at the start of the training, the adapter has no effect on the output.
  2. Fixed some bugs, and now all test cases pass!
  3. Added an example: hra_dreambooth. We follow the examples of oft and boft. Additionally, we have included a simple README. You can try it by following the instructions provided in the README!
  4. Ran make style.

Thank you for your review! If there are any issues, please let me know.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan
Copy link
Member

Thanks @DaShenZi721 for the updates. Could you please run make style on the PR? Note that we updated the ruff version, so in your env, you may have to run python -m pip install -U ruff first.

@DaShenZi721
Copy link
Contributor Author

OK, I just updated the ruff version from 0.2.2 to 0.4.10, and ran make style to modify the two files.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the additions to the PR. It is much more complete now.

While reviewing, I found a few smaller issues. There is also a potentially bigger one, namely in the forward methods the layers. Please check my comments, where I described the details.

Also thanks for adding the examples. I haven't reviewed them yet, as I think we must first address the other issues. I'll get to that once the rest is sorted out.

@@ -0,0 +1,20 @@
# Copyright 2023-present the HuggingFace Inc. team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change all years to 2024.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/peft/tuners/hra/config.py Show resolved Hide resolved
Whether to apply Gram-Schmidt orthogonalization.
"""

r: int = field(default=8, metadata={"help": "HRA rank"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, r should generally be an even number, is that right? If yes, could you please document that here and in the docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"help": "The layer pattern name, used only if `layers_to_transform` is different to None and if the layer pattern is not in the common layers pattern."
},
)
fan_in_fan_out: bool = field(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not missing something, this is never really used. If so, let's remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

tests/test_custom_models.py Show resolved Hide resolved
Comment on lines 218 to 229
if reverse:
for i in range(rank - 1, -1, -1):
unit_v = opt_u[:, i].view(-1, 1)
weight = weight @ (
torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t()
)
else:
for i in range(rank):
unit_v = opt_u[:, i].view(-1, 1)
weight = weight @ (
torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t()
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT?

Suggested change
if reverse:
for i in range(rank - 1, -1, -1):
unit_v = opt_u[:, i].view(-1, 1)
weight = weight @ (
torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t()
)
else:
for i in range(rank):
unit_v = opt_u[:, i].view(-1, 1)
weight = weight @ (
torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t()
)
if reverse:
indices = range(rank - 1, -1, -1)
else:
indices = range(rank)
for i in indices:
unit_v = opt_u[:, i].view(-1, 1)
weight = weight @ (
torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t()
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 248 to 253
x = x.to(self.get_base_layer().weight.data.dtype)
orig_weight = self.get_base_layer().weight.data
delta_weight = self.get_delta_weight(active_adapter)
new_weight = torch.mm(orig_weight, delta_weight)

result = F.linear(input=x, weight=new_weight, bias=self.base_layer.bias)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is not quite correct. Above, we calculate the output from the base layer: result = self.base_layer(x, *args, **kwargs). Here, we just completely override result. So not only did we calculate self.base_layer(x, *args, **kwargs) for nothing (in case of a single active adapter), we also have an issue when there are multiple activate adapters, as only the result from the last adapter would be returned, right?

Ideally, inside the loop, we should incrementally update result with the adapter activations, something like result = result + hra_result. Not sure if that works here though. Alternatively, inside the loop we would have to accumulate the delta weights for each adapter and then outside the loop calculate the whole output.

I was surprised that this was not caught by the tests but for that to happen, we need to add HRA also to the multi adapter tests here:

MULTIPLE_ACTIVE_ADAPTERS_TEST_CASES = [

I added HRA to these tests and indeed some of them are failing. Could you please do that and then fix the error? Please do the same adjustments for conv2d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, we didn't account for the multiple activate adapters. This bug has now been fixed, and we have added and successfully passed the test cases you mentioned.

README.md Outdated
@@ -1,5 +1,5 @@
<!---
Copyright 2023 The HuggingFace Team. All rights reserved.
Copyright 2024 The HuggingFace Team. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you have replaced all the years in all files. This should not be done. Please revert this change. The year only needs to be updated in the files that you created for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry...

@DaShenZi721
Copy link
Contributor Author

Hi! Do you find any other problems?

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, good work. Overall, this looks pretty good and I only found a few minor issues that need to be addressed before we can merge -- please check my comments. Apart from those, could you please run make style?

examples/hra_dreambooth/README.md Outdated Show resolved Hide resolved
examples/hra_dreambooth/README.md Outdated Show resolved Hide resolved
examples/hra_dreambooth/README.md Outdated Show resolved Hide resolved
examples/hra_dreambooth/README.md Outdated Show resolved Hide resolved
examples/hra_dreambooth/README.md Outdated Show resolved Hide resolved

if __name__ == "__main__":
args = parse_args()
main(args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the instructions and started training with ./train_dreambooth.sh 0 0. It ran fine for a while but then I got an error. Any idea what could be the reason?

Steps:   6%|██████████▍                                                                                                                                                                          | 98/1705 [00:42<09:11,  2.92it/s, loss=0.35, lr=0.001]GPU Memory before entering the train : 4796
GPU Memory consumed at the end of the train (end-begin): 21
GPU Peak Memory consumed during the train (max-begin): 7640
GPU Total Peak Memory consumed during the train (max): 12436
CPU Memory before entering the train : 3384
CPU Memory consumed at the end of the train (end-begin): 202
CPU Peak Memory consumed during the train (max-begin): 517
CPU Total Peak Memory consumed during the train (max): 3901
/opt/conda/conda-bld/pytorch_1716905969073/work/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [276,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[...]
/opt/conda/conda-bld/pytorch_1716905969073/work/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [96,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 925, in main
[rank1]:     encoder_hidden_states = text_encoder(batch["input_ids"])[0]
[rank1]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank1]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank1]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
[rank1]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/utils/operations.py", line 822, in forward
[rank1]:     return model_forward(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/utils/operations.py", line 810, in __call__
[rank1]:     return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank1]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/work/forks/peft/src/peft/peft_model.py", line 692, in forward
[rank1]:     return self.get_base_model()(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 806, in forward
[rank1]:     return self.text_model(
[rank1]:            ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 710, in forward
[rank1]:     encoder_outputs = self.encoder(
[rank1]:                       ^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 637, in forward
[rank1]:     layer_outputs = encoder_layer(
[rank1]:                     ^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 384, in forward
[rank1]:     hidden_states = self.mlp(hidden_states)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 339, in forward
[rank1]:     hidden_states = self.fc1(hidden_states)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 116, in forward
[rank1]:     return F.linear(input, self.weight, self.bias)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: CUDA error: device-side assert triggered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[rank1]: During handling of the above exception, another exception occurred:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 1083, in <module>
[rank1]:     main(args)
[rank1]:   File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 896, in main
[rank1]:     with TorchTracemalloc() if not args.no_tracemalloc else nullcontext() as tracemalloc:
[rank1]:   File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 443, in __exit__
[rank1]:     torch.cuda.empty_cache()
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/cuda/memory.py", line 162, in empty_cache
[rank1]:     torch._C._cuda_emptyCache()
[rank1]: RuntimeError: CUDA error: device-side assert triggered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank1]:[W CUDAGuardImpl.h:118] Warning: CUDA warning: device-side assert triggered (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1716905969073/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7a9a6ca93897 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7a9a6ca43b25 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7a9a6cb6d718 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d8d6 (0x7a9a6cb388d6 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1f5e3 (0x7a9a6cb3a5e3 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1f922 (0x7a9a6cb3a922 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5a7380 (0x7a9a635a7380 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6a36f (0x7a9a6ca7836f in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7a9a6ca711cb in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7a9a6ca71379 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #10: c10d::Reducer::~Reducer() + 0x5c4 (0x7a9a5bdec3b4 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #11: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7a9a63cf2402 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #12: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7a9a63471268 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0xcf5eb1 (0x7a9a63cf5eb1 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0x47c8e3 (0x7a9a6347c8e3 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x47d861 (0x7a9a6347d861 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #16: /home/name/anaconda3/envs/peft/bin/python() [0x502cba]
frame #17: /home/name/anaconda3/envs/peft/bin/python() [0x55ed0e]
frame #18: /home/name/anaconda3/envs/peft/bin/python() [0x540ea5]
frame #19: /home/name/anaconda3/envs/peft/bin/python() [0x53f8d8]
frame #20: /home/name/anaconda3/envs/peft/bin/python() [0x53f914]
frame #21: /home/name/anaconda3/envs/peft/bin/python() [0x4f72cb]
frame #22: PyDict_SetItemString + 0xab (0x4fad1b in /home/name/anaconda3/envs/peft/bin/python)
frame #23: /home/name/anaconda3/envs/peft/bin/python() [0x5fc824]
frame #24: Py_FinalizeEx + 0x143 (0x5eb603 in /home/name/anaconda3/envs/peft/bin/python)
frame #25: Py_RunMain + 0x193 (0x5f6da3 in /home/name/anaconda3/envs/peft/bin/python)
frame #26: Py_BytesMain + 0x39 (0x5bbc79 in /home/name/anaconda3/envs/peft/bin/python)
frame #27: <unknown function> + 0x2a1ca (0x7a9a7ee2a1ca in /lib/x86_64-linux-gnu/libc.so.6)
frame #28: __libc_start_main + 0x8b (0x7a9a7ee2a28b in /lib/x86_64-linux-gnu/libc.so.6)
frame #29: /home/name/anaconda3/envs/peft/bin/python() [0x5bbac3]

Steps:   6%|██████████▌                                                                                                                                                                          | 99/1705 [00:42<12:33,  2.13it/s, loss=0.35, lr=0.001]W0701 15:36:00.773000 123747233916416 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 96374 closing signal SIGTERM
E0701 15:36:00.887000 123747233916416 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 96375) of binary: /home/name/anaconda3/envs/peft/bin/python
Traceback (most recent call last):
  File "/home/name/anaconda3/envs/peft/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1088, in launch_command
    multi_gpu_launcher(args)
  File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
train_dreambooth.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-01_15:36:00
  host      : name-machine
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 96375)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 96375
======================================================

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've encountered this issue as well, but it disappeared after retraining. You could try retraining to see if it resolves the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I had to delete the logs directory and then after retraining, it works. Would be good to a add a comment about this for users who encounter the same problem -- or even better fixing the problem :)

src/peft/tuners/hra/layer.py Outdated Show resolved Hide resolved
tests/test_torch_compile.py Show resolved Hide resolved
tests/test_common_gpu.py Show resolved Hide resolved
src/peft/tuners/hra/model.py Show resolved Hide resolved
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks great, we're almost there. Just some very small comments from my side, please take a look.

@@ -1,4 +1,4 @@
# Copyright 2023-present the HuggingFace Inc. team.
# Copyright 2024-present the HuggingFace Inc. team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 1119 to 1120
from datasets import load_dataset
from transformers import AutoImageProcessor, AutoModelForImageClassification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports can be on the root, not necessary to make them local.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if __name__ == "__main__":
args = parse_args()
main(args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I had to delete the logs directory and then after retraining, it works. Would be good to a add a comment about this for users who encounter the same problem -- or even better fixing the problem :)

Comment on lines 188 to 190
echo "BASE_MODEL_NAME = $MODEL_NAME"
echo "ADAPTER_MODEL_PATH = $OUTPUT_DIR"
echo "PROMPT = $instance_prompt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion, this way users can copy&paste into the notebook without having to edit the code.

Suggested change
echo "BASE_MODEL_NAME = $MODEL_NAME"
echo "ADAPTER_MODEL_PATH = $OUTPUT_DIR"
echo "PROMPT = $instance_prompt"
echo "BASE_MODEL_NAME = '$MODEL_NAME'"
echo "ADAPTER_MODEL_PATH = '$OUTPUT_DIR'"
echo "PROMPT = '$instance_prompt'"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def __getattr__(self, name: str):
"""Forward missing attributes to the wrapped module."""
try:
return super().__getattr__(name) # defer to nn.Module's logic
Copy link
Member

@BenjaminBossan BenjaminBossan Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small thing, we just merged a small bug in __getattr__: #1892. Could you please apply the fix here too?

            if name == "model":  # see #1892: prevent infinite recursion if class is not initialized
                raise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a merge conflict in test_common_gpu.py, which should hopefully be easy to resolve.

DaShenZi721 and others added 4 commits July 8, 2024 21:47

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@DaShenZi721
Copy link
Contributor Author

We have applied the fix #1892 and resolved the conflicts!
Besides, we updated the hra_drembooth to generate higher quality images!

@BenjaminBossan
Copy link
Member

@DaShenZi721 Thanks for the updates. We just merged another PR, which resulted in many merge conflicts, but they should be very straightforward to resolve. Could you please take a look?

I'll depart soon to the EuroPython conference, so unfortunately, I'll not be able to review the PR this week.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@DaShenZi721
Copy link
Contributor Author

We have resolved the conflicts.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code and the recent changes all look good. I had a fresh look at the training script and actually found a couple of issues when trying not to use wandb for logging. Maybe these issues are already in the original code, not sure, but please take a look. It would be nice if users could easily switch the logger or set no logger at all. Besides this, the PR is good to go.


```
hra_dreambooth
├── data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no data directory, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data directory should have been automatically created when running the train_dreambooth.sh.

examples/hra_dreambooth/train_dreambooth.py Outdated Show resolved Hide resolved
examples/hra_dreambooth/train_dreambooth.py Outdated Show resolved Hide resolved
examples/hra_dreambooth/train_dreambooth.py Show resolved Hide resolved
examples/hra_dreambooth/train_dreambooth.py Outdated Show resolved Hide resolved
examples/hra_dreambooth/utils/args_loader.py Outdated Show resolved Hide resolved
examples/hra_dreambooth/train_dreambooth.py Outdated Show resolved Hide resolved
examples/hra_dreambooth/utils/args_loader.py Show resolved Hide resolved
examples/hra_dreambooth/train_dreambooth.py Outdated Show resolved Hide resolved
examples/hra_dreambooth/train_dreambooth.py Outdated Show resolved Hide resolved
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the training script, it now works well for me without wandb. And also big thanks for adding HRA to PEFT. I'll merge the PR once CI is green.

@BenjaminBossan BenjaminBossan merged commit 5268495 into huggingface:main Jul 16, 2024
14 checks passed
@DaShenZi721
Copy link
Contributor Author

Thank you so much for your careful and responsible guidance!

@BenjaminBossan
Copy link
Member

Hi @DaShenZi721, I noticed that one of our unit tests was particularly slow and it is involves HRA. The test is test_model_with_batchnorm_reproducibility in tests/test_vision_models.py. On CI, this test takes > 500 sec. But even locally on GPU, it takes > 70 sec for me. I did some quick tests and it's the forward and backwards passes that take a lot of time. The same test with LoRA just takes 7 sec locally.

I wonder if you have some ideas what the exact reason could be and if we could speed up HRA, which would also help with adoption. Otherwise, I'll probably just remove this test.

@DaShenZi721
Copy link
Contributor Author

Sorry, I missed this message... I will now look into how to address this issue.

@DaShenZi721
Copy link
Contributor Author

I wasn't able to reproduce the issue. Has it already been resolved?

@BenjaminBossan
Copy link
Member

I wasn't able to reproduce the issue. Has it already been resolved?

I switched the underlying model which is probably why you can't reproduce now. From that point of view, this is no longer urgent. I think it could still be worth investigating why microsoft/resnet-18 was so slow.

@DaShenZi721
Copy link
Contributor Author

OK, I'll check it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants