support HRA #1864

DaShenZi721 · 2024-06-16T13:31:12Z

Hello! I'm the author of Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation.
This PR integrates our method HRA.

BenjaminBossan

Thanks a lot for this PR to add HRA to PEFT. This method looks very promising and I'm happy to add it to PEFT.

I took a look at the paper (which I have to admit was hard for me to understand) and the code, which already looks quite good.

I didn't start the CI yet, because there is currently an issue with PyTorch and numpy 2.0 which is hopefully resolved very soon. However, I ran some tests locally and found a few of them failing. Could you please check if you find the same when you run pytest tests/ -k hra locally?

I haven't checked all tests but it appears that at least some of them are related to the initialization. By default, you have init_weights=True, which leads to Kaiming uniform initialization. As a result, with an untrained HRA adapter, calling forward is not an identity transform. Is that understanding correct?

To get an identity transform, presumably we have to set init_weights=False. Note that this is the opposite of what we have for the other adapters like LoRA: There, if we set init_weights=True or init_lora_weights=True, it is the identity transform, i.e. at the start of the training, the adapter has no effect on the output.

There can be three solutions to this:

You change the default to init_weights=False, but this can be confusing to users, as the default for the other methods is True.
You leave the default as is but for the failing tests, you have to set init_weights=False.
You change the meaning such that init_weights=True leads to zero-initialization. This is more in line with what LoRA etc. do but probably does not conform to the paper.

Note that I tested setting init_weights=False to see if this leads to the identity transform but instead forward returned nan. So this needs to be checked too.

Before I do a full in-depth review of the code, could you please also do the following:

Add some documentation. It would be especially helpful to write a short, easy to digest description of the method and how it compares to other methods, especially LoRA/OFT/BOFT. Also, let's try to give good description of the hyper parameters so that users know how to tweak them without having to read and understand the whole paper.
It would be really nice if you could add an example to the examples/ folder. Maybe something based of the original repo that we can use to check that the PEFT implementation is faithful.
Speaking of the original repo, if some of the code in this PR is a 1:1 copy of https://github.com/DaShenZi721/HRA, it can be helpful to mark this with code comments and perma-links.
Please run make style on the PR.

review-notebook-app · 2024-06-23T11:23:22Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

DaShenZi721 · 2024-06-23T11:40:49Z

Hello, thank you for your suggestions! Here are the updates on the PR:

Modified the initialization. Currently, if we set init_weights=True, HRA will be the identity transform, i.e. at the start of the training, the adapter has no effect on the output.
Fixed some bugs, and now all test cases pass!
Added an example: hra_dreambooth. We follow the examples of oft and boft. Additionally, we have included a simple README. You can try it by following the instructions provided in the README!
Ran make style.

Thank you for your review! If there are any issues, please let me know.

HuggingFaceDocBuilderDev · 2024-06-25T15:21:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan · 2024-06-25T15:38:14Z

Thanks @DaShenZi721 for the updates. Could you please run make style on the PR? Note that we updated the ruff version, so in your env, you may have to run python -m pip install -U ruff first.

DaShenZi721 · 2024-06-25T16:11:57Z

OK, I just updated the ruff version from 0.2.2 to 0.4.10, and ran make style to modify the two files.

BenjaminBossan

Thanks a lot for the additions to the PR. It is much more complete now.

While reviewing, I found a few smaller issues. There is also a potentially bigger one, namely in the forward methods the layers. Please check my comments, where I described the details.

Also thanks for adding the examples. I haven't reviewed them yet, as I think we must first address the other issues. I'll get to that once the rest is sorted out.

BenjaminBossan · 2024-06-26T09:27:26Z

src/peft/tuners/hra/__init__.py

@@ -0,0 +1,20 @@
+# Copyright 2023-present the HuggingFace Inc. team.


Please change all years to 2024.

src/peft/tuners/hra/config.py

BenjaminBossan · 2024-06-26T09:36:05Z

src/peft/tuners/hra/config.py

+            Whether to apply Gram-Schmidt orthogonalization.
+    """
+
+    r: int = field(default=8, metadata={"help": "HRA rank"})


IIUC, r should generally be an even number, is that right? If yes, could you please document that here and in the docstring?

BenjaminBossan · 2024-06-26T09:39:12Z

src/peft/tuners/hra/config.py

+            "help": "The layer pattern name, used only if `layers_to_transform` is different to None and if the layer pattern is not in the common layers pattern."
+        },
+    )
+    fan_in_fan_out: bool = field(


If I'm not missing something, this is never really used. If so, let's remove it.

tests/test_custom_models.py

BenjaminBossan · 2024-06-26T09:44:15Z

src/peft/tuners/hra/layer.py

+            if reverse:
+                for i in range(rank - 1, -1, -1):
+                    unit_v = opt_u[:, i].view(-1, 1)
+                    weight = weight @ (
+                        torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t()
+                    )
+            else:
+                for i in range(rank):
+                    unit_v = opt_u[:, i].view(-1, 1)
+                    weight = weight @ (
+                        torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t()
+                    )


WDYT?

Suggested change

if reverse:

for i in range(rank - 1, -1, -1):

unit_v = opt_u[:, i].view(-1, 1)

weight = weight @ (

torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t()

)

else:

for i in range(rank):

unit_v = opt_u[:, i].view(-1, 1)

weight = weight @ (

torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t()

)

if reverse:

indices = range(rank - 1, -1, -1)

else:

indices = range(rank)

for i in indices:

unit_v = opt_u[:, i].view(-1, 1)

weight = weight @ (

torch.eye(shape[0], device=opt_u.device, dtype=opt_u.dtype) - 2 * unit_v @ unit_v.t()

)

BenjaminBossan · 2024-06-26T09:57:04Z

src/peft/tuners/hra/layer.py

+                x = x.to(self.get_base_layer().weight.data.dtype)
+                orig_weight = self.get_base_layer().weight.data
+                delta_weight = self.get_delta_weight(active_adapter)
+                new_weight = torch.mm(orig_weight, delta_weight)
+
+                result = F.linear(input=x, weight=new_weight, bias=self.base_layer.bias)


I think this code is not quite correct. Above, we calculate the output from the base layer: result = self.base_layer(x, *args, **kwargs). Here, we just completely override result. So not only did we calculate self.base_layer(x, *args, **kwargs) for nothing (in case of a single active adapter), we also have an issue when there are multiple activate adapters, as only the result from the last adapter would be returned, right?

Ideally, inside the loop, we should incrementally update result with the adapter activations, something like result = result + hra_result. Not sure if that works here though. Alternatively, inside the loop we would have to accumulate the delta weights for each adapter and then outside the loop calculate the whole output.

I was surprised that this was not caught by the tests but for that to happen, we need to add HRA also to the multi adapter tests here:

peft/tests/test_custom_models.py

Line 356 in 184beaf

MULTIPLE_ACTIVE_ADAPTERS_TEST_CASES = [

I added HRA to these tests and indeed some of them are failing. Could you please do that and then fix the error? Please do the same adjustments for conv2d.

Sorry, we didn't account for the multiple activate adapters. This bug has now been fixed, and we have added and successfully passed the test cases you mentioned.

BenjaminBossan · 2024-06-27T13:38:52Z

README.md

@@ -1,5 +1,5 @@
 <!---
-Copyright 2023 The HuggingFace Team. All rights reserved.
+Copyright 2024 The HuggingFace Team. All rights reserved.


It seems that you have replaced all the years in all files. This should not be done. Please revert this change. The year only needs to be updated in the files that you created for this PR.

I'm sorry...

DaShenZi721 · 2024-07-01T07:33:33Z

Hi! Do you find any other problems?

BenjaminBossan

Thanks for the updates, good work. Overall, this looks pretty good and I only found a few minor issues that need to be addressed before we can merge -- please check my comments. Apart from those, could you please run make style?

examples/hra_dreambooth/README.md

BenjaminBossan · 2024-07-01T13:41:03Z

examples/hra_dreambooth/train_dreambooth.py

+
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)


I followed the instructions and started training with ./train_dreambooth.sh 0 0. It ran fine for a while but then I got an error. Any idea what could be the reason?

Steps: 6%|██████████▍ | 98/1705 [00:42<09:11, 2.92it/s, loss=0.35, lr=0.001]GPU Memory before entering the train : 4796 GPU Memory consumed at the end of the train (end-begin): 21 GPU Peak Memory consumed during the train (max-begin): 7640 GPU Total Peak Memory consumed during the train (max): 12436 CPU Memory before entering the train : 3384 CPU Memory consumed at the end of the train (end-begin): 202 CPU Peak Memory consumed during the train (max-begin): 517 CPU Total Peak Memory consumed during the train (max): 3901 /opt/conda/conda-bld/pytorch_1716905969073/work/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [276,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed. [...] /opt/conda/conda-bld/pytorch_1716905969073/work/aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [96,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed. [rank1]: Traceback (most recent call last): [rank1]: File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 925, in main [rank1]: encoder_hidden_states = text_encoder(batch["input_ids"])[0] [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward [rank1]: else self._run_ddp_forward(*inputs, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward [rank1]: return self.module(*inputs, **kwargs) # type: ignore[index] [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/utils/operations.py", line 822, in forward [rank1]: return model_forward(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/utils/operations.py", line 810, in __call__ [rank1]: return convert_to_fp32(self.model_forward(*args, **kwargs)) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast [rank1]: return func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/work/forks/peft/src/peft/peft_model.py", line 692, in forward [rank1]: return self.get_base_model()(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 806, in forward [rank1]: return self.text_model( [rank1]: ^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 710, in forward [rank1]: encoder_outputs = self.encoder( [rank1]: ^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 637, in forward [rank1]: layer_outputs = encoder_layer( [rank1]: ^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 384, in forward [rank1]: hidden_states = self.mlp(hidden_states) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 339, in forward [rank1]: hidden_states = self.fc1(hidden_states) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 116, in forward [rank1]: return F.linear(input, self.weight, self.bias) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: RuntimeError: CUDA error: device-side assert triggered [rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 1083, in <module> [rank1]: main(args) [rank1]: File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 896, in main [rank1]: with TorchTracemalloc() if not args.no_tracemalloc else nullcontext() as tracemalloc: [rank1]: File "/home/name/work/forks/peft/examples/hra_dreambooth/train_dreambooth.py", line 443, in __exit__ [rank1]: torch.cuda.empty_cache() [rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/cuda/memory.py", line 162, in empty_cache [rank1]: torch._C._cuda_emptyCache() [rank1]: RuntimeError: CUDA error: device-side assert triggered [rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. [rank1]:[W CUDAGuardImpl.h:118] Warning: CUDA warning: device-side assert triggered (function destroyEvent) terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1716905969073/work/c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7a9a6ca93897 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7a9a6ca43b25 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7a9a6cb6d718 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) frame #3: <unknown function> + 0x1d8d6 (0x7a9a6cb388d6 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) frame #4: <unknown function> + 0x1f5e3 (0x7a9a6cb3a5e3 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) frame #5: <unknown function> + 0x1f922 (0x7a9a6cb3a922 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) frame #6: <unknown function> + 0x5a7380 (0x7a9a635a7380 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x6a36f (0x7a9a6ca7836f in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so) frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7a9a6ca711cb in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so) frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7a9a6ca71379 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libc10.so) frame #10: c10d::Reducer::~Reducer() + 0x5c4 (0x7a9a5bdec3b4 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #11: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7a9a63cf2402 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #12: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7a9a63471268 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #13: <unknown function> + 0xcf5eb1 (0x7a9a63cf5eb1 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #14: <unknown function> + 0x47c8e3 (0x7a9a6347c8e3 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #15: <unknown function> + 0x47d861 (0x7a9a6347d861 in /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #16: /home/name/anaconda3/envs/peft/bin/python() [0x502cba] frame #17: /home/name/anaconda3/envs/peft/bin/python() [0x55ed0e] frame #18: /home/name/anaconda3/envs/peft/bin/python() [0x540ea5] frame #19: /home/name/anaconda3/envs/peft/bin/python() [0x53f8d8] frame #20: /home/name/anaconda3/envs/peft/bin/python() [0x53f914] frame #21: /home/name/anaconda3/envs/peft/bin/python() [0x4f72cb] frame #22: PyDict_SetItemString + 0xab (0x4fad1b in /home/name/anaconda3/envs/peft/bin/python) frame #23: /home/name/anaconda3/envs/peft/bin/python() [0x5fc824] frame #24: Py_FinalizeEx + 0x143 (0x5eb603 in /home/name/anaconda3/envs/peft/bin/python) frame #25: Py_RunMain + 0x193 (0x5f6da3 in /home/name/anaconda3/envs/peft/bin/python) frame #26: Py_BytesMain + 0x39 (0x5bbc79 in /home/name/anaconda3/envs/peft/bin/python) frame #27: <unknown function> + 0x2a1ca (0x7a9a7ee2a1ca in /lib/x86_64-linux-gnu/libc.so.6) frame #28: __libc_start_main + 0x8b (0x7a9a7ee2a28b in /lib/x86_64-linux-gnu/libc.so.6) frame #29: /home/name/anaconda3/envs/peft/bin/python() [0x5bbac3] Steps: 6%|██████████▌ | 99/1705 [00:42<12:33, 2.13it/s, loss=0.35, lr=0.001]W0701 15:36:00.773000 123747233916416 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 96374 closing signal SIGTERM E0701 15:36:00.887000 123747233916416 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 96375) of binary: /home/name/anaconda3/envs/peft/bin/python Traceback (most recent call last): File "/home/name/anaconda3/envs/peft/bin/accelerate", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1088, in launch_command multi_gpu_launcher(args) File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher distrib_run.run(args) File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== train_dreambooth.py FAILED ------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-01_15:36:00 host : name-machine rank : 1 (local_rank: 1) exitcode : -6 (pid: 96375) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 96375 ======================================================

I've encountered this issue as well, but it disappeared after retraining. You could try retraining to see if it resolves the problem.

Indeed, I had to delete the logs directory and then after retraining, it works. Would be good to a add a comment about this for users who encounter the same problem -- or even better fixing the problem :)

src/peft/tuners/hra/layer.py

tests/test_torch_compile.py

tests/test_common_gpu.py

src/peft/tuners/hra/model.py

BenjaminBossan

This PR looks great, we're almost there. Just some very small comments from my side, please take a look.

BenjaminBossan · 2024-07-02T09:25:03Z

tests/test_stablediffusion.py

@@ -1,4 +1,4 @@
-# Copyright 2023-present the HuggingFace Inc. team.
+# Copyright 2024-present the HuggingFace Inc. team.


Please undo.

BenjaminBossan · 2024-07-02T09:27:27Z

tests/test_common_gpu.py

+        from datasets import load_dataset
+        from transformers import AutoImageProcessor, AutoModelForImageClassification


These imports can be on the root, not necessary to make them local.

BenjaminBossan · 2024-07-02T09:44:19Z

examples/hra_dreambooth/train_dreambooth.py

+
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)


Indeed, I had to delete the logs directory and then after retraining, it works. Would be good to a add a comment about this for users who encounter the same problem -- or even better fixing the problem :)

BenjaminBossan · 2024-07-02T09:46:56Z

examples/hra_dreambooth/train_dreambooth.sh

+  echo "BASE_MODEL_NAME = $MODEL_NAME"
+  echo "ADAPTER_MODEL_PATH = $OUTPUT_DIR"
+  echo "PROMPT = $instance_prompt"


Just a suggestion, this way users can copy&paste into the notebook without having to edit the code.

Suggested change

echo "BASE_MODEL_NAME = $MODEL_NAME"

echo "ADAPTER_MODEL_PATH = $OUTPUT_DIR"

echo "PROMPT = $instance_prompt"

echo "BASE_MODEL_NAME = '$MODEL_NAME'"

echo "ADAPTER_MODEL_PATH = '$OUTPUT_DIR'"

echo "PROMPT = '$instance_prompt'"

BenjaminBossan · 2024-07-03T16:11:24Z

src/peft/tuners/hra/model.py

+    def __getattr__(self, name: str):
+        """Forward missing attributes to the wrapped module."""
+        try:
+            return super().__getattr__(name)  # defer to nn.Module's logic


Just a small thing, we just merged a small bug in __getattr__: #1892. Could you please apply the fix here too?

if name == "model": # see #1892: prevent infinite recursion if class is not initialized raise

There is also a merge conflict in test_common_gpu.py, which should hopefully be easy to resolve.

DaShenZi721 · 2024-07-09T03:01:42Z

We have applied the fix #1892 and resolved the conflicts!
Besides, we updated the hra_drembooth to generate higher quality images!

BenjaminBossan · 2024-07-09T11:28:33Z

@DaShenZi721 Thanks for the updates. We just merged another PR, which resulted in many merge conflicts, but they should be very straightforward to resolve. Could you please take a look?

I'll depart soon to the EuroPython conference, so unfortunately, I'll not be able to review the PR this week.

DaShenZi721 · 2024-07-09T11:32:55Z

We have resolved the conflicts.

BenjaminBossan

The code and the recent changes all look good. I had a fresh look at the training script and actually found a couple of issues when trying not to use wandb for logging. Maybe these issues are already in the original code, not sure, but please take a look. It would be nice if users could easily switch the logger or set no logger at all. Besides this, the PR is good to go.

BenjaminBossan · 2024-07-15T11:59:56Z

examples/hra_dreambooth/README.md

+
+```
+hra_dreambooth
+├── data


There is no data directory, right?

The data directory should have been automatically created when running the train_dreambooth.sh.

examples/hra_dreambooth/train_dreambooth.py

examples/hra_dreambooth/utils/args_loader.py

examples/hra_dreambooth/train_dreambooth.py

examples/hra_dreambooth/utils/args_loader.py

examples/hra_dreambooth/train_dreambooth.py

BenjaminBossan

Thanks for fixing the training script, it now works well for me without wandb. And also big thanks for adding HRA to PEFT. I'll merge the PR once CI is green.

DaShenZi721 · 2024-07-16T12:48:18Z

Thank you so much for your careful and responsible guidance!

BenjaminBossan · 2024-09-05T15:32:50Z

Hi @DaShenZi721, I noticed that one of our unit tests was particularly slow and it is involves HRA. The test is test_model_with_batchnorm_reproducibility in tests/test_vision_models.py. On CI, this test takes > 500 sec. But even locally on GPU, it takes > 70 sec for me. I did some quick tests and it's the forward and backwards passes that take a lot of time. The same test with LoRA just takes 7 sec locally.

I wonder if you have some ideas what the exact reason could be and if we could speed up HRA, which would also help with adoption. Otherwise, I'll probably just remove this test.

DaShenZi721 · 2024-10-16T07:00:01Z

Sorry, I missed this message... I will now look into how to address this issue.

DaShenZi721 · 2024-10-16T08:31:26Z

I wasn't able to reproduce the issue. Has it already been resolved?

BenjaminBossan · 2024-10-16T09:54:29Z

I wasn't able to reproduce the issue. Has it already been resolved?

I switched the underlying model which is probably why you can't reproduce now. From that point of view, this is no longer urgent. I think it could still be worth investigating why microsoft/resnet-18 was so slow.

DaShenZi721 · 2024-10-16T10:47:07Z

OK, I'll check it out.

support HRA

a66737e

BenjaminBossan requested changes Jun 17, 2024

View reviewed changes

fix bugs and add tests

9593be3

update

Loading
Loading status checks…

96c11db

update ruff and make style

Loading
Loading status checks…

3597331

BenjaminBossan requested changes Jun 26, 2024

View reviewed changes

DaShenZi721 added 3 commits June 27, 2024 16:27

fix bug

2098f87

make style

49c5dd1

modify parameter names

e0d4229

BenjaminBossan reviewed Jun 27, 2024

View reviewed changes

DaShenZi721 added 4 commits June 27, 2024 22:00

update year

628ad7c

update year

18e3e3f

add tests

38bb3f3

make style

Loading
Loading status checks…

c7bcfc6

BenjaminBossan requested changes Jul 1, 2024

View reviewed changes

DaShenZi721 added 2 commits July 1, 2024 23:22

update and add tests

987e847

update hra_dreambooth

c679908

BenjaminBossan requested changes Jul 2, 2024

View reviewed changes

update

cd667f0

BenjaminBossan reviewed Jul 3, 2024

View reviewed changes

DaShenZi721 and others added 4 commits July 8, 2024 21:47

Merge branch 'main' into hra

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode

3ebdd73

fix bug and update hra_example

2ee4e93

Merge branch 'hra' of github.com:DaShenZi721/peft into hra

c4b667e

delete local path

10fb707

Merge branch 'main' into hra

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode

Loading
Loading status checks…

d61e219

DaShenZi721 added 2 commits July 9, 2024 20:34

make style

Loading
Loading status checks…

6eb0008

make style

Loading
Loading status checks…

e258353

BenjaminBossan requested changes Jul 15, 2024

View reviewed changes

DaShenZi721 added 2 commits July 16, 2024 16:17

update

Loading
Loading status checks…

6cab325

make style

Loading
Loading status checks…

0a3d22c

BenjaminBossan approved these changes Jul 16, 2024

View reviewed changes

BenjaminBossan merged commit 5268495 into huggingface:main Jul 16, 2024
14 checks passed

BenjaminBossan mentioned this pull request Sep 10, 2024

TST Speed up vision model tests #2058

Merged

BenjaminBossan mentioned this pull request Oct 9, 2024

Is it possible to support the new Bone method? #2138

Closed

		@@ -0,0 +1,20 @@
		# Copyright 2023-present the HuggingFace Inc. team.

		@@ -1,4 +1,4 @@
		# Copyright 2023-present the HuggingFace Inc. team.
		# Copyright 2024-present the HuggingFace Inc. team.

		from datasets import load_dataset
		from transformers import AutoImageProcessor, AutoModelForImageClassification

support HRA #1864

support HRA #1864

Conversation

DaShenZi721 commented Jun 16, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

review-notebook-app bot commented Jun 23, 2024

DaShenZi721 commented Jun 23, 2024

HuggingFaceDocBuilderDev commented Jun 25, 2024

BenjaminBossan commented Jun 25, 2024

DaShenZi721 commented Jun 25, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaShenZi721 commented Jul 1, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaShenZi721 commented Jul 9, 2024

BenjaminBossan commented Jul 9, 2024

DaShenZi721 commented Jul 9, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

DaShenZi721 commented Jul 16, 2024

BenjaminBossan commented Sep 5, 2024

DaShenZi721 commented Oct 16, 2024

DaShenZi721 commented Oct 16, 2024

BenjaminBossan commented Oct 16, 2024

DaShenZi721 commented Oct 16, 2024

BenjaminBossan Jul 3, 2024 •

edited

Loading