FIX: Issue with params that have to ignored for DDP #900

BenjaminBossan · 2023-09-04T15:56:13Z

Resolves #899

Description

When using PyTorch DistributedDataParallel, there is an error if DDP wraps parameters that are not participating in the calculation of the loss. This is annoying, as we often have those types of parameters, e.g.:

ModulesToSaveWrapper has a copy of the original weights
When using multiple adapters, with only one being active

This PR aims at fixing this issue by exposing an attribute (actually: property) called _ddp_params_and_buffers_to_ignore. DDP uses this attribute to ignore the given parameters, making the error disappear.

Implementation

Please let me know if there is a better way to implement this. The current solution has a few drawbacks:

Besides having to do this in the first place, it is also annoying that we have to implement this for all tuner layers, as they can differ when it comes to what parameters to ignore (we don't have a consistent naming scheme).

Also, our unit tests cannot cover this directly, as DDP requires a multi-gpu setup. Instead, the test is only for the existence, and correct value, of the _ddp_params_and_buffers_to_ignore attribute. Writing a proper DDP test would require more effort but should be feasible.

We also rely on a private attribute here, so there is a danger of this breaking with future PyTorch releases.

DONE/TODO:

ModulesToSaveWrapper
LoRA
IA³
AdaLora

Resolves huggingface#899 Description When using PyTorch DistributedDataParallel, there is an error if DDP wraps parameters that are not participating in the calculation of the loss. This is annoying, as we often have those types of parameters, e.g.: - ModulesToSaveWrapper has a copy of the original weights - When using multiple adapters, with only one being active This PR aims at fixing this issue by exposing an attribute (actually: property) called _ddp_params_and_buffers_to_ignore. DDP uses this attribute to ignore the given parameters, making the error disappear. Implementation Please let me know if there is a better way to implement this. The current solution has a few drawbacks: Besides having to do this in the first place, it is also annoying that we have to implement this for all tuner layers, as they can differ when it comes to what parameters to ignore (we don't have a consistent naming scheme). Also, our unit tests cannot cover this directly, as DDP requires a multi-gpu setup. Instead, the test is only for the existince, and correct value, of the _ddp_params_and_buffers_to_ignore attribute. Writing a proper DDP test would require more effort but should be feasible. We also rely on a private attribute here, so there is a danger of this breaking with future PyTorch releases. DONE/TODO: - [x] ModulesToSaveWrapper - [x] LoRA - [ ] IA³ - [ ] AdaLora

HuggingFaceDocBuilderDev · 2023-09-04T16:07:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

This is an alternative to huggingface#900, resolves huggingface#899. Description Currently, we don't handle setting requires_grad on adapter layers really well. The main issue is that it can be set to True on adapter parameters that are not being used, e.g. the original_module in ModulesToSaveWrapper or inactive adapters in LoRA. Normally, this is not a big issue, except maybe if we want to correctly count the number of trainable parameters. However, when training with DistributedDataParallel, this results in errors, as PyTorch thinks that all parameters with requires_grad=True should participate in the loss computation, but those mentioned parameters don't. For that reason, training with DDP currently fails when using modules_to_save or multiple adapters. Implementation This turned out to be more complicated than I initially thought. The logic for setting requires_grad is all over the place, it was hard to encapsulate the logic and I only succeeded partially. As is, this PR is more complex than the one it tries to supersede, huggingface#900, but it is also "more correct". Tests were added to check whether requires_grad is set correctly. There are (so far) no tests for whether DDP indeed works, they could be added with multi-GPU. I did, however, test an early stage of this PR with DDP and setting requires_grad correctly will indeed fix the DDP error. DONE/TODO - [x] ModulesToSaveWrapper - [x] LoRA - [ ] IA³ - [ ] AdaLora Since some tuners are not implemented yet, tests are expected to fail. Check the new tests at the bottom of test_custom.py, those should pass.

younesbelkada

Thanks a lot @BenjaminBossan for this very elegant way of solving the issue you have stated.
IMO we can push that further and make it more future proof if we don't use getattr. I think getattr with [] passed as the last argument can silently lead to always getting empty lists; what about adding an abstract property inside BaseTunerLayer, so that we could safely just use if isinstance(module, BaseTunerLayer) as a condition to retrieve the params and bufffers to ignore of that module. What do you think?

younesbelkada · 2023-09-11T17:45:31Z

src/peft/peft_model.py

+                # avoid infinite recursion
+                continue
+
+            module_params_and_buffers_to_ignore = getattr(module, "_ddp_params_and_buffers_to_ignore", [])


Suggested change

module_params_and_buffers_to_ignore = getattr(module, "_ddp_params_and_buffers_to_ignore", [])

if isinstance(module, BaseTunerLayer):

module_params_and_buffers_to_ignore = module._ddp_params_and_buffers_to_ignore

Then inside the if block perform the for loop

* [WIP] Fix setting requires_grad on adapter layers This is an alternative to #900, resolves #899. Description Currently, we don't handle setting requires_grad on adapter layers really well. The main issue is that it can be set to True on adapter parameters that are not being used, e.g. the original_module in ModulesToSaveWrapper or inactive adapters in LoRA. Normally, this is not a big issue, except maybe if we want to correctly count the number of trainable parameters. However, when training with DistributedDataParallel, this results in errors, as PyTorch thinks that all parameters with requires_grad=True should participate in the loss computation, but those mentioned parameters don't. For that reason, training with DDP currently fails when using modules_to_save or multiple adapters. Implementation This turned out to be more complicated than I initially thought. The logic for setting requires_grad is all over the place, it was hard to encapsulate the logic and I only succeeded partially. As is, this PR is more complex than the one it tries to supersede, #900, but it is also "more correct". Tests were added to check whether requires_grad is set correctly. There are (so far) no tests for whether DDP indeed works, they could be added with multi-GPU. I did, however, test an early stage of this PR with DDP and setting requires_grad correctly will indeed fix the DDP error. DONE/TODO - [x] ModulesToSaveWrapper - [x] LoRA - [ ] IA³ - [ ] AdaLora Since some tuners are not implemented yet, tests are expected to fail. Check the new tests at the bottom of test_custom.py, those should pass. * Refactor: move more requires_grad machinery to ABC * [skip ci] [WIP] Add requires_grad logic to IA³ * Add AdaLora * Fix some minor issues * Make style

BenjaminBossan · 2023-10-12T15:11:04Z

This PR is superseded by #905, I forgot to close it.

BenjaminBossan added 3 commits September 4, 2023 17:45

Make style

6067fe3

annotations import in layer.py

76f5526

BenjaminBossan mentioned this pull request Sep 4, 2023

Modules_to_save can't be used with DDP and gradient checkpointing #899

Closed

4 tasks

BenjaminBossan requested review from pacman100 and younesbelkada September 4, 2023 16:20

BenjaminBossan mentioned this pull request Sep 5, 2023

FIX: setting requires_grad on adapter layers #905

Merged

4 tasks

younesbelkada reviewed Sep 11, 2023

View reviewed changes

BenjaminBossan closed this Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Issue with params that have to ignored for DDP #900

FIX: Issue with params that have to ignored for DDP #900

BenjaminBossan commented Sep 4, 2023

HuggingFaceDocBuilderDev commented Sep 4, 2023

younesbelkada left a comment

younesbelkada Sep 11, 2023

BenjaminBossan commented Oct 12, 2023

	module_params_and_buffers_to_ignore = getattr(module, "_ddp_params_and_buffers_to_ignore", [])
	if isinstance(module, BaseTunerLayer):
	module_params_and_buffers_to_ignore = module._ddp_params_and_buffers_to_ignore

FIX: Issue with params that have to ignored for DDP #900

FIX: Issue with params that have to ignored for DDP #900

Conversation

BenjaminBossan commented Sep 4, 2023

Description

Implementation

DONE/TODO:

HuggingFaceDocBuilderDev commented Sep 4, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada Sep 11, 2023

Choose a reason for hiding this comment

BenjaminBossan commented Oct 12, 2023