FIX: setting requires_grad on adapter layers #905

BenjaminBossan · 2023-09-05T15:35:45Z

This is an alternative to #900, resolves #899.

Thanks @passaglia for figuring out the underlying issue.

Description

Currently, we don't handle setting requires_grad on adapter layers really well. The main issue is that it can be set to True on adapter parameters that are not being used, e.g. the original_module in ModulesToSaveWrapper or inactive adapters in LoRA.

Normally, this is not a big issue, except maybe if we want to correctly count the number of trainable parameters. However, when training with DistributedDataParallel, this results in errors, as PyTorch thinks that all parameters with requires_grad=True should participate in the loss computation, but those mentioned parameters don't. For that reason, training with DDP currently errors when using modules_to_save or multiple adapters.

Implementation

This turned out to be more complicated than I initially thought. The logic for setting requires_grad is all over the place, it was hard to encapsulate the logic and I only succeeded partially. As is, this PR is more complex than the one it tries to supersede, #900, but it is also "more correct".

Tests were added to check whether requires_grad is set correctly. There are (so far) no tests for whether DDP indeed works, they could be added with multi-GPU. I did, however, test an early stage of this PR with DDP and setting requires_grad correctly will indeed fix the DDP error.

DONE/TODO

ModulesToSaveWrapper
LoRA
IA³
AdaLora

This is an alternative to huggingface#900, resolves huggingface#899. Description Currently, we don't handle setting requires_grad on adapter layers really well. The main issue is that it can be set to True on adapter parameters that are not being used, e.g. the original_module in ModulesToSaveWrapper or inactive adapters in LoRA. Normally, this is not a big issue, except maybe if we want to correctly count the number of trainable parameters. However, when training with DistributedDataParallel, this results in errors, as PyTorch thinks that all parameters with requires_grad=True should participate in the loss computation, but those mentioned parameters don't. For that reason, training with DDP currently fails when using modules_to_save or multiple adapters. Implementation This turned out to be more complicated than I initially thought. The logic for setting requires_grad is all over the place, it was hard to encapsulate the logic and I only succeeded partially. As is, this PR is more complex than the one it tries to supersede, huggingface#900, but it is also "more correct". Tests were added to check whether requires_grad is set correctly. There are (so far) no tests for whether DDP indeed works, they could be added with multi-GPU. I did, however, test an early stage of this PR with DDP and setting requires_grad correctly will indeed fix the DDP error. DONE/TODO - [x] ModulesToSaveWrapper - [x] LoRA - [ ] IA³ - [ ] AdaLora Since some tuners are not implemented yet, tests are expected to fail. Check the new tests at the bottom of test_custom.py, those should pass.

HuggingFaceDocBuilderDev · 2023-09-05T15:42:32Z

The documentation is not available anymore as the PR was closed or merged.

pacman100

Thank you @BenjaminBossan for fixing this major bug when using DDP/Multiple Adapters with PEFT. LGTM! 🤗

younesbelkada

Thanks a mile @BenjaminBossan !

BenjaminBossan added 5 commits September 12, 2023 15:46

Refactor: move more requires_grad machinery to ABC

0700815

[skip ci] [WIP] Add requires_grad logic to IA³

a002e29

[skip ci] [WIP] Merge branch 'main' into fix-setting-requires-grad

eb6b238

Add AdaLora

4990862

Fix some minor issues

9ad6bb5

BenjaminBossan marked this pull request as ready for review September 12, 2023 15:16

Merge branch 'main' into fix-setting-requires-grad

f70dd0d

BenjaminBossan changed the title ~~[WIP] FIX: setting requires_grad on adapter layers~~ FIX: setting requires_grad on adapter layers Sep 18, 2023

BenjaminBossan added 2 commits September 25, 2023 18:14

Merge branch 'main' into fix-setting-requires-grad

523b8b2

Make style

a8875a4

BenjaminBossan requested review from pacman100 and younesbelkada September 25, 2023 17:28

pacman100 approved these changes Sep 26, 2023

View reviewed changes

younesbelkada approved these changes Sep 26, 2023

View reviewed changes

pacman100 merged commit 634bd19 into huggingface:main Sep 26, 2023

BenjaminBossan deleted the fix-setting-requires-grad branch September 26, 2023 07:58

This was referenced Sep 26, 2023

[PEFT] Fix PEFT multi adapters support huggingface/transformers#26407

Merged

[PEFT] Fix peft ci huggingface/diffusers#5185

Closed

His-Wardship mentioned this pull request Sep 26, 2023

Add 4-bit support to IA3 - Outperforms QLoRA in both speed and memory consumption #864

Merged

BenjaminBossan mentioned this pull request Sep 26, 2023

Add implementation of LyCORIS LoHa (FedPara-like adapter) for SD&SDXL models #956

Merged

9 tasks

This was referenced Oct 9, 2023

Support Multi-LoRA/qLoRA in PEFT #1005

Closed

FIX: Issue with params that have to ignored for DDP #900

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: setting requires_grad on adapter layers #905

FIX: setting requires_grad on adapter layers #905

BenjaminBossan commented Sep 5, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 5, 2023 •

edited

Loading

pacman100 left a comment

younesbelkada left a comment

FIX: setting requires_grad on adapter layers #905

FIX: setting requires_grad on adapter layers #905

Conversation

BenjaminBossan commented Sep 5, 2023 • edited Loading

Description

Implementation

DONE/TODO

HuggingFaceDocBuilderDev commented Sep 5, 2023 • edited Loading

pacman100 left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Sep 5, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 5, 2023 •

edited

Loading