Add support for bitsandbytes #15622

manuelciosici · 2022-02-11T13:01:13Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@stas00 @sgugger @TimDettmers

Status

Need to instrument CI to install bnb (binary package so a bit trickier than normal dependency)
I have implemented a CLI parameter to support bitsandbytes
I did not write any documentation yet
I followed @TimDettmers 's suggestion to override the embedding layers. However, I am unsure about a couple of things:
- Does the override need to happen before the model is loaded onto the GPU as the official documentation describes for other overrides?
- Are there any pitfalls to my current approach to identifying Embedding layers? It seems to work fine for RoBERTa and for GPT-2.
So far, I've used run_mlm.py and run_clm.py from the examples directory to check that the code runs. Using RTX A6000 GPUs, I see

Model	visible devices	optimizer	per device batch size	GPU memory
gpt2-large	0	adamw_torch	2	48638MiB / 49140MiB
gpt2-large	0	adamw_bnb	2	42412MiB / 49140MiB
gpt2-large	0,1	adamw_torch	1	30724MiB / 49140MiB 21040MiB / 49140MiB
gpt2-large	0,1	adamw_torch	2	OOM
gpt2-large	0,1	adamw_bnb	1	26820MiB / 49140MiB 21042MiB / 49140MiB
gpt2-large	0,1	adamw_bnb	2	44458MiB / 49140MiB 36906MiB / 49140MiB

HuggingFaceDocBuilder · 2022-02-11T13:01:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sgugger

Thanks a lot for working on this! I've left a couple of comments.

src/transformers/trainer.py

src/transformers/training_args.py

tests/extended/test_trainer_ext.py

src/transformers/trainer.py

stas00

Great work, @manuelciosici!

Let's add an actual test to it before merging this.

src/transformers/trainer.py

tests/extended/test_trainer_ext.py

stas00 · 2022-02-11T20:06:03Z

I followed @TimDettmers 's #14819 (comment) to override the embedding layers. However, I am unsure about a couple of things:

Does the override need to happen before the model is loaded onto the GPU as the official documentation describes for other overrides?

This requirement would be a problem in some use-cases e.g. with Deepspeed ZeRO-3 which pre-loads the model directly on GPU during from_pretrained. But given that Deepspeed already shards the optim states over multiple-gpus it doesn't really need BNB. So we might need to instrument a counter-indication of BNB X DS-ZeRO3.

I'm not sure about other cases where a model ends up on GPU - if I'm not mistaken DS is the only one, @sgugger ?

Are there any pitfalls to my current approach to identifying Embedding layers? It seems to work fine for RoBERTa and for GPT-2.

Commented here:
#15622 (comment)

sgugger · 2022-02-11T20:19:21Z

Normally, when creating the optimizer, the model has been moved to the proper device already, except in the following cases:

model parallelism (it has been split across devices already)
deepseed and fairscale
evaluation-only in fp16/bf16 full eval.

stas00 · 2022-02-11T20:37:36Z

Normally, when creating the optimizer, the model has been moved to the proper device already, except in the following cases:
* model parallelism (it has been split across devices already)

so this is not an exception, as it's not on cpu, we just don't do it in the Trainer, but the modeling code does.

* deepseed and fairscale

yes, except deepspeed zero3 where it's already moved to gpu - we just don't do it in trainer.

* evaluation-only in fp16/bf16 full eval.

Check - but it's irrelevant to the optimizer.

So to summarize Sylvain's list of exceptions - in the general case the model should be already on GPU.

So we need to wait for Tim to let us know if that's a problem or whether it has a work around.

stas00 · 2022-03-13T18:04:29Z

@TimDettmers, if you get a chance could you please address some of the questions to you so that this PR can be unblocked and BNB integration added to the HF Trainer? Thank you!

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

TimDettmers · 2022-04-04T15:49:43Z

I followed @TimDettmers 's #14819 (comment) to override the embedding layers. However, I am unsure about a couple of things:
Does the override need to happen before the model is loaded onto the GPU as the official documentation describes for other overrides?

This requirement would be a problem in some use-cases e.g. with Deepspeed ZeRO-3 which pre-loads the model directly on GPU during from_pretrained. But given that Deepspeed already shards the optim states over multiple-gpus it doesn't really need BNB. So we might need to instrument a counter-indication of BNB X DS-ZeRO3.

I'm not sure about other cases where a model ends up on GPU - if I'm not mistaken DS is the only one, @sgugger ?

Are there any pitfalls to my current approach to identifying Embedding layers? It seems to work fine for RoBERTa and for GPT-2.

Commented here: #15622 (comment)

The new implementation of the override no longer depends on when the model is transferred to the GPU or when the override is registered. It takes the following signature:

GlobalOptimManager.get_instance().register_module_override(module, 'weight', {'optim_bits': 32})

where weight is the parameter name to override. In this case, one can use this on the embedding layer (skipping the positional embedding).

TimDettmers · 2022-04-04T15:50:48Z

Normally, when creating the optimizer, the model has been moved to the proper device already, except in the following cases:

model parallelism (it has been split across devices already)

deepseed and fairscale

evaluation-only in fp16/bf16 full eval.

These issues should be resolved with the new parameter override which is independent of when the parameters are transferred to the device.

tests/extended/test_trainer_ext.py

HuggingFaceDocBuilderDev · 2022-04-15T02:57:48Z

The documentation is not available anymore as the PR was closed or merged.

stas00

This looks great now.

Thank you for working on this, @manuelciosici!

And thank you @TimDettmers for supporting the sorting out process!

Let's ask @sgugger to have another look before we merge this.

sgugger

Thanks for all the work on this. It's almost ready to be merged, I just have a small request to replace everywhere is_bnb_available by is_bitsandbytes_available. Since we have a lot of thos eis_xxx_available and not all contributors might know this library, it will make it clearer to everyone what this is :-)

src/transformers/utils/import_utils.py

tests/extended/test_trainer_ext.py

manuelciosici · 2022-04-18T15:16:44Z

@stas00 I caught up with your work on testing and with @sgugger's subsequent requests. Is there anything else I should do on this PR for it to be ready to merge?

stas00 · 2022-04-19T17:49:40Z

Just waiting for @sgugger to have one last look after moving the require_* decorators to testing_utils.py and I think this is good to be merged.

sgugger

All good, thanks again for all the work on this!

* Add initial BNB integration * fixup! Add initial BNB integration * Add bnb test decorator * Update Adamw8bit option name * Use the full bnb package name * Overide bnb for all embedding layers * Fix package name * Formatting * Remove unnecessary import * Update src/transformers/trainer.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Rename AdamwBNB optimizer option * Add training test checking that bnb memory utilization is lower * fix merge * fix merge; fix + extend new test * cleanup * expand bnb * move all require_* candidates to testing_utils.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Stas Bekman <stas@stason.org>

younesbelkada · 2022-08-11T15:46:38Z

Hi there!
Jumping up on this PR after it has been merged
It appears that 2 tests of this PR are not passing:


     def test_bnb_adam8bit_no_bnb(self):
          args = TrainingArguments(optim=OptimizerNames.ADAMW_BNB, output_dir="None")
           # Pretend that bnb does not exist, even if installed. By setting bnb to None, importing
           # bnb will fail even if bnb is installed.
           with patch.dict("sys.modules", {"bnb.optim": None}):
              with self.assertRaises(ValueError):
                  Trainer.get_optimizer_cls_and_kwargs(args)

This test should be fixed in #18584 because of a very small typo, for the second test

test_run_seq2seq_bnb

I suspected it has been never run on our side since it is the only test that requires bitsandbytes on transformers have been always skipped because we used to not install bitsandbytes (check for eg this page tests/extended/test_trainer_ext.py::TestTrainerExt::test_run_seq2seq_bnb SKIPPED [ 21%] on our Docker image until the PR #17901
But I did not managed to reproduce the failing test (it is passing on my testing VM with the latest bitsandbytes + the one we use on the Docker image (bitsandbytes==0.31.5).

cc @TimDettmers @manuelciosici @stas00

stas00 · 2022-08-11T18:48:52Z

Thank you, @younesbelkada

@ydshieh, do you think it'd be OK to add bitsandbytes to the nightly tests workflow so that bnb tests run?

the installation is just cuda-version specific:

pip install bitsandbytes-cudaXXX

https://github.com/facebookresearch/bitsandbytes#requirements--installation

ydshieh · 2022-08-11T18:54:55Z

@stas00 I could add it and see how things go.

But @younesbelkada added it to the scheduled CI (which means to run on GPU) with

RUN python3 -m pip install -i https://test.pypi.org/simple/ bitsandbytes==0.31.5

I am a bit confused by why there was no cuda there.

younesbelkada · 2022-08-11T18:57:40Z

Thanks @stas00 and @ydshieh !
Adding the cuda version is not needed anymore and the facebook repo is not the one we have to refer to from now according to @TimDettmers

pip install bitsandbytes should be sufficient for now (I have to update the Dockerfile though)

younesbelkada · 2022-08-11T18:59:11Z

Here is the repo we have to refer to: https://github.com/TimDettmers/bitsandbytes

stas00 · 2022-08-11T19:03:42Z

oh, ok, I missed that you already added it, nothing to do then.

@TimDettmers, would it be possible to archive the original repo and post a link to the new repo on top of its README, since otherwise users will have no idea to use the new repo instead. thank you!

stas00 · 2022-08-11T19:05:30Z

Also note that we are linking to the old repo:

examples/research_projects/robust-speech-event/README.md:[bitsandbytes](https://github.com/facebookresearch/bitsandbytes) to replace the
docs/source/en/perf_train_gpu_one.mdx:- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/facebookresearch/bitsandbytes)
docs/source/en/perf_train_gpu_one.mdx:On the other hand [8bit BNB optimizer](https://github.com/facebookresearch/bitsandbytes) can save 3/4 of memory normally used by a typical AdamW optimizer if it is configured to quantize all optimizer states, but in some situations only some optimizer states are quintized and then more memory is used. XXX: update once  https://github.com/huggingface/transformers/pull/15622 is merged.
docs/source/en/perf_train_gpu_one.mdx:In contrast to the previous approaches is this one not integrated into the [`Trainer`] as a simple flag. We need to install the 8-bit optimizer and then pass it as a custom optimizer to the [`Trainer`]. Follow the installation guide in the Github [repo](https://github.com/facebookresearch/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer.

@TimDettmers, should we fix those to point to the new repo instead?

ydshieh · 2022-08-18T08:21:50Z

But I did not managed to reproduce the failing test (it is passing on my testing VM with the latest bitsandbytes + the one we use on the Docker image (bitsandbytes==0.31.5).

Hi @younesbelkada Are you running inside docker container on a VM similar to CI runners (Nivida T4)?
Could you try to get the values gpu_peak_mem_orig and gpu_peak_mem_bnb?

younesbelkada · 2022-08-18T08:50:25Z

Hi @ydshieh ! I am running on a VM similar to CI runners, let me re try to reproduce as you suggested

younesbelkada · 2022-08-18T09:21:09Z

The test is passing on my VM.. for the VM I get:

gpu_peak_mem_orig
243490816
gpu_peak_mem_bnb
8053248

But I re-ran the test with CUDA_VISIBLE_DEVICES=0 (running the test on a single GPU) and the test failed. Maybe the test was designed on a multi-GPU setup. Do you think that we should just add the decorator @require_torch_multigpu for this tests?

EDIT: I saw that even on multi-gpu the test was failing on the docker container

younesbelkada · 2022-08-18T09:23:55Z

In a single GPU setup:

gpu_peak_mem_bnb
510833664
gpu_peak_mem_orig
509707264

manuelciosici added 8 commits February 3, 2022 06:43

Add initial BNB integration

3696f64

fixup! Add initial BNB integration

ba8790c

Add bnb test decorator

16df6c8

Update Adamw8bit option name

97bd33b

Use the full bnb package name

22abb9c

Overide bnb for all embedding layers

226b3dd

Fix package name

1d03d49

Formatting

22d7112

sgugger reviewed Feb 11, 2022

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/training_args.py Outdated Show resolved Hide resolved

tests/extended/test_trainer_ext.py Outdated Show resolved Hide resolved

Remove unnecessary import

b048a34

stas00 reviewed Feb 11, 2022

View reviewed changes

src/transformers/trainer.py Show resolved Hide resolved

stas00 suggested changes Feb 11, 2022

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

tests/extended/test_trainer_ext.py Outdated Show resolved Hide resolved

tests/extended/test_trainer_ext.py Outdated Show resolved Hide resolved

stas00 added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Mar 13, 2022

huggingface deleted a comment from github-actions bot Mar 13, 2022

manuelciosici and others added 2 commits March 30, 2022 13:53

Update src/transformers/trainer.py

2c04486

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Rename AdamwBNB optimizer option

b7a1c0c

Add training test checking that bnb memory utilization is lower

8cb259b

ManuchehrBoudesh mentioned this pull request Apr 8, 2022

add Bigbird ONNX config #16427

Merged

stas00 reviewed Apr 8, 2022

View reviewed changes

tests/extended/test_trainer_ext.py Outdated Show resolved Hide resolved

stas00 reviewed Apr 9, 2022

View reviewed changes

tests/extended/test_trainer_ext.py Outdated Show resolved Hide resolved

stas00 added 2 commits April 14, 2022 19:32

Merge remote-tracking branch 'origin/main' into 14819-integrate-bnb

7a1c7b7

fix merge

e8bf8d0

stas00 added 2 commits April 14, 2022 20:25

fix merge; fix + extend new test

37044b5

cleanup

2e36019

stas00 approved these changes Apr 15, 2022

View reviewed changes

stas00 requested a review from sgugger April 15, 2022 03:52

stas00 marked this pull request as ready for review April 15, 2022 03:55

sgugger approved these changes Apr 15, 2022

View reviewed changes

src/transformers/utils/import_utils.py Outdated Show resolved Hide resolved

tests/extended/test_trainer_ext.py Outdated Show resolved Hide resolved

tests/extended/test_trainer_ext.py Show resolved Hide resolved

expand bnb

dbdaf25

move all require_* candidates to testing_utils.py

ce2c550

sgugger approved these changes Apr 19, 2022

View reviewed changes

sgugger merged commit 3104036 into huggingface:main Apr 19, 2022

younesbelkada mentioned this pull request Aug 11, 2022

[bnb] Fix non passing trainer tests #18584

Merged

stas00 mentioned this pull request Mar 11, 2023

Finetune T5 11B and the process is killed . exits with return code = -9[BUG] microsoft/DeepSpeed#2946

Closed

veezbo mentioned this pull request Aug 29, 2023

Modify efficient GPU training doc with now-available adamw_bnb_8bit optimizer #25807

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for bitsandbytes #15622

Add support for bitsandbytes #15622

manuelciosici commented Feb 11, 2022 •

edited by stas00

Loading

HuggingFaceDocBuilder commented Feb 11, 2022

sgugger left a comment

stas00 left a comment

stas00 commented Feb 11, 2022 •

edited

Loading

sgugger commented Feb 11, 2022

stas00 commented Feb 11, 2022 •

edited

Loading

stas00 commented Mar 13, 2022

TimDettmers commented Apr 4, 2022

TimDettmers commented Apr 4, 2022

HuggingFaceDocBuilderDev commented Apr 15, 2022 •

edited

Loading

stas00 left a comment •

edited

Loading

sgugger left a comment

manuelciosici commented Apr 18, 2022

stas00 commented Apr 19, 2022

sgugger left a comment

younesbelkada commented Aug 11, 2022 •

edited

Loading

stas00 commented Aug 11, 2022 •

edited

Loading

ydshieh commented Aug 11, 2022

younesbelkada commented Aug 11, 2022

younesbelkada commented Aug 11, 2022

stas00 commented Aug 11, 2022 •

edited

Loading

stas00 commented Aug 11, 2022

ydshieh commented Aug 18, 2022 •

edited

Loading

younesbelkada commented Aug 18, 2022 •

edited

Loading

younesbelkada commented Aug 18, 2022 •

edited

Loading

younesbelkada commented Aug 18, 2022

Add support for bitsandbytes #15622

Add support for bitsandbytes #15622

Conversation

manuelciosici commented Feb 11, 2022 • edited by stas00 Loading

What does this PR do?

Before submitting

Who can review?

Status

HuggingFaceDocBuilder commented Feb 11, 2022

sgugger left a comment

Choose a reason for hiding this comment

stas00 left a comment

Choose a reason for hiding this comment

stas00 commented Feb 11, 2022 • edited Loading

sgugger commented Feb 11, 2022

stas00 commented Feb 11, 2022 • edited Loading

stas00 commented Mar 13, 2022

TimDettmers commented Apr 4, 2022

TimDettmers commented Apr 4, 2022

HuggingFaceDocBuilderDev commented Apr 15, 2022 • edited Loading

stas00 left a comment • edited Loading

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

manuelciosici commented Apr 18, 2022

stas00 commented Apr 19, 2022

sgugger left a comment

Choose a reason for hiding this comment

younesbelkada commented Aug 11, 2022 • edited Loading

stas00 commented Aug 11, 2022 • edited Loading

ydshieh commented Aug 11, 2022

younesbelkada commented Aug 11, 2022

younesbelkada commented Aug 11, 2022

stas00 commented Aug 11, 2022 • edited Loading

stas00 commented Aug 11, 2022

ydshieh commented Aug 18, 2022 • edited Loading

younesbelkada commented Aug 18, 2022 • edited Loading

younesbelkada commented Aug 18, 2022 • edited Loading

younesbelkada commented Aug 18, 2022

manuelciosici commented Feb 11, 2022 •

edited by stas00

Loading

stas00 commented Feb 11, 2022 •

edited

Loading

stas00 commented Feb 11, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 15, 2022 •

edited

Loading

stas00 left a comment •

edited

Loading

younesbelkada commented Aug 11, 2022 •

edited

Loading

stas00 commented Aug 11, 2022 •

edited

Loading

stas00 commented Aug 11, 2022 •

edited

Loading

ydshieh commented Aug 18, 2022 •

edited

Loading

younesbelkada commented Aug 18, 2022 •

edited

Loading

younesbelkada commented Aug 18, 2022 •

edited

Loading