QLoRA / bnb.nf4 quantization causes issues in recent PyTorch Lightning/Fabric versions #1604

rasbt · 2024-07-19T16:22:36Z

Bug description

Either I'm doing something dumb or QLoRA seems to be broken. Tried it with different models:

LoRA (fine)

gemma_2 ~/litgpt litgpt finetune_lora --devices 1 --config config_hub/finetune/gemma-2b/lora.yaml       
{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.03847,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7fae9a9a2140>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.1,
 'lora_head': True,
 'lora_key': True,
 'lora_mlp': True,
 'lora_projection': True,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/lora-gemma-2b'),
 'precision': 'bf16-true',
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=800,
                    log_interval=1,
                    global_batch_size=6,
                    micro_batch_size=2,
                    lr_warmup_steps=200,
                    lr_warmup_fraction=None,
                    epochs=2,
                    max_tokens=None,
                    max_steps=None,
                    max_seq_length=512,
                    tie_embeddings=None,
                    max_norm=None,
                    min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 11,870,208
Number of non-trainable parameters: 3,030,460,416
The longest sequence length in the train data is 512, the model's maximum sequence length is 512 and context length is 4096
Verifying settings ...
Missing logger folder: /teamspace/studios/this_studio/out/finetune/lora-gemma-2b/logs/csv
Epoch 1 | iter 1 step 0 | loss train: 115.482, val: n/a | iter time: 753.85 ms
Epoch 1 | iter 2 step 0 | loss train: 106.427, val: n/a | iter time: 381.31 ms
Epoch 1 | iter 3 step 1 | loss train: 101.139, val: n/a | iter time: 351.09 ms (step)
Epoch 1 | iter 4 step 1 | loss train: 95.109, val: n/a | iter time: 167.29 ms
Epoch 1 | iter 5 step 1 | loss train: 98.440, val: n/a | iter time: 121.49 ms
Epoch 1 | iter 6 step 2 | loss train: 104.927, val: n/a | iter time: 182.25 ms (step)

QLoRA from config file (not fine)

gemma_2 ~/litgpt litgpt finetune_lora --devices 1 --config config_hub/finetune/gemma-2b/qlora.yaml 
{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.03847,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7f4ae444efb0>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.1,
 'lora_head': True,
 'lora_key': True,
 'lora_mlp': True,
 'lora_projection': True,
 'lora_query': True,
 'lora_r': 16,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/qlora-gemma-2b'),
 'precision': 'bf16-true',
 'quantize': 'bnb.nf4',
 'seed': 1337,
 'train': TrainArgs(save_interval=800,
                    log_interval=1,
                    global_batch_size=6,
                    micro_batch_size=2,
                    lr_warmup_steps=200,
                    lr_warmup_fraction=None,
                    epochs=2,
                    max_tokens=None,
                    max_steps=None,
                    max_seq_length=512,
                    tie_embeddings=None,
                    max_norm=None,
                    min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 23,740,416
Number of non-trainable parameters: 3,030,460,416
Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/teamspace/studios/this_studio/litgpt/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 169, in setup
    fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 845, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 931, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 936, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 215, in main
    load_checkpoint(fabric, model, checkpoint_path, strict=False)
  File "/teamspace/studios/this_studio/litgpt/litgpt/utils.py", line 362, in load_checkpoint
    model.load_state_dict(state_dict, strict=strict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 168, in load_state_dict
    return self._original_module.load_state_dict(state_dict=state_dict, strict=strict, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
    load(self, state_dict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2121, in load
    module._load_from_state_dict(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1991, in _load_from_state_dict
    hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 72, in __call__
    return self.hook(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 166, in _quantize_on_load_hook
    quantize_fn(weight)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 320, in quantize_
    if weight.data.dtype == torch.uint8:
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/utilities/load.py", line 166, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: '_NotYetLoadedTensor' object has no attribute 'data'

QLoRA without config file

gemma_2 ~/litgpt litgpt finetune_lora checkpoints/google/gemma-2b  --devices 1 --quantize bnb.nf4 --precision bf16-true
{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
 'data': None,
 'devices': 1,
 'eval': EvalArgs(interval=100,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': 'AdamW',
 'out_dir': PosixPath('out/finetune/lora'),
 'precision': 'bf16-true',
 'quantize': 'bnb.nf4',
 'seed': 1337,
 'train': TrainArgs(save_interval=1000,
                    log_interval=1,
                    global_batch_size=16,
                    micro_batch_size=1,
                    lr_warmup_steps=100,
                    lr_warmup_fraction=None,
                    epochs=5,
                    max_tokens=None,
                    max_steps=None,
                    max_seq_length=None,
                    tie_embeddings=None,
                    max_norm=None,
                    min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 921,600
Number of non-trainable parameters: 3,030,460,416
Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/teamspace/studios/this_studio/litgpt/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 169, in setup
    fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 845, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 931, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 936, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 215, in main
    load_checkpoint(fabric, model, checkpoint_path, strict=False)
  File "/teamspace/studios/this_studio/litgpt/litgpt/utils.py", line 362, in load_checkpoint
    model.load_state_dict(state_dict, strict=strict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 168, in load_state_dict
    return self._original_module.load_state_dict(state_dict=state_dict, strict=strict, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
    load(self, state_dict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2121, in load
    module._load_from_state_dict(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1991, in _load_from_state_dict
    hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 72, in __call__
    return self.hook(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 166, in _quantize_on_load_hook
    quantize_fn(weight)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 320, in quantize_
    if weight.data.dtype == torch.uint8:
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/utilities/load.py", line 166, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: '_NotYetLoadedTensor' object has no attribute 'data'

What operating system are you using?

Unknown

LitGPT Version

litgpt 0.4.5 (Gemma 2 branch)

The text was updated successfully, but these errors were encountered:

rasbt · 2024-07-19T16:28:40Z

Not related to the Gemma 2 branch, also occurs in main.

rasbt · 2024-07-19T17:18:45Z

Doesn't seem to be related to bitsandbytes and lightning fabric versions (issue also occurs with bnb 0.41.3 and lightning 0.2.2). Maybe something in LitGPT has changed.

Andrei-Aksionov · 2024-07-20T12:49:05Z

Not only QLoRA.
I tried to simply generate/chat in a new studio, fresh venv, code from master, pythia-1b model.
The same error if quantization is applied.

rasbt · 2024-07-20T12:55:18Z

I am not sure what's changed that could be causing this, we have bitsandbytes and lightning/fabric pinned.

Andrei-Aksionov · 2024-07-20T13:16:14Z

It's caused by PyTorch-Lightning.
Try:

pip install lightning==2.3.0.dev20240428

which is the package that the repo used before.

Andrei-Aksionov · 2024-07-20T13:21:42Z

This kind of issues needs to be caught by tests.

rasbt · 2024-07-20T13:27:17Z

Ohhh, so basically #1579. We can revert to an older version, but the question is whether there's something that needs to be updated in PyTorch-Lightning (in case this was an accidental change) or LitGPT (so that we can support newer PTL versions moving forward).
Would appreciate your thoughts here @awaelchli

rasbt · 2024-07-20T13:50:21Z

Added a quick PR to add a test and revert the lightning version until we have more time to investigate #1605

awaelchli · 2024-07-23T14:09:46Z

It's not really fixed. Downgrading the version is possible to avoid the problem, but isn't it conceivable that at some point LitGPT might want to support newer versions of Lightning? What happens then?

I think in such situations at least we should open a ticket on the library in question (lightning in this case). Plus the stack trace hints at bitsandbytes being involved, so we'd also need to collect the bnb version used. These are all essential steps that would help us resolve these issues efficiently.

rasbt · 2024-07-23T14:10:56Z

Yes, I just realized this too and reopened a few seconds before you posted. Let me prepare an issue for the PyTorch Lightning issue tracker.

rasbt · 2024-07-23T14:22:42Z

See issue: Lightning-AI/pytorch-lightning#20119

awaelchli · 2024-07-24T09:41:30Z

With the fix Lightning-AI/pytorch-lightning#20121 you can try updating the lightning package to the nightly produced next Sunday or once the next regular release is done.

rasbt · 2024-07-24T16:45:25Z

Sounds great, thanks. I will make a reminder to test this on Sunday/Monday!

rasbt added the bug Something isn't working label Jul 19, 2024

rasbt mentioned this issue Jul 20, 2024

Gemma 2: 9b and 27b versions #1545

Merged

rasbt mentioned this issue Jul 20, 2024

Add quantization test and revert lightning version #1605

Merged

rasbt closed this as completed Jul 23, 2024

rasbt changed the title ~~QLoRA seems to be broken~~ QLoRA / bnb.nf4 quantization causes issues in recent PyTorch Lightning/Fabric versions Jul 23, 2024

rasbt reopened this Jul 23, 2024

awaelchli mentioned this issue Jul 29, 2024

Update Lightning version to 2.4.0 pre #1640

Merged

rasbt closed this as completed in #1640 Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QLoRA / bnb.nf4 quantization causes issues in recent PyTorch Lightning/Fabric versions #1604

QLoRA / bnb.nf4 quantization causes issues in recent PyTorch Lightning/Fabric versions #1604

rasbt commented Jul 19, 2024

rasbt commented Jul 19, 2024

rasbt commented Jul 19, 2024

Andrei-Aksionov commented Jul 20, 2024 •

edited

Loading

rasbt commented Jul 20, 2024 •

edited

Loading

Andrei-Aksionov commented Jul 20, 2024

Andrei-Aksionov commented Jul 20, 2024

rasbt commented Jul 20, 2024

rasbt commented Jul 20, 2024 •

edited

Loading

awaelchli commented Jul 23, 2024

rasbt commented Jul 23, 2024

rasbt commented Jul 23, 2024

awaelchli commented Jul 24, 2024

rasbt commented Jul 24, 2024

QLoRA / bnb.nf4 quantization causes issues in recent PyTorch Lightning/Fabric versions #1604

QLoRA / bnb.nf4 quantization causes issues in recent PyTorch Lightning/Fabric versions #1604

Comments

rasbt commented Jul 19, 2024

Bug description

LoRA (fine)

QLoRA from config file (not fine)

QLoRA without config file

What operating system are you using?

LitGPT Version

rasbt commented Jul 19, 2024

rasbt commented Jul 19, 2024

Andrei-Aksionov commented Jul 20, 2024 • edited Loading

rasbt commented Jul 20, 2024 • edited Loading

Andrei-Aksionov commented Jul 20, 2024

Andrei-Aksionov commented Jul 20, 2024

rasbt commented Jul 20, 2024

rasbt commented Jul 20, 2024 • edited Loading

awaelchli commented Jul 23, 2024

rasbt commented Jul 23, 2024

rasbt commented Jul 23, 2024

awaelchli commented Jul 24, 2024

rasbt commented Jul 24, 2024

Andrei-Aksionov commented Jul 20, 2024 •

edited

Loading

rasbt commented Jul 20, 2024 •

edited

Loading

rasbt commented Jul 20, 2024 •

edited

Loading