Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QLoRA / bnb.nf4 quantization causes issues in recent PyTorch Lightning/Fabric versions #1604

Closed
rasbt opened this issue Jul 19, 2024 · 13 comments · Fixed by #1640
Closed

QLoRA / bnb.nf4 quantization causes issues in recent PyTorch Lightning/Fabric versions #1604

rasbt opened this issue Jul 19, 2024 · 13 comments · Fixed by #1640
Labels
bug Something isn't working

Comments

@rasbt
Copy link
Collaborator

rasbt commented Jul 19, 2024

Bug description

Either I'm doing something dumb or QLoRA seems to be broken. Tried it with different models:

LoRA (fine)

gemma_2 ~/litgpt litgpt finetune_lora --devices 1 --config config_hub/finetune/gemma-2b/lora.yaml       
{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.03847,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7fae9a9a2140>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.1,
 'lora_head': True,
 'lora_key': True,
 'lora_mlp': True,
 'lora_projection': True,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/lora-gemma-2b'),
 'precision': 'bf16-true',
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=800,
                    log_interval=1,
                    global_batch_size=6,
                    micro_batch_size=2,
                    lr_warmup_steps=200,
                    lr_warmup_fraction=None,
                    epochs=2,
                    max_tokens=None,
                    max_steps=None,
                    max_seq_length=512,
                    tie_embeddings=None,
                    max_norm=None,
                    min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 11,870,208
Number of non-trainable parameters: 3,030,460,416
The longest sequence length in the train data is 512, the model's maximum sequence length is 512 and context length is 4096
Verifying settings ...
Missing logger folder: /teamspace/studios/this_studio/out/finetune/lora-gemma-2b/logs/csv
Epoch 1 | iter 1 step 0 | loss train: 115.482, val: n/a | iter time: 753.85 ms
Epoch 1 | iter 2 step 0 | loss train: 106.427, val: n/a | iter time: 381.31 ms
Epoch 1 | iter 3 step 1 | loss train: 101.139, val: n/a | iter time: 351.09 ms (step)
Epoch 1 | iter 4 step 1 | loss train: 95.109, val: n/a | iter time: 167.29 ms
Epoch 1 | iter 5 step 1 | loss train: 98.440, val: n/a | iter time: 121.49 ms
Epoch 1 | iter 6 step 2 | loss train: 104.927, val: n/a | iter time: 182.25 ms (step)

QLoRA from config file (not fine)

gemma_2 ~/litgpt litgpt finetune_lora --devices 1 --config config_hub/finetune/gemma-2b/qlora.yaml 
{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.03847,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7f4ae444efb0>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.1,
 'lora_head': True,
 'lora_key': True,
 'lora_mlp': True,
 'lora_projection': True,
 'lora_query': True,
 'lora_r': 16,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/qlora-gemma-2b'),
 'precision': 'bf16-true',
 'quantize': 'bnb.nf4',
 'seed': 1337,
 'train': TrainArgs(save_interval=800,
                    log_interval=1,
                    global_batch_size=6,
                    micro_batch_size=2,
                    lr_warmup_steps=200,
                    lr_warmup_fraction=None,
                    epochs=2,
                    max_tokens=None,
                    max_steps=None,
                    max_seq_length=512,
                    tie_embeddings=None,
                    max_norm=None,
                    min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 23,740,416
Number of non-trainable parameters: 3,030,460,416
Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/teamspace/studios/this_studio/litgpt/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 169, in setup
    fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 845, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 931, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 936, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 215, in main
    load_checkpoint(fabric, model, checkpoint_path, strict=False)
  File "/teamspace/studios/this_studio/litgpt/litgpt/utils.py", line 362, in load_checkpoint
    model.load_state_dict(state_dict, strict=strict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 168, in load_state_dict
    return self._original_module.load_state_dict(state_dict=state_dict, strict=strict, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
    load(self, state_dict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2121, in load
    module._load_from_state_dict(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1991, in _load_from_state_dict
    hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 72, in __call__
    return self.hook(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 166, in _quantize_on_load_hook
    quantize_fn(weight)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 320, in quantize_
    if weight.data.dtype == torch.uint8:
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/utilities/load.py", line 166, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: '_NotYetLoadedTensor' object has no attribute 'data'

QLoRA without config file

gemma_2 ~/litgpt litgpt finetune_lora checkpoints/google/gemma-2b  --devices 1 --quantize bnb.nf4 --precision bf16-true
{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
 'data': None,
 'devices': 1,
 'eval': EvalArgs(interval=100,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': 'AdamW',
 'out_dir': PosixPath('out/finetune/lora'),
 'precision': 'bf16-true',
 'quantize': 'bnb.nf4',
 'seed': 1337,
 'train': TrainArgs(save_interval=1000,
                    log_interval=1,
                    global_batch_size=16,
                    micro_batch_size=1,
                    lr_warmup_steps=100,
                    lr_warmup_fraction=None,
                    epochs=5,
                    max_tokens=None,
                    max_steps=None,
                    max_seq_length=None,
                    tie_embeddings=None,
                    max_norm=None,
                    min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 921,600
Number of non-trainable parameters: 3,030,460,416
Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/teamspace/studios/this_studio/litgpt/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 169, in setup
    fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 845, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 931, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 936, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 215, in main
    load_checkpoint(fabric, model, checkpoint_path, strict=False)
  File "/teamspace/studios/this_studio/litgpt/litgpt/utils.py", line 362, in load_checkpoint
    model.load_state_dict(state_dict, strict=strict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 168, in load_state_dict
    return self._original_module.load_state_dict(state_dict=state_dict, strict=strict, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
    load(self, state_dict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2121, in load
    module._load_from_state_dict(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1991, in _load_from_state_dict
    hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 72, in __call__
    return self.hook(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 166, in _quantize_on_load_hook
    quantize_fn(weight)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 320, in quantize_
    if weight.data.dtype == torch.uint8:
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/utilities/load.py", line 166, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: '_NotYetLoadedTensor' object has no attribute 'data'

What operating system are you using?

Unknown

LitGPT Version

litgpt 0.4.5 (Gemma 2 branch)
@rasbt rasbt added the bug Something isn't working label Jul 19, 2024
@rasbt
Copy link
Collaborator Author

rasbt commented Jul 19, 2024

Not related to the Gemma 2 branch, also occurs in main.

@rasbt
Copy link
Collaborator Author

rasbt commented Jul 19, 2024

Doesn't seem to be related to bitsandbytes and lightning fabric versions (issue also occurs with bnb 0.41.3 and lightning 0.2.2). Maybe something in LitGPT has changed.

@Andrei-Aksionov
Copy link
Collaborator

Andrei-Aksionov commented Jul 20, 2024

Not only QLoRA.
I tried to simply generate/chat in a new studio, fresh venv, code from master, pythia-1b model.
The same error if quantization is applied.

@rasbt
Copy link
Collaborator Author

rasbt commented Jul 20, 2024

I am not sure what's changed that could be causing this, we have bitsandbytes and lightning/fabric pinned.

@Andrei-Aksionov
Copy link
Collaborator

It's caused by PyTorch-Lightning.
Try:

pip install lightning==2.3.0.dev20240428 

which is the package that the repo used before.

@Andrei-Aksionov
Copy link
Collaborator

This kind of issues needs to be caught by tests.

@rasbt
Copy link
Collaborator Author

rasbt commented Jul 20, 2024

Ohhh, so basically #1579. We can revert to an older version, but the question is whether there's something that needs to be updated in PyTorch-Lightning (in case this was an accidental change) or LitGPT (so that we can support newer PTL versions moving forward).
Would appreciate your thoughts here @awaelchli

@rasbt
Copy link
Collaborator Author

rasbt commented Jul 20, 2024

Added a quick PR to add a test and revert the lightning version until we have more time to investigate #1605

@rasbt rasbt closed this as completed Jul 23, 2024
@rasbt rasbt changed the title QLoRA seems to be broken QLoRA / bnb.nf4 quantization causes issues in recent PyTorch Lightning/Fabric versions Jul 23, 2024
@rasbt rasbt reopened this Jul 23, 2024
@awaelchli
Copy link
Contributor

It's not really fixed. Downgrading the version is possible to avoid the problem, but isn't it conceivable that at some point LitGPT might want to support newer versions of Lightning? What happens then?

I think in such situations at least we should open a ticket on the library in question (lightning in this case). Plus the stack trace hints at bitsandbytes being involved, so we'd also need to collect the bnb version used. These are all essential steps that would help us resolve these issues efficiently.

@rasbt
Copy link
Collaborator Author

rasbt commented Jul 23, 2024

Yes, I just realized this too and reopened a few seconds before you posted. Let me prepare an issue for the PyTorch Lightning issue tracker.

@rasbt
Copy link
Collaborator Author

rasbt commented Jul 23, 2024

See issue: Lightning-AI/pytorch-lightning#20119

@awaelchli
Copy link
Contributor

With the fix Lightning-AI/pytorch-lightning#20121 you can try updating the lightning package to the nightly produced next Sunday or once the next regular release is done.

@rasbt
Copy link
Collaborator Author

rasbt commented Jul 24, 2024

Sounds great, thanks. I will make a reminder to test this on Sunday/Monday!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants