In this example, we'll see how to use PEFT to perform SFT using PEFT on various distributed setups.
QLoRA uses 4-bit quantization of the base model to drastically reduce the GPU memory consumed by the base model while using LoRA for parameter-efficient fine-tuning. The command to use QLoRA is present at run_peft.sh.
Note:
- At present,
use_reentrant
needs to beTrue
when using gradient checkpointing with QLoRA else QLoRA leads to high GPU memory consumption.
Unsloth enables finetuning Mistral/Llama 2-5x faster with 70% less memory. It achieves this by reducing data upcasting, using Flash Attention 2, custom Triton kernels for RoPE embeddings, RMS Layernorm & Cross Entropy Loss and manual clever autograd computation to reduce the FLOPs during QLoRA finetuning. Below is the list of the optimizations from the Unsloth blogpost mistral-benchmark. The command to use QLoRA with Unsloth is present at run_unsloth_peft.sh.
Optimization in Unsloth to speed up QLoRA finetuning while reducing GPU memory usageTo speed up QLoRA finetuning when you have access to multiple GPUs, look at the launch command at run_peft_multigpu.sh. This example to performs DDP on 8 GPUs.
Note:
- At present,
use_reentrant
needs to beFalse
when using gradient checkpointing with Multi-GPU QLoRA else it will lead to errors. However, this leads to huge GPU memory consumption.
When you have access to multiple GPUs, it would be better to use normal LoRA with DeepSpeed/FSDP. To use LoRA with DeepSpeed, refer the docs at PEFT with DeepSpeed.
When you have access to multiple GPUs, it would be better to use normal LoRA with DeepSpeed/FSDP. To use LoRA with DeepSpeed, refer the docs at PEFT with FSDP.
Generally try to upgrade to the latest package versions for best results, especially when it comes to bitsandbytes
, accelerate
, transformers
, trl
, and peft
.