Skip to content

Releases: unslothai/unsloth

Llama 3.3 + Dynamic 4bit Quants

04 Dec 13:59
9dc399a
Compare
Choose a tag to compare

We provide dynamic 4bit quants which uses a bit more memory, but vastly improves accuracy for finetuning and inference. Unsloth will now default to these versions! See https://unsloth.ai/blog/dynamic-4bit for more details.

Llama 3.3 is out now! Read our blog: https://unsloth.ai/blog/llama3-3

  • You can now fine-tune Llama 3.3 (70B) up to 90,000 context lengths with Unsloth, which is 13x longer than what Hugging Face + FA2 supports at 6,900 on a 80GB GPU.
  • For Llama 3.1 (8B), Unsloth can now do a whopping 342,000 context length, which exceeds the 128K context lengths Llama 3.1 natively supported. HF + FA2 can only do 28,000 on a 80GB GPU, so Unsloth supports 12x context lengths.
  • 70B models can now fit on 41GB of VRAM - nearly 40GB!

All notebooks now use these dynamic quants:

Experiments

Train
Quantizing Qwen2-VL-2B Instruct down to 4 bits breaks the model entirely.

Qwen2-VL-2B-Instruct Description Size Result
16bit The image shows a train traveling on tracks. 4.11GB
Default 4bit all layers The image depicts a vibrant and colorful scene of a coastal area. 1.36GB
Unsloth quant The image shows a train traveling on tracks. 1.81GB

Merging to 16bit now works as expected.

Fixed a major bug which caused merges to not function correctly for vision models.

Llama.cpp GGUF saving now uses cmake.

All saving modules are also updated inside of Unsloth!

Apple Cut Cross Entropy

We worked with Apple to add Cut Cross Entropy into Unsloth which reduces VRAM use and increase context length further.

QwQ 4bit quants and GGUFs

Try a O1 test time compute LLM out! See https://huggingface.co/unsloth

What's Changed

Full Changelog: November-2024...December-2024

Vision finetuning

21 Nov 17:55
Compare
Choose a tag to compare
  • We support Llama 3.2 Vision 11B, 90B; Pixtral; Qwen2VL 2B, 7B, 72B; and any Llava variants like Llava NeXT!
  • We support 16bit LoRA or 4bit QLoRA. Both are accelerated and use much less memory!
  • Llama 3.2 Vision finetuning - Radiography use case. Free Colab Kaggle Notebook
  • Qwen 2 VL Vision finetuning - Maths OCR to LaTeX. Free Colab Kaggle Notebook
  • Pixtral 12B Vision finetuning - General QA datasets. Free Colab
  • Please run pip install --upgrade --no-cache-dir unsloth unsloth_zoo
from unsloth import FastVisionModel # NEW instead of FastLanguageModel
import torch

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Llama-3.2-11B-Vision-Instruct",
    load_in_4bit = True, # Use 4bit quantization to reduce memory usage. Can be False.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision part
    finetune_language_layers   = True, # False if not finetuning language part
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 16,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

from datasets import load_dataset
dataset = load_dataset("unsloth/llava-instruct-mix-vsft-mini", split = "train")
from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

FastVisionModel.for_training(model) # Enable for training!

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer), # Must use!
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1, # Reduce to 1 to make Pixtral fit!
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 30,
        # num_train_epochs = 1, # Set this instead of max_steps for full training runs
        learning_rate = 2e-4,
        fp16 = not is_bf16_supported(),
        bf16 = is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",     # For Weights and Biases

        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        dataset_num_proc = 4,
        max_seq_length = 2048,
    ),
)
trainer_stats = trainer.train()

After finetuning, you can also do inference:

FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[2]["images"][0]
instruction = "Is there something interesting about this image?"

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

We also support merging QLoRA / LoRA directly into 16bit weights for serving:

# Select ONLY 1 to save! (Both not needed!)

# Save locally to 16bit
if False: model.save_pretrained_merged("unsloth_finetune", tokenizer,)

# To export and save to your Hugging Face account
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", tokenizer, token = "PUT_HERE")

What's Changed

New Contributors

Full Changelog: September-2024...November-2024

Gradient Accumulation Fix

15 Oct 16:48
38663b0
Compare
Choose a tag to compare

We fixed a gradient accumulation bug which was actually discovered since 2021 here, and rediscovered here. Read more in our blog post: https://unsloth.ai/blog/gradient

We have a Colab Notebook for Llama 3.2 using the fixed trainer and a Kaggle Notebook as well.

Essentially theoretically bsz * ga should be equivalent to full batch training with no gradient accumulation, but weirdly the training losses do no match up:

We fixed it in Unsloth!

To use Unsloth's fixed trainer with gradient accumulation, use:

from unsloth import unsloth_train
# trainer_stats = trainer.train() << Buggy if using gradient accumulation
trainer_stats = unsloth_train(trainer) # << Fixed gradient accumulation

Please update Unsloth on local machines (no need for Colab / Kaggle) via:

pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Read our blog post: https://unsloth.ai/blog/gradient for more details!

What's Changed

New Contributors

Full Changelog: September-2024...October-2024

Qwen 2.5 Support

23 Sep 21:32
Compare
Choose a tag to compare

Qwen 2.5 Support is here!

There are some issues with Qwen 2.5 models which Unsloth has fixed!

EOS token issues

Qwen 2.5 Base models (0.5b all the way until 72b) - EOS token should be <|endoftext|> not <|im_end|>. The base models <|im_end|> is actually untrained, so it'll cause NaN gradients if you use it. You should re-pull the tokenizer from source, or you can download fixed base models from https://huggingface.co/unsloth if that helps.

Chat template issues

  • Qwen 2.5 Base models should NOT have a chat_template, this will actually cause errors especially in Unsloth's finetuning notebooks, since I check if untrained tokens exist in the chat template to counteract NaN gradients.
  • Do NOT use Qwen 2.5's chat template for the base models. This will cause NaN gradients!

4bit uploaded models

Qwen 2.5 0.5b 4bit 0.5b Instruct 0.5b 4bit Instruct 0.5b
Qwen 2.5 1.5b 4bit 1.5b Instruct 1.5b 4bit Instruct 1.5b
Qwen 2.5 3b 4bit 3b Instruct 3b 4bit Instruct 3b
Qwen 2.5 7b 4bit 7b Instruct 7b 4bit Instruct 7b
Qwen 2.5 14b 4bit 14b Instruct 14b 4bit Instruct 14b
Qwen 2.5 32b 4bit 32b Instruct 32b 4bit Instruct 32b
Qwen 2.5 72b 4bit 72b Instruct 72b 4bit Instruct 72b

What's Changed

New Contributors

Full Changelog: August-2024...September-2024

Phi 3.5

21 Aug 01:08
Compare
Choose a tag to compare

Phi 3.5 is here!

Try it out here: https://colab.research.google.com/drive/1lN6hPQveB_mHSnTOYifygFcrO8C1bxq4?usp=sharing

What's Changed

New Contributors

Full Changelog: July-Mistral-2024...August-2024

Llama 3.1 Support

23 Jul 20:42
Compare
Choose a tag to compare

Llama 3.1 Support

Excited to announce Unsloth makes finetuning Llama 3.1 2.1x faster and use 60% less VRAM! Read up on our release here: https://unsloth.ai/blog/llama3-1
image

We uploaded a Google Colab notebook to finetune Llama 3.1 (8B) on a free Tesla T4: Llama 3.1 (8B) Notebook. We also have a new UI on Google Colab for chatting with your Llama 3.1 Instruct models which uses our own 2x faster inference engine.

Run UI Preview

unsloth_chat_ui_cHN5s0tryafdUM6nzXjgf
We created a new chat UI using Gradio where users can upload and chat with their Llama 3.1 Instruct models online for free on Google Colab.

We uploaded 4bit bitsandbytes quants here: https://huggingface.co/unsloth
To finetune Llama 3.1, please update Unsloth:

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

July-Mistral-2024

19 Jul 15:37
Compare
Choose a tag to compare

Mistral NeMo, Ollama & CSV support

See https://unsloth.ai/blog/mistral-nemo for more details. 4 bit pre-quantized weights at https://huggingface.co/unsloth

2x faster 60% less VRAM Colab finetuning notebook here and also our Kaggle notebook is here

image

Export to Ollama & CSV Support

To use, create and customize your chat template with a dataset and Unsloth will automatically export the finetune to Ollama with automatic Modelfile creation. We also created a 'Step-by-Step Tutorial on How to Finetune Llama-3 and Deploy to Ollama'. Check out our Ollama Llama-3 Alpaca and CSV/Excel Ollama Guide notebooks.

Unlike regular chat templates that use 3 columns, Ollama simplifies the process with just 2 columns: instruction and output. And with Ollama, you can save, run, and deploy your finetuned models locally on your own device.
image
image

Train on Completions / Inputs

We now support training only on the output tokens and not the inputs, which can increase accuracy. Try it with:

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    ...
    args = TrainingArguments(
        ...
    ),
)
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(trainer)

RoPE Scaling for all models

We now allow you to finetune Gemma 2, Mistral, Mistral NeMo, Qwen2 and more models with “unlimited” context lengths through RoPE linear scaling through Unsloth. Coupled with our 4x longer context support, Unsloth can do extremely long context support!

New Docs!

Introducing our new Documentation site which has all the most important info about Unsloth in one place. If you'd like to contribute, please contact us! Docs: https://docs.unsloth.ai/
image

Update instructions

Please update Unsloth in local machines (Colab and Kaggle just refresh and reload notebooks) via:

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

2x faster Gemma 2

03 Jul 22:02
5ab565f
Compare
Choose a tag to compare

Gemma 2 support

We now support Gemma 2! It's 2x faster and uses 63% less VRAM than HF+FA2!
image
We have a Gemma 2 9b notebook here: https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing

To use Gemma 2, please update Unsloth:

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

Head over to our blog post: https://unsloth.ai/blog/gemma2 for more details.

We uploaded 4bit quants for 4x faster downloading to:

https://huggingface.co/unsloth/gemma-2-9b-bnb-4bit

https://huggingface.co/unsloth/gemma-2-27b-bnb-4bit

https://huggingface.co/unsloth/gemma-2-9b-it-bnb-4bit

https://huggingface.co/unsloth/gemma-2-27b-it-bnb-4bit

Continued pretraining

You can now do continued pretraining with Unsloth. See https://unsloth.ai/blog/contpretraining for more details!

Continued pretraining is 2x faster and uses 50% less VRAM than HF + FA2 QLoRA. We offload embed_tokens and lm_head to disk to save VRAM!

You can now simply use both in the target modules like below:

model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

We also allow 2 learning rates - one for the embedding matrices and another for the LoRA adapters:

from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    args = UnslothTrainingArguments(
        ....
        learning_rate = 5e-5,
        embedding_learning_rate = 5e-6,
    ),
)

We also share a free Colab to finetune Mistral v3 to learn Korean (you can select any language you like) using Wikipedia and the Aya Dataset: https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing

And we're sharing our free Colab notebook for continued pretraining for text completion: https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing

What's Changed

New Contributors

Full Changelog: https://github.com/unslothai/unsloth/commits/June-2024