Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama3.2 fine tuning generates repeated pattern towards the end of one epoch #735

Open
2 tasks done
ruian1 opened this issue Oct 18, 2024 · 9 comments
Open
2 tasks done

Comments

@ruian1
Copy link

ruian1 commented Oct 18, 2024

System Info

PyTorch version: 2.4.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
GPU Type and number: A100 80GB x 1

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

GPUS=1 PER_DEVICE_BATCH_SIZE=2 nohup sh src/llama_lora_finetune.sh

Error logs

I was fine tuning the meta-llama/Llama-3.2-11B-Vision-Instruct with https://archive.physionet.org/mimic2/ with 170k image-text pairs. The checkpoints till 0.7 of one epoch generate output text as expected. But starting 0.8 epoch, the checkpoints and so forthgenerate a repeated pattern as below

sp leading jack jack jack coach Port jack jack jackzens jack jack pit jack jackrap jack jack Port jackansk jack jack jackrex jackeman jack jack jack jack jack ピ jackleading sp jackrex jack jack jack jack jack jack jack jack jack jack jack jack jack jack jack jackrex jack jack jackeman pit pit jack jack jack jackleading jack jack pig jack jack pit jack jack event jack jack jack pit jackstorybook jackeman jack jack leading jackchl jack jack jack jack jackjack sp leading jack jack jackleading jack jack jack pigleading jack ピ jack pit pit jack jack ピ jack jack jackrexindow jack jack jack jack jack jack jack jack jackzens jack pitansk jackrap jack jack jack leadingsid pit jack jack jack jack jack jack jack pit jack pit jack jack jack jackeman jack pit pit jack jack jack jack jack jack jack jack jackjack jackjack jack jack jack pit jack pit jack jack jack jack jack event jack jack jack pit jack jack697storybookrex jack jack jack jack jack leading pit pit jack jack jack jack jackzens jack jack jack pit jack jack jack jack jack pit jack jack jack jack jack jack pit697 jackleading jack jack jack pit pit jack jack jack jack jack jack jack jackrexrap jack jack jackjack jack jack jack jack jack pitrapeman jack jack event coach jack jack jack jack jack jack Pose jack jackrap jack jack Pose jack jack jack jack jackjack pit jack jack event pit pit jack jack jack coach jack jack jack jack pit Pose jack pig jack jackzens_ENUMstorybook jack jack jackrapsid pit jack pit jack jack jack jackjack jack jack jack jack jack jackrexindow jack jack jack jack jack coach jack jack jack jack jack jackeman pit jack pit jack pit pitrap jack jackleading jack jack jack jack jack jack jack jackrap jack jack jack jack coach pit jack jack jack coach jackansk jack jack jack pit pig jack jack jack jack jack jack jack jackrap pit jack jackzensansk pit jacksid jack jack jack coach jack jack jack jack jack jack jack jack jackansk ピ jackrap jack jack jack jack jack jack jackzens jack_ENUM jack pit jack jack jack jack jack jack jack jackjack pig ピ pit coach jack jack pit jack jack jack jack jackchl jack coach jack jack jack jack jack jack jack jack jack jack jack pit pitjack jackjack jack jackrex jack jack jackstorybook jackeman pit jack jack jack jack Pose jack jack jack jack leading jack jack jack Pose jack jack jack jack pig jack pit event jack jack jack coach jack jack jack pitrex302 jack jack jack jack jack jack jack jack pit jack jack pigzens jack jackrap

Expected behavior

Expecting the model to generate normal output at 0.8 epoch training and after.

@NicoZenith
Copy link

I get a similar observation on my fine-tuning on custom dataset. Did you plot your training loss with wandb?

I wonder whether this is a learning rate adjustment. Also is there a possibility to schedule the learning rate decay?

@ruian1
Copy link
Author

ruian1 commented Oct 21, 2024

I get a similar observation on my fine-tuning on custom dataset. Did you plot your training loss with wandb?

I wonder whether this is a learning rate adjustment. Also is there a possibility to schedule the learning rate decay?

I printed to tensorboard, take a look at my loss below, I had to smooth it with 1.0 so you can see how it drops. I applied a cosine lr with 0.03 warm_up ratio, and 0.01 weight_decay. What I don't understand is why the model corrupted somewhere between 0.7 and 0.8 of the epoch..

Screenshot 2024-10-20 at 8 55 17 PM Screenshot 2024-10-20 at 8 55 09 PM

@NicoZenith
Copy link

Your loss seems to be fine, maybe train longer or increase learning rate? Repetitive answers usually mean that the model is still adapting to the new domain.
Btw how do you set up warmup ratio and cosine schedule? They are not available arguments in the fine tuning script, as far as I know

@ruian1
Copy link
Author

ruian1 commented Oct 21, 2024

Your loss seems to be fine, maybe train longer or increase learning rate? Repetitive answers usually mean that the model is still adapting to the new domain.

Yup, those are probably the way out, just A100 rates are high and would like check if anyone has seen and solved the similar problem.

Btw how do you set up warmup ratio and cosine schedule? They are not available arguments in the fine tuning script, as far as I know

I added them manually. This is another thing wired about this repo, obviously almost all other fine-tuning repo has enbaled the cosine learning rate (idefics, intern, qwen, Aria etc.) but not this repo. It makes me worried fine-tuning script has not been well tested in this repo

@NicoZenith
Copy link

yeah I agree, my finetuned model performs worse than a smaller LLaVA-Onevision finetuned model on my custom dataset. The loss doesn't manage to go as much down. Let's see if there are any significant updates in the coming weeks
How do you add the learning rate schedular manually?

@ruian1
Copy link
Author

ruian1 commented Nov 1, 2024

yeah I agree, my finetuned model performs worse than a smaller LLaVA-Onevision finetuned model on my custom dataset. The loss doesn't manage to go as much down. Let's see if there are any significant updates in the coming weeks How do you add the learning rate schedular manually?

add these between line 118 and 120
https://github.com/meta-llama/llama-recipes/blob/main/src/llama_recipes/utils/train_utils.py#L118

    epoch_times = []
    checkpoint_times = []
    results = {}
    best_val_loss = float("inf")
    total_train_steps = 0
    max_steps_reached = False  # Flag to indicate max training steps reached
    # Start the training loop

    update_steps = 0  # Counter for steps where the model parameters are updated

    total_length = len(train_dataloader) // gradient_accumulation_steps
    total_steps = train_config.num_epochs * total_length
    warmup_steps = int(train_config.warmup_ratio * total_steps)

    print(f"total_length: {total_length}, total_steps: {total_steps}, warmup_steps: {warmup_steps}")

    def lr_lambda(current_step):
        if current_step < warmup_steps:
            return float(current_step) / float(max(1, warmup_steps))  # Linear warmup
        else:
            return 1.0

    warmup_scheduler = LambdaLR(optimizer, lr_lambda)
    cosine_scheduler = CosineAnnealingLR(
        optimizer, T_max=total_steps - warmup_steps, eta_min=0.0
    )
    lr_scheduler = SequentialLR(
        optimizer,
        schedulers=[warmup_scheduler, cosine_scheduler],
        milestones=[warmup_steps],
    )

and move lr_scheduler.step() to after optimizer.zero_grad()

@xuhang-2
Copy link

Hi, there. I also have this problem when I SFT 11b model. The loss is below:
image
And the output is repeating pattern.
When I check the instruct model without SFT, the output is still repeat pattern.

@wukaixingxp
Copy link
Contributor

Hi! We added freeze_LLM_only option for mllama finetuning and maybe that feature can help with this problem. Please give it a try and let me know if it helps.

@NicoZenith
Copy link

Hi! We added freeze_LLM_only option for mllama finetuning and maybe that feature can help with this problem. Please give it a try and let me know if it helps.

Thanks a lot! If I understand, this option only trains vision encoder and adapter weights, and keeps the LLM frozen. Is it how llama 3.2 Vision was trained? How can this improve performance, knowing that we reduce the number of trainable parameters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants