Very High Loss (~15) and Instability with Previously-working Config From A While Ago #2224
Open
1 task done
Labels
bug
Something isn't working
Please check that this issue hasn't been reported before.
Expected Behavior
Training should proceed roughly as it did many versions ago, without a catastrophic loss graph.
Current behaviour
Wanting to track down whether poor recent SFT performance is my fault or the cause of underlying software changes, I recently tried to use a very old config for finetuning a continued-pretrain mistral 7b.
The original loss was normal back in the day, and looked like this:
This original loss was recorded on 2024-09-25.
However using the config (slight changes to make it not error, both versions provided below) with the latest docker image:
Starts at 15.3 loss and is spikey as all hell.
Notes:
8x A40 Runpod instance, both times
command used to run: accelerate launch --use_deepspeed -m axolotl.cli.train ./pathtoyamlfile.yaml
Some changes had to be made to the original config to make it not error on the newest axolotl version. Specifically: deepspeed had to be changed from zero2 to zero1 due to #2191 and the datasets had to be changed from type: sharegpt to having them be manually specified.
Steps to reproduce
Rent 8x A40 instance on Runpod
Train Mistral 7b base with given deepspeed, hyperparams, on a sharegpt dataset
Observe spikey loss
Config yaml
Config 2 (NEW/BROKEN):
Areas that were changed have been indicated with configs.
The text was updated successfully, but these errors were encountered: