-
Notifications
You must be signed in to change notification settings - Fork 517
Description
Not as much an issue but an fyi. I was getting OOM errors yesterday when trying to train a Lora on my 3090 (24gb vram) and 64gb ddr5.
One issue was the models which convert the dataset to weights (I believe there's 2) were set to local only and that needed to be fixed (set to False), also I had to comment out a line in the beginning which was trying to set the local rank to 0.
So then the training would try to run but I'd run out of vram. I did some research and realized those 2 models were only being used to preprocess the dataset so they didn't need to be used in training.
The solution I did was to build my own preprocess script and then modify the ace dataset class to load the data from the cache and modify the main training script to not use those 2 extra models. Took a while to get everything right but the training is working great now!
In hindsight I probably could have just implemented better memory management in the main training script to make sure the preprocessing models are unloaded from memory before trying to begin training.
I ran it over night and I'm about at step 40k on a 70 song dataset and it's learning it really well with default Lora settings. Not sure how it generalizes yet but I'll test it out when it's done cooking.
Oh also for some reason I had to run in precision = "32-full"
Thanks for all the work y'all have done, this is awesome and it's exactly what I was looking for at exactly the right time!