upgraded to collab+, changed GPU to a100, now training won't continue #614
-
Hey all, I used the regular collab link, and then copied the workbook to my drive. I've got plenty of space, 200gb, so I'm good there also. I always run all the codes in order - checking the GPU outputs this: +-----------------------------------------------------------------------------+ after that I mount Google drive, then install the dependencies (which requires me to restart the runtime afterwards) restart runtime then I make the dataset directory (I'm not sure if running all of these codes is necessary every time for continuation, but I've been doing it regardless to keep consistency then I copy my dataset then I start the pre-processing I copy the config file and select Crepe for the f0_method Then I start the train process - when I originally started encountering this issue, the tensorboard would load and training would stop shortly thereafter. I typically clear the output when the tensorboard loads so that I can see the epoch status actively running, as loading the tensorboard in collab seems to hide the rest of the output for the train process However, now when I run the train process, I get an error page from Google - "403. That's an error. That's all we know" clearing the output shows the loaded checkpoint message then the training completes. I know there's no way it's finished because I was last at around 4300 epochs out of 9999. Now when I try to run train, it gives me the following output: 2023-05-11 13:59:57.204472: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [14:00:37] INFO [14:00:37] Created a temporary directory [14:01:05] INFO [14:01:05] Using strategy: auto [14:01:37] INFO [14:01:37] Loaded checkpoint INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] and then the train process completes. What am I doing wrong here? p.s. the timestamps are out of order because to view the outputs I just pasted here I have to clear the output each time something new shows up to see the next line of output, so I've had to run the train process a few times in order to get the entire output. If there is a way to disable/modify this in Colab let me know, because it's annoying as hell. Any help appreciated. Thanks all! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
In testing a few different things, I noticed training will not continue for one of two typical reasons (at least in my case): One - there are audio files in the dataset folder that are longer than the 5-10 seconds (like if you split up a long file and then leave the full length files in the same folder) In my situation, removing the culprit files fixed it for me. In the rare instance it didn't fix it, I added "-n 2" to the end of the pre-hubert command and it worked like a charm. |
Beta Was this translation helpful? Give feedback.
In testing a few different things, I noticed training will not continue for one of two typical reasons (at least in my case):
One - there are audio files in the dataset folder that are longer than the 5-10 seconds (like if you split up a long file and then leave the full length files in the same folder)
Two - there are non-audio files mixed in with your audio files in the Dataset folder
In my situation, removing the culprit files fixed it for me. In the rare instance it didn't fix it, I added "-n 2" to the end of the pre-hubert command and it worked like a charm.