upgraded to collab+, changed GPU to a100, now training won't continue #614

yoyoitsevan · 2023-05-11T14:04:19Z

yoyoitsevan
May 11, 2023

Hey all,
Super new to all this, but I'm scratching my head here. Been running a training for days now and decided to take a chance on upgrading to Collab+ in hopes it would be significantly faster than what I'm doing currently, but it seems it broke my processes and I can't seem to figure out how to get back on track. I'm not sure what to include here, so here's what I know -

I used the regular collab link, and then copied the workbook to my drive. I've got plenty of space, 200gb, so I'm good there also.

I always run all the codes in order -

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

after that I mount Google drive, then install the dependencies (which requires me to restart the runtime afterwards)

restart runtime

then I make the dataset directory (I'm not sure if running all of these codes is necessary every time for continuation, but I've been doing it regardless to keep consistency

then I copy my dataset

then I start the pre-processing

I copy the config file and select Crepe for the f0_method

Then I start the train process - when I originally started encountering this issue, the tensorboard would load and training would stop shortly thereafter. I typically clear the output when the tensorboard loads so that I can see the epoch status actively running, as loading the tensorboard in collab seems to hide the rest of the output for the train process

However, now when I run the train process, I get an error page from Google - "403. That's an error. That's all we know"

clearing the output shows the loaded checkpoint message then the training completes. I know there's no way it's finished because I was last at around 4300 epochs out of 9999.

Now when I try to run train, it gives me the following output:

2023-05-11 13:59:57.204472: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[13:59:57] INFO [13:59:57] Note: NumExpr detected 12 cores but

[14:00:37] INFO [14:00:37] Created a temporary directory

[14:01:05] INFO [14:01:05] Using strategy: auto

[14:01:37] INFO [14:01:37] Loaded checkpoint

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[14:00:48] INFO [14:00:48] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES:

and then the train process completes. What am I doing wrong here?

p.s. the timestamps are out of order because to view the outputs I just pasted here I have to clear the output each time something new shows up to see the next line of output, so I've had to run the train process a few times in order to get the entire output. If there is a way to disable/modify this in Colab let me know, because it's annoying as hell. Any help appreciated. Thanks all!

Answered by yoyoitsevan

May 28, 2023

In testing a few different things, I noticed training will not continue for one of two typical reasons (at least in my case):

One - there are audio files in the dataset folder that are longer than the 5-10 seconds (like if you split up a long file and then leave the full length files in the same folder)
Two - there are non-audio files mixed in with your audio files in the Dataset folder

In my situation, removing the culprit files fixed it for me. In the rare instance it didn't fix it, I added "-n 2" to the end of the pre-hubert command and it worked like a charm.

View full answer

yoyoitsevan · 2023-05-28T05:30:41Z

yoyoitsevan
May 28, 2023
Author

In testing a few different things, I noticed training will not continue for one of two typical reasons (at least in my case):

One - there are audio files in the dataset folder that are longer than the 5-10 seconds (like if you split up a long file and then leave the full length files in the same folder)
Two - there are non-audio files mixed in with your audio files in the Dataset folder

In my situation, removing the culprit files fixed it for me. In the rare instance it didn't fix it, I added "-n 2" to the end of the pre-hubert command and it worked like a charm.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgraded to collab+, changed GPU to a100, now training won't continue #614

{{title}}

Replies: 1 comment

{{title}}

Select a reply

upgraded to collab+, changed GPU to a100, now training won't continue #614

yoyoitsevan May 11, 2023

Replies: 1 comment

yoyoitsevan May 28, 2023 Author

yoyoitsevan
May 11, 2023

yoyoitsevan
May 28, 2023
Author