Check point #144

ntromas · 2023-03-02T22:19:28Z

Hi Vamb team,

Thanks for such nice tool! My run is taking several days (>5) and has not finished yet. I wonder if there is some way to accelerate the process (we are limited in termed of run time on our server) or if there is some check point (a bit like megahit for example) to restart a run from the last check point.

Cheers,
Nico

simonrasmu · 2023-03-03T10:26:43Z

Hi Nico,

How large is the data set and how far is it getting in the log? Are you using a GPU?

If things are slow in training the VAE you can try to:

Reduce the number of epochs (-e option, default is 500, you can try 250), as far as I remember it should not have a super big impact on the clusters
Increase the starting batch size (-t option, default is 256, you can try 512), this will approx half the time it takes to train the model and if you don't make it too large should not have a super big impact

pinging @jakobnissen as he is more up to date with it

Best,

Simon

ntromas · 2023-03-03T14:07:19Z

Hi Simon,

Thanks for the quick answer! Here is the log file:
log.txt
I am not using GPU (I think it is doable with our server but not sure how) and used system RAM 250G (I can increase that but not sure that will accelerate the process).

Cheers!

Nico

jakobnissen · 2023-03-03T14:55:38Z

Dear @ntromas

You have 13.3M contigs in 138 samples, which is quite a lot to train without a GPU. I see two reasons it could be slow:

You've run out of RAM. That will cause thrashing, massively slowing down Vamb. That seems unlikely with 250 GB RAM though. Can you check how much RAM is being consumed by the process, or alternatively, how much is available on your computer right now?
Maybe it just is that slow for 13M contigs on CPU only. In that case, you could run Vamb on fewer samples by batching them in batches of, say 20 or 25. For reference, with a new-ish GPU this should take a couple of hours (can't remember exactly, but definitely significantly less than one working day)

Vamb will train for 500 epochs, but after epoch 300, the batch size will double again so it will be faster. Nonetheless, it'll probably take a few more days...

To answer your question directly, there is no command-line interface to resume from a previous run, although maybe we should add that. However, Vamb does save all the relevant files from each step, so it should be possible to resume a run. Although the training is a single "step", so if that is interrupted, there is unfortunately no way to remume. You can look at the tutorial (in the doc directory of the repository) to see how Vamb works under the hood, or look at the vamb/__main__.py file.

ntromas · 2023-03-08T13:32:23Z

Hi! I can increase the RAM to 1T if necessary. I can also use GPU but not sure how/if Vamb will automatically recognizes it. For now I followed Simon's advices and I played a bit with -e and -t. But I will use Vamb with much more metagenomes the following weeks, and I will have the same "time" limits so it would be great to optimize the use of CPUs, RAM and GPUs... If I batch samples by 20 or 30, what would be the impact on binning compared to 1 batch of 138 samples? (Not sure I understood you correctly here, sorry...)

Thanks for your help!

Cheers,

Nico

simonrasmu · 2023-03-08T21:33:58Z

Hi Nico,

I would highly recommend using a GPU - it can reduce the running time a lot. It should be easy to get VAMB to run using a GPU, it is basically just adding the --cuda option in the command line. However, you need to have loaded the correct Nvidia libraries and make sure you install VAMB using pip and not conda. Whether the correct libraries are loaded depends on the system you are using, but if you can run nvidia-smi on the command line it should be working.

Best,

Simon

ntromas · 2023-03-09T20:26:06Z

Hi Simon,

Thanks! It worked perfectly and super quick...Impressive!

Cheers!

Nico

jakobnissen · 2023-03-10T09:48:05Z

Great it got solved. I'll keep this issue up because adding checkpoints is definitely something we could think about for the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check point #144

Check point #144

ntromas commented Mar 2, 2023

simonrasmu commented Mar 3, 2023

ntromas commented Mar 3, 2023

jakobnissen commented Mar 3, 2023

ntromas commented Mar 8, 2023

simonrasmu commented Mar 8, 2023

ntromas commented Mar 9, 2023

jakobnissen commented Mar 10, 2023

Check point #144

Check point #144

Comments

ntromas commented Mar 2, 2023

simonrasmu commented Mar 3, 2023

ntromas commented Mar 3, 2023

jakobnissen commented Mar 3, 2023

ntromas commented Mar 8, 2023

simonrasmu commented Mar 8, 2023

ntromas commented Mar 9, 2023

jakobnissen commented Mar 10, 2023