Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check point #144

Open
ntromas opened this issue Mar 2, 2023 · 7 comments
Open

Check point #144

ntromas opened this issue Mar 2, 2023 · 7 comments

Comments

@ntromas
Copy link

ntromas commented Mar 2, 2023

Hi Vamb team,

Thanks for such nice tool! My run is taking several days (>5) and has not finished yet. I wonder if there is some way to accelerate the process (we are limited in termed of run time on our server) or if there is some check point (a bit like megahit for example) to restart a run from the last check point.

Cheers,
Nico

@simonrasmu
Copy link
Collaborator

Hi Nico,

How large is the data set and how far is it getting in the log? Are you using a GPU?

If things are slow in training the VAE you can try to:

  • Reduce the number of epochs (-e option, default is 500, you can try 250), as far as I remember it should not have a super big impact on the clusters
  • Increase the starting batch size (-t option, default is 256, you can try 512), this will approx half the time it takes to train the model and if you don't make it too large should not have a super big impact

pinging @jakobnissen as he is more up to date with it

Best,

Simon

@ntromas
Copy link
Author

ntromas commented Mar 3, 2023

Hi Simon,

Thanks for the quick answer! Here is the log file:
log.txt
I am not using GPU (I think it is doable with our server but not sure how) and used system RAM 250G (I can increase that but not sure that will accelerate the process).

Cheers!

Nico

@jakobnissen
Copy link
Member

Dear @ntromas

You have 13.3M contigs in 138 samples, which is quite a lot to train without a GPU. I see two reasons it could be slow:

  1. You've run out of RAM. That will cause thrashing, massively slowing down Vamb. That seems unlikely with 250 GB RAM though. Can you check how much RAM is being consumed by the process, or alternatively, how much is available on your computer right now?
  2. Maybe it just is that slow for 13M contigs on CPU only. In that case, you could run Vamb on fewer samples by batching them in batches of, say 20 or 25. For reference, with a new-ish GPU this should take a couple of hours (can't remember exactly, but definitely significantly less than one working day)

Vamb will train for 500 epochs, but after epoch 300, the batch size will double again so it will be faster. Nonetheless, it'll probably take a few more days...

To answer your question directly, there is no command-line interface to resume from a previous run, although maybe we should add that. However, Vamb does save all the relevant files from each step, so it should be possible to resume a run. Although the training is a single "step", so if that is interrupted, there is unfortunately no way to remume. You can look at the tutorial (in the doc directory of the repository) to see how Vamb works under the hood, or look at the vamb/__main__.py file.

@ntromas
Copy link
Author

ntromas commented Mar 8, 2023

Hi! I can increase the RAM to 1T if necessary. I can also use GPU but not sure how/if Vamb will automatically recognizes it. For now I followed Simon's advices and I played a bit with -e and -t. But I will use Vamb with much more metagenomes the following weeks, and I will have the same "time" limits so it would be great to optimize the use of CPUs, RAM and GPUs... If I batch samples by 20 or 30, what would be the impact on binning compared to 1 batch of 138 samples? (Not sure I understood you correctly here, sorry...)

Thanks for your help!

Cheers,

Nico

@simonrasmu
Copy link
Collaborator

Hi Nico,

I would highly recommend using a GPU - it can reduce the running time a lot. It should be easy to get VAMB to run using a GPU, it is basically just adding the --cuda option in the command line. However, you need to have loaded the correct Nvidia libraries and make sure you install VAMB using pip and not conda. Whether the correct libraries are loaded depends on the system you are using, but if you can run nvidia-smi on the command line it should be working.

Best,

Simon

@ntromas
Copy link
Author

ntromas commented Mar 9, 2023

Hi Simon,

Thanks! It worked perfectly and super quick...Impressive!

Cheers!

Nico

@jakobnissen
Copy link
Member

Great it got solved. I'll keep this issue up because adding checkpoints is definitely something we could think about for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants