Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training process #28

Open
maksimallist opened this issue Jul 10, 2023 · 23 comments
Open

Training process #28

maksimallist opened this issue Jul 10, 2023 · 23 comments

Comments

@maksimallist
Copy link

Hello. Can you share the details of neural model training? Did you train it yourself? Did you collect data for training from basenji dataset files? I am unable to reproduce the claimed results during training.

@fransilvionGenomica
Copy link

Has anyone looked into this yet? I am also interested in this, since training enformer from scratch using your implementation doesn't reproduce same Pearson correlation values (max I am getting is ~0.4).

@lucidrains
Copy link
Owner

lucidrains commented Nov 1, 2023

@fransilvionGenomica @maksimallist i tried a while ago using TPUs (didn't have access to large cluster of GPU at the time) and didn't hit the mark (got around 0.5-0.6). this was before Ziga officially released their model over at deepmind

the training script i used is all open sourced here . the original reason for making the repo was for a contracting project for a local startup

@lucidrains
Copy link
Owner

lucidrains commented Nov 1, 2023

@fransilvionGenomica are you planning on training it on proprietary data with your own GPU cluster?

@fransilvionGenomica
Copy link

@lucidrains I am training your pytorch implementation using a single A100 GPU node with the original basenji dataset and gradient accumulation. I was using the following deepmind notebook as the reference: https://github.com/google-deepmind/deepmind-research/blob/master/enformer/enformer-training.ipynb. I do believe that it is possible to train the model on GPUs, since in the recent Borzoi paper from Enformer co-authors they did not use TPUs (https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1). Unfortunately, they don't provide any training script (https://github.com/calico/borzoi).

@lucidrains
Copy link
Owner

lucidrains commented Nov 1, 2023

@fransilvionGenomica ahh, i have not checked out Borzoi yet, although someone else told me it is the successor to Enformer

why are you still using this repository if Borzoi is the new SOTA? without reading the paper, did Borzoi set a new SOTA?

@lucidrains
Copy link
Owner

@fransilvionGenomica where do you work btw?

@fransilvionGenomica
Copy link

Oh I see. It makes sense. Even Borzoi mentioned it took them ~25 days on 2 GPUs. And I am training on a single GPU. I guess, I will just have to wait then. Thanks!

@lucidrains
Copy link
Owner

@fransilvionGenomica that is strange they waited that long. i thought calico had google level resources

@lucidrains
Copy link
Owner

lucidrains commented Nov 1, 2023

@fransilvionGenomica i'll revisit genomics maybe end of the month and read the Borzoi paper in detail. knee deep in other projects at the moment.

@lucidrains
Copy link
Owner

ahh ok, was told that Borzoi is nothing more than Enformer applied to RNA-seq data. ok then using this repository is fine in that case

@fransilvionGenomica
Copy link

Yes, architecture wise they are very similar. Borzoi is actually less complex.

@lucidrains
Copy link
Owner

@fransilvionGenomica ok, i'll just copy / paste the existing code and remove that complexity for Borzoi later this month after i read the paper. hopefully they got rid of the annoying gamma positions

@fransilvionGenomica
Copy link

Just curious, have you noticed anything about the batch size while training enformer from scratch? Like, does it have to be relatively big (like at least 32) or can you train decently even if batch size is 1 or 2?

@lucidrains
Copy link
Owner

lucidrains commented Nov 1, 2023

@fransilvionGenomica it has to be big (32 or 64). managing the data and long sequences was also a huge pain

@lucidrains
Copy link
Owner

lucidrains commented Nov 1, 2023

@fransilvionGenomica the code in this repository isn't even setup for distributed training. i didn't set up synchronized batchnorm, which is required for it to train well.

@lucidrains
Copy link
Owner

@fransilvionGenomica actually let me just throw that in there for now

@fransilvionGenomica
Copy link

Have you tried to run your enformer implementation with pytorch lightning?

@lucidrains
Copy link
Owner

@fransilvionGenomica no i haven't, as i said above, my training was done in tensorflow sonnet with TPUs, as i had access to a large cluster of TPUs in collaboration with EleutherAI back then

@lucidrains
Copy link
Owner

@fransilvionGenomica if you ever wire up a working training script, always welcome a pull request, in the spirit of open source science.

@minjaf
Copy link

minjaf commented Nov 2, 2023

@fransilvionGenomica ahh, i have not checked out Borzoi yet, although someone else told me it is the successor to Enformer

why are you still using this repository if Borzoi is the new SOTA? without reading the paper, did Borzoi set a new SOTA?

What the paper says:
"Performance is difficult to compare directly to Enformer due to differences in data processing. Nevertheless, test accuracies on the overlapping datasets are broadly similar, indicating competitive model training"
(https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1.full)

Let's wait until reviewers ask for this question =)

@fransilvionGenomica
Copy link

@lucidrains do you have training/validation loss trends left by any chance? for your tensorflow training code I mean.

@lucidrains
Copy link
Owner

@fransilvionGenomica hey yes, actually still have it lying around (thanks wandb) https://api.wandb.ai/links/lucidrains/9ac4x106

@ZhuJiwei111
Copy link

@fransilvionGenomica the code in this repository isn't even setup for distributed training. i didn't set up synchronized batchnorm, which is required for it to train well.

hello,may i ask how to fix this? My training time is several times higher when I train with DDP than single GPU (with the same batch_size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants