Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Training the C.Origami Model Using Only Sequence Data and Integrating Multi-Species Data #48

Open
hanshandong2024 opened this issue Jul 22, 2024 · 4 comments

Comments

@hanshandong2024
Copy link

Dear Author,

Thank you for developing such a perfect model like C.Origami. It's a great work, but I have encountered some difficulties.

First, I have DNA sequence information from other species, but there is no corresponding ATAC data and ChIP-seq data. I would like to try training the model and making predictions using only the sequence data. Could you please advise me on how to modify the code to retrain the model?

Second, the sequences of my target species are relatively short and may differ by orders of magnitude compared to those of humans and mice. This might lead to poor training results due to insufficient training data. I would like to expand the training data by using sequences from multiple species corresponding to multiple three-dimensional structures. I noticed that our training data is input by chromosome. Is it possible to input sequence information from multiple species corresponding to multiple three-dimensional structures?

Thank you for your assistance.

@tanjimin
Copy link
Owner

Hi @hanshandong2024 there are all doable and make sense.

  1. Training using only sequence: You can remove the ATAC and CTCF input and the corresponding encoder, leaving just the sequence encoder. Then you can edit the output dimension of the seq encoder, make sure it has the same dimension as the transformer input size and connect it directly with the transformer.

  2. Multi-species training: I haven't tried this but you could theoretically treat each species as a "chromosome" to increase training data size. However one pitfall could be that these species might have different principles for genome organization so you model could end up learning an average of these rules, resulting in blurry results.

@hanshandong2024
Copy link
Author

Thank you for your quick reply. When using the model I trained with my data for other tasks such as Prediction and Editing/Perturbation, do I only need to input sequence data?

Below is the help documentation for the Prediction task.

Usage:
corigami-predict [options] 

Options:
-h --help       Show this screen.
--out           Output path for storing results
--celltype      Sample cell type for prediction, used for output separation
--chr           Chromosome for prediction
--start         Starting point for prediction (width defaults to 2097152 bp which is the input window size)
--model         Path to the model checkpoint
--seq           Path to the folder where the sequence .fa.gz files are stored
--ctcf          Path to the folder where the CTCF ChIP-seq .bw files are stored
--atac          Path to the folder where the ATAC-seq .bw files are stored

@tanjimin
Copy link
Owner

tanjimin commented Jul 25, 2024

Yes you only need the seq data. Also since you change a lot of things, I would suggest you to edit and run the prediction file directly instead of using the CLI.
This file:

https://github.com/tanjimin/C.Origami/blob/main/src/corigami/inference/prediction.py

@hanshandong2024
Copy link
Author

Thank you very much for your guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants