Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training data #3

Closed
BenxiaHu opened this issue Jan 14, 2022 · 5 comments
Closed

training data #3

BenxiaHu opened this issue Jan 14, 2022 · 5 comments
Assignees

Comments

@BenxiaHu
Copy link

BenxiaHu commented Jan 14, 2022

Hello,
Would you like to tell me where I can download training data and test data to train the model?
Best,

@michauhl
Copy link
Collaborator

Hello Benxia,
the training data link is in the paper. I updated the README, so you can find the paper and dataset links in the Citation section. The content.txt on Zenodo gives you more details on the included datasets. Feel free to ask if there is something else unclear.
Best,
Michael

@BenxiaHu
Copy link
Author

BenxiaHu commented Jan 18, 2022

I am a little confused. do I just download this file, RNAProt_supplementary_data.zip, which contains many positive and negative fasta squences? which positive and negative sequences do I use?
my query sequences are based on hg19, not hg38. How do I prepare trainning and test dataset?

@michauhl michauhl pinned this issue Feb 18, 2022
@michauhl michauhl self-assigned this Feb 18, 2022
@michauhl
Copy link
Collaborator

Hi Benxia, sorry somehow I did not get a notification that you replied. Could you solve your issue? I think it's all well described in the README and the content.txt. The folders in the RNAProt_supplementary_data.zip are the input folders (--in FOLDER) to rnaprot train. If you want to use the sequences for training with other tools, just use the positives.fa / negatives.fa files in the folders.
Best,
Michael

@BenxiaHu
Copy link
Author

thanks.
my question is there are many folders in the RNAProt_supplementary_data.zip.
image
which folder do I use?

@michauhl
Copy link
Collaborator

The individual folder contents are described in the "content.txt" file (also on https://zenodo.org/record/5083311). In the paper there were two sets of CLIP datasets. E.g. the raw FASTA and BED files for set 1 are in "set1_hg38_bed_fasta" (if you want to use them for other tools), while the RNAProt input folders for set 1 datasets are in "set1_add_feat_rnaprot_train_in".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants