-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
between pops with vcf? #34
Comments
Greetings, Karen! Thanks for your interest in using SNPGenie. I regret that I have not yet programmed a version of SNPGenie that uses the 'pooled' (SNP report/VCF-based) approach for estimating the between-group values—and will likely not get to this for some time. However, I did recently write a Python script (generate_seqs_from_VCF.py) that randomly generates a user-specified number of sequences in a FASTA file, randomly distributing variants specified in a VCF at the appropriate frequencies. Two or more sets of sequences thus generated can then be used as input for the current SNPGenie between-group script. The script,
For the VCF file, the script simply uses the "AF" entry, and it should work as long as AF is present; for example, the following format:
For example, for a reference sequence in a file named reference.fasta, a VCF-format SNP report in a file named variants.vcf, and to create 1,000 sequences, one would run the following:
The script and full documentation have been added to Evolutionary Bioinformatics Toolkit. Please let me know if this addresses your request, and if you encounter any problems, as I have not tested this too extensively! Best, |
Hi Chase, Thanks for the quick reply. Your toolkit has been super useful. I don't actually have pooled sequences, I have seperate variant data for each individual, currently all in one vcf but I can split it by population or individual. I do have the AF information in my vcfs but would you still recommend this for non-pooled data? my vcf files look like this:
Thanks, |
Thanks for these additional details! I am not certain, but I believe you're simply saying that your VCF file is summarizing multiple individual sequences, and that the AF value here in the INFO columns represents a true allele frequency (I think, the allele frequency among all the samples in the columns that come after INFO). If that is true, and the AF represents the allele frequency in a group/sample/population of your organism of interest, then that is all I mean by a "pooled" approach, i.e., the VCF represents multiple individuals. In other words, it doesn't matter whether the sequence data were generated in a pooled fashion or summarized "after the fact" for multiple individual sequences. In summary, if the AF value represents the frequency you want represented in the FASTA, it should work! It's entirely possible I misunderstood — please let me know! Chase |
Hi Chase,
Is it possible to run the snpgenie_between_group.pl with vcf files rather than fasta files? I have two populations with variant data for 6 diploids per population and would like to estimate dn/ds between populations. If if cannot be run with vcfs do you have a tool you recommend for converting diploid vcf data to fasta format?
Thanks,
Karen
The text was updated successfully, but these errors were encountered: