Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

between pops with vcf? #34

Closed
karenbobier opened this issue Jul 13, 2020 · 3 comments
Closed

between pops with vcf? #34

karenbobier opened this issue Jul 13, 2020 · 3 comments

Comments

@karenbobier
Copy link

Hi Chase,

Is it possible to run the snpgenie_between_group.pl with vcf files rather than fasta files? I have two populations with variant data for 6 diploids per population and would like to estimate dn/ds between populations. If if cannot be run with vcfs do you have a tool you recommend for converting diploid vcf data to fasta format?

Thanks,
Karen

@singing-scientist
Copy link
Contributor

Greetings, Karen! Thanks for your interest in using SNPGenie.

I regret that I have not yet programmed a version of SNPGenie that uses the 'pooled' (SNP report/VCF-based) approach for estimating the between-group values—and will likely not get to this for some time.

However, I did recently write a Python script (generate_seqs_from_VCF.py) that randomly generates a user-specified number of sequences in a FASTA file, randomly distributing variants specified in a VCF at the appropriate frequencies. Two or more sets of sequences thus generated can then be used as input for the current SNPGenie between-group script.

The script, generate_seqs_from_VCF.py, requires the Python libraries SeqIO, os, random, re, and sys. It can be run like this:

generate_seqs_from_VCF.py [sequence file].fasta [SNP report].vcf [num sequences]

For the VCF file, the script simply uses the "AF" entry, and it should work as long as AF is present; for example, the following format:

##METADATA ROWS
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
NC_045512.2	643	.	C	A	148	PASS	DP=6102;AF=0.003114;SB=0;DP4=55,6026,0,20
NC_045512.2	689	.	T	C	170	PASS	DP=6604;AF=0.003028;SB=7;DP4=66,6509,1,19
NC_045512.2	716	.	A	G	173	PASS	DP=7833;AF=0.002936;SB=7;DP4=72,7736,1,22
NC_045512.2	769	.	T	C	203	PASS	DP=8319;AF=0.003005;SB=5;DP4=101,8192,1,24
NC_045512.2	778	.	T	C	167	PASS	DP=8091;AF=0.002719;SB=0;DP4=100,7964,0,22
...

For example, for a reference sequence in a file named reference.fasta, a VCF-format SNP report in a file named variants.vcf, and to create 1,000 sequences, one would run the following:

generate_seqs_from_VCF.py reference.fasta variants.vcf 1000

The script and full documentation have been added to Evolutionary Bioinformatics Toolkit. Please let me know if this addresses your request, and if you encounter any problems, as I have not tested this too extensively!

Best,
Chase

@karenbobier
Copy link
Author

Hi Chase,

Thanks for the quick reply. Your toolkit has been super useful.

I don't actually have pooled sequences, I have seperate variant data for each individual, currently all in one vcf but I can split it by population or individual. I do have the AF information in my vcfs but would you still recommend this for non-pooled data?

my vcf files look like this:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	ARG10	ARG18	ARG19	ARG21	ARG25	ARG27	ARG29	ARI10	ARI102	ARI11	ARI12	ARI4	ARI6
##contig=<ID=J9347376|m.109621,length=300>
J9347376|m.109621	72	.	G	T	307.47	.	AC=4;AF=0.222;AN=18;BaseQRankSum=0.871;ClippingRankSum=0.00;DP=66;ExcessHet=0.8432;FS=6.239;MLEAC=4;MLEAF=0.222;MQ=58.81;MQRankSum=0.967;QD=20.50;ReadPosRankSum=2.41;SOR=1.304	GT:AD:DP:GQ:PL	./.:0,0:0:.:0,0,0	0/0:1,0:1:3:0,3,42	0/0:5,0:5:15:0,15,159	0/1:4,8:12:40:213,0,40	1/1:0,3:3:9:128,9,0	0/0:14,0:14:24:0,24,360	0/0:2,0:2:6:0,6,76	0/1:9,1:10:12:12,0,283	./.:1,0:1:.:0,0,0	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0	0/0:2,0:2:6:0,6,49	0/0:1,0:1:3:0,3,25
J9347376|m.109621	84	.	G	A	43.39	.	AC=1;AF=0.056;AN=18;BaseQRankSum=-9.670e-01;ClippingRankSum=0.00;DP=38;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.056;MQ=60.00;MQRankSum=0.00;QD=14.46;ReadPosRankSum=0.967;SOR=0.223	GT:AD:DP:GQ:PL	./.:0,0:0:.:0,0,0	0/0:1,0:1:3:0,3,42	0/0:3,0:3:9:0,9,115	0/0:8,0:8:24:0,24,342	0/1:1,2:3:36:75,0,36	0/0:9,0:9:27:0,27,362	0/0:2,0:2:6:0,6,76	0/0:9,0:9:15:0,15,225	0/0:1,0:1:3:0,3,38	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0	0/0:1,0:1:3:0,3,25	./.:0,0:0:.:0,0,0
J9347376|m.109621	129	.	C	A	230.12	.	AC=2;AF=0.125;AN=16;DP=31;ExcessHet=0.1703;FS=0.000;MLEAC=2;MLEAF=0.125;MQ=60.00;QD=27.00;SOR=1.329	GT:AD:DP:GQ:PGT:PID:PL	./.:0,0:0:.:.:.:0,0,0	0/0:2,0:2:6:.:.:0,6,87	0/0:2,0:2:3:.:.:0,3,45	1/1:0,6:6:18:1|1:129_C_A:270,18,0	0/0:4,0:4:0:.:.:0,0,90	0/0:7,0:7:15:.:.:0,15,225	0/0:2,0:2:6:.:.:0,6,76	0/0:6,0:6:15:.:.:0,15,225	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0	0/0:2,0:2:6:.:.:0,6,72	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0
J9347376|m.109621	135	.	C	G	1377.62	.	AC=16;AF=1.00;AN=16;DP=32;ExcessHet=3.0103;FS=0.000;MLEAC=16;MLEAF=1.00;MQ=60.00;QD=30.82;SOR=0.693	GT:AD:DP:GQ:PGT:PID:PL	./.:0,0:0:.:.:.:0,0,0	1/1:0,2:2:6:.:.:87,6,0	1/1:0,2:2:6:.:.:80,6,0	1/1:0,6:6:18:1|1:129_C_A:270,18,0	1/1:0,6:6:21:.:.:295,21,0	1/1:0,7:7:21:.:.:287,21,0	1/1:0,1:1:3:.:.:42,3,0	1/1:0,6:6:18:.:.:253,18,0	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0	1/1:0,2:2:6:.:.:87,6,0	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0
J9347376|m.109621	144	.	G	C	121.76	.	AC=1;AF=0.071;AN=14;BaseQRankSum=-9.210e-01;ClippingRankSum=0.00;DP=30;ExcessHet=3.3579;FS=0.000;MLEAC=1;MLEAF=0.071;MQ=53.67;MQRankSum=0.00;QD=15.22;ReadPosRankSum=0.732;SOR=0.799	GT:AD:DP:GQ:PL	./.:0,0:0:.:0,0,0	0/0:2,0:2:6:0,6,74	0/0:1,0:1:3:0,3,42	0/0:3,0:3:0:0,0,39	0/1:3,5:8:99:150,0,101	0/0:5,0:5:15:0,15,202	./.:2,0:2:.:0,0,0	0/0:7,0:7:0:0,0,211	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0	0/0:2,0:2:6:0,6,76	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0
J9347376|m.109621	210	.	G	A	3468.25	.	AC=9;AF=0.500;AN=18;BaseQRankSum=1.22;ClippingRankSum=0.00;DP=199;ExcessHet=7.7512;FS=5.817;MLEAC=11;MLEAF=0.611;MQ=52.40;MQRankSum=-1.360e+00;QD=18.16;ReadPosRankSum=0.203;SOR=0.415	GT:AD:DP:GQ:PL	./.:0,0:0:.:0,0,0	0/0:3,0:3:0:0,0,28	1/1:1,8:9:17:224,17,0	1/1:3,54:57:99:1469,111,0	0/1:2,11:13:9:330,0,9	0/1:2,7:9:22:221,0,22	./.:1,0:1:.:0,0,0	0/1:30,33:63:99:850,0,787	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0	0/0:3,0:3:0:0,0,41	0/1:4,4:8:46:46,0,46	0/1:12,20:32:99:373,0,149
J9347376|m.109621	265	.	A	G	41.82	.	AC=1;AF=0.050;AN=20;BaseQRankSum=0.126;ClippingRankSum=0.00;DP=23;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.1793;MLEAC=1;MLEAF=0.050;MQ=60.00;MQRankSum=0.00;QD=8.36;ReadPosRankSum=-5.240e-01;SOR=0.223	GT:AD:DP:GQ:PL	0/0:1,0:1:3:0,3,27	0/0:1,0:1:3:0,3,38	0/0:3,0:3:9:0,9,117	0/0:2,0:2:6:0,6,80	0/1:2,3:5:37:73,0,37	0/0:3,0:3:6:0,6,90	0/0:1,0:1:3:0,3,32	0/0:3,0:3:9:0,9,117	0/0:3,0:3:9:0,9,58	./.:0,0:0:.:0,0,0	0/0:1,0:1:3:0,3,27	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0
J9347376|m.109621	279	.	G	C	49.24	.	AC=1;AF=0.050;AN=20;BaseQRankSum=0.967;ClippingRankSum=0.00;DP=24;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.1578;MLEAC=1;MLEAF=0.050;MQ=60.00;MQRankSum=0.00;QD=16.41;ReadPosRankSum=0.967;SOR=1.179	GT:AD:DP:GQ:PGT:PID:PL	0/0:1,0:1:3:.:.:0,3,27	0/0:1,0:1:3:.:.:0,3,38	0/0:3,0:3:9:.:.:0,9,117	0/0:2,0:2:6:.:.:0,6,80	0/0:4,0:4:12:.:.:0,12,158	0/1:1,2:3:75:0|1:279_G_C:81,0,75	0/0:1,0:1:3:.:.:0,3,32	0/0:3,0:3:9:.:.:0,9,117	0/0:4,0:4:12:.:.:0,12,93	./.:0,0:0:.:.:.:0,0,0	0/0:2,0:2:6:.:.:0,6,61	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0
J9347376|m.109621	282	.	C	A	49.24	.	AC=1;AF=0.050;AN=20;BaseQRankSum=0.431;ClippingRankSum=0.00;DP=24;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.1578;MLEAC=1;MLEAF=0.050;MQ=60.00;MQRankSum=0.00;QD=16.41;ReadPosRankSum=0.967;SOR=1.179	GT:AD:DP:GQ:PGT:PID:PL	0/0:1,0:1:3:.:.:0,3,27	0/0:1,0:1:3:.:.:0,3,38	0/0:3,0:3:9:.:.:0,9,117	0/0:2,0:2:6:.:.:0,6,80	0/0:4,0:4:12:.:.:0,12,158	0/1:1,2:3:75:0|1:279_G_C:81,0,75	0/0:1,0:1:3:.:.:0,3,32	0/0:3,0:3:9:.:.:0,9,117	0/0:4,0:4:12:.:.:0,12,93	./.:0,0:0:.:.:.:0,0,0	0/0:2,0:2:6:.:.:0,6,61	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0

Thanks,
Karen

@singing-scientist
Copy link
Contributor

Thanks for these additional details! I am not certain, but I believe you're simply saying that your VCF file is summarizing multiple individual sequences, and that the AF value here in the INFO columns represents a true allele frequency (I think, the allele frequency among all the samples in the columns that come after INFO). If that is true, and the AF represents the allele frequency in a group/sample/population of your organism of interest, then that is all I mean by a "pooled" approach, i.e., the VCF represents multiple individuals. In other words, it doesn't matter whether the sequence data were generated in a pooled fashion or summarized "after the fact" for multiple individual sequences. In summary, if the AF value represents the frequency you want represented in the FASTA, it should work!

It's entirely possible I misunderstood — please let me know!

Chase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants