between pops with vcf? #34

karenbobier · 2020-07-13T15:13:55Z

Hi Chase,

Is it possible to run the snpgenie_between_group.pl with vcf files rather than fasta files? I have two populations with variant data for 6 diploids per population and would like to estimate dn/ds between populations. If if cannot be run with vcfs do you have a tool you recommend for converting diploid vcf data to fasta format?

Thanks,
Karen

singing-scientist · 2020-07-14T06:49:50Z

Greetings, Karen! Thanks for your interest in using SNPGenie.

I regret that I have not yet programmed a version of SNPGenie that uses the 'pooled' (SNP report/VCF-based) approach for estimating the between-group values—and will likely not get to this for some time.

However, I did recently write a Python script (generate_seqs_from_VCF.py) that randomly generates a user-specified number of sequences in a FASTA file, randomly distributing variants specified in a VCF at the appropriate frequencies. Two or more sets of sequences thus generated can then be used as input for the current SNPGenie between-group script.

The script, generate_seqs_from_VCF.py, requires the Python libraries SeqIO, os, random, re, and sys. It can be run like this:

generate_seqs_from_VCF.py [sequence file].fasta [SNP report].vcf [num sequences]

For the VCF file, the script simply uses the "AF" entry, and it should work as long as AF is present; for example, the following format:

##METADATA ROWS
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
NC_045512.2	643	.	C	A	148	PASS	DP=6102;AF=0.003114;SB=0;DP4=55,6026,0,20
NC_045512.2	689	.	T	C	170	PASS	DP=6604;AF=0.003028;SB=7;DP4=66,6509,1,19
NC_045512.2	716	.	A	G	173	PASS	DP=7833;AF=0.002936;SB=7;DP4=72,7736,1,22
NC_045512.2	769	.	T	C	203	PASS	DP=8319;AF=0.003005;SB=5;DP4=101,8192,1,24
NC_045512.2	778	.	T	C	167	PASS	DP=8091;AF=0.002719;SB=0;DP4=100,7964,0,22
...

For example, for a reference sequence in a file named reference.fasta, a VCF-format SNP report in a file named variants.vcf, and to create 1,000 sequences, one would run the following:

generate_seqs_from_VCF.py reference.fasta variants.vcf 1000

The script and full documentation have been added to Evolutionary Bioinformatics Toolkit. Please let me know if this addresses your request, and if you encounter any problems, as I have not tested this too extensively!

Best,
Chase

karenbobier · 2020-07-14T15:20:29Z

Hi Chase,

Thanks for the quick reply. Your toolkit has been super useful.

I don't actually have pooled sequences, I have seperate variant data for each individual, currently all in one vcf but I can split it by population or individual. I do have the AF information in my vcfs but would you still recommend this for non-pooled data?

my vcf files look like this:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	ARG10	ARG18	ARG19	ARG21	ARG25	ARG27	ARG29	ARI10	ARI102	ARI11	ARI12	ARI4	ARI6
##contig=<ID=J9347376|m.109621,length=300>
J9347376|m.109621	72	.	G	T	307.47	.	AC=4;AF=0.222;AN=18;BaseQRankSum=0.871;ClippingRankSum=0.00;DP=66;ExcessHet=0.8432;FS=6.239;MLEAC=4;MLEAF=0.222;MQ=58.81;MQRankSum=0.967;QD=20.50;ReadPosRankSum=2.41;SOR=1.304	GT:AD:DP:GQ:PL	./.:0,0:0:.:0,0,0	0/0:1,0:1:3:0,3,42	0/0:5,0:5:15:0,15,159	0/1:4,8:12:40:213,0,40	1/1:0,3:3:9:128,9,0	0/0:14,0:14:24:0,24,360	0/0:2,0:2:6:0,6,76	0/1:9,1:10:12:12,0,283	./.:1,0:1:.:0,0,0	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0	0/0:2,0:2:6:0,6,49	0/0:1,0:1:3:0,3,25
J9347376|m.109621	84	.	G	A	43.39	.	AC=1;AF=0.056;AN=18;BaseQRankSum=-9.670e-01;ClippingRankSum=0.00;DP=38;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.056;MQ=60.00;MQRankSum=0.00;QD=14.46;ReadPosRankSum=0.967;SOR=0.223	GT:AD:DP:GQ:PL	./.:0,0:0:.:0,0,0	0/0:1,0:1:3:0,3,42	0/0:3,0:3:9:0,9,115	0/0:8,0:8:24:0,24,342	0/1:1,2:3:36:75,0,36	0/0:9,0:9:27:0,27,362	0/0:2,0:2:6:0,6,76	0/0:9,0:9:15:0,15,225	0/0:1,0:1:3:0,3,38	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0	0/0:1,0:1:3:0,3,25	./.:0,0:0:.:0,0,0
J9347376|m.109621	129	.	C	A	230.12	.	AC=2;AF=0.125;AN=16;DP=31;ExcessHet=0.1703;FS=0.000;MLEAC=2;MLEAF=0.125;MQ=60.00;QD=27.00;SOR=1.329	GT:AD:DP:GQ:PGT:PID:PL	./.:0,0:0:.:.:.:0,0,0	0/0:2,0:2:6:.:.:0,6,87	0/0:2,0:2:3:.:.:0,3,45	1/1:0,6:6:18:1|1:129_C_A:270,18,0	0/0:4,0:4:0:.:.:0,0,90	0/0:7,0:7:15:.:.:0,15,225	0/0:2,0:2:6:.:.:0,6,76	0/0:6,0:6:15:.:.:0,15,225	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0	0/0:2,0:2:6:.:.:0,6,72	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0
J9347376|m.109621	135	.	C	G	1377.62	.	AC=16;AF=1.00;AN=16;DP=32;ExcessHet=3.0103;FS=0.000;MLEAC=16;MLEAF=1.00;MQ=60.00;QD=30.82;SOR=0.693	GT:AD:DP:GQ:PGT:PID:PL	./.:0,0:0:.:.:.:0,0,0	1/1:0,2:2:6:.:.:87,6,0	1/1:0,2:2:6:.:.:80,6,0	1/1:0,6:6:18:1|1:129_C_A:270,18,0	1/1:0,6:6:21:.:.:295,21,0	1/1:0,7:7:21:.:.:287,21,0	1/1:0,1:1:3:.:.:42,3,0	1/1:0,6:6:18:.:.:253,18,0	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0	1/1:0,2:2:6:.:.:87,6,0	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0
J9347376|m.109621	144	.	G	C	121.76	.	AC=1;AF=0.071;AN=14;BaseQRankSum=-9.210e-01;ClippingRankSum=0.00;DP=30;ExcessHet=3.3579;FS=0.000;MLEAC=1;MLEAF=0.071;MQ=53.67;MQRankSum=0.00;QD=15.22;ReadPosRankSum=0.732;SOR=0.799	GT:AD:DP:GQ:PL	./.:0,0:0:.:0,0,0	0/0:2,0:2:6:0,6,74	0/0:1,0:1:3:0,3,42	0/0:3,0:3:0:0,0,39	0/1:3,5:8:99:150,0,101	0/0:5,0:5:15:0,15,202	./.:2,0:2:.:0,0,0	0/0:7,0:7:0:0,0,211	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0	0/0:2,0:2:6:0,6,76	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0
J9347376|m.109621	210	.	G	A	3468.25	.	AC=9;AF=0.500;AN=18;BaseQRankSum=1.22;ClippingRankSum=0.00;DP=199;ExcessHet=7.7512;FS=5.817;MLEAC=11;MLEAF=0.611;MQ=52.40;MQRankSum=-1.360e+00;QD=18.16;ReadPosRankSum=0.203;SOR=0.415	GT:AD:DP:GQ:PL	./.:0,0:0:.:0,0,0	0/0:3,0:3:0:0,0,28	1/1:1,8:9:17:224,17,0	1/1:3,54:57:99:1469,111,0	0/1:2,11:13:9:330,0,9	0/1:2,7:9:22:221,0,22	./.:1,0:1:.:0,0,0	0/1:30,33:63:99:850,0,787	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0	0/0:3,0:3:0:0,0,41	0/1:4,4:8:46:46,0,46	0/1:12,20:32:99:373,0,149
J9347376|m.109621	265	.	A	G	41.82	.	AC=1;AF=0.050;AN=20;BaseQRankSum=0.126;ClippingRankSum=0.00;DP=23;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.1793;MLEAC=1;MLEAF=0.050;MQ=60.00;MQRankSum=0.00;QD=8.36;ReadPosRankSum=-5.240e-01;SOR=0.223	GT:AD:DP:GQ:PL	0/0:1,0:1:3:0,3,27	0/0:1,0:1:3:0,3,38	0/0:3,0:3:9:0,9,117	0/0:2,0:2:6:0,6,80	0/1:2,3:5:37:73,0,37	0/0:3,0:3:6:0,6,90	0/0:1,0:1:3:0,3,32	0/0:3,0:3:9:0,9,117	0/0:3,0:3:9:0,9,58	./.:0,0:0:.:0,0,0	0/0:1,0:1:3:0,3,27	./.:0,0:0:.:0,0,0	./.:0,0:0:.:0,0,0
J9347376|m.109621	279	.	G	C	49.24	.	AC=1;AF=0.050;AN=20;BaseQRankSum=0.967;ClippingRankSum=0.00;DP=24;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.1578;MLEAC=1;MLEAF=0.050;MQ=60.00;MQRankSum=0.00;QD=16.41;ReadPosRankSum=0.967;SOR=1.179	GT:AD:DP:GQ:PGT:PID:PL	0/0:1,0:1:3:.:.:0,3,27	0/0:1,0:1:3:.:.:0,3,38	0/0:3,0:3:9:.:.:0,9,117	0/0:2,0:2:6:.:.:0,6,80	0/0:4,0:4:12:.:.:0,12,158	0/1:1,2:3:75:0|1:279_G_C:81,0,75	0/0:1,0:1:3:.:.:0,3,32	0/0:3,0:3:9:.:.:0,9,117	0/0:4,0:4:12:.:.:0,12,93	./.:0,0:0:.:.:.:0,0,0	0/0:2,0:2:6:.:.:0,6,61	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0
J9347376|m.109621	282	.	C	A	49.24	.	AC=1;AF=0.050;AN=20;BaseQRankSum=0.431;ClippingRankSum=0.00;DP=24;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.1578;MLEAC=1;MLEAF=0.050;MQ=60.00;MQRankSum=0.00;QD=16.41;ReadPosRankSum=0.967;SOR=1.179	GT:AD:DP:GQ:PGT:PID:PL	0/0:1,0:1:3:.:.:0,3,27	0/0:1,0:1:3:.:.:0,3,38	0/0:3,0:3:9:.:.:0,9,117	0/0:2,0:2:6:.:.:0,6,80	0/0:4,0:4:12:.:.:0,12,158	0/1:1,2:3:75:0|1:279_G_C:81,0,75	0/0:1,0:1:3:.:.:0,3,32	0/0:3,0:3:9:.:.:0,9,117	0/0:4,0:4:12:.:.:0,12,93	./.:0,0:0:.:.:.:0,0,0	0/0:2,0:2:6:.:.:0,6,61	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0

Thanks,
Karen

singing-scientist · 2020-07-15T06:06:01Z

Thanks for these additional details! I am not certain, but I believe you're simply saying that your VCF file is summarizing multiple individual sequences, and that the AF value here in the INFO columns represents a true allele frequency (I think, the allele frequency among all the samples in the columns that come after INFO). If that is true, and the AF represents the allele frequency in a group/sample/population of your organism of interest, then that is all I mean by a "pooled" approach, i.e., the VCF represents multiple individuals. In other words, it doesn't matter whether the sequence data were generated in a pooled fashion or summarized "after the fact" for multiple individual sequences. In summary, if the AF value represents the frequency you want represented in the FASTA, it should work!

It's entirely possible I misunderstood — please let me know!

Chase

singing-scientist closed this as completed Aug 7, 2020

andreaswallberg mentioned this issue Aug 16, 2020

Advice on multi-sample, multi-species, multi-contig analyses for both πN/πS and dN/dS #39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

between pops with vcf? #34

between pops with vcf? #34

karenbobier commented Jul 13, 2020

singing-scientist commented Jul 14, 2020

karenbobier commented Jul 14, 2020

singing-scientist commented Jul 15, 2020

between pops with vcf? #34

between pops with vcf? #34

Comments

karenbobier commented Jul 13, 2020

singing-scientist commented Jul 14, 2020

karenbobier commented Jul 14, 2020

singing-scientist commented Jul 15, 2020