Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to properly format VCF for snpgenie.pl (when using VCFs from multiple isolates of a population) #35

Closed
maxgmarin opened this issue Jul 15, 2020 · 7 comments

Comments

@maxgmarin
Copy link

maxgmarin commented Jul 15, 2020

Hello!

First I want to say thank you for maintaining this awesome software :)

I have a question regarding the input VCF format when using snpgenie.pl.

For context, I am interested in analyzing gene diversity across a set of whole genome bacterial assemblies I have generated with PacBio (1 contig and circular).

I have aligned each assembly to the reference genome and inferred SNPs using Minimap2. This results in a single VCF for each sample. I have 31 individual VCFs, which I have also merged using bcftools. I am interested in estimating πN/πS, dN/dS, and gene diversity across this population of 31 assemblies.

My first attempt at using snpgenie.pl was to simply merge all 31 VCFs to a multi-sample VCF.

The multisample VCF is formatted as followed:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	M0011368_9	M0014888_3	
NC_000962.3	309	.	G	C	60	.	QNAME=contig_1;QSTART=309;QSTRAND=+	GT:AD:DP	1/1:0,1:1	0/0:1,0:1
NC_000962.3	490	.	G	A	60	.	QNAME=contig_1;QSTART=490;QSTRAND=+	GT:AD:DP	0/0:1,0:1	0/0:1,0:1
NC_000962.3	558	.	G	T	60	.	QNAME=contig_1;QSTART=558;QSTRAND=+	GT:AD:DP	0/0:1,0:1	0/0:1,0:1
NC_000962.3	1088	.	G	A	60	.	QNAME=contig_1;QSTART=1088;QSTRAND=+	GT:AD:DP	0/0:1,0:1	0/0:1,0:1

NOTE 1: For the purpose of this example I left only 2 samples within the VCF
NOTE 2: I encoded each sample as having an AD that has one count either towards the reference or alternative allele(s). Thus the depth (DP) of each sample is also 1.

I ran snpgenie.pl with the following command

snpgenie.pl --fastafile=./GCF_000195955.2_ASM19595v2_genomic.fasta --gtffile=./GCF_000195955.2_ASM19595v2_genomic.gtf --snpreport=./PMP_31CI_TestVCF_pyvcf.vcf --vcfformat=4 --outdir=./OutDir

As SNPgenie ran (and when it finished) I noticed that it evaluating each sample in the merged VCF individually. Producing files such as
temp_vcf4_M0011368_9.vcf and temp_vcf4_M0014888_3.vcf

In the population_summary.txt you can see that the stats were calculated for each sample individually.

file	sites	sites_coding	sites_noncoding	pi	pi_coding	pi_noncoding	N_sites	S_sites	piN	piS	mean_dN_vs_ref	mean_dS_vs_ref	mean_gdiv_polymorphic	mean_N_gdiv	mean_S_gdiv	mean_gdiv	sites_polymorphic	mean_gdiv_coding_poly	sites_coding_poly	mean_gdiv_noncoding_poly	sites_noncoding_poly
temp_vcf4_M0011368_9.vcf	4411532	3947191	464341	0	0	0	1412442.16666667	516764.833333334	0	0	2.54877692337374e-05	4.45076725744689e-05	0	*	*	0	138	0	114	0	24
temp_vcf4_M0014888_3.vcf	4411532	3947191	464341	0	0	0	1412444.83333333	516762.166666667	0	0	2.97356746322482e-05	5.22484069879986e-05	0	*	*	0	173	0	146	0	27

I now realize that I was misunderstanding the needed format of the input VCF. The pi is 0 for these two samples, which is unsurprising if each sample is evaluated individually.

So my question is, what is the most appropriate way to evaluate gene diversity across my "population" of 27 isolates? (When I have individual VCFs for each isolate)

Should I format a VCF which has a single "sample" that represents the population of 27 isolates?

Quick example of this idea below:

  • DP always is 27
  • AD corresponds to the number of REF vs ALT alleles across 27 samples.
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	MyPopulationSample
NC_000962.3	309	.	G	C	60	.	QNAME=contig_1;QSTART=309;QSTRAND=+	GT:AD:DP	1/1:3,24:27	
NC_000962.3	490	.	G	A	60	.	QNAME=contig_1;QSTART=490;QSTRAND=+	GT:AD:DP	1/1:2,25:27
NC_000962.3	558	.	G	T	60	.	QNAME=contig_1;QSTART=558;QSTRAND=+	GT:AD:DP	1/1:22,5:27	

Also, is SNPgenie even the appropriate tool for this? Is there something inappropriate in how I am thinking about this analysis?

Thank you so much!
Max

@singing-scientist
Copy link
Contributor

Greetings, Max! Thanks a lot for using SNPGenie. Thanks too for the clear description — I think I see what is going on. A major point to make is that any allele frequency is interpreted as an estimated allele frequency within a population being sampled. Thus, by using SNPGenie's VCF format 4, you are analyzing the data as if each of the 31 samples (columns after INFO) is a separate pooled (e.g., deeply sequenced) sample from a separate population, when in fact I think each is just a sequenced individual from the same population. Indeed, I now see you seem to say this later on.

What is needed is a VCF that in some way provides an estimate of allele frequencies in the sample/population of interest. Perhaps the easiest is summarizing your 31 (or 27?) individuals in a single VCF file, analyzing using say SNPGenie's VCF format #2, where DP would be 31 (or 27) and the AF would be the proportion of isolates with the ALT allele. Does this make sense?

Let me know!
Chase

@maxgmarin
Copy link
Author

Hello Chase,

Thank you so much for the clear and quick response! I was able to reformat the VCF and successfully run SNPgenie. Also, to clarify there are a total of 31 samples, sorry for the confusion.

For the future reference of others, I took my multisample VCF and used the pyVCF library to reformat each record such that there was a single sample which represented the entire population of samples.

An example of this format can be seen below.

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	EntirePopulation_N31
NC_000962.3	309	.	G	C	60	.	QNAME=contig_1	AD:DP	30,1:31
NC_000962.3	490	.	G	A	60	.	QNAME=contig_1	AD:DP	30,1:31
NC_000962.3	558	.	G	T	60	.	QNAME=contig_1	AD:DP	30,1:31
NC_000962.3	1849	.	C	A	60	.	QNAME=contig_1	AD:DP	25,6:31
NC_000962.3	1977	.	A	G	60	.	QNAME=contig_1	AD:DP	0,31:31
NC_000962.3	2532	.	T	C	60	.	QNAME=contig_1	AD:DP	23,8:31

NOTE: AD represents the # of occurrences of the REF and ALT allele(s). DP is simply the total samples present and the sum of all AD values.

Thank you,
Max

@singing-scientist
Copy link
Contributor

This is great news! Looks exactly right. Thanks too for sharing your approach for dealing with this VCF conversion — I am sure this will prove valuable to myself and others! Don't hesitate to let me know if anything else comes up.

With gratitude,
Chase

@GuidoGallo
Copy link

Hi Chase and Max,
first of all thanks in advance for the attention and thanks Chase for developing such an useful tool. I’m really sorry for commenting on this closed issue but I think I have a similar issue as the one Max was having and I wanted to kindly ask for a clarification. I am working on 7 vcf files (VCFv4.2, obtained with longranger pipeline, relying on gatk) from 7 bird samples from two different populations, that are characterized by different environmental conditions. We want to evaluate differences in dn/ds ratios along the genome comparing the two different groups. So if I understand correctly the way to go would be to reformat the vcf files and merge them within the two groups, ending up with two vcf files each one representing all individuals of one population, and then run the within-pool analysis for each of the two vcf files. Am I correct? If so, I wanted to ask Max if he could provide more details on how to do that (vcf merging to generate vcf representing the entire population) using pyVCF library.
Many thanks in advance to both of you for your help :)

@singing-scientist
Copy link
Contributor

Hi @GuidoGallo! Thanks for asking about this.

As far as I can tell, the answer really depends on what hypothesis you're addressing. First, are you interested in dN/dS ratios between different samples (i.e. between-host diversity), or within a sample (i.e. within-host diversity)? If between hosts, then you would need to decide whether to use a consensus-level approach (i.e. dN/dS ratios between consensi) or a group-based all-vs-all approach (i.e. mean dN/dS for all pairs of within-sample alleles between the two samples — seldom done, but more accurate). If within hosts, you probably want to keep the deep sequencing data for each sample and do the mean pairwise dN/dS within each sample, i.e. piN/piS. So I'd say the first thing to do is define your hypothesis very explicitly, decide which approach you think will address it best, and try that first.

Prima facie, I'm not sure whether merging makes sense here. For example, suppose 3 of your samples come from environment A. Would you want to merge the reads/variants of those three samples? If so, why, and what would be the biological interpretation of the results? Again, I think clarifying whether you're asking a between- or within-host question needs to be decided first.

Please feel free to provide more specifics, and happy to provide more thoughts!

Chase

@GuidoGallo
Copy link

Dear Chase, thank you very much for your time and for your clarifications.
Thinking about it, I think you're right, it wouldn't make a lot of sense here to merge vcf files.
We have two groups of bird samples (exposed to different environmental conditions) and we would like to see if there might be
a difference in the accumulation of nonsynonymous mutations in one group with respect to the other one (and possibly also which are the genomic regions in which this happens). So also according to what you said I think the mean dN/dS for all pairs between the two groups should be the way to go. Or do you recommend another approach? If I understood correctly from the documentation, the between-group analysis with multi-sample vcf files is not supported so I guess we should have to run the between-group script using all possible pairwise combinations between the two groups (is this doable with vcf files?)
Many thanks again for your time and precious advices

Guido

@singing-scientist
Copy link
Contributor

Greetings! It seems like you might be asking whether nonsynonymous variation accumulates more in one environment than another (controlling for synonymous variation)? One approach would be to treat every sample/codon as an observation. Each codon in each sample has a piN and piS value associated with it. You can take the mean across all samples for each codon. Thus, for each environment, each codon has a mean piN and piS. I'd probably actually do mean N_diffs, N_sites, S_diffs, and S_sites across the samples, then calculate piN and piS from those. It's not super straightforward and a lot can go wrong, but it's hard to use a sample-based approach given so few samples, i.e., you probably need a codon-based approach where the codon is treated as the observational unit, and statistical significance is assessed by bootstrapping codons. See e.g. the methods in this paper https://academic.oup.com/ve/article/7/1/veab041/6248115

Chase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants