Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF: Clarify GL Genotype Field Ordering #58

Closed
adamnovak opened this issue Dec 10, 2014 · 2 comments
Closed

VCF: Clarify GL Genotype Field Ordering #58

adamnovak opened this issue Dec 10, 2014 · 2 comments
Labels

Comments

@adamnovak
Copy link

The description of the GL field says that:

If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j.

It is unclear to me how one is supposed to interpret this equation as defining an ordering of anything. What do j and k represent? Why are we dividing one by the other? What does the value returned by the F function represent?

I think what is happening is that j represents the number of the first allele of a diploid pair (with 0 for the ref allele) k represents the number of the second, the "/" is really denoting that the argument of the function is the unphased genotype composed from those two alleles, and the result of the function is the index in the GL array at which the likelihood of that genotype is to be found. If that is the case, this should be described more clearly in the spec. If that is not the case, this should definitely be described more clearly in the spec.

Furthermore, for triploid or higher sites, the spec merely says genotype likelihoods should appear in "the canonical order". What order is that, exactly?

@pd3
Copy link
Member

pd3 commented Dec 11, 2014

You are probably right, it should be described more clearly. The idea is actually quite simple and the examples which follow immediately after the sentence should clear any doubts for most readers: "In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc."

The generalization of the likelihoods order can be described by the
following nested loops. Here P is the ploidy and N the number of
alleles:

for a1=0..N
for a2=0..a1
...
for aP=0..a(P-1)
print a1/a2/../aP

@atks
Copy link

atks commented Feb 5, 2015

The below wiki explains the ordering of genotypes given ploidy and alleles in the general case.

http://genome.sph.umich.edu/wiki/Relationship_between_Ploidy,_Alleles_and_Genotypes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants