Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genotype ordering (issue #152) #83

Closed
wants to merge 1 commit into from
Closed

Conversation

DSLituiev
Copy link

a fix of the specification

@pd3
Copy link
Member

pd3 commented May 19, 2015

Hi,
thank you for this and sorry for the delay in responding.

There was a change in similar spirit some time ago (66d45d5). If that already does not explain the ordering sufficiently, can you please make the pull request against that? However, I do hope that the version on the VCFv4.3 is clear enough already.

For reference, this has been discussed also here #58.

@atks
Copy link

atks commented May 19, 2015

I think the idea here is to be able to compute the location of a genotype likelihood in an array, so that one may randomly access it. It's not just about enumerating the genotypes. This is useful in the case when one is interested in extracting a particular genotype likelihood of interest for example when a multiallelic variant becomes a biallelic variant in a subset of the sample or when one is interested in decomposing the records to biallelic variants. We may simply enumerate the genotypes, but then we will also need to create a hash to move from genotype to index. Having a closed form from genotype to index helps here.

@pd3
Copy link
Member

pd3 commented May 19, 2015

OK, I am all for it. Can it be made more compact though and the pull request made against the existing text? It is unnecessary to introduce Pochhammer symbol for example and list 0-28 examples...?

@DSLituiev
Copy link
Author

Hi,
This amendment has been earlier discussed with Eric Garrison here:
freebayes/freebayes#152

To repeat: apart from general description, the fix you mention provides mapping from genotype index to allele set, mine vice versa. Also I believe the order of loops should be reverse to the mentioned fix; +some legibility:

\begin{lstlisting}
    for $a_P = 0\ldots N$
        for $a_{P-1} = 0\ldots a_P$
            $\ldots$
                for $a_2 = 0\ldots a_{3}$
                    println $a_1 a_2  \ldots  a_P$ ,
\end{lstlisting}

Yes, you definitely can reduce the number of examples. The Pochhammer symbol does help, as the implementation much more efficient (or only feasible) through the upper factorial, not through division of two factorials.

@atks
Copy link

atks commented May 19, 2015

@DSLituiev

Just to double check, I think the formula should be:

\text{Ordering}\Big(k_1/k_2/.../k_n\Big) & = \sum_{m=1}^{n} \binom{m+k_m - 1}{m}

instead of

\text{Ordering}\Big(k_1/k_2/.../k_n\Big) & = \sum_{m=1}^{n} \binom{k_m - 1}{m}

The Pochhammer symbol is helpful as an implementation detail but in terms of understanding, I think the literals should remain as binomial coefficients as they are more well known.

@pd3
Copy link
Member

pd3 commented May 21, 2015

OK, let me summarize the current proposal. There should be: 1) a pseudocode in some form to explain the order of genotypes, 2) a formula to explain how to calculate the index, 3) some examples.

I've listed the options below, also including a variant from Bob Handsaker who communicated with me offline.

Pseudocode proposals

  1. variable loops already present in the VCFv4.3 draft
  2. variable loops already present but modified so that the ordering of genotypes is 00,01,11 rather than 00,10,11
  3. recursive pseudocode
     GLOrdering (P, N, suffix=""):
           for a in 0..N:
                if (P == 1) println str(a) + suffix
                if (P > 1) GLOrdering(P-1, a, str(a) + suffix)

Index calculation

 Ordering(k_1/k_2/.../k_n) = \sum_{m=1}^{n} \binom{m+k_m - 1}{m}

Examples

 for P=2 and N=1, the ordering is 00,01,11
 for P=2 and N=2, the ordering is 00,01,11,02,12,22
 for P=3 and N=2, the ordering is 000,001,011,111,002,012,112,022,122,222

I guess the only remaining question is whether to go with the pseudocode 1, 2, 3, or a combination of 1+3 or 2+3.

@atks
Copy link

atks commented May 21, 2015

+1 for pseudocode 2+3, the original variable loops description is very easy to understand, the recursive pseudocode shows how it is implemented. The recursive pseudocode takes in P and N where P is the ploidy and N is the number of alleles minus 1. It might be more natural to use the number of alleles directly.

+1 for index calculation, there should be an emphasis that it works only if k_1<=k_2<=..<=k_n. Notation between pseudocode for enumeration and index computation should be consistent, for example k_1/.../k_n can be k_1/...k_P instead. Also, it might be a good idea to add the formula for the number of genotypes expected given ploidy and number of alleles.

+1 for examples, but can you add the index formulae for those simple cases too? In particular for P=1 and P=2 and general N.

@DSLituiev
Copy link
Author

@atks, you are right about the binomial formula / index calculation.

I agree: it is good to add sort function in the pseudocode to emphasize that k_1<=k_2<=..<=k_n.

A Python code for input (N, P) instead of (N-1,P) is following:

def recursive_GL_order(number_of_alleles, ploidy, suffix = ''):
    for a_p in range(0, number_of_alleles):    
        if ploidy > 1:
            recursive_GL_order(a_p + 1, ploidy - 1, '%u ' % a_p + suffix)
        else:
            print(a_p, suffix)

Pseudocode:

 GLOrdering (P, N, suffix=""):
       for a in 0..N:
            if (P == 1) println str(a) + suffix
            if (P > 1) GLOrdering(P-1, a + 1, str(a) + suffix)

@pd3
Copy link
Member

pd3 commented Feb 7, 2017

This has been addressed in commits referenced above, the pull request can be now closed

@pd3 pd3 closed this Feb 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants