genotype ordering (issue #152) #83

DSLituiev · 2015-04-28T16:25:32Z

a fix of the specification

pd3 · 2015-05-19T12:34:06Z

Hi,
thank you for this and sorry for the delay in responding.

There was a change in similar spirit some time ago (66d45d5). If that already does not explain the ordering sufficiently, can you please make the pull request against that? However, I do hope that the version on the VCFv4.3 is clear enough already.

For reference, this has been discussed also here #58.

atks · 2015-05-19T14:16:19Z

I think the idea here is to be able to compute the location of a genotype likelihood in an array, so that one may randomly access it. It's not just about enumerating the genotypes. This is useful in the case when one is interested in extracting a particular genotype likelihood of interest for example when a multiallelic variant becomes a biallelic variant in a subset of the sample or when one is interested in decomposing the records to biallelic variants. We may simply enumerate the genotypes, but then we will also need to create a hash to move from genotype to index. Having a closed form from genotype to index helps here.

pd3 · 2015-05-19T14:24:49Z

OK, I am all for it. Can it be made more compact though and the pull request made against the existing text? It is unnecessary to introduce Pochhammer symbol for example and list 0-28 examples...?

DSLituiev · 2015-05-19T20:36:51Z

Hi,
This amendment has been earlier discussed with Eric Garrison here:
freebayes/freebayes#152

To repeat: apart from general description, the fix you mention provides mapping from genotype index to allele set, mine vice versa. Also I believe the order of loops should be reverse to the mentioned fix; +some legibility:

\begin{lstlisting}
    for $a_P = 0\ldots N$
        for $a_{P-1} = 0\ldots a_P$
            $\ldots$
                for $a_2 = 0\ldots a_{3}$
                    println $a_1 a_2  \ldots  a_P$ ,
\end{lstlisting}

Yes, you definitely can reduce the number of examples. The Pochhammer symbol does help, as the implementation much more efficient (or only feasible) through the upper factorial, not through division of two factorials.

atks · 2015-05-19T20:48:14Z

@DSLituiev

Just to double check, I think the formula should be:

\text{Ordering}\Big(k_1/k_2/.../k_n\Big) & = \sum_{m=1}^{n} \binom{m+k_m - 1}{m}

instead of

\text{Ordering}\Big(k_1/k_2/.../k_n\Big) & = \sum_{m=1}^{n} \binom{k_m - 1}{m}

The Pochhammer symbol is helpful as an implementation detail but in terms of understanding, I think the literals should remain as binomial coefficients as they are more well known.

pd3 · 2015-05-21T08:45:59Z

OK, let me summarize the current proposal. There should be: 1) a pseudocode in some form to explain the order of genotypes, 2) a formula to explain how to calculate the index, 3) some examples.

I've listed the options below, also including a variant from Bob Handsaker who communicated with me offline.

Pseudocode proposals

variable loops already present in the VCFv4.3 draft
variable loops already present but modified so that the ordering of genotypes is 00,01,11 rather than 00,10,11
recursive pseudocode

     GLOrdering (P, N, suffix=""):
           for a in 0..N:
                if (P == 1) println str(a) + suffix
                if (P > 1) GLOrdering(P-1, a, str(a) + suffix)

Index calculation

 Ordering(k_1/k_2/.../k_n) = \sum_{m=1}^{n} \binom{m+k_m - 1}{m}

Examples

 for P=2 and N=1, the ordering is 00,01,11
 for P=2 and N=2, the ordering is 00,01,11,02,12,22
 for P=3 and N=2, the ordering is 000,001,011,111,002,012,112,022,122,222

I guess the only remaining question is whether to go with the pseudocode 1, 2, 3, or a combination of 1+3 or 2+3.

atks · 2015-05-21T12:48:08Z

+1 for pseudocode 2+3, the original variable loops description is very easy to understand, the recursive pseudocode shows how it is implemented. The recursive pseudocode takes in P and N where P is the ploidy and N is the number of alleles minus 1. It might be more natural to use the number of alleles directly.

+1 for index calculation, there should be an emphasis that it works only if k_1<=k_2<=..<=k_n. Notation between pseudocode for enumeration and index computation should be consistent, for example k_1/.../k_n can be k_1/...k_P instead. Also, it might be a good idea to add the formula for the number of genotypes expected given ploidy and number of alleles.

+1 for examples, but can you add the index formulae for those simple cases too? In particular for P=1 and P=2 and general N.

DSLituiev · 2015-06-02T10:59:47Z

@atks, you are right about the binomial formula / index calculation.

I agree: it is good to add sort function in the pseudocode to emphasize that k_1<=k_2<=..<=k_n.

A Python code for input (N, P) instead of (N-1,P) is following:

def recursive_GL_order(number_of_alleles, ploidy, suffix = ''):
    for a_p in range(0, number_of_alleles):    
        if ploidy > 1:
            recursive_GL_order(a_p + 1, ploidy - 1, '%u ' % a_p + suffix)
        else:
            print(a_p, suffix)

Pseudocode:

 GLOrdering (P, N, suffix=""):
       for a in 0..N:
            if (P == 1) println str(a) + suffix
            if (P > 1) GLOrdering(P-1, a + 1, str(a) + suffix)

pd3 · 2017-02-07T07:41:48Z

This has been addressed in commits referenced above, the pull request can be now closed

genotype ordering

dbb02fd

DSLituiev mentioned this pull request May 8, 2015

GL output for polyploids freebayes/freebayes#152

Closed

atks referenced this pull request Jun 2, 2015

Clarified GL ordering. Resolves #58

66d45d5

pd3 added a commit that referenced this pull request Jun 3, 2015

GL ordering made clearer as discussed in #83

d6b46df

pd3 mentioned this pull request Jun 3, 2015

VCFv4.3 - first batch of changes #88

Merged

pd3 closed this Feb 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genotype ordering (issue #152) #83

genotype ordering (issue #152) #83

DSLituiev commented Apr 28, 2015

pd3 commented May 19, 2015

atks commented May 19, 2015

pd3 commented May 19, 2015

DSLituiev commented May 19, 2015

atks commented May 19, 2015

pd3 commented May 21, 2015

atks commented May 21, 2015

DSLituiev commented Jun 2, 2015

pd3 commented Feb 7, 2017

genotype ordering (issue #152) #83

genotype ordering (issue #152) #83

Conversation

DSLituiev commented Apr 28, 2015

pd3 commented May 19, 2015

atks commented May 19, 2015

pd3 commented May 19, 2015

DSLituiev commented May 19, 2015

atks commented May 19, 2015

pd3 commented May 21, 2015

atks commented May 21, 2015

DSLituiev commented Jun 2, 2015

pd3 commented Feb 7, 2017