-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
genotype ordering (issue #152) #83
Conversation
Hi, There was a change in similar spirit some time ago (66d45d5). If that already does not explain the ordering sufficiently, can you please make the pull request against that? However, I do hope that the version on the VCFv4.3 is clear enough already. For reference, this has been discussed also here #58. |
I think the idea here is to be able to compute the location of a genotype likelihood in an array, so that one may randomly access it. It's not just about enumerating the genotypes. This is useful in the case when one is interested in extracting a particular genotype likelihood of interest for example when a multiallelic variant becomes a biallelic variant in a subset of the sample or when one is interested in decomposing the records to biallelic variants. We may simply enumerate the genotypes, but then we will also need to create a hash to move from genotype to index. Having a closed form from genotype to index helps here. |
OK, I am all for it. Can it be made more compact though and the pull request made against the existing text? It is unnecessary to introduce Pochhammer symbol for example and list 0-28 examples...? |
Hi, To repeat: apart from general description, the fix you mention provides mapping from genotype index to allele set, mine vice versa. Also I believe the order of loops should be reverse to the mentioned fix; +some legibility:
Yes, you definitely can reduce the number of examples. The Pochhammer symbol does help, as the implementation much more efficient (or only feasible) through the upper factorial, not through division of two factorials. |
Just to double check, I think the formula should be: \text{Ordering}\Big(k_1/k_2/.../k_n\Big) & = \sum_{m=1}^{n} \binom{m+k_m - 1}{m} instead of \text{Ordering}\Big(k_1/k_2/.../k_n\Big) & = \sum_{m=1}^{n} \binom{k_m - 1}{m} The Pochhammer symbol is helpful as an implementation detail but in terms of understanding, I think the literals should remain as binomial coefficients as they are more well known. |
OK, let me summarize the current proposal. There should be: 1) a pseudocode in some form to explain the order of genotypes, 2) a formula to explain how to calculate the index, 3) some examples. I've listed the options below, also including a variant from Bob Handsaker who communicated with me offline. Pseudocode proposals
Index calculation
Examples
I guess the only remaining question is whether to go with the pseudocode 1, 2, 3, or a combination of 1+3 or 2+3. |
+1 for pseudocode 2+3, the original variable loops description is very easy to understand, the recursive pseudocode shows how it is implemented. The recursive pseudocode takes in P and N where P is the ploidy and N is the number of alleles minus 1. It might be more natural to use the number of alleles directly. +1 for index calculation, there should be an emphasis that it works only if k_1<=k_2<=..<=k_n. Notation between pseudocode for enumeration and index computation should be consistent, for example k_1/.../k_n can be k_1/...k_P instead. Also, it might be a good idea to add the formula for the number of genotypes expected given ploidy and number of alleles. +1 for examples, but can you add the index formulae for those simple cases too? In particular for P=1 and P=2 and general N. |
@atks, you are right about the binomial formula / index calculation. I agree: it is good to add A Python code for input
Pseudocode:
|
This has been addressed in commits referenced above, the pull request can be now closed |
a fix of the specification