You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The probabilistic sampling calculation currently takes the number of sequences available per group as input.
This information doesn't seem to be necessary. The output is based on the number of groups rather than the size of each group. Example:
fromaugur.filter.subsampleimport_calculate_fractional_sequences_per_group_calculate_fractional_sequences_per_group(3, [1,2,3,4])
_calculate_fractional_sequences_per_group(3, [1,1,1,1])
_calculate_fractional_sequences_per_group(3, [100,100,100,100])
_calculate_fractional_sequences_per_group(3, [1,1,1,1000])
# All of the above return 0.726570078125
I think the reason is that one can't just pick desired_total/groups because some groups might have less than that number. So we wouldn't end up with the right desired total.
There's a reason the original calculation is the way it is. I don't think that's a bug or unnecessary.
Imagine there are 90 groups with 1 sequence and 10 groups with 1000. We want to sample 1000 sequences.
The original calculation would pick around 91 sequences per group.
Yours now picks 10 per group.
The original resulted in around 1000 sampled sequences.
Your calculation now in only 190.
That scenario does not result in probabilistic sampling and will not fall through this code path. It will only come here if the number of groups exceeds the desired total sequences.
Your example has 100 groups so this would only happen when sampling less than 100 sequences. At that point the number of sequences available in each group does not matter and either 0 or 1 sequences should be picked per group.
Context
The probabilistic sampling calculation currently takes the number of sequences available per group as input.
This information doesn't seem to be necessary. The output is based on the number of groups rather than the size of each group. Example:
Taking a closer look at implementation:
augur/augur/filter/subsample.py
Line 474 in d8faf01
This seems to be an overly complex way of approximating the fraction$\frac{\text{max sequences}}{\text{number of groups}}$ . When it is changed to
, the return value of the commands above is$0.7499999898397923 \approx \frac{3}{4}$ .
Impact
The deviation can lead to assertion errors here
augur/augur/filter/subsample.py
Line 285 in da1c89d
when
--subsample-max-sequences
is slightly lower than the number of groups such as #1598:Proposal
Replace
_calculate_fractional_sequences_per_group()
with the exact fractiontarget_max_value / len(group_sizes)
.The text was updated successfully, but these errors were encountered: