Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ideas
Generate a list of integers (in binary) to represent the presence of all the calls (union) from all VCF files.
For example, if we have 3 VCF files, and a call is present only in the first VCF file, the integer to represent this call is:
0b100
1
at each bit indicates the presence of that call. If the call is in all 3 vcfs then the number would be0b111
. And these numbers are also the "group"s in the breakdown of VCF's consistency.With this, it's very easy to get:
Test
Tested with some VCFs, and all numbers are the same before and after the PR, just some order of the "group"s in the breakdown is slightly different.
Resource consuming
With ~200 calls in 18 VCFs:
With the version prior to this PR:
With this PR:
The PR improves the time spent from
4.7767 sec
to0.1703 sec
.Also notice that the memory incremented with line 101. This PR nearly adds any additional memory usage, but it was 10.3 MiB previously.