-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finer control of --regions vs --targets overlap #1327
Conversation
This is to address a long-standing design flaw in handling regions and targets. BCFtools (and HTSlib) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged. After samtools/htslib#1327 is merged, this commit resolves #1420 and #1421
8eeaadf
to
a43d170
Compare
This is to address a long-standing design flaw in handling regions and targets, as described in these BCFtools issues: samtools/bcftools#1420 samtools/bcftools#1421 HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged.
@@ -110,7 +112,8 @@ typedef struct bcf_sr_regions_t
- int is_bin; // is open in binary mode (tabix access)
+ int is_bin:30, // is open in binary mode (tabix access)
+ overlap:2; // see BCF_SR_REGIONS_OVERLAP/BCF_SR_TARGETS_OVERLAP This breaks the ABI on big-endian platforms. On all platforms, whether it's ABI breakage depends on implementation-defined behaviour; on x86_64 you probably get away with it (i.e., |
We did briefly discuss this in our meeting too (or rather I asked the question what the impact on ABI was and the conclusion was it'd break, but we didn't get a plan of action). It's possible we could detect big/little endian and switch the order at compile time ( There's also the issue that practically there's not much else we can do, unless we define it in code ( |
From a practical perspective, the |
I know there are a lot of things in exposed structs which we label as internal simply because in C there's no distinction and we don't have the However I see it used twice within bcftools, so it's not an internal-to-htslib thing. |
It is true that with Those two instances of |
@jkbonfield It's true it's used in bcftools, but only once. And it should not be, @pd3 is a bad bad boy. |
There was some uncertainty how samtools#1327 would behave with programs and htslib on different endian platforms when the library and the program is compiled using different compilers. Adding the new field as an integer at the end of the structure was deemed safer.
There was some uncertainty how #1327 would behave with programs and htslib on different endian platforms when the library and the program is compiled using different compilers. Adding the new field as an integer at the end of the structure was deemed safer.
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
samtools/bcftools#1420
samtools/bcftools#1421
HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (
-t/-T
) and one for index-gumping (-r/-R
). They behave differently, the first includesonly records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:
difference between
TC>T-
andC>-
Most importantly, this allows to make the regions and targets behave the same way.
Note that the default behavior remains unchanged.