Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finer control of --regions vs --targets overlap #1327

Merged
merged 1 commit into from
Sep 9, 2021

Conversation

pd3
Copy link
Member

@pd3 pd3 commented Sep 7, 2021

This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
samtools/bcftools#1420
samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (-t/-T) and one for index-gumping (-r/-R). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

  • Include only records with POS starting in the regions/targets
  • Include VCF records that overlap regions/targets, even if POS itself is outside the regions
  • Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
    difference between TC>T- and C>-

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.

pd3 added a commit to samtools/bcftools that referenced this pull request Sep 7, 2021
This is to address a long-standing design flaw in handling regions and targets. BCFtools (and HTSlib)
recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming
(`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with
POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default
behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.

After samtools/htslib#1327 is merged, this commit resolves #1420 and #1421
@pd3 pd3 force-pushed the bt-1421 branch 2 times, most recently from 8eeaadf to a43d170 Compare September 8, 2021 07:37
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
    samtools/bcftools#1420
    samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.
@jmarshall
Copy link
Member

@@ -110,7 +112,8 @@ typedef struct bcf_sr_regions_t
-    int is_bin;             // is open in binary mode (tabix access)
+    int is_bin:30,          // is open in binary mode (tabix access)
+        overlap:2;          // see BCF_SR_REGIONS_OVERLAP/BCF_SR_TARGETS_OVERLAP

This breaks the ABI on big-endian platforms. On all platforms, whether it's ABI breakage depends on implementation-defined behaviour; on x86_64 you probably get away with it (i.e., x.is_bin = 1 sets the same bits before and after this commit).

@jkbonfield
Copy link
Contributor

We did briefly discuss this in our meeting too (or rather I asked the question what the impact on ABI was and the conclusion was it'd break, but we didn't get a plan of action).

It's possible we could detect big/little endian and switch the order at compile time (#ifdef HTS_BIG_ENDIAN), which helps, but as you say it's not guaranteed. I'm not so worried though about guarantees if we can at least validate on a whole bunch of different compilers and have assurances that practically speaking it's consistent.

There's also the issue that practically there's not much else we can do, unless we define it in code (is_bin becomes is_bin & BIN_MASK) which punts it from an ABI break to an API one. That's not really palatable either.

@pd3
Copy link
Member Author

pd3 commented Sep 10, 2021

From a practical perspective, the is_bin is (should be) never used by user applications, so it is safe. Again, practically speaking. Undoubtedly one can think of an evil example.

@pd3 pd3 deleted the bt-1421 branch September 10, 2021 10:36
@jkbonfield
Copy link
Contributor

I know there are a lot of things in exposed structs which we label as internal simply because in C there's no distinction and we don't have the private component as we would we C++ classes. That's really a poor documentation thing though.

However I see it used twice within bcftools, so it's not an internal-to-htslib thing.

@jmarshall
Copy link
Member

It is true that with bcf_sr_regions_init()/bcf_sr_regions_destroy() etc, bcf_sr_regions_t is probably mostly “effectively opaque for application code”, so this is mostly effectively safe enough. (It would be good if some of these considerations could be mentioned publicly on the PR or commit message.)

Those two instances of is_bin in bcftools are peeking at the similarly named htsFile field, which is explicitly forbidden in htslb/hts.h and invalidates the comment there that says only a really old version of bcftools looked directly at this field. I suppose these two instances were added more recently…

@pd3
Copy link
Member Author

pd3 commented Sep 10, 2021

@jkbonfield It's true it's used in bcftools, but only once. And it should not be, @pd3 is a bad bad boy.

pd3 added a commit to pd3/htslib that referenced this pull request Oct 5, 2021
There was some uncertainty how samtools#1327 would behave with programs
and htslib on different endian platforms when the library
and the program is compiled using different compilers. Adding
the new field as an integer at the end of the structure was deemed
safer.
whitwham pushed a commit that referenced this pull request Oct 7, 2021
There was some uncertainty how #1327 would behave with programs
and htslib on different endian platforms when the library
and the program is compiled using different compilers. Adding
the new field as an integer at the end of the structure was deemed
safer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants