Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and-expression bug, and other quirks #1463

Merged
merged 6 commits into from
Jul 14, 2022

Conversation

daviesrob
Copy link
Member

  • Fix bug in and_expr() where is_true was being carried over, leading to some records incorrectly passing filter expressions. Fixes samtools view expression filter gives wrong results depending on the order of expressions. samtools#1670.
  • Fix incorrect result from double unary-not on NULL strings.
  • Ensure res->is_true is updated in mul_expr() and add_expr() so that "5 - 5 && 1" and "+5 - 5 && 1" both give the same answer.
  • Ensure is_true == true doesn't get overridden for strings in hts_filter_eval() so that "null-but-true" and "null-but-true && 1" both give the same result.
  • Add new tests
  • Add a note that hts_expr_val_free() always needs to be called after hts_filter_eval() even if the same hts_expr_val_t is going to be reused in a later hts_filter_eval() call. This is because the structure is cleared on entry in hts_filter_eval(), which would leak memory if the hts_expr_val_t::s was still in use.

@jmarshall
Copy link
Member

jmarshall commented Jul 4, 2022

  • Ensure is_true == true doesn't get overridden for strings in hts_filter_eval() so that "null-but-true" and "null-but-true && 1" both give the same result.
  • Add a note that hts_expr_val_free() always needs to be called after hts_filter_eval() even if the same hts_expr_val_t is going to be reused in a later hts_filter_eval() call. This is because the structure is cleared on entry in hts_filter_eval(), which would leak memory if the hts_expr_val_t::s was still in use.

Yes, I noticed these wrinkles too. I was going to raise an issue, but I'll keep the conversation here instead.

Someone reading the current documentation might write the following reasonable program using this API:

bcf1_t *rec1, *rec2, *rec3;
…
hts_filter_t *filter = hts_filter_init(…);
…
hts_expr_val_t v = HTS_EXPR_VAL_INIT;
if (hts_filter_eval(filter, rec1, vcf_sym_func, &v) >= 0 && v.is_true) foo(1, rec1);
if (hts_filter_eval(filter, rec2, vcf_sym_func, &v) >= 0 && v.is_true) foo(2, rec2);
if (hts_filter_eval(filter, rec3, vcf_sym_func, &v) >= 0 && v.is_true) foo(3, rec3);
hts_expr_val_free(&v);

This will leak memory due to the memset at the start of hts_filter_eval().

(The current users of hts_filter_eval() — in sam.c and test/test_expr.c — don't have = HTS_EXPR_VAL_INIT, but I would call this an oversight enabled by the memset.)

Both this and the and_expr() issue stem from a lack of appropriate clearing of the hts_expr_val_t and the difficulty of combining “it's a string but k.s == NULL so it was never set” with the usual kstring_t buffer reuse.

iMHO the following would likely be a better fix for all this:

  • Use hts_expr_val_t::s in the usual reuse the kstring buffer repeatedly way;
  • Replace the memset with res->is_str=res->is_true=0; ks_clear(&res->s); res->d=0.0;
  • Make is_str tristate to represent the “string but never been set” thing: true, false, true-but-unset.

@daviesrob
Copy link
Member Author

@jmarshall I did consider changing hts_filter_eval() so that it would work in the way you suggest. Unfortunately I fear it's too late to do it safely though. Thanks to the memset, it's been possible to call the function with an uninitialised hts_expr_val_t struct ever since it first appeared in 999d181. And as hts_expr_val_t::s.s is always zeroed, it's not possible to call it at the moment with a valid pointer without getting a leak. Trying at alter that now would be a fairly major change in behaviour. It seemed safer just to document the need for the caller to free the pointer as a quirk of the API.

The documentation in hts_expr.h says that is_true forces the value to true "even if zero" on the data structure. The hts_filter_eval() documentation also says that "It [the returned result] can also be explicitly defined to be true even for a null value.", suggesting that is_true works even for a NULL string or a zero numeric value. Given that callback writers who only read the header would have that expectation, I've tried to make is_true == true consistently override any other definition of truthiness.

@jmarshall
Copy link
Member

The only usages of hts_filter_eval() in htslib/samtools/bcftools are in HTSlib itself, in sam.c and test/test_expr.c. Do you have evidence of other third-party code having started using it? I searched all of GitHub for hts_filter_eval and found zero instances apart from these instances in (copies of) HTSlib.

I certainly agree that, whatever the internals do with the various is_xyz state fields, inside hts_filter_eval() after the expression() call is a good place to canonicalise is_true to reflect the intended truthiness.

@jkbonfield
Copy link
Contributor

I haven't yet taken the time to review all this code again, but my vague recollection regarding truth was that I needed an explicit way of setting it that worked irrespective of the integer or string values being returned, so that obtaining a tag (whose value is zero or a blank string) would be true and distinguishable from looking for a tag and not finding it. The easiest way of handling this is separating truth for conditional checks from values.

@daviesrob
Copy link
Member Author

I suspect hts_filter_eval() probably hasn't been used outside of HTSlib, but it's impossible to tell for certain. The really safe way to make the change would be to make a new function and deprecate the old one. It would be a bigger change than here though, as to really allow memory reuse it would have to count zero-length kstrings as false, as well as ones with NULL pointers. It wouldn't be a bad idea really as I think it's more common to count a zero-length string as false, but it's probably something to consider in the future rather than right now.

@jkbonfield
Copy link
Contributor

The choice of true vs false and memory leakage are two orthogonal issues and we don't need to be changing one to fix the other.

Note the choice of empty strings being true was deliberate, in order to match the 0 value also being true. See the samtools.1 documentation, which is explicit about 0 but doesn't (and perhaps should) mention the analogue of empty strings in aux tags.

  If no comparison is used with an auxiliary tag it is taken simply to be
  a test for the existence of that tag.  So "[NM]" will return any record
  containing an NM tag, even if that tag is zero (NM:i:0).

  If you need to check specifically for a non-zero value then use [NM] &&
  [NM]!=0.

Sambamba's approach was an explicit "null" keyword (https://github.com/lomereiter/sambamba/wiki/%5Bsambamba-view%5D-Filter-expression-syntax), in e.g. [RG] != null. My alternative was simply a boolean of [RG] being true if the tag exists, irrespective of value. Hence the separation of truth from numeric / string contents.

(Also note the expression language was designed with more than just SAM in mind, so potentially other data sources may be plugged in and can make their own decisions on when true / false gets set.)

As for the memory leak potential, I'd say it is an oversight that this is possible. Clearly I added the HTS_EXPR_VAL_INIT macro for static initialisations so there's little reason for the function to be capable of working on completely uninitialised inputs. We could change it to assume an initialised (either HTS_EXPR_VAL_INIT or previously used) value and free before zeroing. That's the easy solution anyway, which then doesn't have any impact on the dowstream nul vs empty checks.

@jkbonfield
Copy link
Contributor

The changes look OK, but I think we should do the additional change to fix the potential memory leak in naive usage of hts_filter_eval too.

@daviesrob
Copy link
Member Author

I've added a test to detect the potential memory leak situation, and bail out if it's going to happen. Unfortunately I don't think there's much else that can be done within the limits of the existing API, especially if we can't be too sure what version is being coded and linked against.

If we do want to go further, then a possible way forward is in this commit which makes rather more radical changes to the way empty strings work. In particular:

  • NULL and allocated but zero-length strings are treated the same (as empty strings).
  • Empty strings are always false. This is the main change from previous behaviour, where allocated but zero-length strings were treated as true.
  • Two empty strings compare as equal.
  • Empty strings compare less than non-empty strings.
  • hts_filter_eval() is deprecated, and replaced with hts_filter_eval2(), which does not zero out the string pointer.

The last change ensures that the new API is in use, so we know it's safe to use the passed-in kstring. Making NULL and zero-length strings both false is necessary so that we don't have to free any pointers. The present-but-zero (or empty) AUX tags still work, because hts_expr_val_t::is_true will be set, and overrides the empty string.

hts_expr.c Outdated
// possible to know is res was initialised correctly, so in
// either case we fail ungracefully.
fprintf(stderr, "hts_expr_val_t not initialised (with HTS_EXPR_VAL_INIT) or cleared (by hts_expr_val_free) correctly. Not safe to continue, sorry.\n");
abort();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't think break our rule that the library should never just call abort?

It ought to return -1 instead (if indeed we wish to even go down this road).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't call abort() for conditions that could be triggered by dodgy inputs, but we do in places where it would indicate a bug in HTSlib. In this case the bug would be in the code that calls HTSlib, but it would still be something that needs the attention of the code developer rather than the end user. I can change it to hts_log_error and return -1 if you'd like though.

@jkbonfield
Copy link
Contributor

You're confusing too different things here:

  1. Memory leaks.
  2. The definition of true/false for empty strings.

Point 1 does not mean we have to do point 2.

Also I think adding an abort() call is a really bad idea. I know it's an abort due to coding error rather than data error, but it still doesn't sit right. Personally I think it would simply be better to make it free the old data if present, reinitialise the data itself (eg memset) and then continue as before. It avoids changing the language semantics which would be a really bad idea now, while fixing the potential memory leak.

We obviously also need to update the documentation too, to state that hts_expr_t must be initialised with HTS_EXPR_VAL_INIT or similar (I'm not sure if we have an allocate-and-return-pointer function, but if so then that's the obvious alternative).

@jkbonfield
Copy link
Contributor

Eg something like this (in addition to the initialisation changes you made):

diff --git a/hts_expr.c b/hts_expr.c
index 599d7a5..7de4297 100644
--- a/hts_expr.c
+++ b/hts_expr.c
@@ -682,6 +682,9 @@ int hts_filter_eval(hts_filter_t *filt,
                     hts_expr_val_t *res) {
     char *end = NULL;
 
+    // Free any old data.  Harmless if not yet used
+    hts_expr_val_free(res);
+
     memset(res, 0, sizeof(*res));
 
     filt->curr_regex = 0;

I think this is sufficient. It doesn't have any worse impact that your existing change. Let's think about that.

  1. Our code is using an initialised hts_expr_val_t already. This change does nothing, and your change would also pass without doing anything.

  2. Our code is using an uninitialised hts_expr_val_t. Both my change and your change are likely to fail. Yours would call abort, while mine would be undefined, but very likely just crash and burn also.

  3. We're manually freeing our hts_expr_val_t between successive calls, but not zeroing the struct. This is perfectly correct code when used in conjunction with the previous API. Both my code and your code will act the same as point 2 above. Yours will hard abort, while my change would probably detect a double free and also die.

  4. We're reusing hts_expr_val_t and not freeing, which previously lead to a memory leak. You implementation would abort, while my change above would free the memory.

We've already asserted that the correct usage in point 3 above isn't likely to be out there in the wild (at least we can't detect it if so, so it'll be private code only). Given this we're basically already saying we don't think the difference between our fixes for point 3 matters, and by the same logic the same can be said for point 2.

The main difference is a guarantee of abort vs a highly likely crash, vs the ability to fix the API and automatically free. While I'd prefer it if it was a guaranteed failure of doing something wrong, I don't think that slight difference is sufficient justified to simply fix the API given the comments above.

@jmarshall
Copy link
Member

jmarshall commented Jul 12, 2022

I suspect the memory issues (due to the unusual handling of a “null” kstring_t) are related to the desired handling of true/false. But you're right that they may be best thought about and addressed separately.

You can't just free the old data if present, because if the hts_expr_val_t is uninitialised then a non-NULL pointer value may be just garbage rather than a valid pointer value. (I think this — in third-party code — is the failure mode Rob is trying extra hard to avoid.) [I just re-read your point (2) above more carefully, and I think we're basically in agreement.]

@jkbonfield
Copy link
Contributor

You can't just free the old data if present, because if the hts_expr_val_t is uninitialised then a non-NULL pointer value may be just garbage rather than a valid pointer value.

Yes, but Rob's fix turns that non-NULL garbage into an abort() as it's simply impossible to distinguish between uninitialised memory and, well, anything else. So I'm saying if we're willing to accept a change of behaviour here, then let's do it properly instead.

Basically the decision I see is do we change the API by requiring an initialised hts_expr_val_t. I think we can probably justify doing that given it's currently (apparently) unused externally. If we don't wish to make that decision, then both Rob's fix and my fix are incorrect and the correct solution is to change the documentation.

Infact either way there needs to be a documentation change.

@daviesrob
Copy link
Member Author

Yes, the free must be done by the caller, unless we explicitly show that the interface has changed by making a new function and deprecating the old one. Otherwise apparently-correct code would either work or leak memory, depending on which version of HTSlib it happens to be linked against.

By forcing the caller to clean up, we get code that works on all versions.

By renaming the function, we ensure that code broken in the earlier versions won't link.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 12, 2022

I don't see a function rename in your changes, but that is a good point.

We could simply rename and fix it however we wish (or if we're really cautious duplicate/deprecate rather than rename and just document the required preconditions in the old API - no doubt that would stop complaints from automated ABI testing tools used by linux vendors, even if apparently unused). The function is tiny so a duplication wouldn't harm, but obviously it's trivial to free and then call the old. Eg:

int hts_filter_eval2(hts_filter_t *filt,                                         
                     void *data, hts_expr_sym_func *fn,                          
                     hts_expr_val_t *res) {                                      
    hts_expr_val_free(res);  // Free any old data.  Harmless if not yet used           
    return hts_filter_eval(filt, data, fn, res);
}

If we were doing that then we could also decide what to do about the language semantics and NULL strings, but that's a more complex issue to work through and requires a lot more work. If this is something we felt worthwhile doing, then this PR should be punted to the next release so won't don't have a third change in API while giving ourselves time to discuss the topic.

@jmarshall
Copy link
Member

jmarshall commented Jul 12, 2022

IMHO this is an extremely conservative response to a minor HTSlib bug.

  1. Anyone using sam_passes_filter() has no problem (now or with any proposed fix), as the hts_expr_val_t is managed internally.
  2. The vanishing few who may be using hts_filter_eval() directly have no problem if they use a hts_expr_val_t in one call to hts_filter_eval(). (If they initialise it with HTS_EXPR_VAL_INIT, no problem now or with any proposed fix. If they don't initialise it, no problem now but they'll get an abort with Rob's proposal and would segfault with the natural reuse-the-kstring fix.)
  3. But for the truly minuscule vanishing few who try to reuse the object for several hts_filter_eval() calls, there are problems:
    • Right now, they have a small memory leak.
    • With Rob's proposal, they'd get an abort and need to change their code to clear the object each time.
    • With a reuse-the-kstring fix inside hts_filter_eval(), they have no problem and their code is correct.

Anyway, enough motivation, I'll just jump to the proposal:

  1. If it were not for a theoretical small group of people using hts_filter_eval() directly without initialising the hts_expr_val_t, we would just fix the existing hts_filter_eval() function to reuse the kstring_t in the usual way. (I say “theoretical group of people” because GitHub searches find zero instances of such in public code on GitHub.)
  2. The documentation in 1.15.1's hts_expr.h is a bit brief, but IMHO programmers who think this through would see that hts_expr_val_t contains a kstring_t — which always needs initialising —, so they would naturally use the provided HTS_EXPR_VAL_INIT to initialise it.
  3. Unlucky programmers who followed the example code in sam.c might not have initialised it. (If hts_filter_eval() were fixed in the natural way, they would then get a segfault. That would be trivially fixed by adding the initialiser in their code, which would work with no problems with old HTSlib too.)
  4. So IMHO HTSlib should fix this in two stages:
    • Right now, add the initialisers in sam.c and test/test_expr.c, clarify the hts_expr.h documentation, and add a NEWS note.
    • In a later HTSlib release, change hts_filter_eval() to use ks_clear() etc to reuse the kstring's buffer instead of the current memset, thus fixing the memory leak for the truly minuscule vanishing few. By then, all example code in HTSlib will have been initialising the hts_expr_val_t for quite some time, so anyone not doing so has little excuse — and has a very simple and compatible fix available.

@daviesrob
Copy link
Member Author

ks_clear() can't be used unless the handling of "null" kstrings is fixed.

I've been trying to come up with a solution for this for a while now. My conclusion is that the only way out is to ask the caller to do all the initialisation, and clean-up between calls, document this and enforce it by adding a test to the code. That's what this PR currently does.

As noted, no-one seems to be using hts_filter_eval() directly at the moment. As I've fixed sam_passes_filter() already, the extra test is unlikely to break anything. Having it there will ensure that new code works correctly, including when linked against earlier versions of HTSlib.

I don't see any point in changing the function name for this document & test solution, as code written to pass the test will fully work with past versions. The same isn't true of any solution that lets you use the hts_expr_val_t structure without calling hts_expr_val_free() first as they would leak memory on earlier versions (and for any expressions involving sequence, the leak could be non-trivial).

@jmarshall
Copy link
Member

ks_clear() can't be used unless the handling of "null" kstrings is fixed.

Indeed, hence the tristate is_str suggestion back in my first comment #1463 (comment).

The same isn't true of any solution that lets you use the hts_expr_val_t structure without calling hts_expr_val_free() first as they would leak memory on earlier versions (and for any expressions involving sequence, the leak could be non-trivial).

Bluntly — Who cares? An HTSlib bug was fixed (will have been fixed in future, if you go for the two-phase approach); earlier versions leaked memory in some circumstances, but this has been fixed in later HTSlib versions.

Tools for which the leak is non-trivial have the option of putting their hts_expr_val_t allocing and freeing inside the loop so that it is not leaked, at the cost of not reusing the allocated kstring buffer. And then they won't leak even on older HTSlib versions.

My conclusion is that the only way out is to ask the caller to do all the initialisation, and clean-up between calls, document this and enforce it by adding a test to the code.

Alternatively, do nothing. Fix the initialisations, perhaps wait a while for putative other initialisations to get fixed, then fix hts_filter_eval()'s memory use.

It seems unfortunate to forbid forever the reuse-the-buffer–style code (which is the natural way to use kstrings and is found all over HTSlib) for the sake of a memory leak exhibited by possible current third-party code that we are pretty sure does not exist — and when a workaround is available to any such third-party code that in fact does exist.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 12, 2022

To be honest, trying to reuse the kstring buffer is likely to be a can of worms. For sure it could work, but it may need a lot more code changes as right now there is an assumption that the value is either a string or a number, and it has never previously been one thing and become something different. I'm not saying it won't work, but I'd not want to naively reuse things without first going through the rest of the code with a fine tooth comb to check if there are any weird corner cases.

Given all we're saving by reuse is a free/malloc, and that this is likely a tiny portion of the overall overhead of the expression parsing implementation, I think the approach of freeing memory and zeroing the struct is the easy and safe fix (which can be done explicitly by the calling code if it wishes to be memory-leak free with old releases, as John rightly points out). We can if we wish later on amend it by permitting reuse without changing API/ABI.

I'd also point out if you're (Rob) worried about being conservative on functionality changes, then upgrading a tiny memory leak to a hard abort() isn't a fair deal! I know you view both as "wrong", but to equate them would be a definite minority view. Maybe the author of the calling code doesn't care about a few bytes leaked, but they sure will care about a crash. It's also naive to think the person experiencing this will be the tool author with the knowlegde of how to fix it. Chances are it'll be some unlucky researcher who simply installed a bunch of packages via conda only to get given a different library version number than the original tool author used.

However this is almost certainly all a moot point, given as far as we know the code is externally unused currently, and with updated API documentation it isn't going to be a serious issue for future code using this API that links against old implementations either. (By documenting that the old code didn't automatically free, then authors if they care can explicitly add hts_expr_val_free calls and doing so won't be incompatible with the new code as it ultimately calls ks_initialize which prevents any double free possibilities.)

@daviesrob
Copy link
Member Author

I've pushed an updated version, which hopefully matches what we were aiming for.

@jkbonfield
Copy link
Contributor

I'll look over this again to consider merging (so shout quickly if any strong disagreements), but for @jmarshall's sake, also see samtools/samtools#1677 discussion on future work planned here.

Specifically I've added an exists() function to handle the tri-state case and turn undefined values into a strict 0 or 1 so we can then use them consistently in booleans. Hence !exists([X1]) is now easy, instead of previously needing the hideous !([X1] || [X1]==0).

Additionally I want to follow standard SQL null-comparison rules and IEEE NaN comparison rules of always failing. This means -null, +null, !null, etc always fail. This means the [X0] != 0 test does what we expect rather than actually meaning !exists([X0]) || (exists[X0] && [X0] != 0). That would be a language semantics change, to document, but worth it IMO.

@jmarshall
Copy link
Member

jmarshall commented Jul 14, 2022

Since you've invoked me explicitly (thanks for the pointer to the other issue, which I have indeed seen), I'll repeat my strong disagreement FWIW: if you're going to go so far as to add another supported hts_filter_eval2() entry point, you should do it with sensible memory semantics.1

Footnotes

  1. As should be clear from my previous comments, by “sensible” I mean the semantics of reusing the kstring_t buffer in the usual HTSlib way.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 14, 2022

How do you define "sensible"? I'm concerned currently with API and not necessarily a perfect implementation.

For simplicity and speed of getting something fixed, it's trivial to free and reinitialise the kstring, and unless the user in internally grubbing around in the kstring and doing sneaky things they'll be none the wiser. Actually reusing the kstring is a much more complex issue, but maybe we can revisit that later on when other changes have been made. For this there are 3 workable outcomes.

  1. We drop the memory leak thing completely and fix the original bug here, postponing how to handle the memory leak to later.
  2. We fix the memory leak in a simple fashion, so it goes into this PR and hence this release.
  3. We fix the memory leak in a rigorous fashion, but the original bug fix slips to the next release.

I don't like option 3, so IMO we have a choice of 1 or 2.

I don't particularly like creating a new function either, but it was my concession to the previous request of making the previous case just blow up spectacularly. As I said before, my preference would simply be to fix the bug and ignore the (somewhat of a moot point) potential API issue of a compatibility with new code linking against old releases, but that's more or less vetoed, leaving only option 2 available from what I can see. I don't view it as a non-sensible solution.

Prevent old is_true values from being carried over, which could
cause incorrect results from '&&' expressions.
Toggling hts_expr_val::is_true on strings could get it out of
phase with hts_expr_val::d on null strings (which are false),
which caused double-unary-not to give the wrong value.

Instead, make unary not always return false if is_true is true,
so empty-but-true works; and for strings return true for null
ones, and false for non-null.  Numbers are handled as before.
Ensures that "5 - 5 && 1" and "+5 - 5 && 1" give the same answer.
The latter sets is_true in the unary +, so it has to be reset
after the subtraction.
So "null-but-true" and "null-but-true && 1" return the same value.
Due to hts_filter_eval() calling memset() on its res parameter,
it's not possible to pass in an allocted kstring_t in res->s
without leaking memory.  Historically it was also possible
to get away with passing in an uninitialised structure, so not
many assumptions can be made about the contents of res on entry.
In particular, it is not guaranteed that free(res->s.s) would
work.

To ensure the function is being used safely, check that the
string part of *res is NULL on entry and fail if not.
Also added a documentation note about calling hts_expr_val_free()
after hts_filter_eval().

Add hts_filter_eval2() and deprecate hts_filter_eval().  The new
function clears its `res` parameter properly, allowing it to
be called repeatedly a bit more easily than the original.
@jkbonfield jkbonfield merged commit 1fba06c into samtools:develop Jul 14, 2022
@daviesrob daviesrob deleted the fix_expr branch July 14, 2022 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

samtools view expression filter gives wrong results depending on the order of expressions.
3 participants