-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and-expression bug, and other quirks #1463
Conversation
Yes, I noticed these wrinkles too. I was going to raise an issue, but I'll keep the conversation here instead. Someone reading the current documentation might write the following reasonable program using this API: bcf1_t *rec1, *rec2, *rec3;
…
hts_filter_t *filter = hts_filter_init(…);
…
hts_expr_val_t v = HTS_EXPR_VAL_INIT;
if (hts_filter_eval(filter, rec1, vcf_sym_func, &v) >= 0 && v.is_true) foo(1, rec1);
if (hts_filter_eval(filter, rec2, vcf_sym_func, &v) >= 0 && v.is_true) foo(2, rec2);
if (hts_filter_eval(filter, rec3, vcf_sym_func, &v) >= 0 && v.is_true) foo(3, rec3);
hts_expr_val_free(&v); This will leak memory due to the (The current users of Both this and the iMHO the following would likely be a better fix for all this:
|
@jmarshall I did consider changing The documentation in |
The only usages of I certainly agree that, whatever the internals do with the various |
I haven't yet taken the time to review all this code again, but my vague recollection regarding truth was that I needed an explicit way of setting it that worked irrespective of the integer or string values being returned, so that obtaining a tag (whose value is zero or a blank string) would be true and distinguishable from looking for a tag and not finding it. The easiest way of handling this is separating truth for conditional checks from values. |
I suspect |
The choice of true vs false and memory leakage are two orthogonal issues and we don't need to be changing one to fix the other. Note the choice of empty strings being true was deliberate, in order to match the 0 value also being true. See the samtools.1 documentation, which is explicit about 0 but doesn't (and perhaps should) mention the analogue of empty strings in aux tags.
Sambamba's approach was an explicit "null" keyword (https://github.com/lomereiter/sambamba/wiki/%5Bsambamba-view%5D-Filter-expression-syntax), in e.g. (Also note the expression language was designed with more than just SAM in mind, so potentially other data sources may be plugged in and can make their own decisions on when true / false gets set.) As for the memory leak potential, I'd say it is an oversight that this is possible. Clearly I added the |
The changes look OK, but I think we should do the additional change to fix the potential memory leak in naive usage of |
I've added a test to detect the potential memory leak situation, and bail out if it's going to happen. Unfortunately I don't think there's much else that can be done within the limits of the existing API, especially if we can't be too sure what version is being coded and linked against. If we do want to go further, then a possible way forward is in this commit which makes rather more radical changes to the way empty strings work. In particular:
The last change ensures that the new API is in use, so we know it's safe to use the passed-in kstring. Making NULL and zero-length strings both false is necessary so that we don't have to free any pointers. The present-but-zero (or empty) AUX tags still work, because |
hts_expr.c
Outdated
// possible to know is res was initialised correctly, so in | ||
// either case we fail ungracefully. | ||
fprintf(stderr, "hts_expr_val_t not initialised (with HTS_EXPR_VAL_INIT) or cleared (by hts_expr_val_free) correctly. Not safe to continue, sorry.\n"); | ||
abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't think break our rule that the library should never just call abort?
It ought to return -1 instead (if indeed we wish to even go down this road).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't call abort()
for conditions that could be triggered by dodgy inputs, but we do in places where it would indicate a bug in HTSlib. In this case the bug would be in the code that calls HTSlib, but it would still be something that needs the attention of the code developer rather than the end user. I can change it to hts_log_error
and return -1
if you'd like though.
You're confusing too different things here:
Point 1 does not mean we have to do point 2. Also I think adding an We obviously also need to update the documentation too, to state that |
Eg something like this (in addition to the initialisation changes you made):
I think this is sufficient. It doesn't have any worse impact that your existing change. Let's think about that.
We've already asserted that the correct usage in point 3 above isn't likely to be out there in the wild (at least we can't detect it if so, so it'll be private code only). Given this we're basically already saying we don't think the difference between our fixes for point 3 matters, and by the same logic the same can be said for point 2. The main difference is a guarantee of abort vs a highly likely crash, vs the ability to fix the API and automatically free. While I'd prefer it if it was a guaranteed failure of doing something wrong, I don't think that slight difference is sufficient justified to simply fix the API given the comments above. |
I suspect the memory issues (due to the unusual handling of a “null”
|
Yes, but Rob's fix turns that non-NULL garbage into an Basically the decision I see is do we change the API by requiring an initialised Infact either way there needs to be a documentation change. |
Yes, the free must be done by the caller, unless we explicitly show that the interface has changed by making a new function and deprecating the old one. Otherwise apparently-correct code would either work or leak memory, depending on which version of HTSlib it happens to be linked against. By forcing the caller to clean up, we get code that works on all versions. By renaming the function, we ensure that code broken in the earlier versions won't link. |
I don't see a function rename in your changes, but that is a good point. We could simply rename and fix it however we wish (or if we're really cautious duplicate/deprecate rather than rename and just document the required preconditions in the old API - no doubt that would stop complaints from automated ABI testing tools used by linux vendors, even if apparently unused). The function is tiny so a duplication wouldn't harm, but obviously it's trivial to free and then call the old. Eg:
If we were doing that then we could also decide what to do about the language semantics and NULL strings, but that's a more complex issue to work through and requires a lot more work. If this is something we felt worthwhile doing, then this PR should be punted to the next release so won't don't have a third change in API while giving ourselves time to discuss the topic. |
IMHO this is an extremely conservative response to a minor HTSlib bug.
Anyway, enough motivation, I'll just jump to the proposal:
|
I've been trying to come up with a solution for this for a while now. My conclusion is that the only way out is to ask the caller to do all the initialisation, and clean-up between calls, document this and enforce it by adding a test to the code. That's what this PR currently does. As noted, no-one seems to be using I don't see any point in changing the function name for this document & test solution, as code written to pass the test will fully work with past versions. The same isn't true of any solution that lets you use the |
Indeed, hence the tristate
Bluntly — Who cares? An HTSlib bug was fixed (will have been fixed in future, if you go for the two-phase approach); earlier versions leaked memory in some circumstances, but this has been fixed in later HTSlib versions. Tools for which the leak is non-trivial have the option of putting their
Alternatively, do nothing. Fix the initialisations, perhaps wait a while for putative other initialisations to get fixed, then fix It seems unfortunate to forbid forever the reuse-the-buffer–style code (which is the natural way to use kstrings and is found all over HTSlib) for the sake of a memory leak exhibited by possible current third-party code that we are pretty sure does not exist — and when a workaround is available to any such third-party code that in fact does exist. |
To be honest, trying to reuse the kstring buffer is likely to be a can of worms. For sure it could work, but it may need a lot more code changes as right now there is an assumption that the value is either a string or a number, and it has never previously been one thing and become something different. I'm not saying it won't work, but I'd not want to naively reuse things without first going through the rest of the code with a fine tooth comb to check if there are any weird corner cases. Given all we're saving by reuse is a free/malloc, and that this is likely a tiny portion of the overall overhead of the expression parsing implementation, I think the approach of freeing memory and zeroing the struct is the easy and safe fix (which can be done explicitly by the calling code if it wishes to be memory-leak free with old releases, as John rightly points out). We can if we wish later on amend it by permitting reuse without changing API/ABI. I'd also point out if you're (Rob) worried about being conservative on functionality changes, then upgrading a tiny memory leak to a hard However this is almost certainly all a moot point, given as far as we know the code is externally unused currently, and with updated API documentation it isn't going to be a serious issue for future code using this API that links against old implementations either. (By documenting that the old code didn't automatically free, then authors if they care can explicitly add |
I've pushed an updated version, which hopefully matches what we were aiming for. |
I'll look over this again to consider merging (so shout quickly if any strong disagreements), but for @jmarshall's sake, also see samtools/samtools#1677 discussion on future work planned here. Specifically I've added an Additionally I want to follow standard SQL null-comparison rules and IEEE NaN comparison rules of always failing. This means -null, +null, !null, etc always fail. This means the |
Since you've invoked me explicitly (thanks for the pointer to the other issue, which I have indeed seen), I'll repeat my strong disagreement FWIW: if you're going to go so far as to add another supported Footnotes
|
How do you define "sensible"? I'm concerned currently with API and not necessarily a perfect implementation. For simplicity and speed of getting something fixed, it's trivial to free and reinitialise the kstring, and unless the user in internally grubbing around in the kstring and doing sneaky things they'll be none the wiser. Actually reusing the kstring is a much more complex issue, but maybe we can revisit that later on when other changes have been made. For this there are 3 workable outcomes.
I don't like option 3, so IMO we have a choice of 1 or 2. I don't particularly like creating a new function either, but it was my concession to the previous request of making the previous case just blow up spectacularly. As I said before, my preference would simply be to fix the bug and ignore the (somewhat of a moot point) potential API issue of a compatibility with new code linking against old releases, but that's more or less vetoed, leaving only option 2 available from what I can see. I don't view it as a non-sensible solution. |
Prevent old is_true values from being carried over, which could cause incorrect results from '&&' expressions.
Toggling hts_expr_val::is_true on strings could get it out of phase with hts_expr_val::d on null strings (which are false), which caused double-unary-not to give the wrong value. Instead, make unary not always return false if is_true is true, so empty-but-true works; and for strings return true for null ones, and false for non-null. Numbers are handled as before.
Ensures that "5 - 5 && 1" and "+5 - 5 && 1" give the same answer. The latter sets is_true in the unary +, so it has to be reset after the subtraction.
So "null-but-true" and "null-but-true && 1" return the same value.
Due to hts_filter_eval() calling memset() on its res parameter, it's not possible to pass in an allocted kstring_t in res->s without leaking memory. Historically it was also possible to get away with passing in an uninitialised structure, so not many assumptions can be made about the contents of res on entry. In particular, it is not guaranteed that free(res->s.s) would work. To ensure the function is being used safely, check that the string part of *res is NULL on entry and fail if not. Also added a documentation note about calling hts_expr_val_free() after hts_filter_eval(). Add hts_filter_eval2() and deprecate hts_filter_eval(). The new function clears its `res` parameter properly, allowing it to be called repeatedly a bit more easily than the original.
res->is_true
is updated in mul_expr() and add_expr() so that "5 - 5 && 1" and "+5 - 5 && 1" both give the same answer.is_true == true
doesn't get overridden for strings inhts_filter_eval()
so that "null-but-true" and "null-but-true && 1" both give the same result.hts_expr_val_free()
always needs to be called afterhts_filter_eval()
even if the samehts_expr_val_t
is going to be reused in a laterhts_filter_eval()
call. This is because the structure is cleared on entry inhts_filter_eval()
, which would leak memory if thehts_expr_val_t::s
was still in use.