-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embed ref2 unsorted #1558
Embed ref2 unsorted #1558
Conversation
This seems OK in the case where you have no references at all, but I managed to break it by allowing it to find a subset of the references. This may be a different, but related, bug. To reproduce, make a name-sorted file that has no The problem is that getting into multi_seq mode also allocates It also looks like there are possible race conditions on |
I'm trying to repeat this and cannot get it to fail. Do you have test data please? Edit: well I can with threads, but if I recall you had a case of failure without threading? I'm working on fixing the trivial thread issues, caused by the new ability to change fd->no_ref from within a thread (previously impossible). Ah sorry, I had a typo in my ref hash! duh. My guessed at fix cures it sure enough. I still need to tweak some threading cases though. |
03886a4
to
e8b30ba
Compare
I think I've fixed the new (or quite possibly old) problem you found, as well as fixed the threading issues caused by changing no_ref from within cram threads. That is particularly insideous as there's so many places it's looked at. In the end I took the same approach we did for some other fields, which is migrate them from fd into the container at the time of creation, and make everything underneath that (eg slices, records, aux fields, etc) all relate to the container variable which doesn't change from that point onwards. It may not be necessary everywhere, but it's safe (and failed attempts to fix this showed at least some of it was necessary). Also fixed the CRAM decoder to be more robust so it doesn't access uninitialised md5sum buffers when we have no reference loaded. Such CRAM files are malformed, and only temporarily created by the earlier commit (now fixed), but improving the robustness of the decoder is good. |
cram/cram_encode.c
Outdated
if (fd->embed_ref) { | ||
hts_log_info("Changing from embed_ref to no_ref mode"); | ||
c->embed_ref = 0;//fd->embed_ref = 0; | ||
c->no_ref = 1;//fd->no_ref = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be more sticky. Currently c->no_ref
gets reset back to fd->no_ref
when the next record is added. It looks like there's also a problem with it dropping out of multiref mode, which on my test file leads to a small container being flushed out and, as it's dropped out of no-ref, fails the MD5 validation.
I'll send you the location of my test file.
5c72a31
to
3c70aec
Compare
cram/cram_encode.c
Outdated
@@ -87,6 +87,10 @@ cram_block *cram_encode_compression_header(cram_fd *fd, cram_container *c, | |||
cram_block *map = cram_new_block(COMPRESSION_HEADER, 0); | |||
int i, mc, r = 0; | |||
|
|||
pthread_mutex_lock(&fd->metrics_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to lock this (as you're accessing c->no_ref
rather that fd->no_ref
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha indeed not! I expect this was originally fd->no_ref and I blithly converted it to c->
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll also need to check other places where it was changed, as there's at least one other case where you grab a lock before accessing c->no_ref
.
f913215
to
9a92483
Compare
@@ -1978,12 +2015,14 @@ int cram_encode_container(cram_fd *fd, cram_container *c) { | |||
|
|||
|
|||
/* Compute MD5s */ | |||
no_ref = c->no_ref; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this line is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function has grown over the years and is overly long, so it really needs a refactor. However no_ref
is used a few lines lower down and above it there are several places that modify c->no_ref
. It's possible the logic is such that they never have any effect on the cached no_ref
, but I wouldn't like to say with 100% certainty and for clarity alone it's worth keeping this line IMO.
cram/cram_encode.c
Outdated
if (!no_ref && c->refs_used) { | ||
for (i = 0; i < nref; i++) { | ||
if (c->refs_used[i]) { | ||
cram_get_ref(fd, i, 1, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was going well until I enabled multi-threading:
REF_PATH=/tmp/cc/%s ./test/test_view -C -@ 8 -p /tmp/t8.cram /tmp/9305_4#1n_nour.bam
[W::cram_get_ref] Failed to populate reference for id 5
[W::cram_encode_container] Failed to load reference #5
[W::cram_encode_container] Enabling embed_ref=2 mode to auto-generate reference
[W::cram_encode_container] NOTE: the CRAM file will be bigger than using an external reference
[W::cram_get_ref] Failed to populate reference for id 16
[W::cram_encode_container] Failed to load reference #16
[W::cram_encode_container] Switching to non-ref mode
[W::cram_get_ref] Failed to populate reference for id 1
[E::validate_md5] SQ header M5 tag discrepancy for reference '10'
[E::validate_md5] Please use the correct reference, or consider using embed_ref=2
[E::cram_flush_thread] Call to cram_encode_container failed
Error writing output.
Note that you might have to run it a few times to get this to happen. I think it depends on the main thread managing to queue up enough containers before one of the workers switches it to no_ref
mode.
I suspect the easiest solution might be to check for reference loading failures here, and switch the container to no_ref
mode if it happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you might have to run it a few times to get this to happen. I think it depends on the main thread managing to queue up enough containers before one of the workers switches it to
no_ref
mode.I suspect the easiest solution might be to check for reference loading failures here, and switch the container to
no_ref
mode if it happens.
More than a few times for me - it was failing about 1 in 100 on my smaller test set (with approx 10 containers worth and 3 threads). Anyway, it's now doing as suggested and I can't break it any more (several thousand runs done).
This looks better, albeit a bit chatty when running multi-threaded and it discovers it needs no-ref mode. I also see that the stray Could you squash this into its final form please? |
Specifically, this is a multi-threading bug where the cram_encode_container fails leading to access of freed memory. This bug always existed, but we now have a guaranteed way of making the container encode fail. Embed_ref=2 computes a consensus and uses that as the reference. By design, it cannot work on non-position sorted data, hence a guaranteed failure for encoding. (Step 2 will be to do something more sensible than just fail.)
If we're attempting to use embed_ref=2 because we have alignments in BAM and no known reference (no M5 tags), and the data is also name sorted so we can't auto-generate a reference from the consensus and/or MD:Z tags, then we now just switch to no_ref mode instead of bailing out. This isn't ideal, but there is little else that can be done unless the user modifies their command line or reheaders the input file. Extra complexities exist as the setting of embed_ref occurs while encoding a container, which also means it can be adjusted within threads and synchronously with other containers being encoded. Hence we now cache more variables from cram_fd into cram_container.
This is far from complete, but it's a guide on the basic function hierarchy and which bits are used from a single thread per cram_fd and which can be concurrent.
31be549
to
80558d9
Compare
Chatty is deliberate, as it has an impact on the CRAM size. If someone accidentally forgets to specify the reference I'd rather they get a few warning messages, possibly more than once if threaded, than a silent degregadation in compression performance. I don't think it's worth putting lots of extra thread locks and a shared variable to track how many times we've reported the warnings, so I think the current warnings represent a good middle ground unless you can think of a simple fix. I rebased and squashed it. |
Fix crash when
cram_encode_container
during multi-threaded CRAM encoding.Prevent embed_ref=2 on name sorted data from being a hard failure for cram encoding. It now switches to non-ref mode instead, so we always encode something, even if it's suboptimal.