Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed ref2 unsorted #1558

Merged
merged 3 commits into from
Feb 8, 2023
Merged

Conversation

jkbonfield
Copy link
Contributor

  • Fix crash when cram_encode_container during multi-threaded CRAM encoding.

  • Prevent embed_ref=2 on name sorted data from being a hard failure for cram encoding. It now switches to non-ref mode instead, so we always encode something, even if it's suboptimal.

@daviesrob
Copy link
Member

This seems OK in the case where you have no references at all, but I managed to break it by allowing it to find a subset of the references. This may be a different, but related, bug.

To reproduce, make a name-sorted file that has no @SQ UR: tags but does have @SQ M5:. Then find the reference for the first read it would try to load, and make an MD5 reference directory with just that one in it. Then try to convert the file to CRAM.

The problem is that getting into multi_seq mode also allocates c->refs_used[], which triggers the MD5 validation code here. The validation code fails when it can't load one of the used references.

It also looks like there are possible race conditions on fd->no_ref as after this PR it can change state in another thread.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 31, 2023

I'm trying to repeat this and cannot get it to fail. Do you have test data please?

Edit: well I can with threads, but if I recall you had a case of failure without threading? I'm working on fixing the trivial thread issues, caused by the new ability to change fd->no_ref from within a thread (previously impossible).

Ah sorry, I had a typo in my ref hash! duh. My guessed at fix cures it sure enough. I still need to tweak some threading cases though.

@jkbonfield
Copy link
Contributor Author

I think I've fixed the new (or quite possibly old) problem you found, as well as fixed the threading issues caused by changing no_ref from within cram threads. That is particularly insideous as there's so many places it's looked at. In the end I took the same approach we did for some other fields, which is migrate them from fd into the container at the time of creation, and make everything underneath that (eg slices, records, aux fields, etc) all relate to the container variable which doesn't change from that point onwards. It may not be necessary everywhere, but it's safe (and failed attempts to fix this showed at least some of it was necessary).

Also fixed the CRAM decoder to be more robust so it doesn't access uninitialised md5sum buffers when we have no reference loaded. Such CRAM files are malformed, and only temporarily created by the earlier commit (now fixed), but improving the robustness of the decoder is good.

if (fd->embed_ref) {
hts_log_info("Changing from embed_ref to no_ref mode");
c->embed_ref = 0;//fd->embed_ref = 0;
c->no_ref = 1;//fd->no_ref = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be more sticky. Currently c->no_ref gets reset back to fd->no_ref when the next record is added. It looks like there's also a problem with it dropping out of multiref mode, which on my test file leads to a small container being flushed out and, as it's dropped out of no-ref, fails the MD5 validation.

I'll send you the location of my test file.

@@ -87,6 +87,10 @@ cram_block *cram_encode_compression_header(cram_fd *fd, cram_container *c,
cram_block *map = cram_new_block(COMPRESSION_HEADER, 0);
int i, mc, r = 0;

pthread_mutex_lock(&fd->metrics_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to lock this (as you're accessing c->no_ref rather that fd->no_ref)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha indeed not! I expect this was originally fd->no_ref and I blithly converted it to c-> instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also need to check other places where it was changed, as there's at least one other case where you grab a lock before accessing c->no_ref.

@@ -1978,12 +2015,14 @@ int cram_encode_container(cram_fd *fd, cram_container *c) {


/* Compute MD5s */
no_ref = c->no_ref;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function has grown over the years and is overly long, so it really needs a refactor. However no_ref is used a few lines lower down and above it there are several places that modify c->no_ref. It's possible the logic is such that they never have any effect on the cached no_ref, but I wouldn't like to say with 100% certainty and for clarity alone it's worth keeping this line IMO.

if (!no_ref && c->refs_used) {
for (i = 0; i < nref; i++) {
if (c->refs_used[i]) {
cram_get_ref(fd, i, 1, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was going well until I enabled multi-threading:

REF_PATH=/tmp/cc/%s ./test/test_view -C -@ 8 -p /tmp/t8.cram /tmp/9305_4#1n_nour.bam
[W::cram_get_ref] Failed to populate reference for id 5
[W::cram_encode_container] Failed to load reference #5
[W::cram_encode_container] Enabling embed_ref=2 mode to auto-generate reference
[W::cram_encode_container] NOTE: the CRAM file will be bigger than using an external reference
[W::cram_get_ref] Failed to populate reference for id 16
[W::cram_encode_container] Failed to load reference #16
[W::cram_encode_container] Switching to non-ref mode
[W::cram_get_ref] Failed to populate reference for id 1
[E::validate_md5] SQ header M5 tag discrepancy for reference '10'
[E::validate_md5] Please use the correct reference, or consider using embed_ref=2
[E::cram_flush_thread] Call to cram_encode_container failed
Error writing output.

Note that you might have to run it a few times to get this to happen. I think it depends on the main thread managing to queue up enough containers before one of the workers switches it to no_ref mode.

I suspect the easiest solution might be to check for reference loading failures here, and switch the container to no_ref mode if it happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you might have to run it a few times to get this to happen. I think it depends on the main thread managing to queue up enough containers before one of the workers switches it to no_ref mode.

I suspect the easiest solution might be to check for reference loading failures here, and switch the container to no_ref mode if it happens.

More than a few times for me - it was failing about 1 in 100 on my smaller test set (with approx 10 containers worth and 3 threads). Anyway, it's now doing as suggested and I can't break it any more (several thousand runs done).

@daviesrob
Copy link
Member

This looks better, albeit a bit chatty when running multi-threaded and it discovers it needs no-ref mode. I also see that the stray no_ref = c->no_ref; mentioned above is still there.

Could you squash this into its final form please?

Specifically, this is a multi-threading bug where the
cram_encode_container fails leading to access of freed memory.  This
bug always existed, but we now have a guaranteed way of making the
container encode fail.

Embed_ref=2 computes a consensus and uses that as the reference.  By
design, it cannot work on non-position sorted data, hence a guaranteed
failure for encoding.

(Step 2 will be to do something more sensible than just fail.)
If we're attempting to use embed_ref=2 because we have alignments in
BAM and no known reference (no M5 tags), and the data is also name
sorted so we can't auto-generate a reference from the consensus and/or
MD:Z tags, then we now just switch to no_ref mode instead of bailing
out.

This isn't ideal, but there is little else that can be done unless the
user modifies their command line or reheaders the input file.

Extra complexities exist as the setting of embed_ref occurs while
encoding a container, which also means it can be adjusted within
threads and synchronously with other containers being encoded.  Hence
we now cache more variables from cram_fd into cram_container.
This is far from complete, but it's a guide on the basic function
hierarchy and which bits are used from a single thread per cram_fd and
which can be concurrent.
@jkbonfield
Copy link
Contributor Author

Chatty is deliberate, as it has an impact on the CRAM size. If someone accidentally forgets to specify the reference I'd rather they get a few warning messages, possibly more than once if threaded, than a silent degregadation in compression performance. I don't think it's worth putting lots of extra thread locks and a shared variable to track how many times we've reported the warnings, so I think the current warnings represent a good middle ground unless you can think of a simple fix.

I rebased and squashed it.

@daviesrob daviesrob merged commit 80558d9 into samtools:develop Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants