Embed ref2 unsorted #1558

jkbonfield · 2023-01-27T16:40:41Z

Fix crash when cram_encode_container during multi-threaded CRAM encoding.
Prevent embed_ref=2 on name sorted data from being a hard failure for cram encoding. It now switches to non-ref mode instead, so we always encode something, even if it's suboptimal.

daviesrob · 2023-01-30T15:27:48Z

This seems OK in the case where you have no references at all, but I managed to break it by allowing it to find a subset of the references. This may be a different, but related, bug.

To reproduce, make a name-sorted file that has no @SQ UR: tags but does have @SQ M5:. Then find the reference for the first read it would try to load, and make an MD5 reference directory with just that one in it. Then try to convert the file to CRAM.

The problem is that getting into multi_seq mode also allocates c->refs_used[], which triggers the MD5 validation code here. The validation code fails when it can't load one of the used references.

It also looks like there are possible race conditions on fd->no_ref as after this PR it can change state in another thread.

jkbonfield · 2023-01-31T12:15:50Z

I'm trying to repeat this and cannot get it to fail. Do you have test data please?

Edit: well I can with threads, but if I recall you had a case of failure without threading? I'm working on fixing the trivial thread issues, caused by the new ability to change fd->no_ref from within a thread (previously impossible).

Ah sorry, I had a typo in my ref hash! duh. My guessed at fix cures it sure enough. I still need to tweak some threading cases though.

jkbonfield · 2023-01-31T17:28:31Z

I think I've fixed the new (or quite possibly old) problem you found, as well as fixed the threading issues caused by changing no_ref from within cram threads. That is particularly insideous as there's so many places it's looked at. In the end I took the same approach we did for some other fields, which is migrate them from fd into the container at the time of creation, and make everything underneath that (eg slices, records, aux fields, etc) all relate to the container variable which doesn't change from that point onwards. It may not be necessary everywhere, but it's safe (and failed attempts to fix this showed at least some of it was necessary).

Also fixed the CRAM decoder to be more robust so it doesn't access uninitialised md5sum buffers when we have no reference loaded. Such CRAM files are malformed, and only temporarily created by the earlier commit (now fixed), but improving the robustness of the decoder is good.

daviesrob · 2023-02-01T10:26:02Z

cram/cram_encode.c

+            if (fd->embed_ref) {
+                hts_log_info("Changing from embed_ref to no_ref mode");
+                c->embed_ref = 0;//fd->embed_ref = 0;
+                c->no_ref = 1;//fd->no_ref = 1;


This needs to be more sticky. Currently c->no_ref gets reset back to fd->no_ref when the next record is added. It looks like there's also a problem with it dropping out of multiref mode, which on my test file leads to a small container being flushed out and, as it's dropped out of no-ref, fails the MD5 validation.

I'll send you the location of my test file.

daviesrob · 2023-02-06T17:23:30Z

cram/cram_encode.c

@@ -87,6 +87,10 @@ cram_block *cram_encode_compression_header(cram_fd *fd, cram_container *c,
    cram_block *map = cram_new_block(COMPRESSION_HEADER, 0);
    int i, mc, r = 0;

+    pthread_mutex_lock(&fd->metrics_lock);


Do you need to lock this (as you're accessing c->no_ref rather that fd->no_ref)?

Haha indeed not! I expect this was originally fd->no_ref and I blithly converted it to c-> instead.

You'll also need to check other places where it was changed, as there's at least one other case where you grab a lock before accessing c->no_ref.

cram/cram_encode.c

daviesrob · 2023-02-07T12:23:17Z

cram/cram_encode.c

@@ -1978,12 +2015,14 @@ int cram_encode_container(cram_fd *fd, cram_container *c) {


    /* Compute MD5s */
+    no_ref = c->no_ref;


I don't think this line is needed.

The function has grown over the years and is overly long, so it really needs a refactor. However no_ref is used a few lines lower down and above it there are several places that modify c->no_ref. It's possible the logic is such that they never have any effect on the cached no_ref, but I wouldn't like to say with 100% certainty and for clarity alone it's worth keeping this line IMO.

daviesrob · 2023-02-07T12:29:12Z

cram/cram_encode.c

+    if (!no_ref && c->refs_used) {
+        for (i = 0; i < nref; i++) {
+            if (c->refs_used[i]) {
+                cram_get_ref(fd, i, 1, 0);


It was going well until I enabled multi-threading:

REF_PATH=/tmp/cc/%s ./test/test_view -C -@ 8 -p /tmp/t8.cram /tmp/9305_4#1n_nour.bam [W::cram_get_ref] Failed to populate reference for id 5 [W::cram_encode_container] Failed to load reference #5 [W::cram_encode_container] Enabling embed_ref=2 mode to auto-generate reference [W::cram_encode_container] NOTE: the CRAM file will be bigger than using an external reference [W::cram_get_ref] Failed to populate reference for id 16 [W::cram_encode_container] Failed to load reference #16 [W::cram_encode_container] Switching to non-ref mode [W::cram_get_ref] Failed to populate reference for id 1 [E::validate_md5] SQ header M5 tag discrepancy for reference '10' [E::validate_md5] Please use the correct reference, or consider using embed_ref=2 [E::cram_flush_thread] Call to cram_encode_container failed Error writing output.

Note that you might have to run it a few times to get this to happen. I think it depends on the main thread managing to queue up enough containers before one of the workers switches it to no_ref mode.

I suspect the easiest solution might be to check for reference loading failures here, and switch the container to no_ref mode if it happens.

Note that you might have to run it a few times to get this to happen. I think it depends on the main thread managing to queue up enough containers before one of the workers switches it to no_ref mode.

I suspect the easiest solution might be to check for reference loading failures here, and switch the container to no_ref mode if it happens.

More than a few times for me - it was failing about 1 in 100 on my smaller test set (with approx 10 containers worth and 3 threads). Anyway, it's now doing as suggested and I can't break it any more (several thousand runs done).

daviesrob · 2023-02-08T10:33:43Z

This looks better, albeit a bit chatty when running multi-threaded and it discovers it needs no-ref mode. I also see that the stray no_ref = c->no_ref; mentioned above is still there.

Could you squash this into its final form please?

Specifically, this is a multi-threading bug where the cram_encode_container fails leading to access of freed memory. This bug always existed, but we now have a guaranteed way of making the container encode fail. Embed_ref=2 computes a consensus and uses that as the reference. By design, it cannot work on non-position sorted data, hence a guaranteed failure for encoding. (Step 2 will be to do something more sensible than just fail.)

If we're attempting to use embed_ref=2 because we have alignments in BAM and no known reference (no M5 tags), and the data is also name sorted so we can't auto-generate a reference from the consensus and/or MD:Z tags, then we now just switch to no_ref mode instead of bailing out. This isn't ideal, but there is little else that can be done unless the user modifies their command line or reheaders the input file. Extra complexities exist as the setting of embed_ref occurs while encoding a container, which also means it can be adjusted within threads and synchronously with other containers being encoded. Hence we now cache more variables from cram_fd into cram_container.

This is far from complete, but it's a guide on the basic function hierarchy and which bits are used from a single thread per cram_fd and which can be concurrent.

jkbonfield · 2023-02-08T13:52:17Z

Chatty is deliberate, as it has an impact on the CRAM size. If someone accidentally forgets to specify the reference I'd rather they get a few warning messages, possibly more than once if threaded, than a silent degregadation in compression performance. I don't think it's worth putting lots of extra thread locks and a shared variable to track how many times we've reported the warnings, so I think the current warnings represent a good middle ground unless you can think of a simple fix.

I rebased and squashed it.

jkbonfield force-pushed the embed_ref2_unsorted branch from 03886a4 to e8b30ba Compare January 31, 2023 17:27

daviesrob reviewed Feb 1, 2023

View reviewed changes

jkbonfield force-pushed the embed_ref2_unsorted branch from 5c72a31 to 3c70aec Compare February 1, 2023 17:23

daviesrob reviewed Feb 6, 2023

View reviewed changes

cram/cram_encode.c Show resolved Hide resolved

jkbonfield force-pushed the embed_ref2_unsorted branch from f913215 to 9a92483 Compare February 7, 2023 10:25

daviesrob reviewed Feb 7, 2023

View reviewed changes

jkbonfield added 3 commits February 8, 2023 13:46

Add some documentation on cram encoder code structure

80558d9

This is far from complete, but it's a guide on the basic function hierarchy and which bits are used from a single thread per cram_fd and which can be concurrent.

jkbonfield force-pushed the embed_ref2_unsorted branch from 31be549 to 80558d9 Compare February 8, 2023 13:51

daviesrob merged commit 80558d9 into samtools:develop Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embed ref2 unsorted #1558

Embed ref2 unsorted #1558

jkbonfield commented Jan 27, 2023

daviesrob commented Jan 30, 2023

jkbonfield commented Jan 31, 2023 •

edited

Loading

jkbonfield commented Jan 31, 2023

daviesrob Feb 1, 2023

daviesrob Feb 6, 2023

jkbonfield Feb 7, 2023

daviesrob Feb 7, 2023

daviesrob Feb 7, 2023

jkbonfield Feb 8, 2023

daviesrob Feb 7, 2023

jkbonfield Feb 7, 2023

daviesrob commented Feb 8, 2023

jkbonfield commented Feb 8, 2023

		@@ -1978,12 +2015,14 @@ int cram_encode_container(cram_fd fd, cram_container c) {


		/* Compute MD5s */
		no_ref = c->no_ref;

Embed ref2 unsorted #1558

Embed ref2 unsorted #1558

Conversation

jkbonfield commented Jan 27, 2023

daviesrob commented Jan 30, 2023

jkbonfield commented Jan 31, 2023 • edited Loading

jkbonfield commented Jan 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daviesrob commented Feb 8, 2023

jkbonfield commented Feb 8, 2023

jkbonfield commented Jan 31, 2023 •

edited

Loading