Demo / sample code update #1666

vasudeva8 · 2023-08-21T16:53:58Z

Samples updated with fasta/fastq index using, tabix index and thread pool queueing of tasks.

jkbonfield · 2023-09-25T09:40:01Z

samples/write_fast.c

+    //close and index it
+    sam_close(outfile); outfile = NULL;
+    if (fai_build3(outname, NULL, NULL) == -1) {
+        printf("Indexing failed with %d\n", errno);
+        goto end;
+    }


I'm not convinced this makes a good demonstration.

Fasta indices are good for references as they allow us to extract small portions of a large reference. However they are wholely inappropriate for NGS data, which is typically what we have in a SAM/BAM file.

Basically fasta/fastq are split personally and used for two different types of data. Messy.

We may be better having a tiny demo of a separate "samtools faidx" index builder instead.

This writes reference data in fasta/fastq format and showcases creation of index for later use, while reading.
It is not using NGS data, and uses sam_open/write/close and bam1_t for this.

Re-reading DEMO.md I see it explains the indexing is suitable for reference data.

As this is for education purposes, maybe put a comment just before the fai_build3 call: // NB: this is only suitable for fasta/fastq files of reference sequences, and not sequencing reads

Separated as a different samples and mentioned as indexing of reference data.

samples/read_with_tabix.c

samples/DEMO.md

samples/read_with_tabix.c

samples/qtask_thread.c

jkbonfield · 2023-11-07T12:41:43Z

I ran a spell checker over the DEMO and README files. Highlighted in bold. Some were there before, hence not easy to add comments to lines here, but it's trivial to search and fix.

Lines in README.md:
"length equal or greather to given input"
"This application shocases the use of mulitple queues,"

DEMO.md:
"Read_multireg - This application showcases the usage of mulitple regionn "
"Qtask_thread - This application shocases the use of mulitple queues, scheduling of different tasks and getting results in orderered and unordered fashion."
"Pileup shows the transposed view of the SAM alignment data, i.e. it shows the the reference positions"

vasudeva8 · 2024-03-26T11:17:37Z

Spell check and other corrections made.

jkbonfield

Comments on qtask_ordered.c only.

I'm tempted to say to drop this for this release if we want the rest of the changes merging, as it needs major work to fix race conditions.

samples/qtask_ordered.c

samples/qtask_unordered.c

jkbonfield · 2024-04-09T14:12:24Z

I have a rebased version here that we may want to use before doing further changes. It fixes that bizarre github induced merge commit.

I haven't squashed anything yet, so all commits are the same content, but we can do that when we're ready to merge.

vasudeva8 · 2024-05-01T13:32:12Z

thanks James, have used this rebased version and pushed it.
also the thread pool is shared with input and output files as well, to make it a better sample as mentioned.

jkbonfield · 2024-05-02T16:29:51Z

I'm still finding qtask_ordered to be considerably slower than it needs to be.

I reimplemented the qtask_ordered problem using my own logic, mostly copied from the way htslib does multi-threading of SAM by generating blocks of BAM records to operate on and separate reading and writing threads to keep things running.

My code for this is here: https://github.com/jkbonfield/htslib/blob/PR_1666d/tv2.c

$ for i in `seq 1 5`;do taskset -c 0-15 /usr/bin/time -f '%U user\t%S system\
t%e elapsed\t%P %%CPU' ./qtask_ordered ~/scratch/data/novaseq.10m.bam 16 /tmp;done
35.51 user	7.23 system	10.69 elapsed	399% %CPU
37.74 user	7.68 system	10.43 elapsed	435% %CPU
36.11 user	7.90 system	11.44 elapsed	384% %CPU
38.03 user	7.74 system	10.60 elapsed	431% %CPU
38.71 user	8.28 system	12.48 elapsed	376% %CPU

So the original is typically 10-11s and maybe around 400% CPU utilisation.

$ for i in `seq 1 5`;do taskset -c 0-15 /usr/bin/time -f '%U user\t%S system\t%e elapsed\t%P %%CPU' ../tv2 -@16 ~/scratch/data/novaseq.10m.bam /tmp/_.sam > /dev/null;done
18.33 user	3.06 system	2.33 elapsed	916% %CPU
19.42 user	3.43 system	2.67 elapsed	855% %CPU
19.27 user	3.56 system	2.64 elapsed	862% %CPU
18.40 user	3.05 system	2.40 elapsed	890% %CPU
19.43 user	3.54 system	2.70 elapsed	848% %CPU

This is 850-900% CPU utilisation and ~2.5s. So around 4x quicker. It's clear there is also a considerable difference in CPU usage, which I can't explain yet, but possibly it's the optimisations in the core counting algorithm. (I should probably remove those as it's not good at demonstrating the benefits of threading when the task we are performing is so quick.)

The output of out.sam and _.sam are identical, although I had to change my code a bit because we counted differently. I think the GC content in this PR is subtly wrong in that it is numbers of G and C over the sequence length, not over the numbers of A, C, G and T. It only differs when Ns are present (or ambiguity codes, but they didn't occur). We don't know what the N base is so it shouldn't be counted in the maths.

Anyway I digress. If we want a demonstration of threading then it needs to be performant.

I also have a different version of this (tv1) which is purely counting the frequency of each base type and reporting summary statistics. That doesn't write anything back. However it means the order in which we do our sums is irrelevant as A + B + C is the same as A + C + B. Hence this becomes a trivial change to unordered threads instead. With hindsight I think that's a simpler example, but the one here may still be relevant too and has practical purpose.

jkbonfield · 2024-05-02T16:59:22Z

I tried changing qtask_ordered to use "wb" to write to "out.bam" instead. It obviously then uses more CPU up as compression of the output becomes significant, but it's still behind tv2's BAM output.

#qtask_ordered
100.16 user	4.93 system	10.18 elapsed	1031% %CPU
100.79 user	5.13 system	10.42 elapsed	1016% %CPU
100.58 user	4.86 system	10.18 elapsed	1035% %CPU
100.91 user	5.17 system	10.43 elapsed	1016% %CPU
100.39 user	5.15 system	10.44 elapsed	1010% %CPU

#tv2
84.84 user	2.06 system	6.31 elapsed	1375% %CPU
84.44 user	1.72 system	5.93 elapsed	1452% %CPU
84.55 user	1.98 system	5.93 elapsed	1458% %CPU
84.77 user	1.73 system	5.94 elapsed	1454% %CPU
84.38 user	1.81 system	6.02 elapsed	1431% %CPU

It's interesting we still see a signficant "system" time difference, which could be something to do with malloc/free and zeroing pages, or it could be something else such as pthread interfaces. I haven't profiled it to find out.

On a larger file this seems to be more pronounced too:

#qtask_ordered
1226.54 user	290.74 system	316.69 elapsed	479% %CPU

#tv2
710.26 user	17.12 system	49.46 elapsed	1470% %CPU
715.45 user	15.22 system	50.37 elapsed	1450% %CPU

System time here is vastly bigger, which is a big hint. Note it's possible to profile the main thread and main thread only, to see where it's spending the CPU time. That in turn may explain why it can't parallelise as well as it ought to. Eg:

./qtask_ordered /local/scratch01/jkb/SRR6882909.name-.bam 16 /tmp & perf record -t $!
[ Wait and then control-C ]

perf report then shows:
  50.93%  qtask_ordered  [kernel]            [k] 0xffffffff99800190                                   ▒
  15.16%  qtask_ordered  libc-2.27.so        [.] _int_malloc                                          ▒
   9.10%  qtask_ordered  libc-2.27.so        [.] calloc                                               ▒
   6.76%  qtask_ordered  libc-2.27.so        [.] __libc_malloc                                        ▒
   4.54%  qtask_ordered  qtask_ordered       [.] bam_read1                                            ▒
   2.64%  qtask_ordered  libc-2.27.so        [.] __memmove_sse2_unaligned_erms                        ▒
   1.75%  qtask_ordered  qtask_ordered       [.] bgzf_read                                            ▒
   1.67%  qtask_ordered  libc-2.27.so        [.] __lll_lock_wait_private                              ▒
   1.11%  qtask_ordered  qtask_ordered       [.] sam_realloc_bam_data                                 ▒
   1.07%  qtask_ordered  qtask_ordered       [.] sam_read1                                            ▒
   0.82%  qtask_ordered  libc-2.27.so        [.] memcpy@GLIBC_2.2.5                                   ▒
   0.74%  qtask_ordered  qtask_ordered       [.] main                                                 ▒
   0.63%  qtask_ordered  qtask_ordered       [.] bam_tag2cigar                                        ▒

It's somewhere in the kernel, but that's probably memory page clearing given the other high CPU functions. Looking at it, it seems to be spending most of its time allocating bam objects. This is one way my implementation majorly differs. I don't use bam_init1 and I don't free the bams between successive calls. I have a queue of bam arrays that I push onto and pull from, so I can reuse my bam-array from a previous job. Don't give the memory back to the system until program exit and it'll be MUCH faster!

vasudeva8 · 2024-05-03T11:59:49Z

Thanks for the sample James.

The aim of the sample was just to demonstrate the usage of queues and hts thread pool, specifically the 2 different ways of its usage, made as basic as possible for that purpose. It had no intention to show the user how best to use threading in general to get best performance.

Based on earlier reviews, I have incorporated the input and output file thread pool sharing as it gives some direction on hts thread pool usage. And taking it to next level with separate read/write threads may make the sample complex and shifts the objective of it.

But if we need one such sample, may be we should add as a separate one? Or just refer the user to a samtools functionality?

update with correction fixed issues

Use hts_tpool_next_result_wait() in threadfn_orderedwrite() so it blocks properly when it has nothing to do. Remove lock and conditional that are no longer needed. Use a sentinel job to tell threadfn_orderedwrite() that there is no more data and it can terminate. Improve error handling. Add a data::result field so that threadfn_orderedwrite() can report back to the main thread if it encountered a problem. Ensure everything is cleaned up correctly if either the main thread or threadfn_orderedwrite() fail, and in the right order. Set the return value to EXIT_FAILURE if either sam_close() or threadfn_orderedwrite() report failure. Ensure error messages are written to stderr.

vasudeva8 · 2024-07-02T15:36:40Z

Thanks James and Rob.
I have merged and updated the samples. Now they use alignments in chunks and reuses memory. Also the removed lookup / decoding of base from loop to outside the loop.
The unordered processing now just dumps the count of bases and gc ratio.

jkbonfield · 2024-07-04T16:12:02Z

I've not checked the code yet, but I can verify the speed is now similar to my own implementations.

I think there are some minor niceties with this directory to fix too.

The default build is -O0 when it perhaps should be -O2. People can always override it with "CFLAGS=-g" on the make command line if they wish.
There are no dependencies, so if source is modified then it doesn't rebuild. It's just a matter of adding the source name in the build rules, e.g flags: flags.c

jkbonfield · 2024-07-17T08:59:44Z

samples/DEMO.md

+            if (!(data = faidx_fetch_seq64(idx, faidx_iseq(idx, tid), beg, end, &len))) {
+    ...
+                printf("Data: %"PRIhts_pos" %s\n", len, data);
+                free((void*)data);
+                data = NULL;
+                //get quality data for fastq
+                if (fmt == FAI_FASTQ) {
+                    if (!(data = faidx_fetch_qual64(idx, faidx_iseq(idx, tid), beg, end, &len))) {


I appreciate this is just code from the demo programs that's been cutdown with "..." to get to the essence of things, but it's losing the if-else structure which is highly confusing. We could add elses clauses, but I think really the correct solution is negating the if statements.

Ie:

if ((data = faidx_fetch_seq64(idx, faidx_iseq(idx, tid), beg, end, &len)) != NULL) { printf("Data: %"PRIhts_pos" %s\n", len, data); ...

We don't need the explicit error checking here as DEMO.md isn't designed as a guide for how to write bullet proof code and instead is trying to describe the most important features. However keeping an if statement with an explicit != NULL is a hint that it can fail and it's up to the programmer to handle that (and is demonstrated in the actual sample code).

It's probably easier for me to just go through this and make a new commit on to this PR than to document each case I find here.

The "..." to indicate clipped out code is useful as it removes a lot of the unnecessary boiler plate, especially error checking, and allows us to drill down to the essense of what we're trying to teach. However without indentation it's not clear what bits are being removed. Also there are times when the code removed is an error handler plus a subsequent else clause, which gives the false impression with the indentation that the if-clause has been negated. Eg: ... //select required field alone, this is useful for CRAM alone if (hts_set_opt(infile, CRAM_OPT_REQUIRED_FIELDS, SAM_FLAG) < 0) { ... //read header in_samhdr = sam_hdr_read(infile); ... Clearly the `sam_hdr_read` is not within the if-clause of `hts_set_opt` returning < 0, but it's how it reads because "..." hid too much. (In the above case just indenting the ... and un-indenting the hdr read call reverses the mental image.) My approach here is to replace all `...` error handlers with `... // error` and to indent them accordingly. Also where we only have a one line thing, removing open curly braces if we've cut the close curly brace gives a more sensible view. None of this is really changing the code we're showing, but presenting it in a more obvious manner to those that don't understand what is and isn't an error case.

samples/read_fast.c

jkbonfield · 2024-07-18T12:00:50Z

Squashed, rebased and merged as f8016c0.

I didn't force push back over this as you may wish to keep the separate commit histories as there was a lot of back and forth, so I'll leave you to decide whether to delete the branch.

daviesrob assigned jkbonfield Aug 22, 2023

jkbonfield reviewed Sep 25, 2023

View reviewed changes

vasudeva8 force-pushed the demo3 branch from bcdc7c8 to cadb6a2 Compare October 25, 2023 19:04

vasudeva8 force-pushed the demo3 branch from 5fc57aa to 498aac9 Compare March 26, 2024 11:28

jkbonfield reviewed Apr 9, 2024

View reviewed changes

samples/qtask_ordered.c Outdated Show resolved Hide resolved

samples/qtask_ordered.c Outdated Show resolved Hide resolved

jkbonfield reviewed Apr 9, 2024

View reviewed changes

samples/qtask_unordered.c Outdated Show resolved Hide resolved

vasudeva8 force-pushed the demo3 branch 3 times, most recently from 3422cdf to 20aaa40 Compare May 1, 2024 13:18

vasudeva8 and others added 5 commits July 2, 2024 16:07

a few more functionalities

66b0e14

merge

a238a75

update based on reviews

a73aaea

update with correction fixed issues

update to make faster

9ce3917

vasudeva8 force-pushed the demo3 branch from 20aaa40 to 9ce3917 Compare July 2, 2024 15:08

vasudeva8 and others added 2 commits July 10, 2024 17:35

update

2492496

Minor samples/*.md doc updates

d1c9efa

jkbonfield reviewed Jul 17, 2024

View reviewed changes

samples/read_fast.c Show resolved Hide resolved

update based on reviews

7d0c19e

jkbonfield closed this Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo / sample code update #1666

Demo / sample code update #1666

vasudeva8 commented Aug 21, 2023

jkbonfield Sep 25, 2023

vasudeva8 Oct 24, 2023

jkbonfield Nov 7, 2023

vasudeva8 Mar 26, 2024

jkbonfield commented Nov 7, 2023

vasudeva8 commented Mar 26, 2024

jkbonfield left a comment

jkbonfield commented Apr 9, 2024

vasudeva8 commented May 1, 2024

jkbonfield commented May 2, 2024

jkbonfield commented May 2, 2024

vasudeva8 commented May 3, 2024

vasudeva8 commented Jul 2, 2024

jkbonfield commented Jul 4, 2024

jkbonfield Jul 17, 2024

jkbonfield commented Jul 18, 2024

Demo / sample code update #1666

Demo / sample code update #1666

Conversation

vasudeva8 commented Aug 21, 2023

jkbonfield Sep 25, 2023

Choose a reason for hiding this comment

vasudeva8 Oct 24, 2023

Choose a reason for hiding this comment

jkbonfield Nov 7, 2023

Choose a reason for hiding this comment

vasudeva8 Mar 26, 2024

Choose a reason for hiding this comment

jkbonfield commented Nov 7, 2023

vasudeva8 commented Mar 26, 2024

jkbonfield left a comment

Choose a reason for hiding this comment

jkbonfield commented Apr 9, 2024

vasudeva8 commented May 1, 2024

jkbonfield commented May 2, 2024

jkbonfield commented May 2, 2024

vasudeva8 commented May 3, 2024

vasudeva8 commented Jul 2, 2024

jkbonfield commented Jul 4, 2024

jkbonfield Jul 17, 2024

Choose a reason for hiding this comment

jkbonfield commented Jul 18, 2024