-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demo / sample code update #1666
Conversation
samples/write_fast.c
Outdated
//close and index it | ||
sam_close(outfile); outfile = NULL; | ||
if (fai_build3(outname, NULL, NULL) == -1) { | ||
printf("Indexing failed with %d\n", errno); | ||
goto end; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this makes a good demonstration.
Fasta indices are good for references as they allow us to extract small portions of a large reference. However they are wholely inappropriate for NGS data, which is typically what we have in a SAM/BAM file.
Basically fasta/fastq are split personally and used for two different types of data. Messy.
We may be better having a tiny demo of a separate "samtools faidx" index builder instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This writes reference data in fasta/fastq format and showcases creation of index for later use, while reading.
It is not using NGS data, and uses sam_open/write/close and bam1_t for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-reading DEMO.md I see it explains the indexing is suitable for reference data.
As this is for education purposes, maybe put a comment just before the fai_build3
call: // NB: this is only suitable for fasta/fastq files of reference sequences, and not sequencing reads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separated as a different samples and mentioned as indexing of reference data.
I ran a spell checker over the DEMO and README files. Highlighted in bold. Some were there before, hence not easy to add comments to lines here, but it's trivial to search and fix. Lines in README.md: DEMO.md: |
Spell check and other corrections made. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments on qtask_ordered.c only.
I'm tempted to say to drop this for this release if we want the rest of the changes merging, as it needs major work to fix race conditions.
I have a rebased version here that we may want to use before doing further changes. It fixes that bizarre github induced merge commit. I haven't squashed anything yet, so all commits are the same content, but we can do that when we're ready to merge. |
3422cdf
to
20aaa40
Compare
thanks James, have used this rebased version and pushed it. |
I'm still finding qtask_ordered to be considerably slower than it needs to be. I reimplemented the qtask_ordered problem using my own logic, mostly copied from the way htslib does multi-threading of SAM by generating blocks of BAM records to operate on and separate reading and writing threads to keep things running. My code for this is here: https://github.com/jkbonfield/htslib/blob/PR_1666d/tv2.c
So the original is typically 10-11s and maybe around 400% CPU utilisation.
This is 850-900% CPU utilisation and ~2.5s. So around 4x quicker. It's clear there is also a considerable difference in CPU usage, which I can't explain yet, but possibly it's the optimisations in the core counting algorithm. (I should probably remove those as it's not good at demonstrating the benefits of threading when the task we are performing is so quick.) The output of out.sam and _.sam are identical, although I had to change my code a bit because we counted differently. I think the GC content in this PR is subtly wrong in that it is numbers of G and C over the sequence length, not over the numbers of A, C, G and T. It only differs when Ns are present (or ambiguity codes, but they didn't occur). We don't know what the N base is so it shouldn't be counted in the maths. Anyway I digress. If we want a demonstration of threading then it needs to be performant. I also have a different version of this (tv1) which is purely counting the frequency of each base type and reporting summary statistics. That doesn't write anything back. However it means the order in which we do our sums is irrelevant as A + B + C is the same as A + C + B. Hence this becomes a trivial change to unordered threads instead. With hindsight I think that's a simpler example, but the one here may still be relevant too and has practical purpose. |
I tried changing qtask_ordered to use "wb" to write to "out.bam" instead. It obviously then uses more CPU up as compression of the output becomes significant, but it's still behind tv2's BAM output.
It's interesting we still see a signficant "system" time difference, which could be something to do with malloc/free and zeroing pages, or it could be something else such as pthread interfaces. I haven't profiled it to find out. On a larger file this seems to be more pronounced too:
System time here is vastly bigger, which is a big hint. Note it's possible to profile the main thread and main thread only, to see where it's spending the CPU time. That in turn may explain why it can't parallelise as well as it ought to. Eg:
It's somewhere in the kernel, but that's probably memory page clearing given the other high CPU functions. Looking at it, it seems to be spending most of its time allocating bam objects. This is one way my implementation majorly differs. I don't use |
Thanks for the sample James. The aim of the sample was just to demonstrate the usage of queues and hts thread pool, specifically the 2 different ways of its usage, made as basic as possible for that purpose. It had no intention to show the user how best to use threading in general to get best performance. Based on earlier reviews, I have incorporated the input and output file thread pool sharing as it gives some direction on hts thread pool usage. And taking it to next level with separate read/write threads may make the sample complex and shifts the objective of it. But if we need one such sample, may be we should add as a separate one? Or just refer the user to a samtools functionality? |
update with correction fixed issues
Use hts_tpool_next_result_wait() in threadfn_orderedwrite() so it blocks properly when it has nothing to do. Remove lock and conditional that are no longer needed. Use a sentinel job to tell threadfn_orderedwrite() that there is no more data and it can terminate. Improve error handling. Add a data::result field so that threadfn_orderedwrite() can report back to the main thread if it encountered a problem. Ensure everything is cleaned up correctly if either the main thread or threadfn_orderedwrite() fail, and in the right order. Set the return value to EXIT_FAILURE if either sam_close() or threadfn_orderedwrite() report failure. Ensure error messages are written to stderr.
Thanks James and Rob. |
I've not checked the code yet, but I can verify the speed is now similar to my own implementations. I think there are some minor niceties with this directory to fix too.
|
samples/DEMO.md
Outdated
if (!(data = faidx_fetch_seq64(idx, faidx_iseq(idx, tid), beg, end, &len))) { | ||
... | ||
printf("Data: %"PRIhts_pos" %s\n", len, data); | ||
free((void*)data); | ||
data = NULL; | ||
//get quality data for fastq | ||
if (fmt == FAI_FASTQ) { | ||
if (!(data = faidx_fetch_qual64(idx, faidx_iseq(idx, tid), beg, end, &len))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate this is just code from the demo programs that's been cutdown with "..." to get to the essence of things, but it's losing the if-else structure which is highly confusing. We could add elses clauses, but I think really the correct solution is negating the if statements.
Ie:
if ((data = faidx_fetch_seq64(idx, faidx_iseq(idx, tid), beg, end, &len)) != NULL) {
printf("Data: %"PRIhts_pos" %s\n", len, data);
...
We don't need the explicit error checking here as DEMO.md isn't designed as a guide for how to write bullet proof code and instead is trying to describe the most important features. However keeping an if statement with an explicit != NULL
is a hint that it can fail and it's up to the programmer to handle that (and is demonstrated in the actual sample code).
It's probably easier for me to just go through this and make a new commit on to this PR than to document each case I find here.
The "..." to indicate clipped out code is useful as it removes a lot of the unnecessary boiler plate, especially error checking, and allows us to drill down to the essense of what we're trying to teach. However without indentation it's not clear what bits are being removed. Also there are times when the code removed is an error handler plus a subsequent else clause, which gives the false impression with the indentation that the if-clause has been negated. Eg: ... //select required field alone, this is useful for CRAM alone if (hts_set_opt(infile, CRAM_OPT_REQUIRED_FIELDS, SAM_FLAG) < 0) { ... //read header in_samhdr = sam_hdr_read(infile); ... Clearly the `sam_hdr_read` is not within the if-clause of `hts_set_opt` returning < 0, but it's how it reads because "..." hid too much. (In the above case just indenting the ... and un-indenting the hdr read call reverses the mental image.) My approach here is to replace all `...` error handlers with `... // error` and to indent them accordingly. Also where we only have a one line thing, removing open curly braces if we've cut the close curly brace gives a more sensible view. None of this is really changing the code we're showing, but presenting it in a more obvious manner to those that don't understand what is and isn't an error case.
Squashed, rebased and merged as f8016c0. I didn't force push back over this as you may wish to keep the separate commit histories as there was a lot of back and forth, so I'll leave you to decide whether to delete the branch. |
Samples updated with fasta/fastq index using, tabix index and thread pool queueing of tasks.