forked from samtools/bcftools
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mpileup speedup #2
Open
jkbonfield
wants to merge
242
commits into
pd3:develop
Choose a base branch
from
jkbonfield:mpileup_speedup
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- rewritten for greater experimental flexibility - the AA peak can be included in the fit (the -i option) - the minimum fraction of aberrant cells as a command line parameter (-m) - control the minimum cn4 bump size (-b)
…lute; hidden --force-cn option for debugging
This was hidden as it was/is experimental, but with the licensing additions, users are explicitly requesting the command by compiling with USE_GPL=1, so we might as well display in the help message. Closes samtools#280
updates to polysomy command for publication
updates to roh command for publication
…ings: <snps|indels|both|all|none>
bcf_hdr_combine deprecated and replaced by bcf_hdr_merge in htslib (2bb9370f5a24938d8a2dc56f404e584661bf413f). Fixes samtools#208
…erying FMT vectors, such as PL{0}
- CNx state removed - automatic optimization of parameters to increase sensitivity when only a fraction of cells is aberrant (at the cose of decreased specificity) - LRR smoothing
See also 1f81d25 and samtools/htslib@3fcf7c9 Fixes samtools#285 Petentially fixes samtools#284
…ferences in FMT/AD
* add tests for samtools#452 and samtools#439 Resolves samtools#452
Not functional yet. Just copying over the files. * bam_plcmd.c for the mpileup command * bam2bcf.[ch] bam2bcf_indel.c for the VCF/BCF creation * sample.[ch] for RG:SM handling
minimal changes to files copied from samtools in order to compile in bcftools. * use regidx from htslib rather than bedidx from samtools * remove sam_opts calls as sam_opts.h not copied from samtools Todo: * copy over relevant functionality from sam_opts.h * remove text based mpileup output * update options and defaults * bring over `---gvcf` and other changes from @pd3 fork
* deprecate `-g -v -u` options (still functional, but with warning) * exit with message to use `samtools mpileup` if `-s/--output-MQ` used * `-O` option was `--output-BP` and is not `--output-type` for consistency with other bcftools commands. If `[buzv]` not given as an option will warn These catches for old text output options are probably not necessary as users may not expect text output from `bcftools mpileup`.
* mpileup.1.out, mpileup.2.out, mpileup.3.out and mpileup.4.out are from samtools with mpileup.1.out and mpileup.3.out converted from the text output * mpileup.5.out a new test with the newer AD, etc annotations * sam/bam/cram test files all stored. perhaps there is some way to store one version and convert within the test ala the vcf-miniview in samtools?
adds to @4e7c8fb86349761fed1b290357dbc792222ecdcb
* remove deprecated `-g`, `-v`, `-u`, `-D`, `-V`, `-S` * remove `-R` short option to make way for `--regions-file` option later
This commit brings over the `--gvcf` functionality from @pd3's branch, consisting of relevant bits from cf3219c and ee8210d Reference only blocks will be merged into gVCF blocks when the minimum per-sample depth falls in the intervals defined by the argument to the `-gvcf` option. Documentaion added to explain the merging and a test added.
pulling over of cf5c354 adding in `-S,--samples-file` option and exiting if no samples are read from the file or list TODO: add exclude logic with `^` prefix as in other bcftools commands removed `config.h` from `sample.c` as leftover this is in samtools, but not bcftools at the moment
switch `-t/--output-tags` option to `-a/--annotate` to make room for the `-t/--targets` option available annotations are now listed on request with `-a ?` rather than cluttering up the help output.
This is meant as a temporary change while we extend the regidx api, but allow bcftools code to use these changes before they appear in some form in htslib. This commit does not add new features, just copies over `regidx.[ch]` and rejiggers the linking to use these local bcftools copies. the `*_c` are removed due to relying on `hts_internal.h` (see fc9aeb6f77668afed412119701c5c58b0fca8091)
* added functions to loop over all regions * lazy index build in case random access is not required * support for chromosome names only, beg-end coordinates not mandatory * set cap at maximum coordinate at 2147483647, hts_itr does not support larger * tab and reg parsers will throw on finding a `0` to catch user error of using 0-based rather than 1-based coords
* `-r/--region` replaced by `-r/--regions` which will accept a comma separated list of regions as in other bcftools commands. `--region` still accepted * `-R/--regions-file` option added to read regions from a file This commit lifts over work originally done in cfd7cf9 Note: when more than one region is given, all indices are stored in memory, which can be a problem when running on many bams. An alternative would be to cache pre-filled `hts_itr`'s for each region. Resolves samtools#369
…ools commands the point of `--no-version` is to remove invocation specify metadata in the header lines for pipeline systems that are tracking this separately. we are outputting the `##reference` line though in mpileup. could drop this as well when `--no-version` used. seems silly to add a separate option.
* prefix with ^ to negate the selection * assign/rename samples by providing second field: RG_ID_1 SAMPLE_A RG_ID_2 SAMPLE_A RG_ID_3 SAMPLE_B * on read group name conflict give the alignment file, asterisk for all reads in the file: RG_ID_1 FILE_1.bam SAMPLE_A RG_ID_2 FILE_2.bam SAMPLE_A * FILE_3.bam SAMPLE_C Resolves 4th item in samtools#414 (comment) and samtools/samtools#324.
Our first foray into exploiting this is to cache the bam_smpl_get_sample_id return value. We compute this once in the constructor (the first time we see a new bam1_t) instead of for every pileup location in group_smpl. TO DO: Group_smpl itself could now become distributed perhaps. Rather than an N*M loop clustering all sample IDs together, each new bam could be added to sample struct on first appearance and removed when it goes out of sight. The slight caveat preventing this from being implemented immediately is that the constructor/destructors are called for every BAM overlapping the region rather than every filtered base that ultimately ends up in a pileup. Indeed sometimes we get constructors for reads entirely filtered out.
Mann Whitney test now uses a precomputed table instead of continually calculating the same values many times over. calc_mwu_bias now has short-cuts to compute the result faster when one or both of a[] and b[] hold zero values.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Speed ups to bcftools mpileup -Ou file.bam (ie simplest mode, default options, no ref and uncompressed BCF output).
The get_position() function now caches the read length so it doesn't have to scan through the entire cigar string for each and every base it operates on. This was O(N^2) complexity.
WARNING: it does this by shoehorning it into an unused field in BAM.
TODO: bite the bullet and break the ABI so we can put something into pileup1_t instead. I'll leave this up to you to thrash out. Probably fix the ghastly "aux" name too while at it. :-)
The speed gain is HUGE on long low-accuracy reads.
Removal of some floating point divisions in bcf_call_glfgen(). We had things like
baseQ/60.0 * nqual
. Given the size of baseq and nqual we can do(baseQ*nqual) / 60
in integer instead.I tried changing the epos division too but it produces different results. This is actually due to a bug in the existing code with rounding errors. E.g. the integer math got a correct 58 while the floating point match got 57.9999999 and rounded down to 57.
If you're happy to have different results, consider changing the epos calc to:
int epos = ((int64_t)pos*bca->npos)/(len+1);
. Instead I moved the initial division earlier up the function to give it time for the result to be computed before we use it.The Mann Whitney 1947 function is now precomputed for all the values potentially used in this code. See mw.h. Anything outside the bounds is computed on the fly as before, just incase I missed something.
Calc_mwu_bias() main loop now has specific code for dealing with one or both of the a[] and b[] arrays being zero. This seems to be a significant speed gain given the sparsity of these arrays.
Lots of "TRACK_EXTENTS" bits; see the #define in bcf2bam.h.
Basically this keeps track of the maximum values filled out in the
bca->{ref,alt}_{epos,ibq,imq}
arrays. It then uses these to shortcut some of the whole-array calculations as once it hits the runs of zeros the results don't change. The speed difference is very slight though. I leave it there for you to test, but the ifdefs show clearly how to cull it if you deem it not worth the extra complexity.We no longer query the RG field over and over again for each base in the read. There's nowhere to put the value though! So in a complete hack I stashed it in the bam insert size. I don't recommend keeping this as is, but it works for now until we figure out the correct form of the new ABI/API. This is simply demonstration that caching such things is a significant win.
Other potential improvements:
The main loop is still column by column and within each column then seq by seq. We have many cases where the same pileup array occurs for many columns, but we always throw it all away and recompute. All we really need to do is a memmove left 1 byte.
Similarly when finding the sample groupings, we do this over and over again despite the fact they're not changing.
Fixing this is a larger restructuring that I didn't want to do myself. It is left as an exercise to the reader, but there are potentially huge speed gains to be had here.
On an Illumina data set (10 million reads), the timings went from 21m15s to 12m33s (approx 70% faster). On a PacBio data set the time dropped from 5m17s to 0m13s!