-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize pileup calling for high throughput targeted amplicon bed files #308
Conversation
if row not in header: | ||
header.append(row) | ||
contig_dict = defaultdict(str) | ||
for vcf_fn in all_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old sorting relied on vcfs created with the contig names in them. This new process has the same result but the VCF files can be named anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old sorting is required as it will not maintain a contig_dict with too many records, we noticed that without contig spliting, it would increase the memory in sort_vcf submodule significantly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @zhengzhenxian , I'm curious why you merged this pull request if this sorting method will use too much memory. Should I create another pull request to change this to be more similar to the old method but also handle the multi-contig vcf the new code path creates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted the code with the previous sorting logic after merging the PR.
@@ -372,17 +372,20 @@ def CheckEnvs(args): | |||
'[WARNING] Current maximum contig length {} is much smaller than default chunk size {}, You may set a smaller chunk size by setting --chunk_size=$ for better parallelism.'.format( | |||
max(contig_length_list), DEFAULT_CHUNK_SIZE))) | |||
|
|||
if is_bed_file_provided: | |||
if is_bed_file_provided and default_chunk_num > -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an issue where contigs with characters illegal in directory names were causing the analysis to crash here. This solves the issue for us, but it might be worth considering a fix that allows contigs to have back-slashes and other special characters in all cases
Recevied and testing. |
The outer parallelization of calling variants with the pileup model using GNU parallel is inefficient for targeted amplicon BAM files where large portions of contigs have no supporting reads. In this case, the outer parallelization strategy spins up many processes that use CPU but do no work because they are assigned regions of a contig where there is nothing to do.
This outer parallelization also causes inefficiency in small, targeted amplicon panels where we tend to run analysis in parallel on many samples, so we want to reduce the number of threads a single sample takes to make room for others. When many of the threads do nothing, this overhead has a higher cost where nothing is happening for a sample for long periods.
To address this I've added new behavior that's invoked by setting chunk_num to -1 (0 already had a special meaning). In this case outer parallelization is disabled and replaced with inner parallelization within tensorflow. All candidates are batched within a single process. If a bed file is provided we achieve further speedups by only looking for candidates in the regions defined by the bed file
Using an in-house targeted amplicon bed file for profiling, I compare the original chunk_num==0 behavior to the new behavior in both the single threaded and threads=4 cases. Note that runtime is the entire analysis, not just the optimized pileup process.
Note that chunk_num==-1 is not appropriate for whole genome analysis because it uses too much RAM. I have an additional commit that fixes that issue but the the old behavior is still significantly faster than the new behavior when there is broad enough coverage for the GNU parallel threads to be fully utilized
I've made this change only for the Cffi portion of the code because that's what we use.
fixes #306