Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize pileup calling for high throughput targeted amplicon bed files #308

Merged
merged 4 commits into from
Jul 26, 2024

Conversation

Permafacture
Copy link

@Permafacture Permafacture commented Jun 3, 2024

The outer parallelization of calling variants with the pileup model using GNU parallel is inefficient for targeted amplicon BAM files where large portions of contigs have no supporting reads. In this case, the outer parallelization strategy spins up many processes that use CPU but do no work because they are assigned regions of a contig where there is nothing to do.

This outer parallelization also causes inefficiency in small, targeted amplicon panels where we tend to run analysis in parallel on many samples, so we want to reduce the number of threads a single sample takes to make room for others. When many of the threads do nothing, this overhead has a higher cost where nothing is happening for a sample for long periods.

To address this I've added new behavior that's invoked by setting chunk_num to -1 (0 already had a special meaning). In this case outer parallelization is disabled and replaced with inner parallelization within tensorflow. All candidates are batched within a single process. If a bed file is provided we achieve further speedups by only looking for candidates in the regions defined by the bed file

Using an in-house targeted amplicon bed file for profiling, I compare the original chunk_num==0 behavior to the new behavior in both the single threaded and threads=4 cases. Note that runtime is the entire analysis, not just the optimized pileup process.

chunk_num bed provided threads wall clock execution time
0 no 1 9m
-1 no 1 1m 19s
-1 yes 1 43s
0 no 4 2m 47s
-1 no 4 51s
-1 yes 4 26s

Note that chunk_num==-1 is not appropriate for whole genome analysis because it uses too much RAM. I have an additional commit that fixes that issue but the the old behavior is still significantly faster than the new behavior when there is broad enough coverage for the GNU parallel threads to be fully utilized

I've made this change only for the Cffi portion of the code because that's what we use.

fixes #306

if row not in header:
header.append(row)
contig_dict = defaultdict(str)
for vcf_fn in all_files:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old sorting relied on vcfs created with the contig names in them. This new process has the same result but the VCF files can be named anything

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old sorting is required as it will not maintain a contig_dict with too many records, we noticed that without contig spliting, it would increase the memory in sort_vcf submodule significantly

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zhengzhenxian , I'm curious why you merged this pull request if this sorting method will use too much memory. Should I create another pull request to change this to be more similar to the old method but also handle the multi-contig vcf the new code path creates?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted the code with the previous sorting logic after merging the PR.

@@ -372,17 +372,20 @@ def CheckEnvs(args):
'[WARNING] Current maximum contig length {} is much smaller than default chunk size {}, You may set a smaller chunk size by setting --chunk_size=$ for better parallelism.'.format(
max(contig_length_list), DEFAULT_CHUNK_SIZE)))

if is_bed_file_provided:
if is_bed_file_provided and default_chunk_num > -1:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an issue where contigs with characters illegal in directory names were causing the analysis to crash here. This solves the issue for us, but it might be worth considering a fix that allows contigs to have back-slashes and other special characters in all cases

@aquaskyline
Copy link
Member

Recevied and testing.

@zhengzhenxian zhengzhenxian merged commit 4ac0590 into HKU-BAL:main Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inefficient in high-throughput, targeted amplicon use case
3 participants