Optimize pileup calling for high throughput targeted amplicon bed files #308

Permafacture · 2024-06-03T17:51:56Z

The outer parallelization of calling variants with the pileup model using GNU parallel is inefficient for targeted amplicon BAM files where large portions of contigs have no supporting reads. In this case, the outer parallelization strategy spins up many processes that use CPU but do no work because they are assigned regions of a contig where there is nothing to do.

This outer parallelization also causes inefficiency in small, targeted amplicon panels where we tend to run analysis in parallel on many samples, so we want to reduce the number of threads a single sample takes to make room for others. When many of the threads do nothing, this overhead has a higher cost where nothing is happening for a sample for long periods.

To address this I've added new behavior that's invoked by setting chunk_num to -1 (0 already had a special meaning). In this case outer parallelization is disabled and replaced with inner parallelization within tensorflow. All candidates are batched within a single process. If a bed file is provided we achieve further speedups by only looking for candidates in the regions defined by the bed file

Using an in-house targeted amplicon bed file for profiling, I compare the original chunk_num==0 behavior to the new behavior in both the single threaded and threads=4 cases. Note that runtime is the entire analysis, not just the optimized pileup process.

chunk_num	bed provided	threads	wall clock execution time
0	no	1	9m
-1	no	1	1m 19s
-1	yes	1	43s
0	no	4	2m 47s
-1	no	4	51s
-1	yes	4	26s

Note that chunk_num==-1 is not appropriate for whole genome analysis because it uses too much RAM. I have an additional commit that fixes that issue but the the old behavior is still significantly faster than the new behavior when there is broad enough coverage for the GNU parallel threads to be fully utilized

I've made this change only for the Cffi portion of the code because that's what we use.

fixes #306

Permafacture · 2024-06-03T18:01:30Z

preprocess/SortVcf.py

-                    if row not in header:
-                        header.append(row)
+    contig_dict = defaultdict(str)
+    for vcf_fn in all_files:


The old sorting relied on vcfs created with the contig names in them. This new process has the same result but the VCF files can be named anything

The old sorting is required as it will not maintain a contig_dict with too many records, we noticed that without contig spliting, it would increase the memory in sort_vcf submodule significantly

Hey @zhengzhenxian , I'm curious why you merged this pull request if this sorting method will use too much memory. Should I create another pull request to change this to be more similar to the old method but also handle the multi-contig vcf the new code path creates?

I have reverted the code with the previous sorting logic after merging the PR.

Permafacture · 2024-06-03T18:03:42Z

preprocess/CheckEnvs.py

@@ -372,17 +372,20 @@ def CheckEnvs(args):
            '[WARNING] Current maximum contig length {} is much smaller than default chunk size {}, You may set a smaller chunk size by setting --chunk_size=$ for better parallelism.'.format(
                max(contig_length_list), DEFAULT_CHUNK_SIZE)))

-    if is_bed_file_provided:
+    if is_bed_file_provided and default_chunk_num > -1:


We have an issue where contigs with characters illegal in directory names were causing the analysis to crash here. This solves the issue for us, but it might be worth considering a fix that allows contigs to have back-slashes and other special characters in all cases

aquaskyline · 2024-06-04T12:47:33Z

Recevied and testing.

Elliot Hallmark added 4 commits May 30, 2024 17:15

fix Dockerfile for build problem

7d9dec4

implemented num_chunks==-1 behavior with bedfile and without

db7e462

added internal threading when chunk_num==-1

9647010

further dockerfile fix

3353c7f

Permafacture commented Jun 3, 2024

View reviewed changes

zhengzhenxian merged commit 4ac0590 into HKU-BAL:main Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize pileup calling for high throughput targeted amplicon bed files #308

Optimize pileup calling for high throughput targeted amplicon bed files #308

Permafacture commented Jun 3, 2024 •

edited

Loading

Permafacture Jun 3, 2024

zhengzhenxian Jul 24, 2024

Permafacture Jul 26, 2024

zhengzhenxian Jul 28, 2024

Permafacture Jun 3, 2024

aquaskyline commented Jun 4, 2024

Optimize pileup calling for high throughput targeted amplicon bed files #308

Optimize pileup calling for high throughput targeted amplicon bed files #308

Conversation

Permafacture commented Jun 3, 2024 • edited Loading

Permafacture Jun 3, 2024

Choose a reason for hiding this comment

zhengzhenxian Jul 24, 2024

Choose a reason for hiding this comment

Permafacture Jul 26, 2024

Choose a reason for hiding this comment

zhengzhenxian Jul 28, 2024

Choose a reason for hiding this comment

Permafacture Jun 3, 2024

Choose a reason for hiding this comment

aquaskyline commented Jun 4, 2024

Permafacture commented Jun 3, 2024 •

edited

Loading