Skip to content

Optimizing Performance

Rob Flickenger edited this page Aug 9, 2021 · 1 revision

BioGraph uses significant CPU, memory, and I/O for BioGraph creation and analysis. There are several parameters that can be adjusted to achieve the best possible performance in your compute environment.

Running other processes on a machine that is actively creating or analyzing BioGraph files will negatively impact performance and should be avoided.

Threads

BioGraph commands attempt make efficient use of multiple cores. Generally speaking, adding more CPU cores reduces the overall runtime. You can explicitly specify the number of concurrent threads with the --threads option. The default auto setting will create one thread per CPU core.

If your system has less than 2GB of RAM available per core, better performance can be achieved by reducing --threads to less than the total number of cores. On memory constrained systems, memory pressure can increase to the point that mmaps are no longer retained in memory and must be swapped in and out. Running with too many threads will lead to CPUs spending increasing time in the disk sleep state. A balanced system should show all working CPUs at close to 100% normal usage most of the time.

Memory

The biograph create command requires a minimum of 32GB RAM for very small datasets, and at least 48GB (the default setting) for typical human datasets. Additional RAM will significantly speed up the create step. You can increase the memory limit with the --max-mem option. Increase this value for larger datasets on systems with sufficient memory. For a 30x human, 64GB works well.

Be sure to always leave at least 20GB free for system processes. For example, on a system with 128GB of RAM, setting --max-mem 100 will provide plenty of memory to BioGraph for most datasets while still leaving room for the system. Increasing this number for larger datasets on systems with sufficient RAM will reduce the time required to run biograph create. This setting is not used for other BioGraph steps.

transparent_hugepage support

Memory performance may be further improved by enabling system-wide transparent hugepage support in the Linux kernel. To see if transparent hugepage support is enabled on your system:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

There are three possible settings: always, madvise, and never.

The always and never settings are self-explanatory. The madvise setting in theory allows software to choose to enable hugepage support, but does not guarantee huge pages in all circumstances. BioGraph will request hugepage support but the kernel may not grant it when set to madvise.

Selecting always will fully enable the feature:

$ echo always | sudo tee -a /sys/kernel/mm/transparent_hugepage/enabled

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Note that hugepage support may degrade performance for some workloads. It should be set to always only on systems dedicated to running BioGraph or that otherwise see a performance improvement when enabled.

Temporary storage

The create step requires a significant amount of temporary storage (about 200GB for a 100GB BAM, 30x human).

Some stages are extremely I/O heavy. Using high performance scratch space (such as a striped RAID) will greatly improve performance. Many operations require random seeking for small blocks, and benefit significantly from using high performance SSD drives. NVME volumes work better still.

BioGraph respects the POSIX convention for temporary storage. The default temporary directory is the value of $TMPDIR, or /tmp/ if $TMPDIR is unset or does not exist.

You may also specify the scratch disk path explicitly with the --tmp option.

(bg7)$ biograph full_pipeline ... --tmp /big/raid/tmp/

Creating an ephemeral RAID on AWS

Many AWS instance types (such as the c5d or r5d series) use NVMe flash for ephemeral storage. Running the following script as root will automatically create an LVM striped RAID from all attached ephemeral NVMe devices and mount it under /mnt. It requires the nvme command, which is available from the nvme-cli Ubuntu package.

This script will destroy data on all ephemeral /dev/nvme* devices. It should be used only from a fresh boot of Ubuntu on an AWS instance with no other data disks attached!

#!/bin/bash
set -e

SSDS=$(nvme list | grep 'Amazon EC2 NVMe Instance Storage' | awk '{print $1}')

pvcreate $SSDS
vgcreate vg0 $SSDS
lvcreate --extents 100%FREE --stripes `ls $SSDS |wc -l` --stripesize 256 --name lv0 vg0

mkfs -t ext4 /dev/vg0/lv0
mount /dev/vg0/lv0 /mnt
chmod 1777 /mnt

Caching the reference

Modern Linux kernels do an excellent job of managing the filesystem cache. However, copying your reference to a ramdisk can speed up many operations.

On Ubuntu systems, a tmpfs ramdisk is automatically created at boot that can use up to half of the system RAM. It is mounted on /dev/shm/.

On other Linux systems, the /dev/shm/ volume may be constrained to a smaller size. Consult your system documentation for instructions on how to increase this size, or mount a larger tmpfs in another location.

The human reference typically uses about 14GB of storage. This technique should be used only on systems with sufficient memory for the reference, at least 64GB for BioGraph, and about 20GB of overhead for system processes.

$ cp -Rv /slow/volume/hs37d5/ /dev/shm/
(bg7)$ biograph full_pipeline --ref /dev/shm/hs37d5/ ...

Be sure to rm -rf /dev/shm/hs37d5/ when you are finished to release the ramdisk memory back to the system.

Caching everything in-process

BioGraph makes extensive use of random access mmap files. This works well in most environments, but performance may suffer if your BioGraph files or reference files reside on a network file share (such as nfs, cifs, or gpfs). Network shares tend to be optimized for large sequential reads, and offer poor performance when fetching small blocks with random seeking.

In this kind of environment, the best performance is gained by copying the files to local storage.

If there is insufficient local storage space, performance may be improved in these environments by using the --cache option. This makes BioGraph attempt to cache as much information in memory as possible at the expense of using more memory overall.

Note that runs may fail when --cache is enabled if there is insufficient system memory to hold the BioGraph and reference files. It should be used only on systems with significant RAM when data files must be stored on a network share.

Measuring performance

Runtime statistics for the create and discovery steps are automatically saved to JSON files in the qc/ folder inside the BioGraph. Various derived values (such as number of reads, coverage, discovered variant types and sizes, etc.) can be used by your pipeline for validation and QC checks. In addition, overall timings for every step in full_pipeline are saved in hh:mm:ss format to timings.json.

$ jq . my.bg/qc/timings.json
{
  "create": "00:50:13",
  "discovery": "00:33:07",
  "coverage": "00:10:38",
  "grm": "00:02:22",
  "qual_classifier": "00:08:43"
}

Using BED regions for faster calling

The biograph discovery command will attempt to assemble reads across the entire reference. To constrain calling to a specific genomic region, use the --bed option. The BED file should contain the regions you wish to include:

$ cat chr1.bed
chr1    10000    248946422
(bg7)$ biograph dicovery --in my.bg --ref grch38/ --out chr1.vcf --bed chr1.bed

While BioGraph can call variants on the entire genome, you can save significant time by using a BED file that includes only regions of interest. The public references at s3://spiral-public/references/ include a regions.bed file that contains only the autosomes and sex chromosomes. It specifically excludes alt contigs, decoys, mitochondria, telomeres, centromeres, and known regions of heterochromatin. If your analysis will not use results from these regions, running discovery with --bed regions.bed can save substantial processing time.

(bg7)$ biograph discovery --in my.bg --ref grch38/ --out genome.vcf \
  --bed grch38/regions.bed

The biograph coverage command will also accept a BED file and operates in a similar manner. It is generally more efficient to use --bed at discovery time when running the full pipeline, since it will only include calls from regions of interest for all subsequent steps.

We recommend at least restricting the analysis to regions that are not centromeric, telomeric, or decoy sequences. Our internal standard pipelines also exclude alternate contigs. The non-centromeric, non-telomeric regions can be created by downloading the centromere/telomere tracks from USCS genome browser as a bed file. This bed file can be subtracted from a genome.bed containing each chromosome's start (0) and length using bedtools.

$ bedtools subtract -a genome.bed -b cent_telo.bed > regions.bed

Pipeline considerations

Depending on your computing architecture, it may be useful to store input reads in a location other than a local file system (such as an AWS S3 bucket, Azure blob storage, Google Cloud Storage, a URL endpoint, or other object storage scheme). You can save time and storage space by streaming reads directly into BioGraph.

BioGraph can accept BAM, SAM, CRAM, or uncompressed FASTQ as input on STDIN by specifying --reads -. This allows for high-speed streaming from external storage without the need to save the reads to disk prior to processing:

(bg7)$ aws s3 cp s3://my-bucket/reads.bam - | biograph full_pipeline --reads - ...

For full details on how to take advantage of streaming, see the biograph create command.


Next: Understanding Read Correction

Clone this wiki locally