-
Notifications
You must be signed in to change notification settings - Fork 10
Optimizing Performance
BioGraph uses significant CPU, memory, and I/O for BioGraph creation and analysis. There are several parameters that can be adjusted to achieve the best possible performance in your compute environment.
Running other processes on a machine that is actively creating or analyzing BioGraph files will negatively impact performance and should be avoided.
BioGraph commands attempt make efficient use of multiple cores. Generally speaking, adding more CPU cores reduces the overall runtime. You can explicitly specify the number of concurrent threads with the --threads
option. The default auto
setting will create one thread per CPU core.
If your system has less than 2GB of RAM available per core, better performance can be achieved by reducing --threads
to less than the total number of cores. On memory constrained systems, memory pressure can increase to the point that mmaps are no longer retained in memory and must be swapped in and out. Running with too many threads will lead to CPUs spending increasing time in the disk sleep state. A balanced system should show all working CPUs at close to 100% normal usage most of the time.
The biograph create
command requires a minimum of 32GB RAM for very small datasets, and at least 48GB (the default setting) for typical human datasets. Additional RAM will significantly speed up the create step. You can increase the memory limit with the --max-mem
option. Increase this value for larger datasets on systems with sufficient memory. For a 30x human, 64GB works well.
Be sure to always leave at least 20GB free for system processes. For example, on a system with 128GB of RAM, setting --max-mem 100
will provide plenty of memory to BioGraph for most datasets while still leaving room for the system. Increasing this number for larger datasets on systems with sufficient RAM will reduce the time required to run biograph create
. This setting is not used for other BioGraph steps.
Memory performance may be further improved by enabling system-wide transparent hugepage support in the Linux kernel. To see if transparent hugepage support is enabled on your system:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
There are three possible settings: always
, madvise
, and never
.
The always
and never
settings are self-explanatory. The madvise
setting in theory allows software to choose to enable hugepage support, but does not guarantee huge pages in all circumstances. BioGraph will request hugepage support but the kernel may not grant it when set to madvise
.
Selecting always
will fully enable the feature:
$ echo always | sudo tee -a /sys/kernel/mm/transparent_hugepage/enabled
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
Note that hugepage support may degrade performance for some workloads. It should be set to always
only on systems dedicated to running BioGraph or that otherwise see a performance improvement when enabled.
The create
step requires a significant amount of temporary storage (about 200GB for a 100GB BAM, 30x human).
Some stages are extremely I/O heavy. Using high performance scratch space (such as a striped RAID) will greatly improve performance. Many operations require random seeking for small blocks, and benefit significantly from using high performance SSD drives. NVME volumes work better still.
BioGraph respects the POSIX convention for temporary storage. The default temporary directory is the value of $TMPDIR
, or /tmp/
if $TMPDIR
is unset or does not exist.
You may also specify the scratch disk path explicitly with the --tmp
option.
(bg7)$ biograph full_pipeline ... --tmp /big/raid/tmp/
Many AWS instance types (such as the c5d
or r5d
series) use NVMe flash for ephemeral storage. Running the following script as root will automatically create an LVM striped RAID from all attached ephemeral NVMe devices and mount it under /mnt
. It requires the nvme
command, which is available from the nvme-cli
Ubuntu package.
This script will destroy data on all ephemeral /dev/nvme*
devices. It should be used only from a fresh boot of Ubuntu on an AWS instance with no other data disks attached!
#!/bin/bash
set -e
SSDS=$(nvme list | grep 'Amazon EC2 NVMe Instance Storage' | awk '{print $1}')
pvcreate $SSDS
vgcreate vg0 $SSDS
lvcreate --extents 100%FREE --stripes `ls $SSDS |wc -l` --stripesize 256 --name lv0 vg0
mkfs -t ext4 /dev/vg0/lv0
mount /dev/vg0/lv0 /mnt
chmod 1777 /mnt
Modern Linux kernels do an excellent job of managing the filesystem cache. However, copying your reference to a ramdisk can speed up many operations.
On Ubuntu systems, a tmpfs ramdisk is automatically created at boot that can use up to half of the system RAM. It is mounted on /dev/shm/
.
On other Linux systems, the /dev/shm/
volume may be constrained to a smaller size. Consult your system documentation for instructions on how to increase this size, or mount a larger tmpfs in another location.
The human reference typically uses about 14GB of storage. This technique should be used only on systems with sufficient memory for the reference, at least 64GB for BioGraph, and about 20GB of overhead for system processes.
$ cp -Rv /slow/volume/hs37d5/ /dev/shm/
(bg7)$ biograph full_pipeline --ref /dev/shm/hs37d5/ ...
Be sure to rm -rf /dev/shm/hs37d5/
when you are finished to release the ramdisk memory back to the system.
BioGraph makes extensive use of random access mmap files. This works well in most environments, but performance may suffer if your BioGraph files or reference files reside on a network file share (such as nfs, cifs, or gpfs). Network shares tend to be optimized for large sequential reads, and offer poor performance when fetching small blocks with random seeking.
In this kind of environment, the best performance is gained by copying the files to local storage.
If there is insufficient local storage space, performance may be improved in these environments by using the --cache
option. This makes BioGraph attempt to cache as much information in memory as possible at the expense of using more memory overall.
Note that runs may fail when --cache
is enabled if there is insufficient system memory to hold the BioGraph and reference files. It should be used only on systems with significant RAM when data files must be stored on a network share.
Runtime statistics for the create
and discovery
steps are automatically saved to JSON files in the qc/
folder inside the BioGraph. Various derived values (such as number of reads, coverage, discovered variant types and sizes, etc.) can be used by your pipeline for validation and QC checks. In addition, overall timings for every step in full_pipeline
are saved in hh:mm:ss format to timings.json
.
$ jq . my.bg/qc/timings.json
{
"create": "00:50:13",
"discovery": "00:33:07",
"coverage": "00:10:38",
"grm": "00:02:22",
"qual_classifier": "00:08:43"
}
The biograph discovery
command will attempt to assemble reads across the entire reference. To constrain calling to a specific genomic region, use the --bed
option. The BED file should contain the regions you wish to include:
$ cat chr1.bed
chr1 10000 248946422
(bg7)$ biograph dicovery --in my.bg --ref grch38/ --out chr1.vcf --bed chr1.bed
While BioGraph can call variants on the entire genome, you can save significant time by using a BED file that includes only regions of interest. The public references at s3://spiral-public/references/
include a regions.bed
file that contains only the autosomes and sex chromosomes. It specifically excludes alt contigs, decoys, mitochondria, telomeres, centromeres, and known regions of heterochromatin. If your analysis will not use results from these regions, running discovery with --bed regions.bed
can save substantial processing time.
(bg7)$ biograph discovery --in my.bg --ref grch38/ --out genome.vcf \
--bed grch38/regions.bed
The biograph coverage
command will also accept a BED file and operates in a similar manner. It is generally more efficient to use --bed
at discovery time when running the full pipeline, since it will only include calls from regions of interest for all subsequent steps.
We recommend at least restricting the analysis to regions that are not centromeric, telomeric, or decoy sequences. Our internal standard pipelines also exclude alternate contigs. The non-centromeric, non-telomeric regions can be created by downloading the centromere/telomere tracks from USCS genome browser as a bed file. This bed file can be subtracted from a genome.bed
containing each chromosome's start (0) and length using bedtools.
$ bedtools subtract -a genome.bed -b cent_telo.bed > regions.bed
Depending on your computing architecture, it may be useful to store input reads in a location other than a local file system (such as an AWS S3 bucket, Azure blob storage, Google Cloud Storage, a URL endpoint, or other object storage scheme). You can save time and storage space by streaming reads directly into BioGraph.
BioGraph can accept BAM, SAM, CRAM, or uncompressed FASTQ as input on STDIN by specifying --reads -
. This allows for high-speed streaming from external storage without the need to save the reads to disk prior to processing:
(bg7)$ aws s3 cp s3://my-bucket/reads.bam - | biograph full_pipeline --reads - ...
For full details on how to take advantage of streaming, see the biograph create command.