-
Notifications
You must be signed in to change notification settings - Fork 38
Performance
Mike Lin edited this page Oct 13, 2020
·
15 revisions
GLnexus is a multithreaded C++ program designed to utilize powerful servers flat-out. Here are tips for achieving the intended performance characteristics. They are all important, so please make sure to check each one if you're doing a big project, and most especially if you're benchmarking.
- Install the jemalloc memory allocator and inject it into GLnexus at runtime, by setting
LD_PRELOAD
as described here.- On Debian/Ubuntu, install
libjemalloc-dev
and setLD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so ./glnexus_cli ...
- This is so important that
glnexus_cli
displays a warning if jemalloc is absent. - The prebuilt docker images should have jemalloc all set up by default.
- On Debian/Ubuntu, install
- The working directory, used intensively as temporary space for external data sorting and scanning, should be on a local SSD to minimize I/O starvation of the available CPUs.
- The program detects the host's threads and memory and tries to use all of it by default. This can be overridden with command-line options for memory and thread budgets (see
glnexus_cli -h
) - Increase the open file limit using e.g.
ulimit -Sn 65536
in the same shell session precedingglnexus_cli
. (It doesn't have to open all the gVCF files at once, but the RocksDB can exceed the typical default limit of 1,024 in large projects.) - We deploy on servers with up to 32 hardware threads (16 cores), and have not yet tuned its scalability much beyond that.
- If the host has multiple NUMA nodes, use
numactl --interleave=all ./glnexus_cli ...
to avoid skewed allocation on one node. TMI here - Piping the pBCF output stream through single-threaded postprocessing programs might bottleneck the throughput. For example if the pipeline includes
bgzip
, make sure to use a recent version supporting multithreaded compression with e.g.bgzip -@ 4
- If you're using our pre-built Docker image, Docker causes some overhead on high-throughput standard output streams; so perform format conversion and compression in a script running inside the container, if possible.
- The internal multithreading scheme is designed with high sample counts in mind, and we haven't tuned it to "scale down" efficiently for small cohorts of <100 samples. In that case the thread scheduling becomes too fine-grained, limiting resource utilization. (Please caveat any micro-benchmarks in light of this.)
With all of this in place, the open-source command-line program is quite performant up to medium-size projects of several thousand samples. For very large projects, we deploy GLnexus within a cloud-native framework providing much higher parallelization and incremental operations. The open-source version produces identical scientific results.