Performance

GLnexus is a multithreaded C++ program designed to utilize powerful servers flat-out. Here are tips for achieving the intended performance characteristics. They are all important, so please make sure to check each one if you're doing a big project, and most especially if you're benchmarking.

Install the jemalloc memory allocator and inject it into GLnexus at runtime, by setting LD_PRELOAD as described here.
- On Debian/Ubuntu, install libjemalloc-dev and set LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so ./glnexus_cli ...
- This is so important that glnexus_cli displays a warning if jemalloc is absent.
- The prebuilt docker images should have jemalloc all set up by default.
The working directory, used intensively as temporary space for external data sorting and scanning, should be on a local SSD to minimize I/O starvation of the available CPUs.
The program detects the host's threads and memory and tries to use all of it by default. This can be overridden with command-line options for memory and thread budgets (see glnexus_cli -h)
Increase the open file limit using e.g. ulimit -Sn 65536 in the same shell session preceding glnexus_cli. (It doesn't have to open all the gVCF files at once, but the RocksDB can exceed the typical default limit of 1,024 in large projects.)
We deploy on servers with up to 32 hardware threads (16 cores), and have not yet tuned its scalability much beyond that.
If the host has multiple NUMA nodes, use numactl --interleave=all ./glnexus_cli ... to avoid skewed allocation on one node. TMI here
Piping the pBCF output stream through single-threaded postprocessing programs might bottleneck the throughput. For example if the pipeline includes bgzip, make sure to use a recent version supporting multithreaded compression with e.g. bgzip -@ 4
If you're using our pre-built Docker image, Docker causes some overhead on high-throughput standard output streams; so perform format conversion and compression in a script running inside the container, if possible.
The internal multithreading scheme is designed with high sample counts in mind, and we haven't tuned it to "scale down" efficiently for small cohorts of <100 samples. In that case the thread scheduling becomes too fine-grained, limiting resource utilization. (Please caveat any micro-benchmarks in light of this.)

With all of this in place, the open-source command-line program is quite performant up to medium-size projects of several thousand samples. For very large projects, we deploy GLnexus within a cloud-native framework providing much higher parallelization and incremental operations. The open-source version produces identical scientific results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

Clone this wiki locally