Skip to content
Mike Lin edited this page Oct 13, 2020 · 15 revisions

GLnexus is a multithreaded C++ program designed to utilize powerful servers flat-out. Here are tips for achieving the intended performance characteristics. They are all important, so please make sure to check each one if you're doing a big project, and most especially if you're benchmarking.

  • Install the jemalloc memory allocator and inject it into GLnexus at runtime, by setting LD_PRELOAD as described here.
    • On Debian/Ubuntu, install libjemalloc-dev and set LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so ./glnexus_cli ...
    • This is so important that glnexus_cli displays a warning if jemalloc is absent.
    • The prebuilt docker images should have jemalloc all set up by default.
  • The working directory, used intensively as temporary space for external data sorting and scanning, should be on a local SSD to minimize I/O starvation of the available CPUs.
  • The program detects the host's threads and memory and tries to use all of it by default. This can be overridden with command-line options for memory and thread budgets (see glnexus_cli -h)
  • Increase the open file limit using e.g. ulimit -Sn 65536 in the same shell session preceding glnexus_cli. (It doesn't have to open all the gVCF files at once, but the RocksDB can exceed the typical default limit of 1,024 in large projects.)
  • We deploy on servers with up to 32 hardware threads (16 cores), and have not yet tuned its scalability much beyond that.
  • If the host has multiple NUMA nodes, use numactl --interleave=all ./glnexus_cli ... to avoid skewed allocation on one node. TMI here
  • Piping the pBCF output stream through single-threaded postprocessing programs might bottleneck the throughput. For example if the pipeline includes bgzip, make sure to use a recent version supporting multithreaded compression with e.g. bgzip -@ 4
  • If you're using our pre-built Docker image, Docker causes some overhead on high-throughput standard output streams; so perform format conversion and compression in a script running inside the container, if possible.
  • The internal multithreading scheme is designed with high sample counts in mind, and we haven't tuned it to "scale down" efficiently for small cohorts of <100 samples. In that case the thread scheduling becomes too fine-grained, limiting resource utilization. (Please caveat any micro-benchmarks in light of this.)

With all of this in place, the open-source command-line program is quite performant up to medium-size projects of several thousand samples. For very large projects, we deploy GLnexus within a cloud-native framework providing much higher parallelization and incremental operations. The open-source version produces identical scientific results.