-
Notifications
You must be signed in to change notification settings - Fork 28
Compiling GenomicsDB
- Version > 0.4.0: If your git commit id is 6bc801d1b1881 or newer, then follow the instructions on this page.
- Version > 0.3.0 and <= 0.4.0: If your git commit id is 6bc801d1b1881 or older and the commit id is 860400623ed4 or newer, then follow the instructions on the page for building GenomicsDB 0.4.
- Version 0.3.0 or older: If your git commit id is e10bf412ddd35 or older, then follow the instructions on the page for building GenomicsDB 0.3 or older.
-
We have tested TileDB/GenomicsDB on the following platforms:
- GNU/Linux:
- CentOS 6 and 7 (almost identical to RHEL 6 and 7). Most of our heavy testing is performed on CentOS-7 systems.
- Ubuntu Trusty (14.04)
- MacOSX - SDK version 10.9
- GNU/Linux:
-
CMake build system - version > 2.8
- Example installation commands:
-
On CentOS/RedHat systems:
sudo yum -y install cmake
-
On Ubuntu systems:
sudo apt-get install cmake
-
On MacOSX, you can use Homebrew to obtain CMake.
brew install cmake
-
- Example installation commands:
-
Dependencies from TileDB
- Zlib headers and libraries
- OpenSSL headers and libraries
- libuuid headers and libraries
- Example installation commands:
-
On CentOS/RedHat systems:
sudo yum -y install openssl-devel zlib-devel libuuid-devel
-
On Ubuntu systems:
sudo apt-get install zlib1g-dev libssl-dev uuid-dev
-
On MacOSX, you can use Homebrew to obtain the OpenSSL library.
brew install openssl brew install ossp-uuid
-
-
C++ compiler: A C++ 2011 compiler.
- gcc version >= 4.8. We have been testing with gcc-4.9.1.
- We have tested with clang version >= 7.3.0 on MacOSX.
- Installing a new version of gcc/g++
-
CentOS/RedHat systems: You can use the software collections repository and install the package devtoolset-3 or devtoolset-4.
sudo yum install centos-release-scl sudo yum install devtoolset-3 #or devtoolset-4 scl enable devtoolset-3 bash
-
Ubuntu: We use the Ubuntu Toolchain PPA to obtain new versions of gcc
sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install g++-4.9 sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 60
-
-
Google Protocol Buffer Google protocol buffer is a mandatory pre-requisite version 0.4.0 onward. We use protocol buffers to exchange configuration parameters, headers and callset/sample id to TileDB row index between Java and C++. Note that Ubuntu-14.04 LTS as well as CentOS 6 and 7 releases use protobuf version 2.5.0. However, we specifically depend on protobuf version 3.0.2. We recommend to build it locally and link it using appropriate environment variables and not overwrite existing system protobuf version. To build protobuf from source:
git clone https://github.com/google/protobuf cd protobuf git checkout 3.0.x sh autogen.sh ./configure --prefix=/path/to/local/installation --with-pic make -j4 make install
On MacOSX:
brew install protobuf@3.1
-
NOTE: We use git submodules to pull in the remaining mandatory dependencies - you can skip directly to the optional pre-requisites section if you do not wish to manually fetch and build the following mandatory dependencies.
-
TileDB
git clone https://github.com/Intel-HLS/TileDB.git
-
Rapidjson library: Parameters are passed to TileDB tools/examples through a JSON file - Rapidjson is used to parse this JSON file. The library is a header-only library - no compilation needed.
git clone https://github.com/miloyip/rapidjson
-
Htslib for parsing and exporting VCFs. We maintain a fork of htslib with some modifications that are needed for use with GenomicsDB.
git clone https://github.com/Intel-HLS/htslib cd htslib git checkout intel_mods make -j 8
-
OpenMPv4: We use directives from OpenMP specification v4. This is supported on gcc versions >= 4.9.0. The CMake build system will check whether your C compiler supports OpenMP v4 and will disable OpenMP during the build process if it does not. You may lose some performance during loading without OpenMP.
On MacOSX systems, OpenMP is disabled by default during compilation.
-
For executables: If you wish to produce any of the executables provided by GenomicsDB, an MPI compiler, library and runtime are required. We have tested with reasonably new versions of OpenMPI, MPICH and MVAPICH2. If you wish to only build the combined TileDB/GenomicsDB shared library and the Java jar (see below), an MPI compiler is not needed.
-
On CentOS/RedHat systems:
sudo yum install mpich-devel
-
On Ubuntu systems:
sudo apt-get install mpich
-
On MacOSX:
brew install mpich
-
-
For importing CSV files: If you wish to import CSV data into TileDB, then you need libcsv. You also need to pass special flags while invoking make (see below).
-
On CentOS/RedHat systems: if you have the EPEL repo installed and enabled, you can install the libcsv packages using yum:
sudo yum install libcsv libcsv-devel
-
Ubuntu systems: on systems with Vivid(15.04) or newer:
sudo apt-get install libcsv3 libcsv-dev
-
Build from source for older Ubuntu systems:
wget -O libcsv.tar.gz http://downloads.sourceforge.net/project/libcsv/libcsv/libcsv-3.0.3/libcsv-3.0.3.tar.gz tar xzf libcsv.tar.gz cd libcsv-<version> && ./configure && make
-
MacOSX:
brew install libcsv
-
-
For the Java/JNI interface
- Java SDK version 8.
- Apache Maven 3
- The other Java dependencies are pulled in by Maven as needed.
-
Get the right branch based on what you wish to do - see the other pages for which branch to get. If you do not know which branch to use, the master branch is your best bet.
-
To get dependencies using git submodule, run:
git clone --recursive https://github.com/Intel-HLS/GenomicsDB.git
-
If you have an existing git repository and wish to pull in the latest changes:
git pull origin master git submodule update --recursive --init
-
Make sure you have the required gcc version in your PATH.
-
We strongly recommend creating a build directory where all the binaries get compiled. This build directory can be outside the source directory
mkdir -p <build_dir> cd <build_dir>
-
It is safe to delete the build directory completely to cleanup all the files produced by cmake.
-
The generated Makefile contains a target called clean-all that will clean out the object files, but keep the Makefiles and CMakeCache.txt files.
-
Assuming you want to use the dependencies pulled in by git, you have the MPI compiler (mpicxx) in your PATH and all other dependencies are in standard system locations where the compiler can find them (for example, under /usr in GNU/Linux):
#release mode - O3, NDEBUG - assertions disabled cd <build_dir> cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> make -j8 && make install
-
Compiling in debug mode:
#debug mode - assertions enabled, can use gdb for stepping, no OPENMP (can enable with the OPENMP=1 flag) cd <build_dir> cmake <source_dir> -DCMAKE_BUILD_TYPE=Debug -DDISABLE_OPENMP=1 -DCMAKE_INSTALL_PREFIX=<install_dir> make -j8 && make install
-
If you do not have the MPI compiler in your PATH:
#release mode - O3, NDEBUG - assertions disabled cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> \ -DMPI_CC_COMPILER=<mpicc_full_path> -DMPI_CXX_COMPILER=<mpicxx_full_path>
-
If your Protobuf library is located in a custom location:
cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> \ -DPROTOBUF_LIBRARY=<directory>
The Protobuf library is statically linked into the executables and the dynamic library - if you wish to link to the dynamic Protobuf library:
cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> \ -DPROTOBUF_LIBRARY=<directory> -DPROTOBUF_STATIC_LINKING=False
You may need to set the environment variable LD_LIBRARY_PATH while running GenomicsDB code.
-
If header file and library for libcsv are located where the compiler can automatically find them (for example under /usr in GNU/Linux), then CSV support is enabled automatically. If you have downloaded libcsv from sourceforge and compiled and installed it at a custom location, then pass the directory to the cmake command
#release mode - O3, NDEBUG - assertions disabled cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> -DLIBCSV_DIR=<libcsv_dir>
The build process assumes that the library file is located in <libcsv_directory>/.libs or <libcsv_directory>/lib (the default location in the build process of libcsv).
-
On a MacOSX system, assuming you installed the pre-requisites using Homebrew under /usr/local/opt, the following command can be used:
#release mode - O3, NDEBUG - assertions disabled cmake <source_dir> -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> \ -DMPI_CC_COMPILER=/usr/local/opt/mpich/bin/mpicc -DMPI_CXX_COMPILER=/usr/local/opt/mpich/bin/mpicxx \ -DOPENSSL_PREFIX_DIR=/usr/local/opt/openssl
If you have downloaded and compiled the dependencies manually, use the following commands:
-
Compiling in release mode
#release mode - O3, NDEBUG - assertions disabled cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> -DLIBCSV_DIR=<libcsv_dir> \ -DTILEDB_SOURCE_DIR=<TileDB_dir>
-
Compiling with a custom htslib source directory:
#release mode - O3, NDEBUG - assertions disabled cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> -DLIBCSV_DIR=<libcsv_dir> \ -DHTSLIB_SOURCE_DIR=<TileDB_dir>
-
To enable light-weight profiling
#release mode - O3, NDEBUG - assertions disabled cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> -DLIBCSV_DIR=<libcsv_dir> \ -DDO_PROFILING=True
-
With the BUILD_JAVA flag enabled, the build environment compiles both Java and Apache Spark interfaces of GenomicsDB.
-
Remember to use Java SDK version 8 - you must have the right Java executable in your PATH or must set the JAVA_HOME environment variable correctly.
-
To build the jar:
cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> -DBUILD_JAVA=1
You don't need an MPI compiler and library to only build the jar and the shared TileDB/GenomicsDB library (no executables will be built).
cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> -DBUILD_JAVA=1 \ -DDISABLE_MPI=1
The jar file genomicsdb-.jar will be created in the <install_dir>/bin/ directory. You can install this jar file into your local Maven repository for use in downstream Maven/Gradle build systems using:
mvn install:install-file -Dfile=bin/genomicsdb-<version>.jar -DpomFile=<source_dir>/pom.xml
Caveats:
- The shared library (libtiledbgenomicsdb.so) that is packaged in the jar depends on GNU libc (glibc). If you compile the library on one system and run it on another system with a newer version of glibc, the library should work since glibc is backward compatible (for example, you can compile the library on CentOS-6 and run it on CentOS-7). However, if you do the reverse, then very likely you will see errors about missing symbols when loading the library. A quick check is to run ldd bin/libtiledbgenomicsdb.so; you should NOT see errors about missing symbols in a correctly functioning configuration.
Note: For most users this section is not applicable. If you are interested in packaging and distributing a jar file that must not contain any distribution specific dependencies (MPI shared libraries for example), follow the steps in this section:
-
The following libraries should be statically linked in:
-
libgcc: We use the option "-static-libgcc"
-
stdc++: We use the option "-static-libstdc++" while building TileDB/GenomicsDB to create portable binaries. Please ensure that your build system has the static version of this library installed.
-
On CentOS/RedHat systems:
sudo yum -y install libstdc++-static
-
-
OpenSSL
-
-
The following libraries are dynamically linked - however, they are backward compatible. Hence, you should build your jar on an 'older' system (I build on CentOS-6).
- zlib
- glibc
-
Command
cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> -DBUILD_JAVA=1 \ -DBUILD_DISTRIBUTABLE_LIBRARY=True
Note: For most users this section is not applicable. You will need to regenerate the sources produced by the Protobuf compiler if and only if one of the following conditions is met:
-
You wish to use a different version of Protobuf than the one used to generate the sources distributed as part of the GenomicsDB repo (Protobuf v 3.0.2)
-
You have modified the .proto files
cmake <source_dir> -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> -DBUILD_JAVA=True \ -DPROTOBUF_LIBRARY=<protobuf_directory> -DPROTOBUF_REGENERATE=True make -j 8
When the make command is executed, the C++ and Java files will be regenerated for the Protocol buffers in the build directoy and compiled into the library and executables.
The default version of Spark core used in GenomicsDB is 2.11. If you wish to compile with Spark core 2.10, please use the GENOMICSDB_SPARK_PROFILE macro in cmake as:
cmake <source_dir> -CMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<install_dir> -DBUILD_JAVA=True -DGENOMICSDB_SPARK_PROFILE=spark_core-2.10
- Overview of GenomicsDB
- Compiling GenomicsDB
-
Importing variant data into GenomicsDB
- Create a TileDB workspace
- Importing data from VCFs/gVCFs into TileDB/GenomicsDB
- Importing data from CSVs into TileDB/GenomicsDB
- Incremental import into TileDB/GenomicsDB
- Overlapping variant calls in a sample
- Java interface for importing VCF/CSV files into TileDB/GenomicsDB
- Dealing with multiple GenomicsDB partitions
- Querying GenomicsDB
- HDFS or S3 or GCS support in GenomicsDB
- MPI with GenomicsDB
- GenomicsDB utilities
- Try out with Docker
- Common issues
- Bug report
- External Contributions