Single whole genome variation graph or a VG for each chromosome?

To represent a genome with VG, you can build a single whole genome variation graph or you can create a VG for each chromosome.

The latter choice is the best one to run GRAFIMO and is also suggested by VG developers for a faster build of the graphs. Moreover, this approach allows an efficient and parallel search of potential motif occurrences, without requiring too high hardware resources.

During our tests we built the hg38 genome VG, built enriching the reference with 1000 Genomes Project phase 3 on GRCh38 genomic variants (78 millions of variants), by creating a VG for each chromosome.

Scanning the genome, on average we used ~20 GB of memory, using all cores available (default option) on our machine (16 cores). But, this number can be limited by setting appropriately the --cores parameter, which tells GRAFIMO how many cores to use while running. Since, each core loads the XG and GBWT index for a chromosome the user can set a smaller number of cores to use to fit his/her hardware resources, accounting for the size of chromosomes VGs. This will cause GRAFIMO to run a little bit slower.

If you decide to scan a whole genome variation graph, by default GRAFIMO will use just a single core. This choice was made because, usually, whole genome VGs are very large file (in order of ~10-15 GB) and performing a parallel search on them could be very expensive in terms of computational resources. We suggest to set --cores taking into account that during the scan of VG each core will load on memory a copy of the whole genome varition graph.

Home

Required softwares and dependencies

Install GRAFIMO via pip