Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

Incremental import into GenomicsDB

Karthik Gururaj edited this page May 31, 2016 · 6 revisions

See the terminology page for definitions of bulk import and incremental import.

The vcf2tiledb program supports incremental imports by appending new rows to an existing TileDB array. By default, the program will import all the samples/CallSets specified in the callset_mapping_file, appending rows to an existing TileDB array if needed.

If the callset_mapping_file contains samples/CallSets corresponding to existing rows in the array (same row index values), then the newly supplied data will be assumed to be the latest version for those samples/CallSets. Note that this does not imply that the old data for the sample/CallSet is completely deleted. For example, if row 0 had data at column 5 and vcf2tiledb is invoked again for row 0 with data at column 6, the updated array will contain data for both columns 5 and 6 for row 0.

We provide some parameters in the loader JSON file for convenience - these parameters are not mandatory. Users may prefer to keep one single callset_mapping_file for a given array at all times and append new samples/CallSets to this file. Hence, the vcf2tiledb import program must be notified that only the new samples/CallSets should be imported from this file - this is achieved by the following two parameters that must be added to the loader_config_file:

  • lb_callset_row_idx (optional, type:int64, default: 0): If specified in the loader configuration file, then the import program will only import samples/CallSets with row index >= lb_callset_row_idx. For example, assuming an array already has 100 rows (row indexes: 0-99), the user can import additional samples by appending sample/CallSet information to the callset_mapping_file (from row index 100) and specify in the loader configuration that only samples with row index >= 100 should be imported.
  • ub_callset_row_idx (optional, type:int64, default: INT64_MAX): The upper bound on the row index for samples/CallSets that should be imported. Provided for completeness.
Clone this wiki locally