-
Notifications
You must be signed in to change notification settings - Fork 28
Incremental import into GenomicsDB
See the terminology page for definitions of bulk import and incremental import.
The vcf2tiledb program supports incremental imports by appending new rows to an existing TileDB array. By default, the program will import all the samples/CallSets specified in the callset_mapping_file, appending rows to an existing TileDB array if needed.
If the callset_mapping_file contains samples/CallSets corresponding to existing rows in the array (same row index values), then the newly supplied data will be assumed to be the latest version for those samples/CallSets. Note that this does not imply that the old data for the sample/CallSet is completely deleted. For example, if row 0 had data at column 5 and vcf2tiledb is invoked again for row 0 with data at column 6, the updated array will contain data for both columns 5 and 6 for row 0.
We provide some parameters in the loader JSON file for convenience - these parameters are not mandatory. Users may prefer to keep one single callset_mapping_file for a given array at all times and append new samples/CallSets to this file. Hence, the vcf2tiledb import program must be notified that only the new samples/CallSets should be imported from this file - this is achieved by the following two parameters that must be added to the loader_config_file:
- lb_callset_row_idx (optional, type:int64, default: 0): If specified in the loader configuration file, then the import program will only import samples/CallSets with row index >= lb_callset_row_idx. For example, assuming an array already has 100 rows (row indexes: 0-99), the user can import additional samples by appending sample/CallSet information to the callset_mapping_file (from row index 100) and specify in the loader configuration that only samples with row index >= 100 should be imported.
- ub_callset_row_idx (optional, type:int64, default: INT64_MAX): The upper bound on the row index for samples/CallSets that should be imported. Provided for completeness.
- Overview of GenomicsDB
- Compiling GenomicsDB
-
Importing variant data into GenomicsDB
- Create a TileDB workspace
- Importing data from VCFs/gVCFs into TileDB/GenomicsDB
- Importing data from CSVs into TileDB/GenomicsDB
- Incremental import into TileDB/GenomicsDB
- Overlapping variant calls in a sample
- Java interface for importing VCF/CSV files into TileDB/GenomicsDB
- Dealing with multiple GenomicsDB partitions
- Querying GenomicsDB
- HDFS or S3 or GCS support in GenomicsDB
- MPI with GenomicsDB
- GenomicsDB utilities
- Try out with Docker
- Common issues
- Bug report
- External Contributions