zzAttic Multi threaded gVCF loading

The overall goal is to allow threads to perform query and import operations concurrently. The information can be thought of as a small amount of metadata, pointing to a large amount of data. The metadata includes sample-sets, sample names, and datasets. A dataset is a file containing one or more samples. A sample contains variant calls for a person (or other organism). A sample-set is a list of samples that a user wishes to work on. It might be all the people in the database, a subset based on ethnicity (all Europeans, all East Asians, ...), or a subset determined by disease.

To allow concurrency, we create two metadata sets:

Active Metadata (AMD): holds the names of all datasets and samples undergoing import.
Committed metadata (CMD): complete samples and datasets that are persistent on disk.

Queries consult only the CMD. Import operations temporarily add to the AMD, and when done, move the new metadata to the CMD.

The import procedure can be divided into several parts:

Verify metadata and prepare

Exclusively lock AMD and MD
Verify that the sample(s) and data set(s) do not already appear
Add them to AMD
Release locks

Bulk insert

Insert bulk data. Batched updates are limited to memory size, and gVCF files could contain gigabytes of compressed data. Therefore, we don't use batched updates.
Perform sanity checks on the gVCF file, as it is loaded. Perform a roll back, in case of error (future work).
A design goal is to allow streaming upload, which requires a single pass on the data

Update metadata

Exclusively lock AMD and CMD
Update the metadata in memory, and in the DB
Use a write-batch in the DB to ensure atomicity
Release locks

Rollback and recovery issues

There are two scenarios that could cause a gVCF import to fail: (1) power failure (2) user abort. Both of these could leave wreckage in the DB, in the form of key-value pairs belonging to partly loaded samples. Currently, we are not dealing with this case, although, we plan to do so in the future.

A potential algorithm for the power failure case is to rollback all the AMD operations. This requires the the AMD to be persistent:

Exclusively lock AMD
Remove from DB all keys that belong to datasets and samples appearing in the AMD
Remove the new datasets and samples from the AMD
Release locks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zzAttic Multi threaded gVCF loading

Verify metadata and prepare

Bulk insert

Update metadata

Rollback and recovery issues

Clone this wiki locally