Skip to content

zzAttic Multi threaded gVCF loading

Mike Lin edited this page Jul 31, 2018 · 1 revision

The overall goal is to allow threads to perform query and import operations concurrently. The information can be thought of as a small amount of metadata, pointing to a large amount of data. The metadata includes sample-sets, sample names, and datasets. A dataset is a file containing one or more samples. A sample contains variant calls for a person (or other organism). A sample-set is a list of samples that a user wishes to work on. It might be all the people in the database, a subset based on ethnicity (all Europeans, all East Asians, ...), or a subset determined by disease.

To allow concurrency, we create two metadata sets:

  1. Active Metadata (AMD): holds the names of all datasets and samples undergoing import.
  2. Committed metadata (CMD): complete samples and datasets that are persistent on disk.

Queries consult only the CMD. Import operations temporarily add to the AMD, and when done, move the new metadata to the CMD.

The import procedure can be divided into several parts:

Verify metadata and prepare

  • Exclusively lock AMD and MD
  • Verify that the sample(s) and data set(s) do not already appear
  • Add them to AMD
  • Release locks

Bulk insert

  • Insert bulk data. Batched updates are limited to memory size, and gVCF files could contain gigabytes of compressed data. Therefore, we don't use batched updates.
  • Perform sanity checks on the gVCF file, as it is loaded. Perform a roll back, in case of error (future work).
  • A design goal is to allow streaming upload, which requires a single pass on the data

Update metadata

  • Exclusively lock AMD and CMD
  • Update the metadata in memory, and in the DB
  • Use a write-batch in the DB to ensure atomicity
  • Release locks

Rollback and recovery issues

There are two scenarios that could cause a gVCF import to fail: (1) power failure (2) user abort. Both of these could leave wreckage in the DB, in the form of key-value pairs belonging to partly loaded samples. Currently, we are not dealing with this case, although, we plan to do so in the future.

A potential algorithm for the power failure case is to rollback all the AMD operations. This requires the the AMD to be persistent:

  • Exclusively lock AMD
  • Remove from DB all keys that belong to datasets and samples appearing in the AMD
  • Remove the new datasets and samples from the AMD
  • Release locks