Skip to content
Bill Katz edited this page Mar 4, 2016 · 23 revisions

Planned and Existing Features for DVID:

Distributed operation: Once a DVID repo is created and loaded with data, it can be pushed to remote sites using an optional ROI as well as pulled. Each DVID server chooses how much of the data set is held locally.

Status: Repo push with optional data instance specification added in September 2014. See published one-column repo. Production push/pull operations will be added once core data types (particularly 64-bit label handling) are stable.

Versioning: Each version of a DVID repo corresponds to a node in a version DAG (Directed Acyclic Graph). Versions are identified through a UUID that can be composed locally yet are unique globally. Versioning and distribution follow patterns similar to distributed version control systems like git and mercurial. Provenance is kept in the DAG.

Status: Versioning is currently used for FlyEM production tasks. Conflict-free merging (versions are disjoint at key-value pair level) has been implemented but not thoroughly tested as of July 2015.

Denormalized Views: For any node in the version DAG, we can choose to create denormalized views that accelerate particular access patterns. For example, imagetile (quadtree) data can be created for XY, XZ, and YZ orthogonal views or sparse volumes can compactly describe a neuron. The extra denormalized data is kept in the datastore until a node is archived, which removes all denormalized key­-value pairs associated with that version node. Views of the same data will be eventually consistent.

Status: Multi-scale 2d images in XY, XZ, YZ, and sparse volumes implemented. Multi-scale 3d is planned with no set timeline. Rapid label surfaces was implemented but deprecated until integrated 64-bit label type is completed. Pub/sub framework for syncing is currently used for label voxels and sparse volume synchronization. We are also looking at other methods of guaranteeing synced data using either provably eventually consistent denormalization or distributed transactions as in CockroachDB.

Flexible Data Types: DVID provides a well­-defined interface to data type code that can be easily added by users. A DVID server provides HTTP and RPC APIs, authentication, authorization, versioning, provenance, and storage engines. It delegates datatype­-specific commands and processing to data type code. As long as a DVID type can return data for its implemented commands, we don’t care how its implemented.

Status: Variety of voxel types, tiles, labels, label graph, label-aware annotations, key-value, and ROI have been implemented. As an example of a simple proxy datatype, googlevoxels proxies requests between DVID and the Google BrainMaps API, taking care of OAuth2 authentication within the datatype implementation. A FUSE interface for key-value type working but not heavily used. Lightweight authentication and authorization support planned using something like password-less tokens.

Scalable Storage Engine: Although DVID may support polyglot persistence (i.e., allow use of relational, graph, or NoSQL databases), we are initially focused on key­-value stores. DVID has an abstract key­-value interface to its swappable storage engine. We choose a key­-value interface because (1) there are a large number of high­-performance, open­-source implementations that run from embedded to clustered systems, (2) the surface area of the API is very small, even after adding important cases like bulk loads or sequential key read/write, and (3) novel technology tends to match key­-value interfaces, e.g., Seagate's Kinetic Open Storage Platform. As storage becomes more log structured, the key-value API becomes a more natural fit.

Auto-migration of newly committed data from mutable to immutable store and concurrent reads of mutable/immutable data is on roadmap.

A key part of the DVID vision is the flexibility to choose storage engines and tradeoff speed, storage capacity, and cost. By focusing on key-value stores, we have a variety of solutions.

Spectrum of key-value stores

_Status: Currently built with Basho-tuned leveldb and other leveldb variants have been tested successfully in past: Google's open source version and HyperLevelDB.

Google Cloud BigTable support was added by Ignacio Tartavull but not throughly tested. Google Cloud Storage (similar to Amazon S3) backend was added by Steve Plaza and is currently used for DVID Spark Services. Use of a petabyte-capable immutable store (MongoDB for ordered indexing + Scality for object store) is being tested at Janelia.

RocksDB and/or ForestDB support is planned. In the past, Lightning MDB and also experimental use of Bolt were tested, although neither were tuned to work as well as the leveldb variants.

As of Oct 2015, DVID allows specification of different mutable, immutable, and metadata storage engines with datatype authors able to tap into each. Direct support of Seagate Kinetic drives via their protobuf protocol will be done if their platform reaches sufficient viability._

Clone this wiki locally