zstor
is an object encoding storage system. It can be run in either a
daemon - client setup, or it can perform single actions without an
associated daemon, which is mainly useful for uploading/retrieving
single items. The daemon is part of the same binary, and will run other
useful features, such as a repair queue which periodically verifies the
integrity of objects.
Zstor uses 0-db's to store the data. It does so by splitting up the data in chunks and distributing them over N 0-db's.
C4Component
title Zstor setup
Component(zstor, "Zstor instance")
Deployment_Node(zerodbgroup0,"0-db group", ""){
System(zerodb1,"0-db 1")
System(zerodb2,"0-db 2")
}
Deployment_Node(zerodbgroup1,"0-db group", ""){
System(zerodbx,"0-db ...")
System(zerodbn,"0-db N")
}
Rel(zstor, zerodb1, "")
Rel(zstor, zerodb2, "")
Rel(zstor, zerodbx, "")
Rel(zstor, zerodbn, "")
Zstor uses forward looking error correcting codes (FLECC) for data consistency and to protect against data loss.
This means zstor constantly tries to spread the data over N, being the expected_shards, 0-db's.
As long as there are M (minimal_shards), M being smaller than N off course, chunks of data intact, zstor can recover the data.
Currently, zstor
expects a stable system to start from, which is user
provided:
zstor
has a redundancy configuration which introduces the notion ofgroups
: a group is a list of 0-db's which have an inherent larger risk of going down together. For example, grid 0-db's which are deployed on the same farm.
The daemon, or monitor, can be started by invoking zstor
with the
monitor
subcommand. This starts a long running process, and opens up a
unix socket on the path specified in the config. Regular command
invocations (example "store") of zstor
will then read the path to the
unix socket from the config, connect to it, send the command, and wait
until the monitor daemon returns a response after executing the command.
This setup is recommended as:
- It exposes optional metrics for prometheus to scrape.
- Only a single upload/download of a file happens at once, meaning you won't burn out your whole cpu by sending multiple upload commands in quick succession.
If the socket path is not specified, zstor
will fall back to its
single command flow, where it executes the command in process, and then
exits. Invoking zstor
multiple times in quick succession might cause
multiple uploads to be performed at the same time, causing multiple cpu
cores to be used for the encryption/compression.
Store
data in multiple chunks on zdb backends, according to a given policyRetrieve
said data, using just the path and the metadata store. Zdbs can be removed, as long as sufficient are left to recover the data.Rebuild
the data, loading existing data (as long as sufficient zdbs are left), reencoding it, and storing it in (new) zdbs according to the current configCheck
a file, returning a 16 byteblake2b
checksum (in hex) if it is present in the backend (by fetching it from the metastore).Status
: get statistics about backends
- Config file hot reloading. If the config file is editted,
SIGUSR1
can be send to a running 0-stor process to reload the configuration. Only the backends in the new configuration are used. Data is not rebuild on these new backends as long as the old backends are still operational - Monitoring of active 0-db backends. An active backend is considered a backend that is tracked in the config, which has sufficient space left to write new blocks.
- Repair queue: periodically, all 0-db's used are checked, to see if the are still online. If a 0-db is unreachable, all objects which have a chunk stored on that 0-db will be rebuild on fully healthy 0-db's.
- Prometheus metrics. The metrics server is bound to all interfaces, on
the port specified in the config. The path is
/metrics
. If no port is set in the config, the metrics server won't be enabled.
Make sure you have the latest Rust stable installed. Clone the repository:
git clone https://github.com/threefoldtech/0-stor_v2
cd 0-stor_v2
Then build with the standard toolchain through cargo:
cargo build
This will produce the executable in ./target/debug/zstor_v2
.
On linux, a fully static binary can be compiled by using the x86_64-unknown-linux-musl
target, as follows:
cargo build --target x86_64-unknown-linux-musl --release
Running zstor
requires a config file. An example config, and
explanation of the parameters is found below.
minimal_shards = 10
expected_shards = 15
redundant_groups = 1
redundant_nodes = 1
root = "/virtualroot"
socket = "/tmp/zstor.sock"
prometheus_port = 9100
zdb_data_dir_path = "/tmp/0-db/data"
max_zdb_data_dir_size = 25600
[encryption]
algorithm = "AES"
key = "0000000000000000000000000000000000000000000000000000000000000000"
[compression]
algorithm = "snappy"
[meta]
type = "zdb"
[meta.config]
prefix = "someprefix"
[meta.config.encryption]
algorithm = "AES"
key = "0101010101010101010101010101010101010101010101010101010101010101"
[[meta.config.backends]]
address = "[2a02:1802:5e::dead:beef]:9900"
namespace = "test2"
password = "supersecretpass"
[[meta.config.backends]]
address = "[2a02:1802:5e::dead:beef]:9901"
namespace = "test2"
password = "supersecretpass"
[[meta.config.backends]]
address = "[2a02:1802:5e::dead:beef]:9902"
namespace = "test2"
password = "supersecretpass"
[[meta.config.backends]]
address = "[2a02:1802:5e::dead:beef]:9903"
namespace = "test2"
password = "supersecretpass"
[[groups]]
[[groups.backends]]
address = "[fe80::1]:9900"
[[groups.backends]]
address = "[fe80::1]:9900"
namespace = "test"
[[groups]]
[[groups.backends]]
address = "[2a02:1802:5e::dead:babe]:9900"
[[groups.backends]]
address = "[2a02:1802:5e::dead:beef]:9900"
namespace = "test2"
password = "supersecretpass"
minimal_shards
: The minimum amount of shards which are needed to recover the original data.expected_shards
: The amount of shards which are generated when the data is encoded. Essentially, this is the amount of shards which is needed to be able to recover the data, and some disposable shards which could be lost. The amount of disposable shards can be calculated asexpected_shards - minimal_shards
.redundant_groups
: The amount of groups which one should be able to loose while still being able to recover the original data.redundant_nodes
: The amount of nodes that can be lost in every group while still being able to recover the original data.root
: virtual root on the filesystem to use, this path will be removed from all files saved. If a file path is loaded, the path will be interpreted as relative to this directorysocket
: Optional path to a unix socket. This socket is required in case zstor needs to run in daemon mode. If this is present, zstor invocations will first try to connect to the socket. If it is not found, the command is run in-process, else it is encoded and send to the socket so the daemon can process it.zdb_data_dir_path
: Optional path to the local 0-db data file directory. If set, it will be monitored and kept within the size limits. This is primarily used when 0-stor is running as part of a QSFS deployment. In this case, a 0-db-fs instance is running, which is using a local 0-db as read/write cache. When this option is set, the size of this cache is monitored, and if needed the least recently accessed files are removed.max_zdb_data_dir_size
: Maximum size of the data dir in MiB, if this is set and the sum of the file sizes in the data dir gets higher than this value, the least used, already encoded file will be removed.zdbfs_mountpoint
: Optional path of a 0-db-fs mount. If present, a syscall will be executed periodically to retrieve file system statistics, which will then be exposed through the build-in prometheus server.prometheus_port
: An optional port on which prometheus metrics will be exposed. If this is not set, the metrics will not get exposed.encryption
: configuration to use for the encryption stage. Currently onlyAES
is supported. The encryptionkey
is 32 random bytes in hexadecimal form.compression
: configuration to use for the compression stage. Currently onlysnappy
is supportedmeta
: configuration for the metadata store to use, currently onlyzdb
is supportedgroups
: The backend groups to write the data to.
Explanation:
When data is encoded, metadata is generated to later retrieve this data. The metadata is stored in 4 0-dbs, with a given prefix.
For every file, we get the full path of the file on the system, generate a 16 byte blake2b hash, and hex encode the bytes. We then append this to the prefix to generate the final key.
The key structure is: /{prefix}/meta/{hashed_path_hex}
The metadata itself is encrypted, binary encoded, and then dispersed in the metadata 0-dbs.
Since the metadata is also encoded before being stored, we need to know the used
encoding to be able to decode again. Since we can't store metadata about metadata
itself, this is a static setup. As said, at present there are 4 metadata storage
0-db's defined. Since the key is defined by the system, these must be run in user
mode. At the moment, it is not possible to define more metadata stores as can be
done with regular data stores.
The actual metadata is encoded in a 2:2 setup, that is, 2 data shards and 2 parity shards. This allows up to 2 (i.e. half) of the metadata stores to be lost, while still retaining access to the data. Any 2 stores can be used to recover the data, there is no specific difference between them.
Because the system is designed to prioritize recoverability over availability, writers will be rejected if the metadata storage is in the degraded state, that is, not all 4 stores are available and writeable.
A metadata store can be replaced by a new one, by removing the old one in the config and inserting the new one. The repair subsystem will take care of rebulding the data, regenerating the shards, and storing the new shards on the new metatada store.