-
Notifications
You must be signed in to change notification settings - Fork 0
DD and SS
A storage server persists its storage server interface in \xff/serverList/
when starting up, and DD scans that part of the keyspace to read the set of old storage servers when it starts up (and periodically re-scans it in case too)
1 in every 200 keys is sampled and kept at \xff\xff
key space. It is used by data distribution to make shard movement decisions. In 6.3, Xin added APIs to get sample data.
DD sends WaitMetricsRequest
to SS with a min and max (trackShardMetrics()
calls waitStorageMetrics()
), both initialized to 0. SS knows the actual value, which is outside [min, max]
range, and sends a reply. Now DD knows the actual number, and immediately sends another request with -10% as min, and +10% as max. This continues when actual size is outside the range, and SS sends a reply back. Essentially DD tracks per shard size this way.
SplitMetricsRequest
: give a key range, SS returns split keys. The complication is the range spans more than one SS.
// The sample will consist of (key, sampled_size) pairs, where sampled_size is
// an estimate of the size of the values associated with that key. These sizes
// are used to estimate the size of key ranges and determine split points.
//
// It's assumed that there's some overhead involved in the sample,
// BYTE_SAMPLING_OVERHEAD, which defaults to 100 bytes per entry.
//
// The rough goal is for the sample size to be a fixed fraction of the total
// size of all keys and values, 1/BYTE_SAMPLING_FACTOR, which defaults to 1/250.
// This includes the overhead, mentioned above.
//
// NOTE: This BYTE_SAMPLING_FACTOR and BYTE_SAMPLING_OVERHEAD knobs can't be
// changed after a database has been created. Data which has been already
// sampled can't be resampled, and the estimates of the size of key ranges
// implicitly includes these constants.
Related to watch. If a SS is far behind, watch fires immediately so the watch goes to a different SS.
Parse each message twice:
-
first time look for shard changes (metadata mutations), eager reads: atomic op for prefetching keys, clear ranges as large as possible (reason?) -- clear up to the next key in the DB, essentially expanding the range and adding a read .
-
second time update data to disk
Proxy applyMetadataMutation SS applyPrivateMutation => addShard
- AddingShard builds mutations in memory
fetchKeys()
: fetchComplete.send(Void())
on SS, after the fetch, need to wait MVCC (5s) to persist data.
FetchInjectionInfo: a fetched shard needs to be injected. update()
vanilla batch, fk.send(&fii)
injects the rest of fetchKeys()
, i.e., adding batch
data to the beginning of the version data.
AddingShard
phase transitions, done by fetchKeys()
actor (i.e., AddingShard::fetchClient
):
WaitPrevious
-
Fetching
: fetches data beforefetchVersion
and write it to storage, then let updater know it is ready to update the deferred updates. Specifically, atryGetRange()
orgetRangeStream()
is called for fetching data for the key range, which throwsend_of_stream
if all data are fetched. In thecatch
block, if partial data is fetched, thisfetchKeys()
actor finishes committing the keys[keys.begin,nfk)
that we already fetched. The remaining unfetched keys[nfk,keys.end)
will become a separateAddingShard
with its ownfetchKeys()
. Mutations inshard->updates
will be updated to include only these within the finished range, others are discarded because the new shard is inWaitPrevious
phase (hasn't chosen afetchVersion
yet). -
Waiting
: sends updater the deferred updates, and wait until they are durable.- Wait durable version is at least
storageVersion + 1
. - Create a
FetchInjectionInfo
object, wait for theupdate()
actor to receive a new batch of versions from the tlog, but before eager reads take place. Choose theshard->transferredVersion
to be <= any version in the batch, can be mutated inversionedData
(PTree?), and not committed/persisted to storage. Insert theshard->updates
into theFetchInjectionInfo
object at thetransferredVersion
. The lie about their versions is acceptable because this shard will never be read atversions < transferredVersion
. Eager reads will be applied for injected mutations byupdate()
, and the mutations will come back throughAddingShard::addMutations
and be applied to versionedMap and mutationLog as normal. The key range is set to be available (setAvailableStatus()
). - Wait for the transferredVersion to be durable.
- Wait durable version is at least
- The shard's state is changed from adding to readWrite then.
During the Fetching phase, AddingShard::updates
saves newer mutations whose version is greater or equal to fetchClient's
fetchVersion
, while the shard is still busy catching up with fetchClient. It applies these updates after fetching completes.
ShardInfo
: A shard has 3 mutual exclusive states: adding, readWrite and notAssigned.
From: https://forums.foundationdb.org/t/internal-error-keyvaluestoresqlite-actor-cpp/3860/9
Blocks on disk have checksums to prevent corruption, but this cannot protect from “lost writes”, where a page update which is fsync()'d remains at its previous state instead when later read back from disk. Checksums do not help here because the old state of the page had a valid checksum, the page was just never updated to the new content.
A lost write can cause BTree invariant violations which are the type of errors you are seeing. A lost write can also cause stale data to be returned from the SQLite tree without it detecting an invariant violation internally. In the fragment error you saw, SQLite itself did not detect an issue and so it returned a bad result, but this broke an invariant in the FDB KV modeling layer on top of the SQLite tree which adds record fragmenting to avoid a high slack space issue with SQLite’s structure. If the record were not fragmented, the error would not have been detected by the KV modeling layer and a bad result would have been returned from the storage engine.
There is an incredible amount of simulated and real world testing and at-scale usage of the SQLite storage engine, and its code in FDB and the underlying SQLite code have not changed meaningfully in many years. A lost write can cause all of the issues you have seen, and is the most likely cause.
The good news is, we’ve encountered badly behaving disks before and so there is an FDB feature you can turn on which will prove that this is what is happening.
If you set this option in your FDB conf file in the fdbserver section
knob_page_write_checksum_history = 50000000
then FDB track written block checksums within a process’s lifetime for the first 50,000,000 4k blocks of space in your data files, and use these to verify the checksums of any of those blocks when read back from disk later after cache eviction. If a mismatch is found, it will log a AsyncFileLostWriteDetected
event. The memory overhead of this feature is approximately the history size setting * 8, so around 400MB of memory to track 50 million blocks which would cover all blocks of a 200GB data file.