Skip to content
Jingyu Zhou edited this page Apr 27, 2023 · 8 revisions

A storage server persists its storage server interface in \xff/serverList/ when starting up, and DD scans that part of the keyspace to read the set of old storage servers when it starts up (and periodically re-scans it in case too)

updateStorage() persists data to disks

byteSample

1 in every 200 keys is sampled and kept at \xff\xff key space. It is used by data distribution to make shard movement decisions. In 6.3, Xin added APIs to get sample data.

DD sends WaitMetricsRequest to SS with a min and max (trackShardMetrics() calls waitStorageMetrics()), both initialized to 0. SS knows the actual value, which is outside [min, max] range, and sends a reply. Now DD knows the actual number, and immediately sends another request with -10% as min, and +10% as max. This continues when actual size is outside the range, and SS sends a reply back. Essentially DD tracks per shard size this way.

SplitMetricsRequest: give a key range, SS returns split keys. The complication is the range spans more than one SS.

// The sample will consist of (key, sampled_size) pairs, where sampled_size is
// an estimate of the size of the values associated with that key. These sizes
// are used to estimate the size of key ranges and determine split points.
//
// It's assumed that there's some overhead involved in the sample,
// BYTE_SAMPLING_OVERHEAD, which defaults to 100 bytes per entry.
//
// The rough goal is for the sample size to be a fixed fraction of the total
// size of all keys and values, 1/BYTE_SAMPLING_FACTOR, which defaults to 1/250.
// This includes the overhead, mentioned above.
//
// NOTE: This BYTE_SAMPLING_FACTOR and BYTE_SAMPLING_OVERHEAD knobs can't be
// changed after a database has been created. Data which has been already
// sampled can't be resampled, and the estimates of the size of key ranges
// implicitly includes these constants.

checkBehind

Related to watch. If a SS is far behind, watch fires immediately so the watch goes to a different SS.

update(): pulling data from tlogs

Parse each message twice:

  • first time look for shard changes (metadata mutations), eager reads: atomic op for prefetching keys, clear ranges as large as possible (reason?) -- clear up to the next key in the DB, essentially expanding the range and adding a read .

  • second time update data to disk

MoveKey (Evan's DD Part1)

Step 1: startMoveKeys()

Proxy applyMetadataMutation SS applyPrivateMutation => addShard

  • AddingShard builds mutations in memory

fetchKeys(): fetchComplete.send(Void())

on SS, after the fetch, need to wait MVCC (5s) to persist data.

FetchInjectionInfo: a fetched shard needs to be injected. update() vanilla batch, fk.send(&fii) injects the rest of fetchKeys(), i.e., adding batch data to the beginning of the version data.

AddingShard phase transitions, done by fetchKeys() actor (i.e., AddingShard::fetchClient):

  • WaitPrevious
  • Fetching: fetches data before fetchVersion and write it to storage, then let updater know it is ready to update the deferred updates. Specifically, a tryGetRange() or getRangeStream() is called for fetching data for the key range, which throws end_of_stream if all data are fetched. In the catch block, if partial data is fetched, this fetchKeys() actor finishes committing the keys [keys.begin,nfk) that we already fetched. The remaining unfetched keys [nfk,keys.end) will become a separate AddingShard with its own fetchKeys(). Mutations in shard->updates will be updated to include only these within the finished range, others are discarded because the new shard is in WaitPrevious phase (hasn't chosen a fetchVersion yet).
  • Waiting: sends updater the deferred updates, and wait until they are durable.
    • Wait durable version is at least storageVersion + 1.
    • Create a FetchInjectionInfo object, wait for the update() actor to receive a new batch of versions from the tlog, but before eager reads take place. Choose the shard->transferredVersion to be <= any version in the batch, can be mutated in versionedData (PTree?), and not committed/persisted to storage. Insert the shard->updates into the FetchInjectionInfo object at the transferredVersion. The lie about their versions is acceptable because this shard will never be read at versions < transferredVersion. Eager reads will be applied for injected mutations by update(), and the mutations will come back through AddingShard::addMutations and be applied to versionedMap and mutationLog as normal. The key range is set to be available (setAvailableStatus()).
    • Wait for the transferredVersion to be durable.
  • The shard's state is changed from adding to readWrite then.

During the Fetching phase, AddingShard::updates saves newer mutations whose version is greater or equal to fetchClient's fetchVersion, while the shard is still busy catching up with fetchClient. It applies these updates after fetching completes.

ShardInfo: A shard has 3 mutual exclusive states: adding, readWrite and notAssigned.

Step 2: waitForShardReady()

Step 3: finishMoveKeys(): wait shard readable,

Checking "Lost Writes" on SS

From: https://forums.foundationdb.org/t/internal-error-keyvaluestoresqlite-actor-cpp/3860/9

Blocks on disk have checksums to prevent corruption, but this cannot protect from “lost writes”, where a page update which is fsync()'d remains at its previous state instead when later read back from disk. Checksums do not help here because the old state of the page had a valid checksum, the page was just never updated to the new content.

A lost write can cause BTree invariant violations which are the type of errors you are seeing. A lost write can also cause stale data to be returned from the SQLite tree without it detecting an invariant violation internally. In the fragment error you saw, SQLite itself did not detect an issue and so it returned a bad result, but this broke an invariant in the FDB KV modeling layer on top of the SQLite tree which adds record fragmenting to avoid a high slack space issue with SQLite’s structure. If the record were not fragmented, the error would not have been detected by the KV modeling layer and a bad result would have been returned from the storage engine.

There is an incredible amount of simulated and real world testing and at-scale usage of the SQLite storage engine, and its code in FDB and the underlying SQLite code have not changed meaningfully in many years. A lost write can cause all of the issues you have seen, and is the most likely cause.

The good news is, we’ve encountered badly behaving disks before and so there is an FDB feature you can turn on which will prove that this is what is happening.

If you set this option in your FDB conf file in the fdbserver section

knob_page_write_checksum_history = 50000000

then FDB track written block checksums within a process’s lifetime for the first 50,000,000 4k blocks of space in your data files, and use these to verify the checksums of any of those blocks when read back from disk later after cache eviction. If a mismatch is found, it will log a AsyncFileLostWriteDetected event. The memory overhead of this feature is approximately the history size setting * 8, so around 400MB of memory to track 50 million blocks which would cover all blocks of a 200GB data file.