-
Notifications
You must be signed in to change notification settings - Fork 0
Some Things You Should Know as an FDB Developer
FoundationDB project starts in 2009, when there is not a good async programming framework and memory management library. As a result, FDB invents its own, which has many nuances that may trip you if you are not aware.
The following is assuming you are already familiar with flow.
The memory management uses Arena
abstraction a lot, which owns some memory blocks, typically for MutationRef
, VectorRef
these *Ref
types. For example, a StringRef
object is usually a pointer and a length to one of the blocks Arena
owns. Thus, it's very important that the Arena
is live while accessing these *Ref
objects.
Arena
is internal reference counted objects. Thus, if you declare Arena b = a;
, then both a
and b
share the same memory blocks that a
owns before. Even if a
goes out of scope, these memory blocks are still valid.
Alternatively, you can have b.dependsOn(a)
, so that b
also holds references to the memory blocks a
owns before. A caveat is that memory allocated by a
afterward may NOT be reference counted by b
. This is because the new memory block is only referenced by a
, not b
.
If you are curious about the implementation, here is the explanation:
// For an arena A, one would expect using B.dependsOn(A) would extent A's lifecycle to B's lifecycle.
// However this is not quite true. Internally arenas are implemented as linked lists, or
//
// A.impl = ArenaBlock -> ArenaBlock -> ...
//
// where each ArenaBlock is a piece of allocated memory managed by reference counting. Applying
// B.dependsOn(A) would cause B refers to the first ArenaBlock in A, or
//
// B----|
// V
// A.impl = ArenaBlock -> ArenaBlock -> ...
//
// so when A destructs, there is still at least one extra reference to the first ArenaBlock,
// preventing the release of the chained blocks.
// However, when additional memory is requested to A, A will create additional ArenaBlocks (see
// Arena.cpp:ArenaBlock::create), and the new ArenaBlock might be inserted *prior* to the
// existing ArenaBlock, i.e.
// B----|
// V
// A.impl = ArenaBlock -> ArenaBlock -> ArenaBlock ...
// ^- new created
// and when A is destructed, the unreferrenced (hereby the first one) will be destructed.
// This causes B depends on only part of A. The only way to ensure all blocks in A have
// the same life cycle to B is to let B.dependsOn(A) get called when A is stable, or no
// extra ArenaBlocks will be created.
FDB implements its own FlatBuffers, originally invented by Google. FlatBuffers allow accessing serialized data without parsing/unpacking.
struct GetKeyValuesReply : public LoadBalancedReply {
constexpr static FileIdentifier file_identifier = 1783066;
Arena arena;
VectorRef<KeyValueRef, VecSerStrategy::String> data;
Version version; // useful when latestVersion was requested
bool more;
bool cached = false;
GetKeyValuesReply() : version(invalidVersion), more(false), cached(false) {}
template <class Ar>
void serialize(Ar& ar) {
serializer(ar, LoadBalancedReply::penalty, LoadBalancedReply::error, data, version, more, cached, arena);
}
};
The above is an example of using FlatBuffers. The function serialize()
, surprisingly, does both serialization and deserialization! This is because:
template <class Archive, class Item, class... Items>
typename Archive::WRITER& serializer(Archive& ar, const Item& item, const Items&... items) {
save(ar, item);
if constexpr (sizeof...(Items) > 0) {
serializer(ar, items...);
}
return ar;
}
template <class Archive, class Item, class... Items>
typename Archive::READER& serializer(Archive& ar, Item& item, Items&... items) {
load(ar, item);
if constexpr (sizeof...(Items) > 0) {
serializer(ar, items...);
}
return ar;
}
The "Reader" (e.g., BinaryReader
, ObjectReader
, ArenaReader
) calls load(ar, item)
to deserialize all items, while the "Writer" (e.g., BinaryWriter
, ObjectWriter
) calls save(ar, item)
to serialize all data items.
Sometimes, we want the serialization and deserialization to behave differently according to the value of data items. For instance, a single key clear MutationRef
can be compressed by sending only the key once, not twice (key
and keyAfter(key)
). To achieve this, the serialize()
function below checks if we are serializing ar.isSerializing
or deserializing ar.isDeserializing
:
struct MutationRef {
uint8_t type;
StringRef param1, param2;
template <class Ar>
void serialize(Ar& ar) {
if (ar.isSerializing && type == ClearRange && equalsKeyAfter(param1, param2)) {
StringRef empty;
serializer(ar, type, param2, empty);
} else {
serializer(ar, type, param1, param2);
}
if (ar.isDeserializing && type == ClearRange && param2 == StringRef() && param1 != StringRef()) {
ASSERT(param1[param1.size() - 1] == '\x00');
param2 = param1;
param1 = param2.substr(0, param2.size() - 1);
}
}
FlatBuffer has a requirement that new fields should be added to the end of the list. During the 7.1.4 release process, we encountered an interesting bug: suddenly the Python or Java clients can no longer issues API calls to the 7.1.2 cluster. On the other hand, the same client that uses the 7.1.4 libfdb_c.so
can work perfectly fine with the 7.1.4 cluster. After several hours, a group of engineers noticed there were incompatible changes:
struct OpenDatabaseCoordRequest {
...
void serialize(Ar& ar) {
- serializer(ar, issues, supportedVersions, traceLogGroup, knownClientInfoID, clusterKey, coordinators, reply);
+ serializer(ar,
+ issues,
+ supportedVersions,
+ traceLogGroup,
+ knownClientInfoID,
+ clusterKey,
+ hostnames,
+ coordinators,
+ reply);
}
Note hostnames
is inserted into the middle of the list of items. This change breaks compatibility between the client and the server, because the 7.1.2 server would interpret hostnames
as coordinators
----flat buffer use the relative position of items for serialization. The fix is to move hostnames
to the end of the list.
A side note is that within a major version, such as 6.3 or 7.1, FDB binaries should be compatible. More specifically using FDB's term, they should be protocol compatible. In this way, a client uses 6.3.9 C library can connect to 6.3.24 cluster without any issues. And the server cluster can upgrade to new minor release versions, without the upgrade requirement for the client C library.
A surprising fact is that BinaryReader
has an internal Arena
object, which is used in BinaryReader::arenaRead()
:
const uint8_t* arenaRead(int bytes) {
// Reads and returns the next bytes.
// The returned pointer has the lifetime of this.arena()
// Could be implemented zero-copy if [begin,end) was in this.arena() already; for now is a copy
if (!bytes)
return nullptr;
uint8_t* dat = new (arena()) uint8_t[bytes];
serializeBytes(dat, bytes);
return dat;
}
Note arenaRead()
allocates memory new (arena()) uint8_t[bytes]
, and then copies bytes into the newly allocated memory. This important to know, because for StringRef
, its serialization (defined in Arena.h ) is:
template <class Archive>
inline void load(Archive& ar, StringRef& value) {
uint32_t length;
ar >> length;
value = StringRef(ar.arenaRead(length), length);
}
So if a BinaryReader
is used to deserialize a StringRef
, then the returned value's life cycle is the same as the BinaryReader
object. As a result, if the BinaryReader
object goes out of the scope, the MutationRef
returned by the BinaryReader
object will points to invalid memory:
MutationRef m;
{
BinaryReader reader(serialized, IncludeVersion());
reader >> m;
}
// m.param1 now points to invalid memory
The AsyncVar::set()
compares the value with its internal state and only set if they are different:
void set(V const& v) {
if (v != value)
setUnconditional(v);
}
This usually works. However, if the !=
operator for the value v
is overloaded, there could be unintended consequences. For example, the ClientDBInfo
has the overloaded operator as:
bool operator!=(ClientDBInfo const& r) const { return id != r.id; }
And in extractClientInfo()
, we have code like
ClientDBInfo ni = db->get().client;
shrinkProxyList(ni, lastCommitProxyUIDs, lastCommitProxies, lastGrvProxyUIDs, lastGrvProxies);
info->set(ni);
This looks correct, but is wrong! The reason is that shrinkProxyList()
modifies the ni
object, but does not change the id
field. As a result, even if ni
has changed, info->set(ni)
has no effect! This bug has caused infinite retrying GRVs at the client side. Specifically, the version vector feature introduces a change that the client side compares the returned proxy ID with its known set of GRV proxies and will retry GRV if the returned proxy ID is not in the set. Due to the above bug, GRV returned by a proxy is not within the client set, because the change was not applied to the "info". The fix is in PR #6877.
Typically in the code, we have:
try {
...
} catch (Error& e) {
if (e.code() == error_code_actor_cancelled)
throw;
...
}
The idea is that for actor_cancelled()
error, we should always throw
to kill the actor.
This behavior can be different from other errors, which may be caught and then go back to
the try
block again.
To understand actor_cancelled()
error, we need to look at Future::cancel()
:
class Future {
...
void cancel() {
if (sav)
sav->cancel();
}
}
struct SAV : private Callback<T>, FastAllocated<SAV<T>> {
...
virtual void cancel() {}
}
Recall that a Future
object references an actor state. Future::cancel()
tries
to cancel the actor, by calling the virtual function sav->cancel()
.
Depending on the actual actor implementation, it's cancel()
function is generated
code like:
void cancel() override
{
auto wait_state = this->actor_wait_state;
this->actor_wait_state = -1;
switch (wait_state) {
case 1: this->a_callback_error((ActorCallback< AddMutationActor, 0, Void >*)0, actor_cancelled()); break;
case 2: this->a_callback_error((ActorCallback< AddMutationActor, 1, Void >*)0, actor_cancelled()); break;
case 3: this->a_callback_error((ActorCallback< AddMutationActor, 2, Void >*)0, actor_cancelled()); break;
case 4: this->a_callback_error((ActorCallback< AddMutationActor, 3, Void >*)0, actor_cancelled()); break;
}
}
It does two things: 1) set the actor state to -1
, i.e., cancelled state; and
2) cancel all the callbacks by sending actor_cancelled()
error. The second part
is important, which means the code automatically cancels all actors created inside
this actor, and such cancellation is performed recursively. What a convenience!
See this doc.
The fdbserver
runs almost everything in a single "network" thread (sometimes called "main" thread in the code). Particularly, all "flow" related stuff, i.e., actors, run on the network thread.
There are other threads for not so important stuff like profiling, printing stack traces, etc., if you are curious.
TODOs: flow lock, actor's synchronous callback invocation
-
Promise.send()
synchronously invokes callbacks. See storage server's FII and PR 6924 - backtrace can be inaccurate, use
gdb
to find the precise stack trace -
holdWhile()
from https://github.com/apple/foundationdb/pull/7230/: keep a non-state object live before passing it to an actor, otherwise the object can be freed when the actor accesses it.
-
Can a
Reference<X>
be serialized? In generalReference<X>
can't be serialized.serializeReplicationPolicy
gives an example of how a particularReference
type can be serialized, but generally it isn't supported or recommended. -
What's the difference between request_maybe_delivered and broken_promise. In what situations are they each returned? It’s two completely different things. The first means that a connection failed while you tried sending a request, i.e., a network issue. The second means that all promises for some future were destroyed. A
broken_promise
is also sent over the network if aReplyPromise
is destroyed on the server side, i.e., a logical error not related to network. -
Joshua says timeout, but no seed. Find the seed in the Joshua output (just not the test file). Then you can rerun this by: unpack the correctness package somewhere. Run
JOSHUA_SEED=3826190815342414232 mono bin/TestHarness.exe joshua-run "/path/to/old/binaries/if/needed" false
(replace whatever the seed is).