Skip to content

Some Things You Should Know as an FDB Developer

Jingyu Zhou edited this page May 24, 2022 · 18 revisions

FoundationDB project starts in 2009, when there is not a good async programming framework and memory management library. As a result, FDB invents its own, which has many nuances that may trip you if you are not aware.

The following is assuming you are already familiar with flow.

Arena Memory Management

The memory management uses Arena abstraction a lot, which owns some memory blocks, typically for MutationRef, VectorRef these *Ref types. For example, a StringRef object is usually a pointer and a length to one of the blocks Arena owns. Thus, it's very important that the Arena is live while accessing these *Ref objects.

Arena is internal reference counted objects. Thus, if you declare Arena b = a;, then both a and b share the same memory blocks that a owns before. Even if a goes out of scope, these memory blocks are still valid.

Alternatively, you can have b.dependsOn(a), so that b also holds references to the memory blocks a owns before. A caveat is that memory allocated by a afterward may NOT be reference counted by b. This is because the new memory block is only referenced by a, not b.

If you are curious about the implementation, here is the explanation:

// For an arena A, one would expect using B.dependsOn(A) would extent A's lifecycle to B's lifecycle.
// However this is not quite true. Internally arenas are implemented as linked lists, or
//
//  A.impl =      ArenaBlock -> ArenaBlock -> ...
//
// where each ArenaBlock is a piece of allocated memory managed by reference counting. Applying
// B.dependsOn(A) would cause B refers to the first ArenaBlock in A, or
//
//            B----|
//                 V
//  A.impl =      ArenaBlock -> ArenaBlock -> ...
//
// so when A destructs, there is still at least one extra reference to the first ArenaBlock,
// preventing the release of the chained blocks.
// However, when additional memory is requested to A, A will create additional ArenaBlocks (see
// Arena.cpp:ArenaBlock::create), and the new ArenaBlock might be inserted *prior* to the
// existing ArenaBlock, i.e.
//                           B----|
//                                V
//  A.impl =      ArenaBlock -> ArenaBlock -> ArenaBlock ...
//                   ^- new created
// and when A is destructed, the unreferrenced (hereby the first one) will be destructed.
// This causes B depends on only part of A. The only way to ensure all blocks in A have
// the same life cycle to B is to let B.dependsOn(A) get called when A is stable, or no
// extra ArenaBlocks will be created.

Serialization with FlatBuffers

FDB implements its own FlatBuffers, originally invented by Google. FlatBuffers allow accessing serialized data without parsing/unpacking.

struct GetKeyValuesReply : public LoadBalancedReply {
	constexpr static FileIdentifier file_identifier = 1783066;
	Arena arena;
	VectorRef<KeyValueRef, VecSerStrategy::String> data;
	Version version; // useful when latestVersion was requested
	bool more;
	bool cached = false;

	GetKeyValuesReply() : version(invalidVersion), more(false), cached(false) {}

	template <class Ar>
	void serialize(Ar& ar) {
		serializer(ar, LoadBalancedReply::penalty, LoadBalancedReply::error, data, version, more, cached, arena);
	}
};

The above is an example of using FlatBuffers. The function serialize(), surprisingly, does both serialization and deserialization! This is because:

template <class Archive, class Item, class... Items>
typename Archive::WRITER& serializer(Archive& ar, const Item& item, const Items&... items) {
	save(ar, item);
	if constexpr (sizeof...(Items) > 0) {
		serializer(ar, items...);
	}
	return ar;
}

template <class Archive, class Item, class... Items>
typename Archive::READER& serializer(Archive& ar, Item& item, Items&... items) {
	load(ar, item);
	if constexpr (sizeof...(Items) > 0) {
		serializer(ar, items...);
	}
	return ar;
}

The "Reader" (e.g., BinaryReader, ObjectReader, ArenaReader) calls load(ar, item) to deserialize all items, while the "Writer" (e.g., BinaryWriter, ObjectWriter) calls save(ar, item) to serialize all data items.

Sometimes, we want the serialization and deserialization to behave differently according to the value of data items. For instance, a single key clear MutationRef can be compressed by sending only the key once, not twice (key and keyAfter(key)). To achieve this, the serialize() function below checks if we are serializing ar.isSerializing or deserializing ar.isDeserializing:

struct MutationRef {
	uint8_t type;
	StringRef param1, param2;

	template <class Ar>
	void serialize(Ar& ar) {
		if (ar.isSerializing && type == ClearRange && equalsKeyAfter(param1, param2)) {
			StringRef empty;
			serializer(ar, type, param2, empty);
		} else {
			serializer(ar, type, param1, param2);
		}
		if (ar.isDeserializing && type == ClearRange && param2 == StringRef() && param1 != StringRef()) {
			ASSERT(param1[param1.size() - 1] == '\x00');
			param2 = param1;
			param1 = param2.substr(0, param2.size() - 1);
		}
	}

Append new fields in serializer()

FlatBuffer has a requirement that new fields should be added to the end of the list. During the 7.1.4 release process, we encountered an interesting bug: suddenly the Python or Java clients can no longer issues API calls to the 7.1.2 cluster. On the other hand, the same client that uses the 7.1.4 libfdb_c.so can work perfectly fine with the 7.1.4 cluster. After several hours, a group of engineers noticed there were incompatible changes:

struct OpenDatabaseCoordRequest {
...
        void serialize(Ar& ar) {
-               serializer(ar, issues, supportedVersions, traceLogGroup, knownClientInfoID, clusterKey, coordinators, reply);
+               serializer(ar,
+                          issues,
+                          supportedVersions,
+                          traceLogGroup,
+                          knownClientInfoID,
+                          clusterKey,
+                          hostnames,
+                          coordinators,
+                          reply);
        }

Note hostnames is inserted into the middle of the list of items. This change breaks compatibility between the client and the server, because the 7.1.2 server would interpret hostnames as coordinators----flat buffer use the relative position of items for serialization. The fix is to move hostnames to the end of the list.

A side note is that within a major version, such as 6.3 or 7.1, FDB binaries should be compatible. More specifically using FDB's term, they should be protocol compatible. In this way, a client uses 6.3.9 C library can connect to 6.3.24 cluster without any issues. And the server cluster can upgrade to new minor release versions, without the upgrade requirement for the client C library.

BinaryReader has an Arena

A surprising fact is that BinaryReader has an internal Arena object, which is used in BinaryReader::arenaRead():

	const uint8_t* arenaRead(int bytes) {
		// Reads and returns the next bytes.
		// The returned pointer has the lifetime of this.arena()
		// Could be implemented zero-copy if [begin,end) was in this.arena() already; for now is a copy
		if (!bytes)
			return nullptr;
		uint8_t* dat = new (arena()) uint8_t[bytes];
		serializeBytes(dat, bytes);
		return dat;
	}

Note arenaRead() allocates memory new (arena()) uint8_t[bytes], and then copies bytes into the newly allocated memory. This important to know, because for StringRef, its serialization (defined in Arena.h ) is:

template <class Archive>
inline void load(Archive& ar, StringRef& value) {
	uint32_t length;
	ar >> length;
	value = StringRef(ar.arenaRead(length), length);
}

So if a BinaryReader is used to deserialize a StringRef, then the returned value's life cycle is the same as the BinaryReader object. As a result, if the BinaryReader object goes out of the scope, the MutationRef returned by the BinaryReader object will points to invalid memory:

MutationRef m;
  {
    BinaryReader reader(serialized, IncludeVersion());
    reader >> m;
  }
// m.param1 now points to invalid memory

AsyncVar::set() may not "set" the value

The AsyncVar::set() compares the value with its internal state and only set if they are different:

	void set(V const& v) {
		if (v != value)
			setUnconditional(v);
	}

This usually works. However, if the != operator for the value v is overloaded, there could be unintended consequences. For example, the ClientDBInfo has the overloaded operator as:

bool operator!=(ClientDBInfo const& r) const { return id != r.id; }

And in extractClientInfo(), we have code like

		ClientDBInfo ni = db->get().client;
		shrinkProxyList(ni, lastCommitProxyUIDs, lastCommitProxies, lastGrvProxyUIDs, lastGrvProxies);
		info->set(ni);

This looks correct, but is wrong! The reason is that shrinkProxyList() modifies the ni object, but does not change the id field. As a result, even if ni has changed, info->set(ni) has no effect! This bug has caused infinite retrying GRVs at the client side. Specifically, the version vector feature introduces a change that the client side compares the returned proxy ID with its known set of GRV proxies and will retry GRV if the returned proxy ID is not in the set. Due to the above bug, GRV returned by a proxy is not within the client set, because the change was not applied to the "info". The fix is in PR #6877.

error_code_actor_cancelled is not an Error

Typically in the code, we have:

try {
    ...
} catch (Error& e) {
	if (e.code() == error_code_actor_cancelled)
		throw;
	...
}

The idea is that for actor_cancelled() error, we should always throw to kill the actor. This behavior can be different from other errors, which may be caught and then go back to the try block again.

To understand actor_cancelled() error, we need to look at Future::cancel():

class Future {
...
	void cancel() {
		if (sav)
			sav->cancel();
	}
}

struct SAV : private Callback<T>, FastAllocated<SAV<T>> {
...
	virtual void cancel() {}
}

Recall that a Future object references an actor state. Future::cancel() tries to cancel the actor, by calling the virtual function sav->cancel(). Depending on the actual actor implementation, it's cancel() function is generated code like:

void cancel() override
{
        auto wait_state = this->actor_wait_state;
        this->actor_wait_state = -1;
        switch (wait_state) {
        case 1: this->a_callback_error((ActorCallback< AddMutationActor, 0, Void >*)0, actor_cancelled()); break;
        case 2: this->a_callback_error((ActorCallback< AddMutationActor, 1, Void >*)0, actor_cancelled()); break;
        case 3: this->a_callback_error((ActorCallback< AddMutationActor, 2, Void >*)0, actor_cancelled()); break;
        case 4: this->a_callback_error((ActorCallback< AddMutationActor, 3, Void >*)0, actor_cancelled()); break;
        }
}

It does two things: 1) set the actor state to -1, i.e., cancelled state; and 2) cancel all the callbacks by sending actor_cancelled() error. The second part is important, which means the code automatically cancels all actors created inside this actor, and such cancellation is performed recursively. What a convenience!

Debugging Techniques

See this doc.

Use Valgrind/ASAN for FDB.

"Mostly" Single Threaded

The fdbserver runs almost everything in a single "network" thread (sometimes called "main" thread in the code). Particularly, all "flow" related stuff, i.e., actors, run on the network thread.

There are other threads for not so important stuff like profiling, printing stack traces, etc., if you are curious.

TODOs: flow lock, actor's synchronous callback invocation

  • Promise.send() synchronously invokes callbacks. See storage server's FII and PR 6924
  • backtrace can be inaccurate, use gdb to find the precise stack trace
  • holdWhile() from https://github.com/apple/foundationdb/pull/7230/: keep a non-state object live before passing it to an actor, otherwise the object can be freed when the actor accesses it.

FAQs

  • Can a Reference<X> be serialized? In general Reference<X> can't be serialized. serializeReplicationPolicy gives an example of how a particular Reference type can be serialized, but generally it isn't supported or recommended.

  • What's the difference between request_maybe_delivered and broken_promise. In what situations are they each returned? It’s two completely different things. The first means that a connection failed while you tried sending a request, i.e., a network issue. The second means that all promises for some future were destroyed. A broken_promise is also sent over the network if a ReplyPromise is destroyed on the server side, i.e., a logical error not related to network.

  • Joshua says timeout, but no seed. Find the seed in the Joshua output (just not the test file). Then you can rerun this by: unpack the correctness package somewhere. Run JOSHUA_SEED=3826190815342414232 mono bin/TestHarness.exe joshua-run "/path/to/old/binaries/if/needed" false (replace whatever the seed is).