Each RethinkDB process has a thread pool (see arch/runtime/thread_pool.hpp
) with a fixed number of threads. Each of these threads is running an event loop that processes IO notifications from the OS and messages from other threads. The arch/
directory contains primitives for timers, file and network IO, and so on in the context of the RethinkDB thread pool. The main event loop threads never run blocking IO operations; instead, there is a separate set "blocker pool" (see arch/io/blocker_pool.hpp
) of threads used for blocking operations.
RethinkDB makes heavy use of cooperatively-scheduled coroutines; see arch/runtime/coroutines.hpp
. The concurrency/
directory contains many concurrency primitives designed for use with RethinkDB's coroutines. Coroutines can switch between threads by scheduling themselves to be resumed on another thread and then suspending themselves on the current thread.
Each RethinkDB server in a cluster maintains a TCP connection to every other server in the cluster. The messages sent across these TCP connections mostly fit into two basic paradigms: "mailboxes", and the "directory". A mailbox is an object that is created dynamically on a specific server. It has an address, which is serializable; if one server has the address of a mailbox on another server, it can send messages to that mailbox. This works well for point-to-point communication but not for service discovery. The directory consists of a set of values that each server broadcasts to every other server it's connected to. Typically a server will set up a mailbox to process a certain type of request and then send that mailbox's address over the directory so other servers can discover it. These mailbox addresses in the directory are often called "business cards". See rpc/README.md
for more information about RethinkDB's networking primitives.
The main()
function is defined in main.cc
, but it quickly transfers control to clustering/administration/main/command_line.cc
. This file sets up the thread pool and event loop, then starts an instance of do_serve()
(defined in clustering/administration/main/serve.cc
) in a coroutine inside the new thread pool. do_serve()
is responsible for setting up and tearing down all of the important components of the RethinkDB server in the right order, and passing them pointers to each other as necessary.
One of the objects constructed by do_serve()
is a rdb_query_server_t
instance (see rdb_protocol/query_server.hpp
), which listens for incoming TCP connections from client drivers. When it receives a message, ql::compile_term()
(see rdb_protocol/term.hpp
) converts the message from the client into a tree of ql::term_t
s. The rdb_protocol/terms/
directory contains many subclasses of ql::term_t
for the different ReQL terms. Each subclass has an eval()
method that does the actual work of executing the term.
The ql::val_t
type (defined in rdb_protocol/val.hpp
) is the basic type of a ReQL value, such as a string, table, database, etc. Simple ReQL values such as strings, arrays, objects, dates, etc. are called "datums"; they are represented by the ql::datum_t
type (defined in rdb_protocol/datum.hpp
).
Most queries will need to perform operations on one or more tables. An operation on a table is represented as a read_t
or write_t
object (defined in rdb_protocol/protocol.hpp
); the result of the operation is represented as a read_response_t
or write_response_t
object. The ReQL evaluation code passes the read_t
and write_t
objects to the table_query_client_t
(see clustering/query_routing/table_query_client.hpp
). The table_query_client_t
finds the relevant primary_query_server_t
by looking for its business card in the directory, and sends the query to the mailbox listed on the business card. (If the query affects multiple keys, there may be multiple relevant primary_query_server_t
s; the read_t
or write_t
will be broken into several sub-read_t
s or sub-write_t
s, which will each be sent to a different primary_query_server_t
.) The queries pass from the primary_query_server_t
via a series of types defined in clustering/immediate_consistency/
until they eventually reach the store_t
. If read_mode
is "outdated"
, read queries will reach the store_t
through the direct_query_server_t
instead of the primary_query_server_t
.
store_t
(defined in rdb_protocol/store.hpp
) is responsible for executing the read_t
and write_t
operations on the RethinkDB storage engine and collecting the results. Every RethinkDB table is represented as a B-tree. The blocks of the B-tree are stored in RethinkDB's custom page cache, page_cache_t
(defined in buffer_cache/page_cache.hpp
). The btree/
directory contains functions for executing insertion, deletion, lookup, traversal, etc. operations against the B-tree stored in the page cache. Modified pages are eventually flushed to disk via RethinkDB's log-structured serializer, log_serializer_t
(defined in serializer/log/log_serializer.hpp
).
When the store_t
finishes executing the read_t
or write_t
against the B-tree, it returns a read_response_t
or write_response_t
, which is passed back via the same way the read_t
or write_t
arrived until it gets back to the ql::term_t
that initiated the request. When the eval()
method on the root term finishes executing, the response is sent back over the TCP connection to the client driver that initiated the request.
Changefeeds are handled differently from normal reads. For normal reads, the machine handling the query will send a read_t
to the shards for every chunk of data it gets back. The exception to this rule is changefeeds, where the shards will instead push data to the server handling the query. The server handling the query subscribes to a changefeed by sending the mailbox address for its changefeed::client_t
to the shards inside of a read_t
. The shards will then subscribe that client to their changefeed::server_t
(which is a member variable of store_t
). The relevant classes can be found in rdb_protocol/changefeed.hpp
.
The above description has mentioned several types that are involved in the process of executing queries against a table: the store_t
, the primary_query_server_t
, the various types defined in clustering/immediate_consistency/
, and so on. The table's configuration (as exposed to the user in the rethinkdb.table_config
system table) describes which servers should be performing which role for which part of the table. When the user changes the table's configuration, objects on various servers must be created and destroyed to implement the user's new configuration, while making sure that the tables' contents are correctly copied from server to server.
There is one multi_table_manager_t
(defined in clustering/table_manager/
) per server, created by do_serve()
. It keeps a record for each table that is known to exist, whether that table has a replica on that server or not. If the server is supposed to be a replica for a given table, the multi_table_manager_t
will construct a table_manager_t
for that table.
RethinkDB uses the Raft consensus algorithm to manage table metadata, but not to manage user queries. RethinkDB uses a custom implementation of Raft that's designed to work with the RethinkDB event loop and networking primitives; it's implemented by the type raft_member_t
, defined in clustering/generic/raft_core.hpp
. There is one Raft cluster per RethinkDB table. Each table_manager_t
instantiates a raft_member_t
. The state managed by the Raft instance is defined as table_raft_state_t
, defined in clustering/table_contract/contract_metadata.hpp
. Among other things, the table_raft_state_t
contains the table's current configuration and a collection of contract_t
objects.
The contract_t
s describe what each server is supposed to be doing with respect to each region of the table. The difference between the contracts and the config is that the user can change the config freely, but the contracts can only change in specific well-defined ways that ensure that the data in the tables is preserved and remains consistent. The table_manager_t
constructs a contract_coordinator_t
(defined in clustering/table_contract/coordinator/
) on whichever server is the leader of the Raft cluster; it's responsible for updating the contracts to match the config. The table_manager_t
also constructs a contract_executor_t
(defined in clustering/table_contract/executor/
) on every server; the contract_executor_t
is responsible for carrying out the instructions in the contracts by constructing the store_t
, the primary_query_server_t
, the various types in clustering/immediate_consistency/
, etc. The details of how this works are quite complicated; the comments in clustering/table_contract/contract_metadata.hpp
are a good introduction.