Skip to content

Latest commit

 

History

History
38 lines (19 loc) · 8.83 KB

README.md

File metadata and controls

38 lines (19 loc) · 8.83 KB

Overview

Basic primitives

Each RethinkDB process has a thread pool (see arch/runtime/thread_pool.hpp) with a fixed number of threads. Each of these threads is running an event loop that processes IO notifications from the OS and messages from other threads. The arch/ directory contains primitives for timers, file and network IO, and so on in the context of the RethinkDB thread pool. The main event loop threads never run blocking IO operations; instead, there is a separate set "blocker pool" (see arch/io/blocker_pool.hpp) of threads used for blocking operations.

RethinkDB makes heavy use of cooperatively-scheduled coroutines; see arch/runtime/coroutines.hpp. The concurrency/ directory contains many concurrency primitives designed for use with RethinkDB's coroutines. Coroutines can switch between threads by scheduling themselves to be resumed on another thread and then suspending themselves on the current thread.

Each RethinkDB server in a cluster maintains a TCP connection to every other server in the cluster. The messages sent across these TCP connections mostly fit into two basic paradigms: "mailboxes", and the "directory". A mailbox is an object that is created dynamically on a specific server. It has an address, which is serializable; if one server has the address of a mailbox on another server, it can send messages to that mailbox. This works well for point-to-point communication but not for service discovery. The directory consists of a set of values that each server broadcasts to every other server it's connected to. Typically a server will set up a mailbox to process a certain type of request and then send that mailbox's address over the directory so other servers can discover it. These mailbox addresses in the directory are often called "business cards". See rpc/README.md for more information about RethinkDB's networking primitives.

Server start-up process

The main() function is defined in main.cc, but it quickly transfers control to clustering/administration/main/command_line.cc. This file sets up the thread pool and event loop, then starts an instance of do_serve() (defined in clustering/administration/main/serve.cc) in a coroutine inside the new thread pool. do_serve() is responsible for setting up and tearing down all of the important components of the RethinkDB server in the right order, and passing them pointers to each other as necessary.

Executing queries

One of the objects constructed by do_serve() is a rdb_query_server_t instance (see rdb_protocol/query_server.hpp), which listens for incoming TCP connections from client drivers. When it receives a message, ql::compile_term() (see rdb_protocol/term.hpp) converts the message from the client into a tree of ql::term_ts. The rdb_protocol/terms/ directory contains many subclasses of ql::term_t for the different ReQL terms. Each subclass has an eval() method that does the actual work of executing the term.

The ql::val_t type (defined in rdb_protocol/val.hpp) is the basic type of a ReQL value, such as a string, table, database, etc. Simple ReQL values such as strings, arrays, objects, dates, etc. are called "datums"; they are represented by the ql::datum_t type (defined in rdb_protocol/datum.hpp).

Most queries will need to perform operations on one or more tables. An operation on a table is represented as a read_t or write_t object (defined in rdb_protocol/protocol.hpp); the result of the operation is represented as a read_response_t or write_response_t object. The ReQL evaluation code passes the read_t and write_t objects to the table_query_client_t (see clustering/query_routing/table_query_client.hpp). The table_query_client_t finds the relevant primary_query_server_t by looking for its business card in the directory, and sends the query to the mailbox listed on the business card. (If the query affects multiple keys, there may be multiple relevant primary_query_server_ts; the read_t or write_t will be broken into several sub-read_ts or sub-write_ts, which will each be sent to a different primary_query_server_t.) The queries pass from the primary_query_server_t via a series of types defined in clustering/immediate_consistency/ until they eventually reach the store_t. If read_mode is "outdated", read queries will reach the store_t through the direct_query_server_t instead of the primary_query_server_t.

store_t (defined in rdb_protocol/store.hpp) is responsible for executing the read_t and write_t operations on the RethinkDB storage engine and collecting the results. Every RethinkDB table is represented as a B-tree. The blocks of the B-tree are stored in RethinkDB's custom page cache, page_cache_t (defined in buffer_cache/page_cache.hpp). The btree/ directory contains functions for executing insertion, deletion, lookup, traversal, etc. operations against the B-tree stored in the page cache. Modified pages are eventually flushed to disk via RethinkDB's log-structured serializer, log_serializer_t (defined in serializer/log/log_serializer.hpp).

When the store_t finishes executing the read_t or write_t against the B-tree, it returns a read_response_t or write_response_t, which is passed back via the same way the read_t or write_t arrived until it gets back to the ql::term_t that initiated the request. When the eval() method on the root term finishes executing, the response is sent back over the TCP connection to the client driver that initiated the request.

Changefeeds are handled differently from normal reads. For normal reads, the machine handling the query will send a read_t to the shards for every chunk of data it gets back. The exception to this rule is changefeeds, where the shards will instead push data to the server handling the query. The server handling the query subscribes to a changefeed by sending the mailbox address for its changefeed::client_t to the shards inside of a read_t. The shards will then subscribe that client to their changefeed::server_t (which is a member variable of store_t). The relevant classes can be found in rdb_protocol/changefeed.hpp.

Table configuration

The above description has mentioned several types that are involved in the process of executing queries against a table: the store_t, the primary_query_server_t, the various types defined in clustering/immediate_consistency/, and so on. The table's configuration (as exposed to the user in the rethinkdb.table_config system table) describes which servers should be performing which role for which part of the table. When the user changes the table's configuration, objects on various servers must be created and destroyed to implement the user's new configuration, while making sure that the tables' contents are correctly copied from server to server.

There is one multi_table_manager_t (defined in clustering/table_manager/) per server, created by do_serve(). It keeps a record for each table that is known to exist, whether that table has a replica on that server or not. If the server is supposed to be a replica for a given table, the multi_table_manager_t will construct a table_manager_t for that table.

RethinkDB uses the Raft consensus algorithm to manage table metadata, but not to manage user queries. RethinkDB uses a custom implementation of Raft that's designed to work with the RethinkDB event loop and networking primitives; it's implemented by the type raft_member_t, defined in clustering/generic/raft_core.hpp. There is one Raft cluster per RethinkDB table. Each table_manager_t instantiates a raft_member_t. The state managed by the Raft instance is defined as table_raft_state_t, defined in clustering/table_contract/contract_metadata.hpp. Among other things, the table_raft_state_t contains the table's current configuration and a collection of contract_t objects.

The contract_ts describe what each server is supposed to be doing with respect to each region of the table. The difference between the contracts and the config is that the user can change the config freely, but the contracts can only change in specific well-defined ways that ensure that the data in the tables is preserved and remains consistent. The table_manager_t constructs a contract_coordinator_t (defined in clustering/table_contract/coordinator/) on whichever server is the leader of the Raft cluster; it's responsible for updating the contracts to match the config. The table_manager_t also constructs a contract_executor_t (defined in clustering/table_contract/executor/) on every server; the contract_executor_t is responsible for carrying out the instructions in the contracts by constructing the store_t, the primary_query_server_t, the various types in clustering/immediate_consistency/, etc. The details of how this works are quite complicated; the comments in clustering/table_contract/contract_metadata.hpp are a good introduction.