Two Phase Commit Design

This page explains how two-phase commit (XA) works in TokuDB. This page is organized as follows. First it reviews the BerkeleyDB API for two phase commit, (which TokuDB implements), then it explains how those functions are implemented.

This wiki supercedes the design captured in the original two phase commit design document authored by Bradley Kuszmaul.

Relevant bugs: #4298, implement XA in TokuFT and TokuDB.

TokuFT API

Berkeley DB defines three functions for XA: DB_TXN->prepare(), db_env->txn_recover(), and DB_TXN->discard().

`DB_TXN->prepare()`

int DB_TXN->prepare(/* modified: */ DB_TXN *tid, 
                    /* in:       */ u_int8_t gid[DB_GID_SIZE]);

Effect: Initiate two phase commit. Causes TokuFT to write enough information to disk to ensure that if the system crashes, the transaction can be committed or aborted during recovery. The gid is created by the caller (e.g., by mysqld). The DB_GID_SIZE is 128, making it big enough to allow the caller to ensure that no duplicate gid is ever used. For example, if the gid were constructed using the gettimeofday() that the server started, catted with a random number (chosen at server boot time by reading /dev/urandom, catted with a 64-bit counter, it would likely be unique enough. Also, for MySQL as long as the gid is unique during one process's life, it's good enough. The uniqueness issues really show up only when the master and the slave are different processes that may not all crash at the same time.

Parameters:

tid: The transaction.
gid: a global transaction ID which the handler must construct. The transaction ID probably should not be reused. The gid is provided by the caller.

Return values:

0: The transaction can commit or abort.
DB_LOCK_DEADLOCK or DB_LOCK_NOTGRANTED or other nonzero values denoting a failure: The transaction cannot prepare, so the application must call DB_TXN->abort() on the transaction, and proceed.

`DB_ENV->txn_recover()`

int DB_ENV->txn_recover(DB_ENV *dbenv,
                        /* out (filled in by the function) */ DB_PREPLIST preplist[/*count*/],
                        long count, 
                        /*out*/ long *retp,
                        u_int32_t flags);

Effect: Return a list of prepared but not completed transactions. Each transaction must be committed or aborted by the caller. The values are returned a few at a time.

Parameters:

dbenv: A (recovered) environment.
preplist: a pointer to an array of DB_PREPLIST structs. The preplist is memory allocated by the caller and filled in by the function.
count: the size of preplist. The maximum number of transactions to return.
retp: an output value containing the number of transactions actually being returned.
flags: must contain one of these two values:
DB_FIRST: Begin returning a list of prepared but not completed transactions. Specifying this flag begins a new pass over all such transactions regardless of whether they have been returned in previous calls to DB_ENV->txn_recover(). Don't call this function from different threads concurrently.
DB_NEXT: Return more prepared but uncompleted transactions;.

The DB_PREPLIST is a structure that has at least two fields. It might be defined by db.h as

typedef struct __db_preplist {
        DB_TXN *txn;
        u_int8_t gid[DB_GID_SIZE];
} DB_PREPLIST;

The struct may be slightly different, so it should be gotten from db.h rather than redefined by the application.

Note: For BDB, txn_recover() is specified to acquire all the locks held by the original transaction. In TokuDB we don't bother acquiring those locks. After TokuDB recovery, you must resolve all of the prepared but unresolved transactions before performing any new transactions. (In BDB, you may be allowed to run new transactions before resolving the old ones.)

`DB_TXN->discard(DB_TXN *txn, u_int32_t flags)`

Effect: Discard the internal representation of a prepared transaction without committing or aborting it. The DB_TXN handle is no longer valid after this function is called.

Purpose: When the MySQL transaction coordinator is running recovery, it can commit a prepared transaction, abort a prepared transaction, or discard a prepared transaction. The transaction coordinator commits a prepared transaction when the transaction is in the binlog. The transaction coordinator aborts a prepared transaction when the transaction is NOT in the binlog. Finally, the transaction coordinator discards all prepared transactions when it can not find the binlog.

Typical Usage

MySQL (a typical master) does the following

Begin transaction.
Perform TokuDB operations and log statements to the binlog. (The binlog may or may not have been written to disk.)
Create a gid. This is done above the handler.
Tell TokuDB to prepare DB_TXN->prepare(). TokuDB writes an xprepare log entry. At this point, TokuDB has no control over the destiny of the transaction. If MySQL says abort, TokuDB must abort. If MySQL says commit, TokuDB commits. If MySQL says jump, TokuDB says "how high". Even if a crash happens betwen the return from prepare() and the subsequent commit or abort, TokuDB must be able follow MySQL's command after recovery.
Write a commit record (containing the gid) to the binlog, and fsync the binlog.
Tell TokuDB to commit DB_TXN->commit().

During Recovery

TokuDB must recover, and construct a list of all the prepared transactions. Those prepared transactions must be in the same state they were before the crash. All the undo log information must have been constructed.
MySQL then uses DB_ENV->txn_recover() to get a list of those transactions.
MySQL then looks in the binlog to determine which of those transactions must commit and issues DB_TXN->commit() operations on those transactions, and issues DB_TXN->abort() operations on the rest. The commit or abort operations are not allowed to fail.
Repeat, calling DB_ENV->txn_recover() until *retp<count.

TokuFT Implementation

We add to the DB_ENV internal structure:

A (doubly-linked) list of prepared transactions. This list is doubly linked so that the transactions can be added and removed in constant time.

We add new logentries to the transaction log:

xprepare: a log entry that is like commit except that it prepares a transaction. This log entry is typically fsync'd. Then xcommit doesn't need to be fsync'd as long as the application remembers which transactions were committed for a little while. For mysqld this means that the binlog should not be rotated until some subsequent fsync occurs on the tokudb logs.
xstillopenprepared: a log entry used a checkpoint. xstillopenprepared}} is a little bit like an xstillopenfollowed by anxprepare. It denotes a transaction that was open and prepared at the checkpoint. The xstillopenprepared``` does not need a parent xid, but it does need a GID.

When running recovery, if the system encounters a PREPARE, it should add the transaction to the list of prepared transactions.

When running recovery, if the system encounters a COMMIT on a prepared transaction, it should remove it from the list of prepared transactions.

When preparing a transaction, we log a PREPARE entry and perform an fsync(). If there is any reason that the transaction could fail, the system must speak now or forever hold its peace. (For example if we ever use some sort of optimistic locking protocol, we must actually acquire the locks now. For now, since we acquire locks pessimistically, there are no reasons that the transaction could fail.) When logging a PREPARE entry we must perform an fsync().

When logging a COMMIT entry that is previously prepared, we do not need to perform another fsync(). (Normally a COMMIT is fsync()'d.) That's because if the COMMIT entry does not make it to disk, and then we run recovery, then during recovery the transaction will be included in the list returned by txn_recover() and MySQL will look in the bin log to determine that the transaction actually is committed.

There is an obscure race condition built in to the BDB API for XA if we don't fsync on commit: There is no upper bound to how long MySQL must remember the BINLOG. In practice, as soon as any transaction does a prepare then all previous commits will be durable (since the prepare flushes the log). But this could happen:

Prepare
fsync the binlong
Commit (no fsync)
Do a lot of transactions on InnoDB, causing the binlog to rotate.
Crash.
Recover: tokudb never did an fsync on the commit, and so it includes the transaction in the txn_recover list. But MySQL doesn't remember what happened to that transaction. So the system fails.

Possible fixes for this obscure race:

Don't do that. fsync on commit as well as prepare. This is a serious performance penalty
Do the write() during commit. Not as big a performance penalty, but still pretty bad in some cases (e.g., if we start using direct I/O on the log, then a write is as bad as an fsync.)
Don't worry about it. Since TokuDB does checkpoints every 60 seconds or so, the log will be flushed then. It's still slightly racy, but should be fine. This is OK for mysqld as long as no one disables the checkpoints, and the logs rotate less than once per checkpoint interval.
Add a call to DB_ENV->txn_checkpoint() before rotating the log. This will flush the log. This is the prescribed approach that should always work.
MySQL calls handler::flush_logs when rotating the binlog, and tokudb::flush_logs fsync's the tokudb recovery log.

Note that when performing a commit or abort, we need a bunch of DBs open. If a DB has been closed, but there is a pending transaction that needs the DB to perform its commit or abort, then the DB is a ''zombie''. The prepare cannot close those zombie DBs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two Phase Commit Design

Two Phase Commit Design

TokuFT API

`DB_TXN->prepare()`

`DB_ENV->txn_recover()`

`DB_TXN->discard(DB_TXN *txn, u_int32_t flags)`

Typical Usage

During Recovery

TokuFT Implementation

Clone this wiki locally