-
Notifications
You must be signed in to change notification settings - Fork 132
Two Phase Commit Design
This page explains how two-phase commit (XA) works in TokuDB. This page is organized as follows. First it reviews the BerkeleyDB API for two phase commit, (which TokuDB implements), then it explains how those functions are implemented.
This wiki supercedes the design captured in the original two phase commit design document authored by Bradley Kuszmaul.
Relevant bugs: #4298, implement XA in TokuFT and TokuDB.
Berkeley DB defines three functions for XA: DB_TXN->prepare()
, db_env->txn_recover()
, and DB_TXN->discard()
.
int DB_TXN->prepare(/* modified: */ DB_TXN *tid,
/* in: */ u_int8_t gid[DB_GID_SIZE]);
Effect: Initiate two phase commit. Causes TokuFT to write enough information to disk to ensure that if the system crashes, the transaction can be committed or aborted during recovery. The gid
is created by the caller (e.g., by mysqld). The DB_GID_SIZE is 128, making it big enough to allow the caller to ensure that no duplicate gid is ever used. For example, if the gid
were constructed using the gettimeofday()
that the server started, catted with a random number (chosen at server boot time by reading /dev/urandom
, catted with a 64-bit counter, it would likely be unique enough. Also, for MySQL as long as the gid is unique during one process's life, it's good enough. The uniqueness issues really show up only when the master and the slave are different processes that may not all crash at the same time.
Parameters:
-
tid
: The transaction. -
gid
: a global transaction ID which the handler must construct. The transaction ID probably should not be reused. Thegid
is provided by the caller.
Return values:
-
0
: The transaction can commit or abort. -
DB_LOCK_DEADLOCK
orDB_LOCK_NOTGRANTED
or other nonzero values denoting a failure: The transaction cannot prepare, so the application must callDB_TXN->abort()
on the transaction, and proceed.
int DB_ENV->txn_recover(DB_ENV *dbenv,
/* out (filled in by the function) */ DB_PREPLIST preplist[/*count*/],
long count,
/*out*/ long *retp,
u_int32_t flags);
Effect: Return a list of prepared but not completed transactions. Each transaction must be committed or aborted by the caller. The values are returned a few at a time.
Parameters:
-
dbenv
: A (recovered) environment. -
preplist
: a pointer to an array of DB_PREPLIST structs. The preplist is memory allocated by the caller and filled in by the function. -
count
: the size of preplist. The maximum number of transactions to return. -
retp
: an output value containing the number of transactions actually being returned. -
flags
: must contain one of these two values: -
DB_FIRST
: Begin returning a list of prepared but not completed transactions. Specifying this flag begins a new pass over all such transactions regardless of whether they have been returned in previous calls toDB_ENV->txn_recover()
. Don't call this function from different threads concurrently. -
DB_NEXT
: Return more prepared but uncompleted transactions;.
The DB_PREPLIST is a structure that has at least two fields. It might be defined by db.h
as
typedef struct __db_preplist {
DB_TXN *txn;
u_int8_t gid[DB_GID_SIZE];
} DB_PREPLIST;
The struct may be slightly different, so it should be gotten from db.h
rather than redefined by the application.
Note: For BDB, txn_recover()
is specified to acquire all the locks held by the original transaction. In TokuDB we don't bother acquiring those locks. After TokuDB recovery, you must resolve all of the prepared but unresolved transactions before performing any new transactions. (In BDB, you may be allowed to run new transactions before resolving the old ones.)
Effect: Discard the internal representation of a prepared transaction without committing or aborting it. The DB_TXN handle is no longer valid after this function is called.
Purpose: When the MySQL transaction coordinator is running recovery, it can commit a prepared transaction, abort a prepared transaction, or discard a prepared transaction. The transaction coordinator commits a prepared transaction when the transaction is in the binlog. The transaction coordinator aborts a prepared transaction when the transaction is NOT in the binlog. Finally, the transaction coordinator discards all prepared transactions when it can not find the binlog.
MySQL (a typical master) does the following
- Begin transaction.
- Perform TokuDB operations and log statements to the binlog. (The binlog may or may not have been written to disk.)
- Create a gid. This is done above the handler.
- Tell TokuDB to prepare
DB_TXN->prepare()
. TokuDB writes anxprepare
log entry. At this point, TokuDB has no control over the destiny of the transaction. If MySQL says abort, TokuDB must abort. If MySQL says commit, TokuDB commits. If MySQL says jump, TokuDB says "how high". Even if a crash happens betwen the return fromprepare()
and the subsequent commit or abort, TokuDB must be able follow MySQL's command after recovery. - Write a commit record (containing the gid) to the binlog, and fsync the binlog.
- Tell TokuDB to commit
DB_TXN->commit()
.
- TokuDB must recover, and construct a list of all the prepared transactions. Those prepared transactions must be in the same state they were before the crash. All the undo log information must have been constructed.
- MySQL then uses
DB_ENV->txn_recover()
to get a list of those transactions. - MySQL then looks in the binlog to determine which of those transactions must commit and issues
DB_TXN->commit()
operations on those transactions, and issuesDB_TXN->abort()
operations on the rest. The commit or abort operations are not allowed to fail. - Repeat, calling
DB_ENV->txn_recover()
until*retp<count
.
We add to the DB_ENV
internal structure:
- A (doubly-linked) list of prepared transactions. This list is doubly linked so that the transactions can be added and removed in constant time.
We add new logentries to the transaction log:
-
xprepare
: a log entry that is like commit except that it prepares a transaction. This log entry is typically fsync'd. Thenxcommit
doesn't need to be fsync'd as long as the application remembers which transactions were committed for a little while. For mysqld this means that the binlog should not be rotated until some subsequent fsync occurs on the tokudb logs. -
xstillopenprepared
: a log entry used a checkpoint.xstillopenprepared}} is a little bit like an
xstillopenfollowed by an
xprepare. It denotes a transaction that was open and prepared at the checkpoint. The
xstillopenprepared``` does not need a parent xid, but it does need a GID.
When running recovery, if the system encounters a PREPARE
, it should add the transaction to the list of prepared transactions.
When running recovery, if the system encounters a COMMIT
on a prepared transaction, it should remove it from the list of prepared transactions.
When preparing a transaction, we log a PREPARE
entry and perform an fsync()
. If there is any reason that the transaction could fail, the system must speak now or forever hold its peace. (For example if we ever use some sort of optimistic locking protocol, we must actually acquire the locks now. For now, since we acquire locks pessimistically, there are no reasons that the transaction could fail.) When logging a PREPARE
entry we must perform an fsync()
.
When logging a COMMIT
entry that is previously prepared, we do not need to perform another fsync()
. (Normally a COMMIT
is fsync()
'd.) That's because if the COMMIT
entry does not make it to disk, and then we run recovery, then during recovery the transaction will be included in the list returned by txn_recover()
and MySQL will look in the bin log to determine that the transaction actually is committed.
There is an obscure race condition built in to the BDB API for XA if we don't fsync on commit: There is no upper bound to how long MySQL must remember the BINLOG. In practice, as soon as any transaction does a prepare then all previous commits will be durable (since the prepare flushes the log). But this could happen:
- Prepare
- fsync the binlong
- Commit (no fsync)
- Do a lot of transactions on InnoDB, causing the binlog to rotate.
- Crash.
- Recover: tokudb never did an fsync on the commit, and so it includes the transaction in the
txn_recover
list. But MySQL doesn't remember what happened to that transaction. So the system fails.
Possible fixes for this obscure race:
- Don't do that. fsync on commit as well as prepare. This is a serious performance penalty
- Do the
write()
duringcommit
. Not as big a performance penalty, but still pretty bad in some cases (e.g., if we start using direct I/O on the log, then a write is as bad as an fsync.) - Don't worry about it. Since TokuDB does checkpoints every 60 seconds or so, the log will be flushed then. It's still slightly racy, but should be fine. This is OK for mysqld as long as no one disables the checkpoints, and the logs rotate less than once per checkpoint interval.
- Add a call to
DB_ENV->txn_checkpoint()
before rotating the log. This will flush the log. This is the prescribed approach that should always work. - MySQL calls
handler::flush_logs
when rotating the binlog, andtokudb::flush_logs
fsync's the tokudb recovery log.
Note that when performing a commit or abort, we need a bunch of DB
s open. If a DB
has been closed, but there is a pending transaction that needs the DB
to perform its commit or abort, then the DB
is a ''zombie''. The prepare cannot close those zombie DB
s.