- Feature Name: Transactional Analytic Engine
- Status: In Progress
- Start Date: 2022-02-21
- Authors: Xu Peng
- Implementation PR:
- Issue for this RFC:
TAE (Transactional Analytic Engine) is designed for hybrid transactional analytical query workloads, which can be used as the underlying storage engine of database management system (DBMS) for online analytical processing of queries (HTAP).
- Block: Piece of a segment which is the minimum part of table data. The maximum number of rows of a block is fixed
- Segment: Piece of a table which is composed of blocks
- Table: Piece of a database which is composed of segments
- Database: A combination of tables, which shares the same log space
TODO
- Transient Block: Block where the number of rows does not reach the upper limit and the blocks queued to be sorted and flushed
- Sorted Block: Sorted block
- Unsorted Segment: Segment that not merge sorted
- Sorted Segment: Segment that merge sorted.
TODO
- Vector: Data fragment of a column in memory
- Batch: A combination of vectors, and the number of rows in each vector is aligned
TODO
TAE stores data represented as tables. Each table is bound to a schema consisting of numbers of column definitions. A table data is organized as a log-structured merge-tree (LSM tree).
Currently, TAE is a three-level LSM tree, called L0, L1 and L2. L0 is small and can be entirely resident in memory, whereas L1 and L2 are both definitely resident on disk. In TAE, L0 consists of transient blocks and L1 consists of sorted blocks. The incoming new data is always inserted into the latest transient block. If the insertion causes the block to exceed the maximum row count of a block, the block will be sorted by primary key and flushed into L1 as sorted block. If the number of sorted blocks exceed the maximum number of a segment, the segment will be sorted by primary key using merge sort.
L1 and L2 are organized into sorted runs of data. Each run contains data sorted by the primary key, which can be represented on disk as a single file. There will be overlapping primary key ranges between sort runs. The difference of L1 and L2 is that a run in L1 is a block while a run in L2 is a segment.
A segment can be compacted into a new segment if it has many updates(deletions). Segments can be merged into a segment. The scheduling behind this has some customizable strategies, mainly the trade-off between write amplification and read amplification.
As described above, transient blocks can be entirely resident in memory, but not necessarily so. Because there will be many tables, each table has transient blocks. If they are always resident in memory, it will cause a huge waste. In TAE, transient blocks from all tables share a dedicated fixed-size LRU cache. A evicted transient block will be unloaded from memory and flushed as a transient block file. In practice, the transient blocks are constantly flowing to the L1 and the number of transient blocks per table at a certain time is very small, those active transient blocks will likly reside in memory even with a small-sized cache.
There's no table-level index in TAE, only segment and block-level indexes are available.
In TAE, there is a dedicated fixed-size LRU cache for all indexes. Compared with the original data, the index occupies a limited space, but the acceleration of the query is very obvious, and the index will be called very frequently. A dedicated cache can avoid a memory copy when being called.
TAE creates an index for each table's primary key by default. The main function is to deduplicate when inserting data and filter according to the primary key. Deduplication is the critical path for data insertion. We need to make trade-offs in the following three aspects:
- Query performance
- Memory usage
- Match with the underlying data store layout
From the granularity of the index, we divide the index into two categories, one is a table-level index, and the other is an index set composed of a series of partition indexes. For example, we can have a table-level B tree index, or each segment has a B tree index. The table data of TAE consists of multiple segments, and each segment must be unordered first and then ordered. Compaction, merging, or splitting may take place afterwards. This scenario is very unfriendly to the table-level index. So the index of TAE should be a segment-level index set.
There are two types of segment. One is appendable and the other is not. For non-appendable segment, the segment-level index is a two-level structure, bloomfilter and zonemap respectively. There are two options for bloomfilter, a segment-based bloomfilter, and a block-based bloomfilter. The Segment-based is a better choice when the index can be fully resident in memory. An appendable segment consists of at least one appendable block plus multiple non-appendable blocks. Appendable block index is a resident memory ART-tree plus zonemap while the non-appendable one is bloomfilter plus zonemap.
The query flow chart of segment index is as follow:
TODO
TODO
TAE is a column-oriented data store, very friendly to data compression. It supports per-column compression codecs and now only LZ4 is used. You can easily obtain the meta information of compressed blocks. In TAE, the compression unit is a column of a block.
+-------------+--------------+-------------+----------------+------------------+------------------+
| Magic (8B) | Version (2B) | Format (2B) | Reserved (32B) | MinPartSize (2B) | MaxPartSize (2B) |
+-------------+--------------+-------------+----------------+------------------+------------------+
Magic = Engine identity (0x01346616). TAE only
Version = File version
Format = Layout format
Reserved = 32 bytes reserved space
MinPartSize = Specify the size of the smallest part uint. 1 for 1K bytes. If 4 is specified, the smallest part size is 4K bytes
MaxPartSize = Specify the size of the bigest part. 1 for 1K bytes
Meta-1, Meta-2 = One is stale while the other is active
+----------+----------+-
| <Meta-1> | <Meta-2> |
+----------+----------+-
|
|
+---------------------------------------------------+
| Meta |
+---------------+-----------------+-----------------+
| Version (2B) | PartOffset (4B) | Reserved (128B) |
+---------------+-----------------+-----------------+
Version = Meta version
PartOffset = The first part position
Reserved = 128 bytes reserved space
+------------+---------------------+---- ... ----+
| Size (2B) | NextPartOffset (4B) | Payload |
+------------+---------------------+---- ... ----+
Size = Part size. 1 for 1K bytes
NextPartOffset = Next part pointer
Payload = The size of payload is ($Size * 1K - 2B - 4B)
Format | Value | Description |
---|---|---|
SegmentFile | 0x10 | Segment file format |
+-------------------------------------------------------------------------------+
| Segment File Meta Part Payload |
+-----------+---------------+--------------+---------------+--------------------+
| Cols (2B) | BlockCnt (2B) | <ColumnMeta> | <IndexesMeta> | <BlockIndexesMeta> |
+-----------+---------------+--------------+---------------+--------------------+
| | |
| | |
| | +-------------------------------------+
| | | BlockIndexesMeta |
| | +------------------+-------...--------+
| | | <IndexesMeta> | ... |
| | +------------------+-------...--------+
| | |
| | |
| +------------------------------------------------------+
| | IndexesMeta |
| +-------------------+------------+-------------+--...--+
| | CompressAlgo (1B) | Count (2B) | <IndexMeta> | ... |
| +-------------------+------------+-------------+--...--+
| |
| |
+----------------------------------+ +----------------+----------------+-----------------+------------+-----------+--------------+
| ColumnMeta | | IndexMeta |
+-------------------+--------------+ +----------------+----------------+-----------------+------------+-----------+--------------+
| CompressAlgo (1B) | <BlocksMeta> | | IndexType (1B) | ColumnIdx (2B) | PartOffset (4B) | Start (4B) | Size (4B) | RawSize (4B) |
+-------------------+--------------+ +----------------+----------------+-----------------+------------+-----------+--------------+
|
|
+----------------------------------------------------+
| BlocksMeta |
+--------------+-------------+---...---+-------------+
| <BlockMeta> | <BlockMeta> | | <BlockMeta> |
+--------------+-------------+---...---+-------------+
|
|
+----------------------------------------------------------+
| BlockMeta |
+------------------+------------+-----------+--------------+
| PartOffset (4B) | Start (4B) | Size (4B) | RawSize (4B) |
+------------------+------------+-----------+--------------+
Suppose the MinPartSize
is 4
, which stands for 4K
. It starts with an empty segment (No Parts). Now we do the following in sequence
- Append a block (Block0) of 3 columns,and the compressed sizes of the three columns are 3.4K, 7.1K, and 2.5K.
1)Calculate an optimal size for each column:4K,8K and 4K 2)Find unused candidate parts. Not found here. 3) Allocate new parts for these 3 columns 4) Flush these 3 columns to the specified part 5) Flush meta
- Append a block (Block1) of 3 columns,and the compressed sizes of the three columns are 11.3K, 7.8K, and 3.5K.
1)Calculate an optimal size for each column:12K,8K and 4K 2)Find unused candidate parts. Not found here. 3) Allocate new parts for these 3 columns 4) Flush these 3 columns to the specified new parts 5) Flush meta
- There are some updates to the 3rd column of Block0, which is marked as
Block0-2
. Now we start to checkpoint the updates into segment file. The compressed size of the updatedBlock0-2
is 7.3K.1)Calculate an optimal size:8K 2)Find unused candidate parts. Not found here. 3) Allocate a new part 4) Flush to the specified new part 5) Flush meta
- There are some updates to the first column of Block1, which is marked as
Block1-0
. Now we start to checkpoint the updates into segment file. The compressed size of the updatedBlock1-0
is 3.6K.1)Calculate an optimal size:4K 2)Find unused candidate parts. Unused Part(3, 1) found 3) Flush to the specified unused part 4) Flush meta
- Append a block (Block2) of 3 columns,and the compressed sizes of the three columns are 2.8K, 3.1K, and 3.7K.
1)Calculate an optimal size for each column:4K,4K and 4K 2)Find unused candidate parts. Unused Part(4, 1), Part(5, 1), Part(6, 1) found 3) Flush these 3 columns to the specified unused parts 4) Flush meta
Buffer manager is responsible for the allocation of buffer space. It handles all requests for data pages and temporary blocks of the TAE.
- Each page is bound to a buffer node with a unique node ID
- A buffer node has two states:
- Loaded
- Unloaded
- When a requestor Pin a node:
- If the node is in Loaded state, it will increase the node reference count by 1 and wrap a node handle with the page address in memory
- If the node is in Unloaded state, it will read the page from disk|remote first, increase the node reference count by 1 and wrap a node handle with the page address in memory. When there is no left room in the buffer, some victim node will be unloaded to make room. The current replacement strategy is LRU
- When a requestor Unpin a node, just call Close of the node handle. It will decrease the node reference count by 1. If the reference count is 0, the node will be a candidate for eviction. Node with reference count greater than 0 never be evicted.
There are currently four buffer managers for different purposes in TAE
- Mutation buffer manager: A dedicated fixed-size buffer used by L0 transient blocks. Each block corresponds to a node in the buffer
- SST buffer manager: A dedicated fixed-size buffer used by L1 and L2 blocks. Each column within a block corresponds to a node in the buffer
- Index buffer manager: A dedicated fixed-size buffer used by indexes. Each block or a segment index corresponds to a node in the buffer
- Redo log buffer manager: A dedicated fixed-size buffer used by uncommitted transactions. Each transaction local storage consists of at least one buffer node.
An embedded log-structured data store. It is used as the underlying driver of Catalog and WAL.
TODO
Write-ahead logging (WAL) is the key for providing atomicity and durability. All modifications should be written to a log before applied. In TAE, REDO log does not need to record every write operation, but it must be recorded when the transaction is committed. We will reduce the usage of io by using the redo log buffer manager, and avoid any io events for those transactions that are not long and may need to be rolled back due to various conflicts. It can also support long or large transactions.
Item | Size(Byte) | Scope | Description |
---|---|---|---|
GroupId |
4 | All |
Specify the group id |
LSN |
8 | All |
Specify the log sequence number |
Length |
4 | All |
Specify the length of the entry |
Type |
1 | All |
Specify the entry type |
Type | Datatype | Value | Description |
---|---|---|---|
AC |
int8 | 0x10 | A committed transaction of complete write operations |
PC |
int8 | 0x11 | A committed transaction of partial write operations |
UC |
int8 | 0x12 | Partial write operations of a uncommitted transaction |
RB |
int8 | 0x13 | Rollback of a transaction |
CKP |
int8 | 0x40 | Checkpoint |
Most transactions only have one log entry. Only those long or large transactions may need to record multiple log entries. So the log of a transaction may be of 1+
UC
type log entries plus one PC
type log entry, or only one AC
type log entry. TAE assigns a dedicate group to log entries of type UC
. Here is the transaction log of six committed transactions. specifies the log entry in group 2 with LSN 3.
A transaction log entry includes multiple nodes, and there are multiple types of nodes. DML node, delete info node, append info node, update info node. A node is an atomic command, which can be annotated as a sub-entry index of a committed entry. For example, there are 3 nodes in , which can be annotated as
Item | Size(Byte) | Scope | Description |
---|---|---|---|
Id |
8 | All |
Specify the transaction id |
InfoLen |
2 | All |
Specify the length of transaction info |
Info |
- | All |
Specify the transaction info |
PrevPtr |
12 | UC ,PC ,RB |
Specify the previous entry |
NodeCnt |
4 | All |
Specify the count of nodes |
Type | Datatype | Value | Description |
---|---|---|---|
DML_A |
int8 | 0x1 | Append info |
DML_D |
int8 | 0x2 | Delete info |
DML_U |
int8 | 0x3 | Update info |
DDL_CT |
int8 | 0x4 | Create table info |
DDL_DT |
int8 | 0x5 | Drop table info |
DDL_CD |
int8 | 0x6 | Create database info |
DDL_DD |
int8 | 0x7 | Drop database info |
DDL_CI |
int8 | 0x8 | Create index info |
DDL_DI |
int8 | 0x9 | Drop index info |
Item | Size(Byte) | Scope | Description |
---|---|---|---|
Type |
1 | All |
Specify the type of node |
Len |
4 | All |
Specify the length of node |
Buf |
- | All |
Node payload |
Here are 6 concurrent transactions, arranged chronologically from left to right. specify the second write operation of . specify the commit of . specify the rollback of .
As mentioned in the chapter on transactions, all active transactions share a fixed-size memory space, which is managed by the buffer manager. When the remaining space is not enough, some transaction nodes will be unloaded. If it is the first time of the node unload, it will be saved to the redo log as a log entry, and when loaded, the corresponding log entry will be loaded from the redo log.
specify the first transaction node of . At first, register a transaction node in buffer manager, and add into it.
- register a transaction node in buffer manager and add into it. Add into .
- register a transaction node in buffer manager and add into it.
- register a transaction node in buffer manager and add into it. Add into . Till now, no log entry generated.
- register a transaction node in buffer manager and add into it.
- register a transaction node in buffer manager and add into it. Add into . Add into . Add into and create a log entry for . Add into and create a log entry for .
- Unregister and from buffer manager. Before adding into , there is no space left. is choosen as a candidate for eviction and it is unloaded as a log entry. Add into . register a transaction node in buffer manager and add into it. Add into and create a log entry for . Add into and create a log entry for . Add into . Add into and create a log entry for (For rolled back transaction, log entry is only necessary if some of the previous operations have been persisted).
Typically, a checkpoint is a safe point from which state machine can start applying log entries during restart. Entries before the checkpoint are no longer needed and will be physically destroyed at the appropriate time. A checkpoint can represent the equivalent of the collection of data in the range it indicates. For example, is the equivalent of the entries from to and entries in the range are no longer needed. Replay from the last checkpoint log entry on restart.
However, the checkpoint in TAE is different
The log of TAE will have different groups, and the LSN of each group is continuously monotonically increasing. There is a dedicated group for checkpoint log entries.
The range indicated by a typical checkpoint is always a continous interval from the minimum value to a certain LSN like and . While the interval of TAE does not need to be continuous like . Futhermore, given that each committed entry is a collection of multiple subcommands, each checkpoint should be a collection of subcommands indexes
Catalog is TAE's in-memory metadata manager that manages all states of the engine, and the underlying driver is an embedded LogStore. Catalog can be fully replayed from the underlying LogStore.
- Storage layout info
- Database and table schema info
- DDL operation info
Catalog is the database of TAE metadata. As a database, it has the following characteristics:
- In-memory database (Data and Indexes)
- Row oriented database
- MVCC
- RR isolation
- DDL like
CREATE|DROP TABLE, CREATE|DROP DATABASE
not supported. There are only a few built-in tables and cannot be deleted.
The hierarchical relationship from top to bottom is: Database, Table, Segment, Block. Catalog is responsible for managing these resources and supports addition, deletion, modification and query.
Catalog creates a table for each resource, corresponding to Database Entry, Table Entry, Segment Entry and Block Entry.
The Catalog transaction is actually a sub-transaction of the engine transaction. Each table has an in-memory primary key index, and each node in the index corresponds to a table row. Once there is any update on the row, a version chain will be created for it.
- Search the primary key index, if there is the same primary key, return Duplicate error
- Create a new index node (uncommitted)
- Search the primary key index, if not found, return NotFound error
- If the row was already deleted (committed), return NotFound error
- If there's a no version chain on the row, create a version chain and insert a delete node into the chain. Return
- Scan the version chain. If a delete node is found
- If the delete node is committed, return NotFound error
- If the delete node is uncomiitted, if it is not the same transaction, return
W-W
conflict error. Else return NotFound error
- Insert a delete node into the chain. Return
- Search the primary key index, if not found, return NotFound error
- If the row was already deleted (committed), return NotFound error
- If there's a no version chain on the row, create a version chain and insert a update node into the chain. Return
- Scan the version chain.
- If a delete node is found
- If the delete node is committed, return NotFound error
- If the delete node is uncomiitted, if it is the different transaction, return
W-W
conflict error. Else return NotFound error
- If a update node is found
- If the delete node is uncommitted, return
W-W
conflict error
- If the delete node is uncommitted, return
- If a delete node is found
- Insert a update node into the chain. Return
- Search the primary key index, if not found, return NotFound error
- If the row was already deleted (committed), return NotFound error
- If there's a no version chain on the row
- If the row is uncommitted
- If it is the same transaction, return the row value
- If it is a different transaction, return NotFound error
- If the row is committed, return the row value
- If the row is uncommitted
- Scan the version chain.
- If a delete node is found
- If the delete node is committed before the query transaction starts, return NotFound error.
- If the delete node is uncomiitted and is the same transaction, return NotFound error
- If a update node is found
- If the update node is committed before the query transaction starts, return row value
- If the update node is uncommitted and is the same transaction, return row value
- If a delete node is found
- All uncommitted changes are stored in transaction's in-memory store
- All rows read in the transaction will be recorded in the txn store.
- Commit pipeline
- Prepare:
- Check R-W anti-dependency. If violated, rollback
- Update all related uncommitted rows' commit info
- Commit: flush and update visible version
- Rollback: remove all uncommitted rows
- Prepare:
Any operation within a subtransaction corresponds to a sub-command in the engine's transaction log. So a commit or rollback of Catalog transaction is a checkpoint to some sub-commands in the engine's transaction log.
- In-memory version chain pruning
- Hard delete previous soft deleted rows
- Compact disk data
In TAE, a Table is a Column Family while a Database is Column Families. The main idea behind Column Families is that they share the write-ahead log (Share Log Space), so that we can implement Database-level atomic writes. The old WAL cannot be compacted when the mutable buffer of a Table flushed since it may contains live data from other Tables. It can only be compacted when all related Tables mutable buffer are flushed.
TAE supports multiple Databases, that is, one TAE instance can work with multiple Log Spaces. Our MatrixOne DBMS is built upon multi-raft and each node only needs one TAE engine, and each raft group corresponds to a Database. It is complicated and what makes it more complicated is the engine shares the external WAL with Raft log.
TAE uses MVCC to provide snapshot isolation of individual transactions. For SI, the consistent read view of a transaction is determined by the transaction start time, so that data read within the transaction will never reflect changes made by other simultaneous transactions. For example, for , the read view includes , and more fine-grained read view to the block level includes .
TAE provides value-level fine-grained optimistic concurrency control, only updates to the same row and same column will conflict. The transaction uses the value versions that exist when the transaction begins and no locks are placed on the data when it is read. When two transactions attempt to update the same value, the second transaction will fail due to write-write conflict.
In TAE, a table includes multiple segments. A segment is the result of the combined action of multiple transactions. So a segment can be represented as ( is the commit time of the oldest transaction while is the commit time of the newest). Since segment can be compacted to a new segment and segments can be merged into a new segment, We need to add a dimension to the segment representation to distinguish versions ( is the segment create time while is the segment drop time). means the segment is not dropped. The block representation is as same as the segment .
A transaction can be represented as ( is the transaction start time while is the commit time). The read view of a transaction can be determined by the following formula:
When a transaction is committed, it is necessary to obtain a read view related to the commit time for deduplication:
For example, the read view of includes while the read view during commit includes .
The block read view is similar to segment.
Compaction is needed for space efficiency, read efficiency, and timely data deletion. In TAE, the following scenarios require compaction:
- . When inserting data, it first flows to L0 in an unordered manner. After certain conditions are met, the data will be reorganized and flowed to L1, sorted by the primary key.
- . Multiple L1 blocks are merge-sorted into a L2 segment.
- . If there are many updates to a L2 block and it is needed to compact the block to a new block to improve read efficiency.
- . If there are many updates to a L2 segment and it is needed to compact the block to a new segment to improve read efficiency.
- . Multiple L2 segments are merge-sorted into a L2 segment.
is created @ , which contains data from . starts to sort @ ,and its block read view is the baseline plus an uncommitted update node, which will be skipped. Sort and persist a block may take a long time. There are two committed transactions and one uncommitted before commiting sorted . When commiting @ , it will fail because has been terminated. Update nodes that were committed in between will be merged into a new update node and it will be committed together with @ .
A compaction is the termination of a series of blocks or segments, while atomically creating a new one (building index). It usually takes a long time compared to normal transactions and we don't want to block update or delete transactions on involved blocks or segments. Here we extend the content of the read view to include the metadata of blocks and segments into it. When commiting a normal transaction, once it is detected that the metadata of blocks (segments) corresponding to write operation has been changed (committed), it will fail.
For a compaction transaction, write operations include block (segment) soft deletion and addition. During the execution of the transaction, each write will detect a write-write conflict. Once there is a conflict, the transaction will be terminated in advance.
A tuple is the representation of a transaction where both elements are of type uint64
, and can be regarded as . The gloabl timestamp is monotonically increasing continuously from 1, and after restarting, the previous value needs to be restored as a new starting point. When starting a transaction, the global timestamp is assigned to the transaction's and incremented by 1. When a transaction is committed, the global timestamp is also assigned to its and incremented by 1.
A transaction usually consists of multiple commands, and each command is usually trivial. When committing a transaction, we want to be able to preprocess some commands ahead of time. It is a requirement of the fuzzy checkpoint mechanism.
Here are all command types:
Command | Datatype | Value | Description |
---|---|---|---|
CREATE_DB |
int8 | 0x01 | Create a database |
DELETE_DB |
int8 | 0x02 | Delete a database |
CREATE_TABLE |
int8 | 0x13 | Create a table |
UPDATE_TABLE |
int8 | 0x14 | Update a table |
DELETE_TABLE |
int8 | 0x15 | Delete a table |
INSERT |
int8 | 0x30 | Insert rows |
UPDATE_COMMITTED |
int8 | 0x32 | Update committed value |
DELETE_LOCAL |
int8 | 0x33 | Delete row in transaction local store |
DELETE_COMMITTED |
int8 | 0x34 | Delete committed row |
The raw command list for a transaction is in the order accepted. is the command sequence and is the command type. Split splits the command list into serveral lists, which improves the parallelism during committing. Split is table-bounded, and all commands applied on the same table are group into a same list.
Merge transforms a command list into a new list , where .
- There are some types of commands that cannot be merged
CREATE_DB, DELETE_DB, CREATE_TABLE, CREATE_TABLE, DELETE_TABLE
- Commands of different types can be merged
1. INSERT + INSERT => INSERT 2. INSERT + DELETE_LOCAL => INSERT
- Commands dependency. Prepend all
DELETE_COMMITTED
commandsINSERT, INSERT, DELETE_COMMITTED, INSERT => DELETE_COMMITTED, INSERT
All inserted data is stored in transaction local storage before committed. Data in transaction local storage is grouped into tables. Insert request inserts data into the target table. The data for each table contains one or more batches. If the amount of request data is particularly large, it will be split into multiple batches .
Suppose the maximum number of rows in a batch is 10. Here are commands in a transaction
1. INSERT 15 ROWS INTO TABLE1
2. INSERT 4 ROWS INTO TABLE2
3. INSERT 5 ROWS INTO TABLE1
Txn Local Storage
|
|--- Table1
| |
| |--- Batch1
| |--- Batch2
|
|--- Table2
|
|--- Batch3
If any deletes applied to the batch in the transaction local store,a bitmap for deletion is created. Any update will be transfer to a delete and insert.
A delete history will be create for blocks with any DELETE operation.
type UpdateEntry struct {
txn *TxnEntry
indexes []*LogIndex
}
type DeleteHistry struct {
entries map[int32]*UpdateEntry
}
For the UPDATE of the primary key, it will be converted to DELETE plus INSERT. So the UPDATE we are talking about is always update to non-primary key.
type UpdateNode struct {
UpdateEntry
values map[int32]interface{}
}
TODO
A transaction usually contains multiple DDL and DML statements. As mentioned in Catalog, Catalog has its own transaction mechanism, and the transaction of the TAE contains both DDL and DML, so we take the transaction of the Catalog as a sub-transaction of the TAE transaction.
All DDL operations correspond to Catalog DML operations, see the corresponding chapter for details.
As mentioned earlier,the data of a transaction is grouped by table, and each group of data is a combination of the following data types
- . The i-th uncommitted batch
- . The delete bitmap of
- . The delete node of a committed block.
- . The update node of a committed column block.
CREATE_TABLE
DROP_TABLE
When committing a transaction, all combinations can be summed up in the following pattern:
- P1: Only insert
- P2: Only delete or update to committed data
- P3: Insert and delete or update to inserted data
- P4: Insert and delete or update to committed data
- P5: Only DDL
- P6: DDL and others
- Wrap each batch as a command with a sequence number
- Append the batch to statemachine with annotated command context
- Wrap each update|delete node as a command with a sequence number
- Update the update|delete node commit info
- If there is a delete bitmap for a batch, apply the delete bitmap first to generate a new batch
- Same as P1
- Wrap a command with a sequence number
- Update the commit info
- All operations to a table before the
DROP TABLE
can be eliminated. Delete all related uncommitted batch, delete|update nodes. Then do as P5 CREATE TABLE
should always be the first command, unless there is aDROP TABLE
later
TODO
TODO
TODO
TODO
- Metadata compaction
- In-memory version chain compaction
- In-memory hard deleted metadata entry compaction
- Persisted data compaction
- Stale data deletion
- Table data with reference count equal 0
- Log compaction
TODO