Skip to content

XuPeng-SH/tae_design

Repository files navigation

  • Feature Name: Transactional Analytic Engine
  • Status: In Progress
  • Start Date: 2022-02-21
  • Authors: Xu Peng
  • Implementation PR:
  • Issue for this RFC:

Summary

TAE (Transactional Analytic Engine) is designed for hybrid transactional analytical query workloads, which can be used as the underlying storage engine of database management system (DBMS) for online analytical processing of queries (HTAP).

Guide-level design

Terms

Layout

  • Block: Piece of a segment which is the minimum part of table data. The maximum number of rows of a block is fixed
  • Segment: Piece of a table which is composed of blocks
  • Table: Piece of a database which is composed of segments
  • Database: A combination of tables, which shares the same log space

TODO

State

  • Transient Block: Block where the number of rows does not reach the upper limit and the blocks queued to be sorted and flushed
  • Sorted Block: Sorted block
  • Unsorted Segment: Segment that not merge sorted
  • Sorted Segment: Segment that merge sorted.

TODO

Container

  • Vector: Data fragment of a column in memory
  • Batch: A combination of vectors, and the number of rows in each vector is aligned

TODO

Data storage

Table

TAE stores data represented as tables. Each table is bound to a schema consisting of numbers of column definitions. A table data is organized as a log-structured merge-tree (LSM tree).

Currently, TAE is a three-level LSM tree, called L0, L1 and L2. L0 is small and can be entirely resident in memory, whereas L1 and L2 are both definitely resident on disk. In TAE, L0 consists of transient blocks and L1 consists of sorted blocks. The incoming new data is always inserted into the latest transient block. If the insertion causes the block to exceed the maximum row count of a block, the block will be sorted by primary key and flushed into L1 as sorted block. If the number of sorted blocks exceed the maximum number of a segment, the segment will be sorted by primary key using merge sort.

L1 and L2 are organized into sorted runs of data. Each run contains data sorted by the primary key, which can be represented on disk as a single file. There will be overlapping primary key ranges between sort runs. The difference of L1 and L2 is that a run in L1 is a block while a run in L2 is a segment.

A segment can be compacted into a new segment if it has many updates(deletions). Segments can be merged into a segment. The scheduling behind this has some customizable strategies, mainly the trade-off between write amplification and read amplification.

As described above, transient blocks can be entirely resident in memory, but not necessarily so. Because there will be many tables, each table has transient blocks. If they are always resident in memory, it will cause a huge waste. In TAE, transient blocks from all tables share a dedicated fixed-size LRU cache. A evicted transient block will be unloaded from memory and flushed as a transient block file. In practice, the transient blocks are constantly flowing to the L1 and the number of transient blocks per table at a certain time is very small, those active transient blocks will likly reside in memory even with a small-sized cache.

Indexes

There's no table-level index in TAE, only segment and block-level indexes are available.

In TAE, there is a dedicated fixed-size LRU cache for all indexes. Compared with the original data, the index occupies a limited space, but the acceleration of the query is very obvious, and the index will be called very frequently. A dedicated cache can avoid a memory copy when being called.

Primary key index

TAE creates an index for each table's primary key by default. The main function is to deduplicate when inserting data and filter according to the primary key. Deduplication is the critical path for data insertion. We need to make trade-offs in the following three aspects:

  • Query performance
  • Memory usage
  • Match with the underlying data store layout

From the granularity of the index, we divide the index into two categories, one is a table-level index, and the other is an index set composed of a series of partition indexes. For example, we can have a table-level B tree index, or each segment has a B tree index. The table data of TAE consists of multiple segments, and each segment must be unordered first and then ordered. Compaction, merging, or splitting may take place afterwards. This scenario is very unfriendly to the table-level index. So the index of TAE should be a segment-level index set.

There are two types of segment. One is appendable and the other is not. For non-appendable segment, the segment-level index is a two-level structure, bloomfilter and zonemap respectively. There are two options for bloomfilter, a segment-based bloomfilter, and a block-based bloomfilter. The Segment-based is a better choice when the index can be fully resident in memory. An appendable segment consists of at least one appendable block plus multiple non-appendable blocks. Appendable block index is a resident memory ART-tree plus zonemap while the non-appendable one is bloomfilter plus zonemap.

The query flow chart of segment index is as follow:

TODO

Secondary index

TODO

Compression

TAE is a column-oriented data store, very friendly to data compression. It supports per-column compression codecs and now only LZ4 is used. You can easily obtain the meta information of compressed blocks. In TAE, the compression unit is a column of a block.

Layout

Storage File Format

Header
+-------------+--------------+-------------+----------------+------------------+------------------+
|  Magic (8B) | Version (2B) | Format (2B) | Reserved (32B) | MinPartSize (2B) | MaxPartSize (2B) |
+-------------+--------------+-------------+----------------+------------------+------------------+

Magic = Engine identity (0x01346616). TAE only
Version = File version
Format = Layout format
Reserved = 32 bytes reserved space
MinPartSize = Specify the size of the smallest part uint. 1 for 1K bytes. If 4 is specified, the smallest part size is 4K bytes
MaxPartSize = Specify the size of the bigest part. 1 for 1K bytes
Meta
Meta-1, Meta-2 = One is stale while the other is active

+----------+----------+-
| <Meta-1> | <Meta-2> |
+----------+----------+-
       |
       |
+---------------------------------------------------+
|                        Meta                       |
+---------------+-----------------+-----------------+
|  Version (2B) | PartOffset (4B) | Reserved (128B) |
+---------------+-----------------+-----------------+

Version = Meta version
PartOffset = The first part position
Reserved = 128 bytes reserved space
Part
+------------+---------------------+---- ... ----+
|  Size (2B) | NextPartOffset (4B) |   Payload   |
+------------+---------------------+---- ... ----+

Size = Part size. 1 for 1K bytes
NextPartOffset = Next part pointer
Payload = The size of payload is ($Size * 1K - 2B - 4B)

Segment File

Format Value Description
SegmentFile 0x10 Segment file format
+-------------------------------------------------------------------------------+
|                       Segment File Meta Part Payload                          |
+-----------+---------------+--------------+---------------+--------------------+
| Cols (2B) | BlockCnt (2B) | <ColumnMeta> | <IndexesMeta> | <BlockIndexesMeta> |
+-----------+---------------+--------------+---------------+--------------------+
                                    |             |                  |
                                    |             |                  |
                                    |             |  +-------------------------------------+
                                    |             |  |           BlockIndexesMeta          |
                                    |             |  +------------------+-------...--------+
                                    |             |  | <IndexesMeta>    |       ...        |
                                    |             |  +------------------+-------...--------+
                                    |             |          |
                                    |             |          |
                                    |  +------------------------------------------------------+
                                    |  |                        IndexesMeta                   |
                                    |  +-------------------+------------+-------------+--...--+
                                    |  | CompressAlgo (1B) | Count (2B) | <IndexMeta> |  ...  |
                                    |  +-------------------+------------+-------------+--...--+
                                    |                                        |
                                    |                                        |
             +----------------------------------+   +----------------+----------------+-----------------+------------+-----------+--------------+
             |             ColumnMeta           |   |                                        IndexMeta                                          |
             +-------------------+--------------+   +----------------+----------------+-----------------+------------+-----------+--------------+
             | CompressAlgo (1B) | <BlocksMeta> |   | IndexType (1B) | ColumnIdx (2B) | PartOffset (4B) | Start (4B) | Size (4B) | RawSize (4B) |
             +-------------------+--------------+   +----------------+----------------+-----------------+------------+-----------+--------------+
                                        |
                                        |
                +----------------------------------------------------+
                |                    BlocksMeta                      |
                +--------------+-------------+---...---+-------------+
                |  <BlockMeta> | <BlockMeta> |         | <BlockMeta> |
                +--------------+-------------+---...---+-------------+
                        |
                        |
+----------------------------------------------------------+
|                        BlockMeta                         |
+------------------+------------+-----------+--------------+
|  PartOffset (4B) | Start (4B) | Size (4B) | RawSize (4B) |
+------------------+------------+-----------+--------------+
Mutation

Suppose the MinPartSize is 4, which stands for 4K. It starts with an empty segment (No Parts). Now we do the following in sequence

  1. Append a block (Block0) of 3 columns,and the compressed sizes of the three columns are 3.4K, 7.1K, and 2.5K.
    1)Calculate an optimal size for each column:4K,8K and 4K
    2)Find unused candidate parts. Not found here.
    3) Allocate new parts for these 3 columns
    4) Flush these 3 columns to the specified part
    5) Flush meta
    

  1. Append a block (Block1) of 3 columns,and the compressed sizes of the three columns are 11.3K, 7.8K, and 3.5K.
    1)Calculate an optimal size for each column:12K,8K and 4K
    2)Find unused candidate parts. Not found here.
    3) Allocate new parts for these 3 columns
    4) Flush these 3 columns to the specified new parts
    5) Flush meta
    

  1. There are some updates to the 3rd column of Block0, which is marked as Block0-2. Now we start to checkpoint the updates into segment file. The compressed size of the updated Block0-2 is 7.3K.
    1)Calculate an optimal size:8K
    2)Find unused candidate parts. Not found here.
    3) Allocate a new part
    4) Flush to the specified new part
    5) Flush meta
    

image

  1. There are some updates to the first column of Block1, which is marked as Block1-0. Now we start to checkpoint the updates into segment file. The compressed size of the updated Block1-0 is 3.6K.
    1)Calculate an optimal size:4K
    2)Find unused candidate parts. Unused Part(3, 1) found
    3) Flush to the specified unused part
    4) Flush meta
    

image

  1. Append a block (Block2) of 3 columns,and the compressed sizes of the three columns are 2.8K, 3.1K, and 3.7K.
    1)Calculate an optimal size for each column:4K,4K and 4K
    2)Find unused candidate parts. Unused Part(4, 1), Part(5, 1), Part(6, 1) found
    3) Flush these 3 columns to the specified unused parts
    4) Flush meta
    

image

Buffer manager

Buffer manager is responsible for the allocation of buffer space. It handles all requests for data pages and temporary blocks of the TAE.

  1. Each page is bound to a buffer node with a unique node ID
  2. A buffer node has two states:
    1. Loaded
    2. Unloaded
  3. When a requestor Pin a node:
    1. If the node is in Loaded state, it will increase the node reference count by 1 and wrap a node handle with the page address in memory
    2. If the node is in Unloaded state, it will read the page from disk|remote first, increase the node reference count by 1 and wrap a node handle with the page address in memory. When there is no left room in the buffer, some victim node will be unloaded to make room. The current replacement strategy is LRU
  4. When a requestor Unpin a node, just call Close of the node handle. It will decrease the node reference count by 1. If the reference count is 0, the node will be a candidate for eviction. Node with reference count greater than 0 never be evicted.

There are currently four buffer managers for different purposes in TAE

  1. Mutation buffer manager: A dedicated fixed-size buffer used by L0 transient blocks. Each block corresponds to a node in the buffer
  2. SST buffer manager: A dedicated fixed-size buffer used by L1 and L2 blocks. Each column within a block corresponds to a node in the buffer
  3. Index buffer manager: A dedicated fixed-size buffer used by indexes. Each block or a segment index corresponds to a node in the buffer
  4. Redo log buffer manager: A dedicated fixed-size buffer used by uncommitted transactions. Each transaction local storage consists of at least one buffer node.

LogStore

An embedded log-structured data store. It is used as the underlying driver of Catalog and WAL.

TODO

WAL

Write-ahead logging (WAL) is the key for providing atomicity and durability. All modifications should be written to a log before applied. In TAE, REDO log does not need to record every write operation, but it must be recorded when the transaction is committed. We will reduce the usage of io by using the redo log buffer manager, and avoid any io events for those transactions that are not long and may need to be rolled back due to various conflicts. It can also support long or large transactions.

Log Entry

Entry Layout

Entry Header Layout

Item Size(Byte) Scope Description
GroupId 4 All Specify the group id
LSN 8 All Specify the log sequence number
Length 4 All Specify the length of the entry
Type 1 All Specify the entry type

Entry Type

Type Datatype Value Description
AC int8 0x10 A committed transaction of complete write operations
PC int8 0x11 A committed transaction of partial write operations
UC int8 0x12 Partial write operations of a uncommitted transaction
RB int8 0x13 Rollback of a transaction
CKP int8 0x40 Checkpoint

Transaction Log Entry

Most transactions only have one log entry. Only those long or large transactions may need to record multiple log entries. So the log of a transaction may be of 1+ UC type log entries plus one PC type log entry, or only one AC type log entry. TAE assigns a dedicate group to log entries of type UC. Here is the transaction log of six committed transactions. specifies the log entry in group 2 with LSN 3.

A transaction log entry includes multiple nodes, and there are multiple types of nodes. DML node, delete info node, append info node, update info node. A node is an atomic command, which can be annotated as a sub-entry index of a committed entry. For example, there are 3 nodes in , which can be annotated as

Transaction Payload

Meta Layout
Item Size(Byte) Scope Description
Id 8 All Specify the transaction id
InfoLen 2 All Specify the length of transaction info
Info - All Specify the transaction info
PrevPtr 12 UC,PC,RB Specify the previous entry
NodeCnt 4 All Specify the count of nodes
Node Type
Type Datatype Value Description
DML_A int8 0x1 Append info
DML_D int8 0x2 Delete info
DML_U int8 0x3 Update info
DDL_CT int8 0x4 Create table info
DDL_DT int8 0x5 Drop table info
DDL_CD int8 0x6 Create database info
DDL_DD int8 0x7 Drop database info
DDL_CI int8 0x8 Create index info
DDL_DI int8 0x9 Drop index info
Node Layout
Item Size(Byte) Scope Description
Type 1 All Specify the type of node
Len 4 All Specify the length of node
Buf - All Node payload

Example

Here are 6 concurrent transactions, arranged chronologically from left to right. specify the second write operation of . specify the commit of . specify the rollback of .

As mentioned in the chapter on transactions, all active transactions share a fixed-size memory space, which is managed by the buffer manager. When the remaining space is not enough, some transaction nodes will be unloaded. If it is the first time of the node unload, it will be saved to the redo log as a log entry, and when loaded, the corresponding log entry will be loaded from the redo log.

specify the first transaction node of . At first, register a transaction node in buffer manager, and add into it.

  1. register a transaction node in buffer manager and add into it. Add into .
  2. register a transaction node in buffer manager and add into it.
  3. register a transaction node in buffer manager and add into it. Add into . Till now, no log entry generated.

  1. register a transaction node in buffer manager and add into it.
  2. register a transaction node in buffer manager and add into it. Add into . Add into . Add into and create a log entry for . Add into and create a log entry for .
  3. Unregister and from buffer manager. Before adding into , there is no space left. is choosen as a candidate for eviction and it is unloaded as a log entry. Add into . register a transaction node in buffer manager and add into it. Add into and create a log entry for . Add into and create a log entry for . Add into . Add into and create a log entry for (For rolled back transaction, log entry is only necessary if some of the previous operations have been persisted).

Checkpoint

Typically, a checkpoint is a safe point from which state machine can start applying log entries during restart. Entries before the checkpoint are no longer needed and will be physically destroyed at the appropriate time. A checkpoint can represent the equivalent of the collection of data in the range it indicates. For example, is the equivalent of the entries from to and entries in the range are no longer needed. Replay from the last checkpoint log entry on restart.

However, the checkpoint in TAE is different

Dedicated Group

The log of TAE will have different groups, and the LSN of each group is continuously monotonically increasing. There is a dedicated group for checkpoint log entries.

Fuzzy Checkpoint

The range indicated by a typical checkpoint is always a continous interval from the minimum value to a certain LSN like and . While the interval of TAE does not need to be continuous like . Futhermore, given that each committed entry is a collection of multiple subcommands, each checkpoint should be a collection of subcommands indexes

Catalog

Catalog is TAE's in-memory metadata manager that manages all states of the engine, and the underlying driver is an embedded LogStore. Catalog can be fully replayed from the underlying LogStore.

  1. Storage layout info
  2. Database and table schema info
  3. DDL operation info

Yet Another Database-Database

Catalog is the database of TAE metadata. As a database, it has the following characteristics:

  1. In-memory database (Data and Indexes)
  2. Row oriented database
  3. MVCC
  4. RR isolation
  5. DDL like CREATE|DROP TABLE, CREATE|DROP DATABASE not supported. There are only a few built-in tables and cannot be deleted.

Built-in Tables

The hierarchical relationship from top to bottom is: Database, Table, Segment, Block. Catalog is responsible for managing these resources and supports addition, deletion, modification and query.

Catalog creates a table for each resource, corresponding to Database Entry, Table Entry, Segment Entry and Block Entry.

Transactional Operation

The Catalog transaction is actually a sub-transaction of the engine transaction. Each table has an in-memory primary key index, and each node in the index corresponds to a table row. Once there is any update on the row, a version chain will be created for it.

Insert

  1. Search the primary key index, if there is the same primary key, return Duplicate error
  2. Create a new index node (uncommitted)

Delete

  1. Search the primary key index, if not found, return NotFound error
  2. If the row was already deleted (committed), return NotFound error
  3. If there's a no version chain on the row, create a version chain and insert a delete node into the chain. Return
  4. Scan the version chain. If a delete node is found
    • If the delete node is committed, return NotFound error
    • If the delete node is uncomiitted, if it is not the same transaction, return W-W conflict error. Else return NotFound error
  5. Insert a delete node into the chain. Return

Update

  1. Search the primary key index, if not found, return NotFound error
  2. If the row was already deleted (committed), return NotFound error
  3. If there's a no version chain on the row, create a version chain and insert a update node into the chain. Return
  4. Scan the version chain.
    • If a delete node is found
      • If the delete node is committed, return NotFound error
      • If the delete node is uncomiitted, if it is the different transaction, return W-W conflict error. Else return NotFound error
    • If a update node is found
      • If the delete node is uncommitted, return W-W conflict error
  5. Insert a update node into the chain. Return

Query

  1. Search the primary key index, if not found, return NotFound error
  2. If the row was already deleted (committed), return NotFound error
  3. If there's a no version chain on the row
    • If the row is uncommitted
      • If it is the same transaction, return the row value
      • If it is a different transaction, return NotFound error
    • If the row is committed, return the row value
  4. Scan the version chain.
    • If a delete node is found
      • If the delete node is committed before the query transaction starts, return NotFound error.
      • If the delete node is uncomiitted and is the same transaction, return NotFound error
    • If a update node is found
      • If the update node is committed before the query transaction starts, return row value
      • If the update node is uncommitted and is the same transaction, return row value

Commit & Rollback

  1. All uncommitted changes are stored in transaction's in-memory store
  2. All rows read in the transaction will be recorded in the txn store.
  3. Commit pipeline
    • Prepare:
      • Check R-W anti-dependency. If violated, rollback
      • Update all related uncommitted rows' commit info
    • Commit: flush and update visible version
    • Rollback: remove all uncommitted rows

Checkpoint

Any operation within a subtransaction corresponds to a sub-command in the engine's transaction log. So a commit or rollback of Catalog transaction is a checkpoint to some sub-commands in the engine's transaction log.

Compaction

  1. In-memory version chain pruning
  2. Hard delete previous soft deleted rows
  3. Compact disk data

Database (Column Families)

In TAE, a Table is a Column Family while a Database is Column Families. The main idea behind Column Families is that they share the write-ahead log (Share Log Space), so that we can implement Database-level atomic writes. The old WAL cannot be compacted when the mutable buffer of a Table flushed since it may contains live data from other Tables. It can only be compacted when all related Tables mutable buffer are flushed.

TAE supports multiple Databases, that is, one TAE instance can work with multiple Log Spaces. Our MatrixOne DBMS is built upon multi-raft and each node only needs one TAE engine, and each raft group corresponds to a Database. It is complicated and what makes it more complicated is the engine shares the external WAL with Raft log.

Multi-Version Concurrency Control (MVCC)

TAE uses MVCC to provide snapshot isolation of individual transactions. For SI, the consistent read view of a transaction is determined by the transaction start time, so that data read within the transaction will never reflect changes made by other simultaneous transactions. For example, for , the read view includes , and more fine-grained read view to the block level includes .

TAE provides value-level fine-grained optimistic concurrency control, only updates to the same row and same column will conflict. The transaction uses the value versions that exist when the transaction begins and no locks are placed on the data when it is read. When two transactions attempt to update the same value, the second transaction will fail due to write-write conflict.

Read View

In TAE, a table includes multiple segments. A segment is the result of the combined action of multiple transactions. So a segment can be represented as ( is the commit time of the oldest transaction while is the commit time of the newest). Since segment can be compacted to a new segment and segments can be merged into a new segment, We need to add a dimension to the segment representation to distinguish versions ( is the segment create time while is the segment drop time). means the segment is not dropped. The block representation is as same as the segment .

A transaction can be represented as ( is the transaction start time while is the commit time). The read view of a transaction can be determined by the following formula:

When a transaction is committed, it is necessary to obtain a read view related to the commit time for deduplication:

For example, the read view of includes while the read view during commit includes .

The block read view is similar to segment.

Concurrent Compaction

Compaction is needed for space efficiency, read efficiency, and timely data deletion. In TAE, the following scenarios require compaction:

  • . When inserting data, it first flows to L0 in an unordered manner. After certain conditions are met, the data will be reorganized and flowed to L1, sorted by the primary key.
  • . Multiple L1 blocks are merge-sorted into a L2 segment.
  • . If there are many updates to a L2 block and it is needed to compact the block to a new block to improve read efficiency.
  • . If there are many updates to a L2 segment and it is needed to compact the block to a new segment to improve read efficiency.
  • . Multiple L2 segments are merge-sorted into a L2 segment.

Block Sort Example

is created @ , which contains data from . starts to sort @ ,and its block read view is the baseline plus an uncommitted update node, which will be skipped. Sort and persist a block may take a long time. There are two committed transactions and one uncommitted before commiting sorted . When commiting @ , it will fail because has been terminated. Update nodes that were committed in between will be merged into a new update node and it will be committed together with @ .

image

Compaction As A Transactional Schema Change

A compaction is the termination of a series of blocks or segments, while atomically creating a new one (building index). It usually takes a long time compared to normal transactions and we don't want to block update or delete transactions on involved blocks or segments. Here we extend the content of the read view to include the metadata of blocks and segments into it. When commiting a normal transaction, once it is detected that the metadata of blocks (segments) corresponding to write operation has been changed (committed), it will fail.

For a compaction transaction, write operations include block (segment) soft deletion and addition. During the execution of the transaction, each write will detect a write-write conflict. Once there is a conflict, the transaction will be terminated in advance.

Transaction

A tuple is the representation of a transaction where both elements are of type uint64, and can be regarded as . The gloabl timestamp is monotonically increasing continuously from 1, and after restarting, the previous value needs to be restored as a new starting point. When starting a transaction, the global timestamp is assigned to the transaction's and incremented by 1. When a transaction is committed, the global timestamp is also assigned to its and incremented by 1.

Preprocessing

A transaction usually consists of multiple commands, and each command is usually trivial. When committing a transaction, we want to be able to preprocess some commands ahead of time. It is a requirement of the fuzzy checkpoint mechanism.

Command

Here are all command types:

Command Datatype Value Description
CREATE_DB int8 0x01 Create a database
DELETE_DB int8 0x02 Delete a database
CREATE_TABLE int8 0x13 Create a table
UPDATE_TABLE int8 0x14 Update a table
DELETE_TABLE int8 0x15 Delete a table
INSERT int8 0x30 Insert rows
UPDATE_COMMITTED int8 0x32 Update committed value
DELETE_LOCAL int8 0x33 Delete row in transaction local store
DELETE_COMMITTED int8 0x34 Delete committed row

Split Commands

The raw command list for a transaction is in the order accepted. is the command sequence and is the command type. Split splits the command list into serveral lists, which improves the parallelism during committing. Split is table-bounded, and all commands applied on the same table are group into a same list.

Merge Commands

Merge transforms a command list into a new list , where .

Rules
  • There are some types of commands that cannot be merged
    CREATE_DB, DELETE_DB, CREATE_TABLE, CREATE_TABLE, DELETE_TABLE
    
  • Commands of different types can be merged
    1. INSERT + INSERT => INSERT
    2. INSERT + DELETE_LOCAL => INSERT
    
  • Commands dependency. Prepend all DELETE_COMMITTED commands
    INSERT, INSERT, DELETE_COMMITTED, INSERT => DELETE_COMMITTED, INSERT
    
Example

DML

INSERT

Uncommitted

All inserted data is stored in transaction local storage before committed. Data in transaction local storage is grouped into tables. Insert request inserts data into the target table. The data for each table contains one or more batches. If the amount of request data is particularly large, it will be split into multiple batches .

Suppose the maximum number of rows in a batch is 10. Here are commands in a transaction

1. INSERT 15 ROWS INTO TABLE1
  • ,
2. INSERT 4 ROWS INTO TABLE2
3. INSERT 5 ROWS INTO TABLE1
Txn Local Storage
       |
       |--- Table1
       |     |
       |     |--- Batch1
       |     |--- Batch2
       |
       |--- Table2
             |
             |--- Batch3
Committed

UPDATE & DELETE

UPDATE & DELETE INSERTED DATA

If any deletes applied to the batch in the transaction local store,a bitmap for deletion is created. Any update will be transfer to a delete and insert.

UPDATE & DELETE COMMITTED DATA

A delete history will be create for blocks with any DELETE operation.

type UpdateEntry struct {
  txn *TxnEntry
  indexes []*LogIndex
}

type DeleteHistry struct {
  entries map[int32]*UpdateEntry
}

For the UPDATE of the primary key, it will be converted to DELETE plus INSERT. So the UPDATE we are talking about is always update to non-primary key.

type UpdateNode struct {
  UpdateEntry
  values map[int32]interface{}
}

Version Chain

TODO

DDL

A transaction usually contains multiple DDL and DML statements. As mentioned in Catalog, Catalog has its own transaction mechanism, and the transaction of the TAE contains both DDL and DML, so we take the transaction of the Catalog as a sub-transaction of the TAE transaction.

All DDL operations correspond to Catalog DML operations, see the corresponding chapter for details.

Commit

As mentioned earlier,the data of a transaction is grouped by table, and each group of data is a combination of the following data types

  • . The i-th uncommitted batch
  • . The delete bitmap of
  • . The delete node of a committed block.
  • . The update node of a committed column block.
  • CREATE_TABLE
  • DROP_TABLE

When committing a transaction, all combinations can be summed up in the following pattern:

  1. P1: Only insert
  2. P2: Only delete or update to committed data
  3. P3: Insert and delete or update to inserted data
  4. P4: Insert and delete or update to committed data
  5. P5: Only DDL
  6. P6: DDL and others

P1

  1. Wrap each batch as a command with a sequence number
  2. Append the batch to statemachine with annotated command context

P2

  1. Wrap each update|delete node as a command with a sequence number
  2. Update the update|delete node commit info

P3

  1. If there is a delete bitmap for a batch, apply the delete bitmap first to generate a new batch
  2. Same as P1

P4

  1. Process delete|update first. Same as P2
  2. Process insert. Same as P1

P5

  1. Wrap a command with a sequence number
  2. Update the commit info

P6

  1. All operations to a table before the DROP TABLE can be eliminated. Delete all related uncommitted batch, delete|update nodes. Then do as P5
  2. CREATE TABLE should always be the first command, unless there is a DROP TABLE later

Commit Pipeline

TODO

Schema Change

TODO

Snapshot

TODO

Split

TODO

GC

  1. Metadata compaction
    1. In-memory version chain compaction
    2. In-memory hard deleted metadata entry compaction
    3. Persisted data compaction
  2. Stale data deletion
    1. Table data with reference count equal 0
    2. Log compaction

TODO

About

Design and prototypes for TAE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages