Skip to content

Commit

Permalink
doc: add relational table schema doc + move files (#3712)
Browse files Browse the repository at this point in the history
* doc: add relational table schema doc + move files

* fix comment

* Update docs/relational_table/relational-table-schema.md

Co-authored-by: congyi <58715567+wcy-fdu@users.noreply.github.com>

Co-authored-by: congyi <58715567+wcy-fdu@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored Jul 7, 2022
1 parent 4a92eb6 commit f9362b1
Show file tree
Hide file tree
Showing 3 changed files with 41 additions and 4 deletions.
2 changes: 1 addition & 1 deletion docs/consistent-hash.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ We know that [Hummock](./state-store-overview.md#overview), our LSM-Tree-based s
```
table_id | vnode | ...
```
where `table_id` denotes the [state table](./storing-state-using-relational-table.md#relational-table-layer), and `vnode` is computed via $H$ on key of the data.
where `table_id` denotes the [state table](relational_table/storing-state-using-relational-table.md#relational-table-layer), and `vnode` is computed via $H$ on key of the data.

To illustrate this, let's revisit the [previous example](#streaming). Executors of an operator will share the same logical state table, just as is shown in the figure below:

Expand Down
35 changes: 35 additions & 0 deletions docs/relational_table/relational-table-schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Relational Table Schema

We introduce the rough cell-based encoding format in [relational states](storing-state-using-relational-table.md#cell-based-encoding)

In this doc, we will take HashAgg with extreme state (`max`, `min`) or value state (`sum`, `count`) for example, and introduce a more detailed design for the internal table schema.

[Code](https://github.com/singularity-data/risingwave/blob/7f9ad2240712aa0cfe3edffb4535d43b42f32cc5/src/frontend/src/optimizer/plan_node/logical_agg.rs#L144)

## Table id
For all relational table states, the keyspace must start with `table_id`. This is a globally unique id allocated in meta. Meta is responsible for traversing the Plan Tree and calculating the total number of Relational Tables needed. For example, the Hash Join Operator needs 2, one for the left table and one for the right table. The number of tables needed for Agg depends on the number of agg calls.

## Value State (Sum, Count)
Query example:
```sql
select sum(v2), count(v3) from t group by v1
```

This query will need to initiate 2 Relational Tables. The schema is `table_id/group_key/column_id`.

## Extreme State (Max, Min)
Query example:
```sql
select max(v2), min(v3) from t group by v1
```

This query will need to initiate 2 Relational Tables. If the upstream is not append-only, the schema becomes `table_id/group_key/sort_key/upstrea_pk/column_id`.

The order of `sort_key` depends on the agg call kind. For example, if it's `max()`, `sort_key` will order with `Ascending`. if it's `min()`, `sort_key` will order with `Descending`.
The `upstream_pk` is also appended to ensure the uniqueness of the key.
This design allows the streaming executor not to read all the data from the storage when the cache fails, but only a part of it. The streaming executor will try to write all streaming data to storage, because there may be `update` or `delete` operations in the stream, it's impossible to always guarantee correct results without storing all data.

If `t` is created with append-only flag, the schema becomes `table_id/group_key/column_id`, which is the same for Value State. This is because in the append-only mode, there is no `update` or `delete` operation, so the cache will never miss. Therefore, we only need to write one value to the storage.



Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ We implement a relational table layer as the bridge between stateful executors a
| join | table_id \| join_key \| pk \| col_id| materialized value |
| agg | table_id \| group_key \| col_id| agg_value |

For the detailed schema, please check [doc](relational-table-schema.md)

<!-- Todo: link cconsistence hash doc and state table agg doc -->
## Relational Table Layer
[source code](https://github.com/singularity-data/risingwave/blob/main/src/storage/src/table/state_table.rs)
Expand All @@ -43,13 +45,13 @@ In this part, we will introduce how stateful executors interact with KV state st
Relational table layer consists of State Table, Mem Table and Storage Table. The State Table provides the table operations by these APIs: `get_row`, `scan`, `insert_row`, `delete_row` and `update_row`, which are the read and write interfaces for executors. The Mem Table is an in-memory buffer for caching table operations during one epoch, and the Storage Table is responsible for writing row operations into kv pairs, which performs serialization and deserialization between cell-based encoding and KV encoding.


![Overview of Relational Table](images/relational-table-layer/relational-table-01.svg)
![Overview of Relational Table](../images/relational-table-layer/relational-table-01.svg)
### Write Path
To write into KV state store, executors first perform operations on State Table, and these operations will be cached in Mem Table. Once a barrier flows through one executor, executor will flush the cached operations into state store. At this moment, Storage Table will covert these operations into kv pairs and write to state store with specific epoch.

For example, an executor performs `insert(a, b, c)` and `delete(d, e, f)` through the State Table APIs, Mem Table first caches these two operations in memory. After receiving new barrier, Cell Based Table converts these two operations into KV operations by cell-based format, and write these KV operations into state store (Hummock).

![write example](images/relational-table-layer/relational-table-03.svg)
![write example](../images/relational-table-layer/relational-table-03.svg)
### Read Path
Executors should be able to read the just written data, which means uncommited data is visible. The data in Mem Table (memory) is fresher than that in shared storage (state store). State Table provides both point-get and scan to read from state store by merging data from Mem Table and Storage Table.
#### Get
Expand All @@ -75,4 +77,4 @@ Get(pk = 3): [3, 3333, 3333]
#### Scan
Scan on relational table is implemented by `StateTableIter`, which is a merge iterator of `MemTableIter` and `StorageTableIter`. If a pk exist in both KV state store (StorageTable) and memory (MemTable), result of `MemTableIter` is returned. For example, in the following figure, `StateTableIter` will generate `1->4->5->6` in order.

![Scan example](images/relational-table-layer/relational-table-02.svg)
![Scan example](../images/relational-table-layer/relational-table-02.svg)

0 comments on commit f9362b1

Please sign in to comment.