doc: add relational table schema doc + move files (risingwavelabs#3712)

* doc: add relational table schema doc + move files * fix comment * Update docs/relational_table/relational-table-schema.md Co-authored-by: congyi <58715567+wcy-fdu@users.noreply.github.com> Co-authored-by: congyi <58715567+wcy-fdu@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
nasnoisaac · Jul 7, 2022 · f9362b1 · f9362b1
1 parent 4a92eb6
commit f9362b1
Show file tree

Hide file tree

Showing 3 changed files with 41 additions and 4 deletions.
diff --git a/docs/consistent-hash.md b/docs/consistent-hash.md
@@ -78,7 +78,7 @@ We know that [Hummock](./state-store-overview.md#overview), our LSM-Tree-based s
 ```
 table_id | vnode | ...
 ```
-where `table_id` denotes the [state table](./storing-state-using-relational-table.md#relational-table-layer), and `vnode` is computed via $H$ on key of the data. 
+where `table_id` denotes the [state table](relational_table/storing-state-using-relational-table.md#relational-table-layer), and `vnode` is computed via $H$ on key of the data. 
 
 To illustrate this, let's revisit the [previous example](#streaming). Executors of an operator will share the same logical state table, just as is shown in the figure below:
 

diff --git a/docs/relational_table/relational-table-schema.md b/docs/relational_table/relational-table-schema.md
@@ -0,0 +1,35 @@
+# Relational Table Schema
+
+We introduce the rough cell-based encoding format in [relational states](storing-state-using-relational-table.md#cell-based-encoding)
+
+In this doc, we will take HashAgg with extreme state (`max`, `min`) or value state (`sum`, `count`) for example, and introduce a more detailed design for the internal table schema.
+
+[Code](https://github.com/singularity-data/risingwave/blob/7f9ad2240712aa0cfe3edffb4535d43b42f32cc5/src/frontend/src/optimizer/plan_node/logical_agg.rs#L144)
+
+## Table id
+For all relational table states, the keyspace must start with `table_id`. This is a globally unique id allocated in meta. Meta is responsible for traversing the Plan Tree and calculating the total number of Relational Tables needed. For example, the Hash Join Operator needs 2, one for the left table and one for the right table. The number of tables needed for Agg depends on the number of agg calls.
+
+## Value State (Sum, Count)
+Query example:
+```sql
+select sum(v2), count(v3) from t group by v1 
+```
+
+This query will need to initiate 2 Relational Tables. The schema is `table_id/group_key/column_id`.
+
+## Extreme State (Max, Min)
+Query example:
+```sql
+select max(v2), min(v3) from t group by v1 
+```
+
+This query will need to initiate 2 Relational Tables. If the upstream is not append-only, the schema becomes `table_id/group_key/sort_key/upstrea_pk/column_id`. 
+
+The order of `sort_key` depends on the agg call kind. For example, if it's `max()`, `sort_key` will order with `Ascending`. if it's `min()`, `sort_key` will order with `Descending`. 
+The `upstream_pk` is also appended to ensure the uniqueness of the key.
+This design allows the streaming executor not to read all the data from the storage when the cache fails, but only a part of it. The streaming executor will try to write all streaming data to storage, because there may be `update` or `delete` operations in the stream, it's impossible to always guarantee correct results without storing all data.
+
+If `t` is created with append-only flag, the schema becomes `table_id/group_key/column_id`, which is the same for Value State. This is because in the append-only mode, there is no `update` or `delete` operation, so the cache will never miss. Therefore, we only need to write one value to the storage.
+
+
+
diff --git a/docs/storing-state-using-relational-table.md → ...e/storing-state-using-relational-table.md b/docs/storing-state-using-relational-table.md → ...e/storing-state-using-relational-table.md
@@ -34,6 +34,8 @@ We implement a relational table layer as the bridge between stateful executors a
 | join     | table_id \| join_key \| pk \| col_id| materialized value |
 | agg | table_id \| group_key \| col_id| agg_value |
 
+For the detailed schema, please check [doc](relational-table-schema.md)
+
 <!-- Todo: link cconsistence hash doc and state table agg doc -->
 ## Relational Table Layer
 [source code](https://github.com/singularity-data/risingwave/blob/main/src/storage/src/table/state_table.rs)
@@ -43,13 +45,13 @@ In this part, we will introduce how stateful executors interact with KV state st
 Relational table layer consists of State Table, Mem Table and Storage Table. The State Table provides the table operations by these APIs: `get_row`, `scan`, `insert_row`, `delete_row` and `update_row`, which are the read and write interfaces for executors. The Mem Table is an in-memory buffer for caching table operations during one epoch, and the Storage Table is responsible for writing row operations into kv pairs, which performs serialization and deserialization between cell-based encoding and KV encoding.
 
 
-![Overview of Relational Table](images/relational-table-layer/relational-table-01.svg)
+![Overview of Relational Table](../images/relational-table-layer/relational-table-01.svg)
 ### Write Path
 To write into KV state store, executors first perform operations on State Table, and these operations will be cached in Mem Table. Once a barrier flows through one executor, executor will flush the cached operations into state store. At this moment, Storage Table will covert these operations into kv pairs and write to state store with specific epoch. 
 
 For example, an executor performs `insert(a, b, c)` and `delete(d, e, f)` through the State Table APIs, Mem Table first caches these two operations in memory. After receiving new barrier, Cell Based Table converts these two operations into KV operations by cell-based format, and write these KV operations into state store (Hummock).
 
-![write example](images/relational-table-layer/relational-table-03.svg)
+![write example](../images/relational-table-layer/relational-table-03.svg)
 ### Read Path
 Executors should be able to read the just written data, which means uncommited data is visible. The data in Mem Table (memory) is fresher than that in shared storage (state store). State Table provides both point-get and scan to read from state store by merging data from Mem Table and Storage Table. 
 #### Get
@@ -75,4 +77,4 @@ Get(pk = 3): [3, 3333, 3333]
 #### Scan
 Scan on relational table is implemented by `StateTableIter`, which is a merge iterator of `MemTableIter` and `StorageTableIter`. If a pk exist in both KV state store (StorageTable) and memory (MemTable), result of `MemTableIter` is returned. For example, in the  following figure, `StateTableIter` will generate `1->4->5->6` in order.
 
-![Scan example](images/relational-table-layer/relational-table-02.svg)
+![Scan example](../images/relational-table-layer/relational-table-02.svg)