feature: make aggregation parallel process to reduce execution time #422

adofsauron · 2022-08-19T04:01:37Z

Documentation for mysql aggregation

https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions-and-modifiers.html

Functional Requirements:

Consistent with aggregation conditions handled by existing logic
The Aggregate function is used as the entry and ResultSender is used as the result output

Performance Requirements:

Table data limit:

The upper limit of a physical drive
The memory limit is exceeded. Procedure

Memory usage limit:

Add parameters to control the maximum amount of memory that can be used for aggregation
Otherwise temporary disk files if the limit is exceeded

Execution time limit:

Under the same conditions, it is in the same order of magnitude as InnoDB aggregate query at most
Aim to take less time than InnoDB for aggregated queries

Aggregate query SQL

	select
		c_name,
		c_custkey,
		o_orderkey,
		o_orderdate,
		o_totalprice,
		sum(l_quantity)
	from
		customer,
		orders,
		lineitem
	where
		c_custkey = o_custkey
		and o_orderkey = l_orderkey
	group by
		c_name,
		c_custkey,
		o_orderkey,
		o_orderdate,
		o_totalprice
	order by
		o_totalprice desc,
		o_orderdate
	limit 10;

mysql> desc part;
+---------------+---------------+------+-----+---------+-------+
| Field         | Type          | Null | Key | Default | Extra |
+---------------+---------------+------+-----+---------+-------+
| p_partkey     | int(11)       | NO   | PRI | NULL    |       |
| p_name        | varchar(55)   | NO   |     | NULL    |       |
| p_mfgr        | char(25)      | NO   |     | NULL    |       |
| p_brand       | char(10)      | NO   |     | NULL    |       |
| p_type        | varchar(25)   | NO   |     | NULL    |       |
| p_size        | int(11)       | NO   |     | NULL    |       |
| p_container   | char(10)      | NO   |     | NULL    |       |
| p_retailprice | decimal(15,2) | NO   |     | NULL    |       |
| p_comment     | varchar(23)   | NO   |     | NULL    |       |
+---------------+---------------+------+-----+---------+-------+
9 rows in set (0.00 sec)

mysql> desc partsupp;
+---------------+---------------+------+-----+---------+-------+
| Field         | Type          | Null | Key | Default | Extra |
+---------------+---------------+------+-----+---------+-------+
| ps_partkey    | int(11)       | NO   | PRI | NULL    |       |
| ps_suppkey    | int(11)       | NO   | PRI | NULL    |       |
| ps_availqty   | int(11)       | NO   |     | NULL    |       |
| ps_supplycost | decimal(15,2) | NO   |     | NULL    |       |
| ps_comment    | varchar(199)  | NO   |     | NULL    |       |
+---------------+---------------+------+-----+---------+-------+
5 rows in set (0.00 sec)

    select
        p_brand,
        p_type,
        p_size,
        count(distinct ps_suppkey) as supplier_cnt
    from
        partsupp,
        part
    where
        p_partkey = ps_partkey
        and p_brand <> 'Brand#45'
        and p_type not like 'MEDIUM POLISHED%'
        and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
        and ps_suppkey not in (
            select
                s_suppkey
            from
                supplier
            where
                s_comment like '%Customer%Complaints%'
        )
    group by
        p_brand,
        p_type,
        p_size
    order by
        supplier_cnt desc,
        p_brand,
        p_type,
        p_size
    limit 10;

adofsauron · 2022-08-19T04:08:09Z

The original overall logic

adofsauron · 2022-08-19T04:09:03Z

The original aggregation initialization logic

adofsauron · 2022-08-19T04:09:55Z

Brief aggregation parallelization processing logic

hustjieke · 2022-08-19T04:31:56Z

ACK!
Nice job!

adofsauron · 2022-08-19T04:41:54Z

The parallel hash structure can be referenced

https://greg7mdp.github.io/parallel-hashmap/

adofsauron · 2022-08-19T05:05:08Z

The current flame diagram at the time of the aggregate query

adofsauron · 2022-08-19T05:08:15Z

Compare the EXPLAIN plan for aggregate queries in MySql8

mysql>     explain  select
    ->         p_brand,
    ->         p_type,
    ->         p_size,
    ->         count(distinct ps_suppkey) as supplier_cnt
    ->     from
    ->         partsupp,
    ->         part
    ->     where
    ->         p_partkey = ps_partkey
    ->         and p_brand <> 'Brand#45'
    ->         and p_type not like 'MEDIUM POLISHED%'
    ->         and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
    ->         and ps_suppkey not in (
    ->             select
    ->                 s_suppkey
    ->             from
    ->                 supplier
    ->             where
    ->                 s_comment like '%Customer%Complaints%'
    ->         )
    ->     group by
    ->         p_brand,
    ->         p_type,
    ->         p_size
    ->     order by
    ->         supplier_cnt desc,
    ->         p_brand,
    ->         p_type,
    ->         p_size\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: partsupp
   partitions: NULL
         type: index
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 7735092
     filtered: 100.00
        Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: <subquery2>
   partitions: NULL
         type: eq_ref
possible_keys: <auto_distinct_key>
          key: <auto_distinct_key>
      key_len: 5
          ref: tpch.partsupp.ps_suppkey
         rows: 1
     filtered: 100.00
        Extra: Using where; Not exists
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: part
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: tpch.partsupp.ps_partkey
         rows: 1
     filtered: 40.00
        Extra: Using where
*************************** 4. row ***************************
           id: 2
  select_type: MATERIALIZED
        table: supplier
   partitions: NULL
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 98754
     filtered: 100.00
        Extra: Using where
4 rows in set, 1 warning (0.00 sec)

adofsauron · 2022-08-19T05:12:12Z

Compare the execution time of aggregated queries in MySql8

mysql>     select
    ->         p_brand,
    ->         p_type,
    ->         p_size,
    ->         count(distinct ps_suppkey) as supplier_cnt
    ->     from
    ->         partsupp,
    ->         part
    ->     where
    ->         p_partkey = ps_partkey
    ->         and p_brand <> 'Brand#45'
    ->         and p_type not like 'MEDIUM POLISHED%'
    ->         and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
    ->         and ps_suppkey not in (
    ->             select
    ->                 s_suppkey
    ->             from
    ->                 supplier
    ->             where
    ->                 s_comment like '%Customer%Complaints%'
    ->         )
    ->     group by
    ->         p_brand,
    ->         p_type,
    ->         p_size
    ->     order by
    ->         supplier_cnt desc,
    ->         p_brand,
    ->         p_type,
    ->         p_size
    ->     limit 10;
+----------+--------------------------+--------+--------------+
| p_brand  | p_type                   | p_size | supplier_cnt |
+----------+--------------------------+--------+--------------+
| Brand#44 | STANDARD PLATED TIN      |      9 |          120 |
| Brand#12 | STANDARD POLISHED COPPER |     14 |          100 |
| Brand#11 | LARGE BRUSHED STEEL      |     36 |           96 |
| Brand#23 | PROMO BURNISHED STEEL    |     14 |           96 |
| Brand#34 | MEDIUM BRUSHED STEEL     |     23 |           96 |
| Brand#53 | PROMO BURNISHED BRASS    |     36 |           96 |
| Brand#54 | STANDARD BRUSHED COPPER  |     19 |           96 |
| Brand#32 | LARGE POLISHED COPPER    |     14 |           95 |
| Brand#43 | LARGE PLATED COPPER      |     19 |           95 |
| Brand#11 | SMALL BRUSHED STEEL      |      9 |           92 |
+----------+--------------------------+--------+--------------+
10 rows in set (11.44 sec)

adofsauron · 2022-08-19T05:16:08Z

Compare the execution time of aggregate query of TIANMU engine in StonedB

mysql>     select
    ->         p_brand,
    ->         p_type,
    ->         p_size,
    ->         count(distinct ps_suppkey) as supplier_cnt
    ->     from
    ->         partsupp,
    ->         part
    ->     where
    ->         p_partkey = ps_partkey
    ->         and p_brand <> 'Brand#45'
    ->         and p_type not like 'MEDIUM POLISHED%'
    ->         and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
    ->         and ps_suppkey not in (
    ->             select
    ->                 s_suppkey
    ->             from
    ->                 supplier
    ->             where
    ->                 s_comment like '%Customer%Complaints%'
    ->         )
    ->     group by
    ->         p_brand,
    ->         p_type,
    ->         p_size
    ->     order by
    ->         supplier_cnt desc,
    ->         p_brand,
    ->         p_type,
    ->         p_size
    ->     limit 10;
+----------+---------------------------+--------+--------------+
| p_brand  | p_type                    | p_size | supplier_cnt |
+----------+---------------------------+--------+--------------+
| Brand#44 | STANDARD PLATED TIN       |      9 |          120 |
| Brand#33 | STANDARD BURNISHED COPPER |     42 |          116 |
| Brand#24 | SMALL BURNISHED NICKEL    |     11 |          104 |
| Brand#33 | SMALL BURNISHED NICKEL    |     12 |          104 |
| Brand#41 | ECONOMY POLISHED BRASS    |     16 |          104 |
| Brand#51 | PROMO PLATED STEEL        |     28 |          104 |
| Brand#55 | ECONOMY BURNISHED NICKEL  |     21 |          104 |
| Brand#11 | ECONOMY ANODIZED COPPER   |     28 |          100 |
| Brand#11 | ECONOMY POLISHED COPPER   |     41 |          100 |
| Brand#12 | ECONOMY ANODIZED STEEL    |     31 |          100 |
+----------+---------------------------+--------+--------------+
10 rows in set (1 min 20.25 sec)

adofsauron · 2022-08-19T05:58:56Z

Table tuples split

purpose:

The table data traversed by a single thread is evenly divided into different intervals, so that different worker threads can independently process a certain interval of data
Correctly handle the memory visibility of different worker threads to ensure that the function of the worker thread itself is complete and correct, and will not cause different threads to work with each other due to the overlap of critical sections

adofsauron · 2022-08-19T06:37:02Z

Problems faced by table tuples split

How do I get the table tuple description, how many rows, how many columns, how many packs?
What are the rules for dividing a tablespace into intervals, how many worker threads are started, and how many intervals are handled by each worker thread?
GroupByWrapper currently stores the complete information of the table and holds the identity information of whether a row of the column has a block of data, which makes it impossible for multiple threads to process the class in parallel. Whether to maintain a GroupByWrapper class in each thread to store only the information about the current tuple range, so that the worker thread can handle it independently
The GroupByWrapper class stores the information of the matching aggregation column, which is the same in each worker thread. Whether to separate the information of the aggregation column into a separate data structure, each worker thread will read only
Should we also use the current encapsulation layers of GroupByWrapper and GroupTable, GroupDistinctTable and GroupDistinctCache, and does that really cut functionality correctly? What is the maintenance cost? Is it already the optimal class organization structure?
The ValueMatchingTable class, which previously stored aggregated data, should now be replaced with a Parallel Hash Map, which would otherwise require data to be stored in each worker thread. Finally, the data in each worker thread is copied together to do a global aggregate processing operation. The performance overhead of memory copy is huge and unacceptable. How does the interface of the Parallel Hash Map class combine with the logic of the upper-level traversal aggregation when replacing a local hash?
When writing aggregate data, how does a parallel Hash map perform aggregation operation with existing data to form a unified result? How to ensure the correctness of aggregate summary results?
What are the lock rules when a parallel Hash map involves multiple threads competing to write the value of the same aggregation condition? How can I avoid overwriting the results of different worker threads?

adofsauron · 2022-08-19T08:49:38Z

Extracting aggregate attributes

adofsauron · 2022-08-19T08:56:15Z

Strip the column attributes of the processing aggregate:

Process aggregated columns whose meta-attributes are independent of a particular tuple and are invariant and can be stripped
The ·GroupByWrapper· class retains data associated with tuples during the tuple traversal, such as rows that have been processed, data that has been aggregated, etc

adofsauron · 2022-08-19T09:03:02Z

Aggregate attribute static class diagram

adofsauron · 2022-08-19T13:18:09Z

It is necessary to establish the mathematical model of aggregate processing of relational algebra first, and implement the mathematical model under the premise of mathematical rules

adofsauron · 2022-08-20T10:07:21Z

The projection operation

Projection is relational algebra's counterpart of existential quantification in predicate logic. The attributes not included correspond to existentially quantified variables in the predicate whose extension the operand relation represents. The example below illustrates this point.

Because of the correspondence with existential quantification, some authorities prefer to define projection in terms of the excluded attributes. In a computer language it is of course possible to provide notations for both, and that was done in ISBL and several languages that have taken their cue from ISBL.

A nearly identical concept occurs in the category of monoids, called a string projection, which consists of removing all of the letters in the string that do not belong to a given alphabet.

When implemented in SQL standard the "default projection" returns a multiset instead a set, and the π projection is obtained by the addition of the DISTINCT keyword to eliminate duplicate data.

Πa1,a2,a3...ak( r )

adofsauron · 2022-08-20T10:37:40Z

Grouping

πname,SUM(length)σMIN(year)<1930γname,MIN(year),SUM(length)σcert = producer(MovieExec×Movie)

γ grouping_attribute, func(A) → name(R)

unary relation that takes R as input
first parameter (grouping_attribute) is attribute on which R will be grouped
function func is applied to each group, and the result is written to attribute name
NB: all other (non-mentioned) attributes are not output to the result!

Example

$\gamma_{A, \ \text{min}(B) \ \to \ D} \left(
\begin{array}{ c | c | c }

A & B & C \
\hline \hline
1 & 2 & a \
1 & 3 & b \
2 & 3 & c \
2 & 4 & a \
2 & 5 & d \
\end{array} \right) = \begin{array}{ c | c | c }

A & D \
\hline \hline
1 & 2 \
2 & 3 \
\end{array} $

adofsauron · 2022-08-20T11:46:21Z

Domain driven model and Domain driven design (DDD) is the mathematical rules of multi-thread parallel processing aggregation:

Only mathematical rules can describe existing code logic, so mathematical modeling is used
The engine code can only be mathematically modeled to achieve domain-driven design, and it can only be mathematically understood. This is the way
This makes it impossible to understand and optimize the corresponding code according to the idea of modularity. It needs to understand the relational algebraic operation rules behind it, and then optimize from the mathematical model, and then use formal verification. The last step is to write the mathematical model to the code from the beginning to the end. Changing just one piece violates the domain model

VC=Column
Pack=<VC>
R=<Pack>
Row=Column(line)
Iterator=<Row>
Vector=<Iter>
EachWorker={Vec1,Vec2...}
Aggrega={EachWorker}
    =P1{Vec1,Vec2...}+P2+...
    =P1{PackRow(Vec1)) => Hash[row]->line]}

adofsauron · 2022-08-21T11:13:54Z

physical operations for selection

adofsauron · 2022-08-21T11:35:01Z

Eliminate duplicate

adofsauron · 2022-08-21T11:44:00Z

Static class diagram from the previous aggregation process

adofsauron · 2022-08-21T11:53:21Z

Static class diagram for iterators

adofsauron · 2022-08-21T14:58:45Z

Table properties

[2022-08-21 10:57:51.504922] [45954] [INFO] [aggregation_algorithm.cpp:49] MSG: Aggregate numOfAttrs: 4 packpower: 16 NumOfObj: -1 NumOfTables: 2 NumOfTuples: 7422784 distinct: false
[2022-08-21 10:57:51.504980] [45954] [INFO] [aggregation_algorithm.cpp:69] MSG: Aggregate AddGroupingColumn attr_num: 0 col: 0
[2022-08-21 10:57:51.505004] [45954] [INFO] [aggregation_algorithm.cpp:69] MSG: Aggregate AddGroupingColumn attr_num: 1 col: 1
[2022-08-21 10:57:51.505011] [45954] [INFO] [aggregation_algorithm.cpp:69] MSG: Aggregate AddGroupingColumn attr_num: 2 col: 2
[2022-08-21 10:57:51.505024] [45954] [INFO] [aggregation_algorithm.cpp:127] MSG: Aggregate AddAggregatedColumn col: 3 max_no_of_distinct: 100000 min_v: 1 max_v: 100000 max_size: 11
[2022-08-21 10:57:51.540405] [45954] [INFO] [aggregation_algorithm.cpp:238] MSG: NumOfDimensions: 2 NumOfTuples: 7422784

Set the number of threads to 4

[2022-08-21 21:41:28.927878] [67237] [INFO] [aggregation_algorithm.cpp:896] MSG: ReadyDist threads: 4 packnum: 154 loopcnt: 4 num: 38 mod: 2

adofsauron · 2022-08-22T02:25:53Z

Shards need to be refactored

The accessibility of iterators and identity processing after splitting is not properly considered

void AggregationWorkerEnt::PrepShardingCopy(MIIterator *mit, GroupByWrapper *gb_sharding,
                                            std::vector<std::unique_ptr<GroupByWrapper>> *vGBW) {
  DimensionVector dims(mind->NumOfDimensions());
  std::unique_ptr<GroupByWrapper> gbw_ptr(new GroupByWrapper(*gb_sharding));
  gbw_ptr->FillDimsUsed(dims);
  gbw_ptr->SetDistinctTuples(mit->NumOfTuples());
  if (!gbw_ptr->IsOnePass()) gbw_ptr->InitTupleLeft(mit->NumOfTuples());
  gbw_ptr->AddAllGroupingConstants(*mit);
  std::scoped_lock guard(mtx);
  vGBW->emplace_back(std::move(gbw_ptr));
}

adofsauron · 2023-04-17T07:31:51Z

To show why aggregate processing performance is currently low, use the simplest aggregated queries to show CPU usage. You can see the serialization and deserialization in the aggregate operation hash

adofsauron · 2023-04-17T07:38:20Z

Tianmu: : core: : ColumnBinEncoder: : Encode function is very interesting, as the aggregation, to generate the hash key, contains both the random retrieve data from RAM, the CPU cache invalidation, It also involves transcoding the fetched data to serialize it into a new piece of RAM, spanning several access cycles in terms of CPU usage

adofsauron · 2023-04-17T07:47:41Z

In addition to the most obvious data acquisition and serialization, another typical example is the tuple iterator depletion of the classic Volcano model, which causes CPU instructions to process the wait state while preprocessing

adofsauron · 2023-04-17T07:49:36Z

mysqld 174443 518728.020321:   11642419 cycles:
                 2d4d70b Tianmu::core::MIIterator::operator+++0x11 (/data/stonedb57/install/bin/mysqld)
                 30681e2 Tianmu::core::MIIterator::Increment+0x18 (/data/stonedb57/install/bin/mysqld)
                 2feba05 Tianmu::core::AggregationAlgorithm::AggregatePackrow+0x64d (/data/stonedb57/install/bin/mysqld)
                 2fea0cb Tianmu::core::AggregationAlgorithm::MultiDimensionalGroupByScan+0x789 (/data/stonedb57/install/bin/mysqld)
                 2fe9855 Tianmu::core::AggregationAlgorithm::Aggregate+0xe0f (/data/stonedb57/install/bin/mysqld)
                 2d49f86 Tianmu::core::TempTable::Materialize+0x892 (/data/stonedb57/install/bin/mysqld)
                 2ce4488 Tianmu::core::Engine::Execute+0xa66 (/data/stonedb57/install/bin/mysqld)
                 2ce31ac Tianmu::core::Engine::HandleSelect+0x8ce (/data/stonedb57/install/bin/mysqld)
                 2df2371 Tianmu::DBHandler::ha_my_tianmu_query+0x5c (/data/stonedb57/install/bin/mysqld)
                 24205c6 execute_sqlcom_select+0x254 (/data/stonedb57/install/bin/mysqld)
                 241993c mysql_execute_command+0xe5c (/data/stonedb57/install/bin/mysqld)
                 242162b mysql_parse+0x6b0 (/data/stonedb57/install/bin/mysqld)
                 2416622 dispatch_command+0xcc7 (/data/stonedb57/install/bin/mysqld)
                 2415463 do_command+0x4ba (/data/stonedb57/install/bin/mysqld)
                 25479f9 handle_connection+0x1ee (/data/stonedb57/install/bin/mysqld)
                 2c17238 pfs_spawn_thread+0x173 (/data/stonedb57/install/bin/mysqld)
            7faad6ef81ca start_thread+0xea (/usr/lib64/libpthread-2.28.so)

adofsauron · 2023-04-17T07:51:47Z

Memory copy has the worst effect on cpu cycles

mysqld 174443 518766.939690:   25789094 cycles:
                 1d15640 memcpy@plt+0x0 (/data/stonedb57/install/bin/mysqld)
                 3003c55 Tianmu::core::ColumnBinEncoder::Encode+0x95 (/data/stonedb57/install/bin/mysqld)
                 2fee7d1 Tianmu::core::GroupTable::PutGroupingValue+0x5f (/data/stonedb57/install/bin/mysqld)
                 2feeb43 Tianmu::core::GroupByWrapper::PutGroupingValue+0x2f (/data/stonedb57/install/bin/mysqld)
                 2feb807 Tianmu::core::AggregationAlgorithm::AggregatePackrow+0x44f (/data/stonedb57/install/bin/mysqld)
                 2fea0cb Tianmu::core::AggregationAlgorithm::MultiDimensionalGroupByScan+0x789 (/data/stonedb57/install/bin/mysqld)
                 2fe9855 Tianmu::core::AggregationAlgorithm::Aggregate+0xe0f (/data/stonedb57/install/bin/mysqld)
                 2d49f86 Tianmu::core::TempTable::Materialize+0x892 (/data/stonedb57/install/bin/mysqld)
                 2ce4488 Tianmu::core::Engine::Execute+0xa66 (/data/stonedb57/install/bin/mysqld)
                 2ce31ac Tianmu::core::Engine::HandleSelect+0x8ce (/data/stonedb57/install/bin/mysqld)
                 2df2371 Tianmu::DBHandler::ha_my_tianmu_query+0x5c (/data/stonedb57/install/bin/mysqld)
                 24205c6 execute_sqlcom_select+0x254 (/data/stonedb57/install/bin/mysqld)
                 241993c mysql_execute_command+0xe5c (/data/stonedb57/install/bin/mysqld)
                 242162b mysql_parse+0x6b0 (/data/stonedb57/install/bin/mysqld)
                 2416622 dispatch_command+0xcc7 (/data/stonedb57/install/bin/mysqld)
                 2415463 do_command+0x4ba (/data/stonedb57/install/bin/mysqld)
                 25479f9 handle_connection+0x1ee (/data/stonedb57/install/bin/mysqld)
                 2c17238 pfs_spawn_thread+0x173 (/data/stonedb57/install/bin/mysqld)
            7faad6ef81ca start_thread+0xea (/usr/lib64/libpthread-2.28.so)

adofsauron · 2023-04-17T08:30:17Z

the cost of the routine call (in the ballpark of 20
cycles) must be amortized over only one opera�tion, which effectively doubles the operation cost.

adofsauron · 2023-04-17T08:52:29Z

adofsauron · 2023-04-17T08:54:54Z

The data structure here is ready for redesign

adofsauron · 2023-04-17T09:08:16Z

The data for each row of the column store is extracted and then an in-memory copy is made to generate the hash, and there are duplicates in the column store data

adofsauron · 2023-04-17T09:14:29Z

These joins are not re�quired in a Volcano-like pipelined execution model. It
can do the selection, computations and aggregation all
in a single pass, not materializing any data.

adofsauron · 2023-04-17T09:54:49Z

The
same vertically partitioned and even compressed
disk data layout is used in RAM to save space and
bandwidth.

adofsauron · 2023-04-17T10:14:03Z

Execution proceeds using Volcano-like pipelining,
on the granularity of a vector (e.g. 1000 values).
The Scan operator retrieves data vector-at-a-time from
Monet BATs. Note that only attributes relevant to the
query are actually scanned.

adofsauron · 2023-04-17T10:22:16Z

Some peculiarities of this
algebra are that Project is just used for expression
calculation; it does not eliminate duplicates.

adofsauron · 2023-04-17T10:43:29Z

Modern CPUs can typically only perform 1 or
2 load/store operations per cycle. In compound prim�itives, the results from one calculation are passed via a
CPU register to the next calculation, with load/stores
only occurring at the edges of the expression graph.

adofsauron · 2023-04-17T11:08:01Z

Columns store data in data files that are either unduplicated or sorted. Consider generating a new auxiliary column column:

Deduplication of data in the new auxiliary column
Consider whether the data of the new auxiliary column should be sorted?
During the access, use the offset of the data in the secondary column to access the data to avoid memory copy

adofsauron · 2023-04-18T02:34:50Z

  struct {
    std::vector<buf> v;
    size_t sum_len;
    char **index;
    union {
      void *lens;
      uint32_t *lens32;
      uint16_t *lens16;
    };
    uint8_t len_mode;
  } data_{};

adofsauron · 2023-04-18T02:40:25Z

Such a column data layout suitable for writing is not conducive to reading operations, and the previous LSM tree using rocksdb was equally stupid and inherited this disgusting design

adofsauron · 2023-04-18T02:56:35Z

Aggregation Stage 3 First split: Establish a column data disk IO file layout suitable for vectorized reading

adofsauron · 2023-04-18T06:29:23Z

Horizontal slicing of column data

|   5786 | (X_67=[100095]:bat[:oid], C_68=[4]:bat[:oid]) := group.group(X_40=[100095]:bat[:str]);                                                                                       |
|   5852 | (X_69=[100095]:bat[:oid], C_70=[4]:bat[:oid]) := group.group(X_48=[100095]:bat[:str]);                                                                                       |
|   6182 | (X_71=[100095]:bat[:oid], C_72=[4]:bat[:oid]) := group.group(X_51=[100095]:bat[:str]);                                                                                       |
|   5472 | (X_73=[100095]:bat[:oid], C_74=[4]:bat[:oid]) := group.group(X_57=[100095]:bat[:str]);                                                                                       |
|   8232 | (X_75=[100095]:bat[:oid], C_76=[4]:bat[:oid]) := group.group(X_49=[100095]:bat[:str]);                                                                                       |
|   5879 | (X_77=[100097]:bat[:oid], C_78=[4]:bat[:oid]) := group.group(X_63=[100097]:bat[:str]);                                                                                       |
|  12955 | (X_79=[100095]:bat[:oid], C_80=[4000]:bat[:oid]) := group.subgroup(X_39=[100095]:bat[:int], X_67=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  14632 | (X_81=[100095]:bat[:oid], C_82=[4000]:bat[:oid]) := group.subgroup(X_36=[100095]:bat[:int], X_73=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  12384 | (X_83=[100097]:bat[:oid], C_84=[4000]:bat[:oid]) := group.subgroup(X_34=[100097]:bat[:int], X_77=[100097]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  20043 | (X_85=[100095]:bat[:oid], C_86=[4000]:bat[:oid]) := group.subgroup(X_53=[100095]:bat[:int], X_69=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  20069 | (X_87=[100095]:bat[:oid], C_88=[4000]:bat[:oid]) := group.subgroup(X_46=[100095]:bat[:int], X_71=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  18398 | (X_89=[100095]:bat[:oid], C_90=[4000]:bat[:oid]) := group.subgroup(X_38=[100095]:bat[:int], X_75=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  26944 | (X_91=[100097]:bat[:oid], C_92=[78707]:bat[:oid]) := group.subgroupdone(X_54=[100097]:bat[:lng], X_83=[100097]:bat[:oid]); # GRP_create_partial_hash_table, dense            |
|  33312 | (X_93=[100095]:bat[:oid], C_94=[78880]:bat[:oid]) := group.subgroupdone(X_44=[100095]:bat[:lng], X_79=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense

adofsauron · 2023-04-18T07:35:29Z

The kernel maintains a central table of all active threads. They
are indexed by their tid. The structure contains information on the
input/output file descriptors, which should be set before a
database operation is started. It ensures that output is delivered
to the proper client.

adofsauron · 2023-04-18T10:06:34Z


tmp_727#600572@0[str]VE,lo=0,hi=100095 -> tmp_731#100095@0[str]VE

adofsauron · 2023-04-18T11:08:21Z

Segmentation for packet processing



DFLOWworker21   calling group.group [17, 18]
DFLOWworker21   calling group.group [17, 18]
DFLOWworker30   calling group.group [65, 66]
DFLOWworker30   calling group.group [65, 66]
DFLOWworker0    calling group.group [49, 50]
DFLOWworker0    calling group.group [49, 50]
DFLOWworker11   calling group.group [81, 82]
DFLOWworker11   calling group.group [81, 82]
DFLOWworker2    calling group.group [33, 34]
DFLOWworker2    calling group.group [33, 34]
DFLOWworker31   calling group.group [97, 98]
DFLOWworker31   calling group.group [97, 98]
DFLOWworker21   calling group.subgroup [18, 19]
DFLOWworker30   calling group.subgroup [66, 67]
DFLOWworker31   calling group.subgroup [98, 99]
DFLOWworker2    calling group.subgroup [34, 35]
DFLOWworker11   calling group.subgroup [82, 83]
DFLOWworker0    calling group.subgroup [50, 51]
DFLOWworker21   calling group.subgroupdone [19, 20]
DFLOWworker31   calling group.subgroupdone [99, 100]
DFLOWworker11   calling group.subgroupdone [83, 84]
DFLOWworker0    calling group.subgroupdone [51, 52]
DFLOWworker2    calling group.subgroupdone [35, 36]
DFLOWworker30   calling group.subgroupdone [67, 68]
DFLOWworker24   calling group.group [129, 130]
DFLOWworker24   calling group.group [129, 130]
DFLOWworker24   calling group.subgroup [130, 131]
DFLOWworker24   calling group.subgroupdone [131, 132]

adofsauron · 2023-04-18T11:26:22Z





(X_51=   [100095]:bat[:oid], C_52=[4]:bat[:oid])         := group.group(X_39=[100095]:bat[:str]);                                                                                       |
(X_51=   [100095]:bat[:oid], C_52=[4]:bat[:oid])         := group.group(X_39=[100095]:bat[:str]);                                                                                       |
(X_53=   [100095]:bat[:oid], C_54=[4]:bat[:oid])         := group.group(X_34=[100095]:bat[:str]);                                                                                       |
(X_53=   [100095]:bat[:oid], C_54=[4]:bat[:oid])         := group.group(X_34=[100095]:bat[:str]);                                                                                       |
(X_61=   [100095]:bat[:oid], C_62=[4]:bat[:oid])         := group.group(X_41=[100095]:bat[:str]);                                                                                       |
(X_61=   [100095]:bat[:oid], C_62=[4]:bat[:oid])         := group.group(X_41=[100095]:bat[:str]);                                                                                       |
(X_70=   [100095]:bat[:oid], C_71=[4]:bat[:oid])         := group.group(X_59=[100095]:bat[:str]);                                                                                       |
(X_70=   [100095]:bat[:oid], C_71=[4]:bat[:oid])         := group.group(X_59=[100095]:bat[:str]);                                                                                       |
(X_72=   [100095]:bat[:oid], C_73=[4]:bat[:oid])         := group.group(X_45=[100095]:bat[:str]);                                                                                       |
(X_72=   [100095]:bat[:oid], C_73=[4]:bat[:oid])         := group.group(X_45=[100095]:bat[:str]);                                                                                       |
(X_74=   [100097]:bat[:oid], C_75=[4]:bat[:oid])         := group.group(X_68=[100097]:bat[:str]);                                                                                       |
(X_74=   [100097]:bat[:oid], C_75=[4]:bat[:oid])         := group.group(X_68=[100097]:bat[:str]);    
                                                                                   |
(X_78=   [100095]:bat[:oid], C_79=[4000]:bat[:oid])      := group.subgroup(X_55=[100095]:bat[:int], X_53=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
(X_80=   [100095]:bat[:oid], C_81=[4000]:bat[:oid])      := group.subgroup(X_44=[100095]:bat[:int], X_51=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
(X_82=   [100095]:bat[:oid], C_83=[4000]:bat[:oid])      := group.subgroup(X_37=[100095]:bat[:int], X_61=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
(X_84=   [100097]:bat[:oid], C_85=[4000]:bat[:oid])      := group.subgroup(X_48=[100097]:bat[:int], X_74=[100097]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
(X_86=   [100095]:bat[:oid], C_87=[4000]:bat[:oid])      := group.subgroup(X_30=[100095]:bat[:int], X_70=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
(X_88=   [100095]:bat[:oid], C_89=[4000]:bat[:oid])      := group.subgroup(X_32=[100095]:bat[:int], X_72=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense      
           
(X_90=   [100095]:bat[:oid], C_91=[78696]:bat[:oid])     := group.subgroupdone(X_47=[100095]:bat[:lng], X_78=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense            |
(X_98=   [100097]:bat[:oid], C_99=[78707]:bat[:oid])     := group.subgroupdone(X_42=[100097]:bat[:lng], X_84=[100097]:bat[:oid]); # GRP_create_partial_hash_table, dense            |
(X_104=  [100095]:bat[:oid], C_105=[78787]:bat[:oid])    := group.subgroupdone(X_76=[100095]:bat[:lng], X_82=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense          |
(X_114=  [100095]:bat[:oid], C_115=[78962]:bat[:oid])    := group.subgroupdone(X_38=[100095]:bat[:lng], X_86=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense          |
(X_129=  [100095]:bat[:oid], C_130=[78618]:bat[:oid])    := group.subgroupdone(X_64=[100095]:bat[:lng], X_80=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense          |
(X_160=  [100095]:bat[:oid], C_161=[78880]:bat[:oid])    := group.subgroupdone(X_31=[100095]:bat[:lng], X_88=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense  
        
(X_0=    [472650]:bat[:oid], C_1=[4]:bat[:oid])          := group.group(X_2=[472650]:bat[:str]);                                                                                          |
(X_0=    [472650]:bat[:oid], C_1=[4]:bat[:oid])          := group.group(X_2=[472650]:bat[:str]);                                                                                          |
(X_3=    [472650]:bat[:oid], C_4=[4000]:bat[:oid])       := group.subgroup(X_5=[472650]:bat[:int], X_0=[472650]:bat[:oid]); # GRP_create_partial_hash_table, dense                     |
(X_202=  [472650]:bat[:oid], C_203=[189919]:bat[:oid])   := group.subgroupdone(X_196=[472650]:bat[:lng], X_3=[472650]:bat[:oid]); # GRP_create_partial_hash_table, dense         |

adofsauron · 2023-04-19T07:10:18Z

While the layout of disk IO follows the same vectorization structure as RAM, there are several different design strategies for multithreaded task systems, and some emphasis needs to be analyzed in detail

adofsauron · 2023-04-19T07:35:12Z

The problem is memory visibility. A more efficient design would be to use a completely lock-free approach without any visibility conflicts

adofsauron added the A-feature feature with good idea label Aug 19, 2022

hustjieke assigned adofsauron Aug 19, 2022

hustjieke added this to the stonedb_5.7_v1.0.0 milestone Aug 19, 2022

RingsC added prio: high High priority A-improvement labels Aug 19, 2022

adofsauron changed the title ~~feature: aggregation parallel for reduce execution time~~ feature: make aggregation parallel process to reduce execution time Aug 19, 2022

hustjieke modified the milestones: StoneDB_5.7_v1.0.3, StoneDB_5.7_v1.0.4 Apr 18, 2023

adofsauron assigned adofsauron and unassigned adofsauron Apr 27, 2023

adofsauron removed their assignment Jun 14, 2023

feature: make aggregation parallel process to reduce execution time #422

feature: make aggregation parallel process to reduce execution time #422

Comments

adofsauron commented Aug 19, 2022 • edited Loading

Documentation for mysql aggregation

Functional Requirements:

Performance Requirements:

Table data limit:

Memory usage limit:

Execution time limit:

Aggregate query SQL

adofsauron commented Aug 19, 2022

The original overall logic

adofsauron commented Aug 19, 2022

The original aggregation initialization logic

adofsauron commented Aug 19, 2022

Brief aggregation parallelization processing logic

hustjieke commented Aug 19, 2022

adofsauron commented Aug 19, 2022

The parallel hash structure can be referenced

adofsauron commented Aug 19, 2022

The current flame diagram at the time of the aggregate query

adofsauron commented Aug 19, 2022

Compare the EXPLAIN plan for aggregate queries in MySql8

adofsauron commented Aug 19, 2022 • edited Loading

Compare the execution time of aggregated queries in MySql8

adofsauron commented Aug 19, 2022

Compare the execution time of aggregate query of TIANMU engine in StonedB

adofsauron commented Aug 19, 2022 • edited Loading

Table tuples split

adofsauron commented Aug 19, 2022 • edited Loading

Problems faced by table tuples split

adofsauron commented Aug 19, 2022

Extracting aggregate attributes

adofsauron commented Aug 19, 2022

Strip the column attributes of the processing aggregate:

adofsauron commented Aug 19, 2022

Aggregate attribute static class diagram

adofsauron commented Aug 19, 2022

adofsauron commented Aug 20, 2022 • edited Loading

The projection operation

adofsauron commented Aug 20, 2022 • edited Loading

Grouping

adofsauron commented Aug 20, 2022 • edited Loading

Domain driven model and Domain driven design (DDD) is the mathematical rules of multi-thread parallel processing aggregation:

adofsauron commented Aug 21, 2022 • edited Loading

physical operations for selection

adofsauron commented Aug 21, 2022

Eliminate duplicate

adofsauron commented Aug 21, 2022

Static class diagram from the previous aggregation process

adofsauron commented Aug 21, 2022

Static class diagram for iterators

adofsauron commented Aug 21, 2022 • edited Loading

Table properties

Set the number of threads to 4

adofsauron commented Aug 22, 2022

Shards need to be refactored

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023

Memory copy has the worst effect on cpu cycles

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023 • edited Loading

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023

adofsauron commented Apr 17, 2023 • edited Loading

adofsauron commented Apr 18, 2023

adofsauron commented Apr 18, 2023

adofsauron commented Apr 18, 2023

adofsauron commented Apr 18, 2023

Horizontal slicing of column data

adofsauron commented Apr 18, 2023

adofsauron commented Aug 19, 2022 •

edited

Loading

adofsauron commented Aug 19, 2022 •

edited

Loading

adofsauron commented Aug 19, 2022 •

edited

Loading

adofsauron commented Aug 19, 2022 •

edited

Loading

adofsauron commented Aug 20, 2022 •

edited

Loading

adofsauron commented Aug 20, 2022 •

edited

Loading

adofsauron commented Aug 20, 2022 •

edited

Loading

adofsauron commented Aug 21, 2022 •

edited

Loading

adofsauron commented Aug 21, 2022 •

edited

Loading

adofsauron commented Apr 17, 2023 •

edited

Loading

adofsauron commented Apr 17, 2023 •

edited

Loading

adofsauron commented Apr 18, 2023 •

edited

Loading