Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: make aggregation parallel process to reduce execution time #422

Open
adofsauron opened this issue Aug 19, 2022 · 141 comments · Fixed by #472 or #1279
Open

feature: make aggregation parallel process to reduce execution time #422

adofsauron opened this issue Aug 19, 2022 · 141 comments · Fixed by #472 or #1279
Labels

Comments

@adofsauron
Copy link
Collaborator

adofsauron commented Aug 19, 2022

Documentation for mysql aggregation

https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions-and-modifiers.html

Functional Requirements:

  1. Consistent with aggregation conditions handled by existing logic
  2. The Aggregate function is used as the entry and ResultSender is used as the result output

Performance Requirements:

Table data limit:

  1. The upper limit of a physical drive
  2. The memory limit is exceeded. Procedure

Memory usage limit:

  1. Add parameters to control the maximum amount of memory that can be used for aggregation
  2. Otherwise temporary disk files if the limit is exceeded

Execution time limit:

  1. Under the same conditions, it is in the same order of magnitude as InnoDB aggregate query at most
  2. Aim to take less time than InnoDB for aggregated queries

Aggregate query SQL

	select
		c_name,
		c_custkey,
		o_orderkey,
		o_orderdate,
		o_totalprice,
		sum(l_quantity)
	from
		customer,
		orders,
		lineitem
	where
		c_custkey = o_custkey
		and o_orderkey = l_orderkey
	group by
		c_name,
		c_custkey,
		o_orderkey,
		o_orderdate,
		o_totalprice
	order by
		o_totalprice desc,
		o_orderdate
	limit 10;
mysql> desc part;
+---------------+---------------+------+-----+---------+-------+
| Field         | Type          | Null | Key | Default | Extra |
+---------------+---------------+------+-----+---------+-------+
| p_partkey     | int(11)       | NO   | PRI | NULL    |       |
| p_name        | varchar(55)   | NO   |     | NULL    |       |
| p_mfgr        | char(25)      | NO   |     | NULL    |       |
| p_brand       | char(10)      | NO   |     | NULL    |       |
| p_type        | varchar(25)   | NO   |     | NULL    |       |
| p_size        | int(11)       | NO   |     | NULL    |       |
| p_container   | char(10)      | NO   |     | NULL    |       |
| p_retailprice | decimal(15,2) | NO   |     | NULL    |       |
| p_comment     | varchar(23)   | NO   |     | NULL    |       |
+---------------+---------------+------+-----+---------+-------+
9 rows in set (0.00 sec)

mysql> desc partsupp;
+---------------+---------------+------+-----+---------+-------+
| Field         | Type          | Null | Key | Default | Extra |
+---------------+---------------+------+-----+---------+-------+
| ps_partkey    | int(11)       | NO   | PRI | NULL    |       |
| ps_suppkey    | int(11)       | NO   | PRI | NULL    |       |
| ps_availqty   | int(11)       | NO   |     | NULL    |       |
| ps_supplycost | decimal(15,2) | NO   |     | NULL    |       |
| ps_comment    | varchar(199)  | NO   |     | NULL    |       |
+---------------+---------------+------+-----+---------+-------+
5 rows in set (0.00 sec)

    select
        p_brand,
        p_type,
        p_size,
        count(distinct ps_suppkey) as supplier_cnt
    from
        partsupp,
        part
    where
        p_partkey = ps_partkey
        and p_brand <> 'Brand#45'
        and p_type not like 'MEDIUM POLISHED%'
        and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
        and ps_suppkey not in (
            select
                s_suppkey
            from
                supplier
            where
                s_comment like '%Customer%Complaints%'
        )
    group by
        p_brand,
        p_type,
        p_size
    order by
        supplier_cnt desc,
        p_brand,
        p_type,
        p_size
    limit 10;
@adofsauron adofsauron added the A-feature feature with good idea label Aug 19, 2022
@adofsauron
Copy link
Collaborator Author

The original overall logic

image

@adofsauron
Copy link
Collaborator Author

The original aggregation initialization logic

image

@adofsauron
Copy link
Collaborator Author

Brief aggregation parallelization processing logic

image

@hustjieke
Copy link
Collaborator

ACK!
Nice job!

@adofsauron
Copy link
Collaborator Author

The parallel hash structure can be referenced

https://greg7mdp.github.io/parallel-hashmap/

@adofsauron
Copy link
Collaborator Author

The current flame diagram at the time of the aggregate query

slow-16-hash-paral-20220815Aug081660539545 mysql

@adofsauron
Copy link
Collaborator Author

Compare the EXPLAIN plan for aggregate queries in MySql8

mysql>     explain  select
    ->         p_brand,
    ->         p_type,
    ->         p_size,
    ->         count(distinct ps_suppkey) as supplier_cnt
    ->     from
    ->         partsupp,
    ->         part
    ->     where
    ->         p_partkey = ps_partkey
    ->         and p_brand <> 'Brand#45'
    ->         and p_type not like 'MEDIUM POLISHED%'
    ->         and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
    ->         and ps_suppkey not in (
    ->             select
    ->                 s_suppkey
    ->             from
    ->                 supplier
    ->             where
    ->                 s_comment like '%Customer%Complaints%'
    ->         )
    ->     group by
    ->         p_brand,
    ->         p_type,
    ->         p_size
    ->     order by
    ->         supplier_cnt desc,
    ->         p_brand,
    ->         p_type,
    ->         p_size\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: partsupp
   partitions: NULL
         type: index
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 7735092
     filtered: 100.00
        Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: <subquery2>
   partitions: NULL
         type: eq_ref
possible_keys: <auto_distinct_key>
          key: <auto_distinct_key>
      key_len: 5
          ref: tpch.partsupp.ps_suppkey
         rows: 1
     filtered: 100.00
        Extra: Using where; Not exists
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: part
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: tpch.partsupp.ps_partkey
         rows: 1
     filtered: 40.00
        Extra: Using where
*************************** 4. row ***************************
           id: 2
  select_type: MATERIALIZED
        table: supplier
   partitions: NULL
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 98754
     filtered: 100.00
        Extra: Using where
4 rows in set, 1 warning (0.00 sec)

@adofsauron
Copy link
Collaborator Author

adofsauron commented Aug 19, 2022

Compare the execution time of aggregated queries in MySql8

mysql>     select
    ->         p_brand,
    ->         p_type,
    ->         p_size,
    ->         count(distinct ps_suppkey) as supplier_cnt
    ->     from
    ->         partsupp,
    ->         part
    ->     where
    ->         p_partkey = ps_partkey
    ->         and p_brand <> 'Brand#45'
    ->         and p_type not like 'MEDIUM POLISHED%'
    ->         and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
    ->         and ps_suppkey not in (
    ->             select
    ->                 s_suppkey
    ->             from
    ->                 supplier
    ->             where
    ->                 s_comment like '%Customer%Complaints%'
    ->         )
    ->     group by
    ->         p_brand,
    ->         p_type,
    ->         p_size
    ->     order by
    ->         supplier_cnt desc,
    ->         p_brand,
    ->         p_type,
    ->         p_size
    ->     limit 10;
+----------+--------------------------+--------+--------------+
| p_brand  | p_type                   | p_size | supplier_cnt |
+----------+--------------------------+--------+--------------+
| Brand#44 | STANDARD PLATED TIN      |      9 |          120 |
| Brand#12 | STANDARD POLISHED COPPER |     14 |          100 |
| Brand#11 | LARGE BRUSHED STEEL      |     36 |           96 |
| Brand#23 | PROMO BURNISHED STEEL    |     14 |           96 |
| Brand#34 | MEDIUM BRUSHED STEEL     |     23 |           96 |
| Brand#53 | PROMO BURNISHED BRASS    |     36 |           96 |
| Brand#54 | STANDARD BRUSHED COPPER  |     19 |           96 |
| Brand#32 | LARGE POLISHED COPPER    |     14 |           95 |
| Brand#43 | LARGE PLATED COPPER      |     19 |           95 |
| Brand#11 | SMALL BRUSHED STEEL      |      9 |           92 |
+----------+--------------------------+--------+--------------+
10 rows in set (11.44 sec)

@adofsauron
Copy link
Collaborator Author

Compare the execution time of aggregate query of TIANMU engine in StonedB

mysql>     select
    ->         p_brand,
    ->         p_type,
    ->         p_size,
    ->         count(distinct ps_suppkey) as supplier_cnt
    ->     from
    ->         partsupp,
    ->         part
    ->     where
    ->         p_partkey = ps_partkey
    ->         and p_brand <> 'Brand#45'
    ->         and p_type not like 'MEDIUM POLISHED%'
    ->         and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
    ->         and ps_suppkey not in (
    ->             select
    ->                 s_suppkey
    ->             from
    ->                 supplier
    ->             where
    ->                 s_comment like '%Customer%Complaints%'
    ->         )
    ->     group by
    ->         p_brand,
    ->         p_type,
    ->         p_size
    ->     order by
    ->         supplier_cnt desc,
    ->         p_brand,
    ->         p_type,
    ->         p_size
    ->     limit 10;
+----------+---------------------------+--------+--------------+
| p_brand  | p_type                    | p_size | supplier_cnt |
+----------+---------------------------+--------+--------------+
| Brand#44 | STANDARD PLATED TIN       |      9 |          120 |
| Brand#33 | STANDARD BURNISHED COPPER |     42 |          116 |
| Brand#24 | SMALL BURNISHED NICKEL    |     11 |          104 |
| Brand#33 | SMALL BURNISHED NICKEL    |     12 |          104 |
| Brand#41 | ECONOMY POLISHED BRASS    |     16 |          104 |
| Brand#51 | PROMO PLATED STEEL        |     28 |          104 |
| Brand#55 | ECONOMY BURNISHED NICKEL  |     21 |          104 |
| Brand#11 | ECONOMY ANODIZED COPPER   |     28 |          100 |
| Brand#11 | ECONOMY POLISHED COPPER   |     41 |          100 |
| Brand#12 | ECONOMY ANODIZED STEEL    |     31 |          100 |
+----------+---------------------------+--------+--------------+
10 rows in set (1 min 20.25 sec)

@adofsauron
Copy link
Collaborator Author

adofsauron commented Aug 19, 2022

Table tuples split

purpose:

  1. The table data traversed by a single thread is evenly divided into different intervals, so that different worker threads can independently process a certain interval of data
  2. Correctly handle the memory visibility of different worker threads to ensure that the function of the worker thread itself is complete and correct, and will not cause different threads to work with each other due to the overlap of critical sections

@adofsauron
Copy link
Collaborator Author

adofsauron commented Aug 19, 2022

Problems faced by table tuples split

  • How do I get the table tuple description, how many rows, how many columns, how many packs?

  • What are the rules for dividing a tablespace into intervals, how many worker threads are started, and how many intervals are handled by each worker thread?

  • GroupByWrapper currently stores the complete information of the table and holds the identity information of whether a row of the column has a block of data, which makes it impossible for multiple threads to process the class in parallel. Whether to maintain a GroupByWrapper class in each thread to store only the information about the current tuple range, so that the worker thread can handle it independently

  • The GroupByWrapper class stores the information of the matching aggregation column, which is the same in each worker thread. Whether to separate the information of the aggregation column into a separate data structure, each worker thread will read only

  • Should we also use the current encapsulation layers of GroupByWrapper and GroupTable, GroupDistinctTable and GroupDistinctCache, and does that really cut functionality correctly? What is the maintenance cost? Is it already the optimal class organization structure?

  • The ValueMatchingTable class, which previously stored aggregated data, should now be replaced with a Parallel Hash Map, which would otherwise require data to be stored in each worker thread. Finally, the data in each worker thread is copied together to do a global aggregate processing operation. The performance overhead of memory copy is huge and unacceptable. How does the interface of the Parallel Hash Map class combine with the logic of the upper-level traversal aggregation when replacing a local hash?

  • When writing aggregate data, how does a parallel Hash map perform aggregation operation with existing data to form a unified result? How to ensure the correctness of aggregate summary results?

  • What are the lock rules when a parallel Hash map involves multiple threads competing to write the value of the same aggregation condition? How can I avoid overwriting the results of different worker threads?

@adofsauron adofsauron changed the title feature: aggregation parallel for reduce execution time feature: make aggregation parallel process to reduce execution time Aug 19, 2022
@adofsauron
Copy link
Collaborator Author

Extracting aggregate attributes

image

@adofsauron
Copy link
Collaborator Author

Strip the column attributes of the processing aggregate:

  1. Process aggregated columns whose meta-attributes are independent of a particular tuple and are invariant and can be stripped
  2. The ·GroupByWrapper· class retains data associated with tuples during the tuple traversal, such as rows that have been processed, data that has been aggregated, etc

@adofsauron
Copy link
Collaborator Author

Aggregate attribute static class diagram

image

@adofsauron
Copy link
Collaborator Author

It is necessary to establish the mathematical model of aggregate processing of relational algebra first, and implement the mathematical model under the premise of mathematical rules

@adofsauron
Copy link
Collaborator Author

adofsauron commented Aug 20, 2022

The projection operation

image

Projection is relational algebra's counterpart of existential quantification in predicate logic. The attributes not included correspond to existentially quantified variables in the predicate whose extension the operand relation represents. The example below illustrates this point.

Because of the correspondence with existential quantification, some authorities prefer to define projection in terms of the excluded attributes. In a computer language it is of course possible to provide notations for both, and that was done in ISBL and several languages that have taken their cue from ISBL.

A nearly identical concept occurs in the category of monoids, called a string projection, which consists of removing all of the letters in the string that do not belong to a given alphabet.

When implemented in SQL standard the "default projection" returns a multiset instead a set, and the π projection is obtained by the addition of the DISTINCT keyword to eliminate duplicate data.

Πa1,a2,a3...ak( r )

1660990048348_53800A49-F004-45d8-B9E0-5DA2DAF5CF73

@adofsauron
Copy link
Collaborator Author

adofsauron commented Aug 20, 2022

Grouping

πname,SUM(length)σMIN(year)<1930γname,MIN(year),SUM(length)σcert = producer(MovieExec×Movie)

γ grouping_attribute, func(A) → name(R)

  • unary relation that takes R as input
  • first parameter (grouping_attribute) is attribute on which R will be grouped
  • function func is applied to each group, and the result is written to attribute name
  • NB: all other (non-mentioned) attributes are not output to the result!

Example

$\gamma_{A, \ \text{min}(B) \ \to \ D} \left(
\begin{array}{ c | c | c }

A & B & C \
\hline \hline
1 & 2 & a \
1 & 3 & b \
2 & 3 & c \
2 & 4 & a \
2 & 5 & d \
\end{array} \right) = \begin{array}{ c | c | c }

A & D \
\hline \hline
1 & 2 \
2 & 3 \
\end{array} $

@adofsauron
Copy link
Collaborator Author

adofsauron commented Aug 20, 2022

Domain driven model and Domain driven design (DDD) is the mathematical rules of multi-thread parallel processing aggregation:

  • Only mathematical rules can describe existing code logic, so mathematical modeling is used

  • The engine code can only be mathematically modeled to achieve domain-driven design, and it can only be mathematically understood. This is the way

  • This makes it impossible to understand and optimize the corresponding code according to the idea of modularity. It needs to understand the relational algebraic operation rules behind it, and then optimize from the mathematical model, and then use formal verification. The last step is to write the mathematical model to the code from the beginning to the end. Changing just one piece violates the domain model

VC=Column
Pack=<VC>
R=<Pack>
Row=Column(line)
Iterator=<Row>
Vector=<Iter>
EachWorker={Vec1,Vec2...}
Aggrega={EachWorker}
    =P1{Vec1,Vec2...}+P2+...
    =P1{PackRow(Vec1)) => Hash[row]->line]}

@adofsauron
Copy link
Collaborator Author

adofsauron commented Aug 21, 2022

physical operations for selection

lALPGSWEnDgOV5zNBhbNB_U_2037_1558

@adofsauron
Copy link
Collaborator Author

Eliminate duplicate

lALPGR5pnu6YXxTNBW3NB5I_1938_1389

lALPGSyfmYG-h0HNBnvNB3Y_1910_1659

@adofsauron
Copy link
Collaborator Author

Static class diagram from the previous aggregation process

image

@adofsauron
Copy link
Collaborator Author

Static class diagram for iterators

image

@adofsauron
Copy link
Collaborator Author

adofsauron commented Aug 21, 2022

Table properties

[2022-08-21 10:57:51.504922] [45954] [INFO] [aggregation_algorithm.cpp:49] MSG: Aggregate numOfAttrs: 4 packpower: 16 NumOfObj: -1 NumOfTables: 2 NumOfTuples: 7422784 distinct: false
[2022-08-21 10:57:51.504980] [45954] [INFO] [aggregation_algorithm.cpp:69] MSG: Aggregate AddGroupingColumn attr_num: 0 col: 0
[2022-08-21 10:57:51.505004] [45954] [INFO] [aggregation_algorithm.cpp:69] MSG: Aggregate AddGroupingColumn attr_num: 1 col: 1
[2022-08-21 10:57:51.505011] [45954] [INFO] [aggregation_algorithm.cpp:69] MSG: Aggregate AddGroupingColumn attr_num: 2 col: 2
[2022-08-21 10:57:51.505024] [45954] [INFO] [aggregation_algorithm.cpp:127] MSG: Aggregate AddAggregatedColumn col: 3 max_no_of_distinct: 100000 min_v: 1 max_v: 100000 max_size: 11
[2022-08-21 10:57:51.540405] [45954] [INFO] [aggregation_algorithm.cpp:238] MSG: NumOfDimensions: 2 NumOfTuples: 7422784

Set the number of threads to 4

[2022-08-21 21:41:28.927878] [67237] [INFO] [aggregation_algorithm.cpp:896] MSG: ReadyDist threads: 4 packnum: 154 loopcnt: 4 num: 38 mod: 2

@adofsauron
Copy link
Collaborator Author

Shards need to be refactored

The accessibility of iterators and identity processing after splitting is not properly considered

void AggregationWorkerEnt::PrepShardingCopy(MIIterator *mit, GroupByWrapper *gb_sharding,
                                            std::vector<std::unique_ptr<GroupByWrapper>> *vGBW) {
  DimensionVector dims(mind->NumOfDimensions());
  std::unique_ptr<GroupByWrapper> gbw_ptr(new GroupByWrapper(*gb_sharding));
  gbw_ptr->FillDimsUsed(dims);
  gbw_ptr->SetDistinctTuples(mit->NumOfTuples());
  if (!gbw_ptr->IsOnePass()) gbw_ptr->InitTupleLeft(mit->NumOfTuples());
  gbw_ptr->AddAllGroupingConstants(*mit);
  std::scoped_lock guard(mtx);
  vGBW->emplace_back(std::move(gbw_ptr));
}

@adofsauron
Copy link
Collaborator Author

To show why aggregate processing performance is currently low, use the simplest aggregated queries to show CPU usage. You can see the serialization and deserialization in the aggregate operation hash

mysql

@adofsauron
Copy link
Collaborator Author

Tianmu: : core: : ColumnBinEncoder: : Encode function is very interesting, as the aggregation, to generate the hash key, contains both the random retrieve data from RAM, the CPU cache invalidation, It also involves transcoding the fetched data to serialize it into a new piece of RAM, spanning several access cycles in terms of CPU usage

@adofsauron
Copy link
Collaborator Author

In addition to the most obvious data acquisition and serialization, another typical example is the tuple iterator depletion of the classic Volcano model, which causes CPU instructions to process the wait state while preprocessing

1681717606936_EF2A9BE5-2D4F-4381-A105-F66C54BBDB4E

@adofsauron
Copy link
Collaborator Author

mysqld 174443 518728.020321:   11642419 cycles:
                 2d4d70b Tianmu::core::MIIterator::operator+++0x11 (/data/stonedb57/install/bin/mysqld)
                 30681e2 Tianmu::core::MIIterator::Increment+0x18 (/data/stonedb57/install/bin/mysqld)
                 2feba05 Tianmu::core::AggregationAlgorithm::AggregatePackrow+0x64d (/data/stonedb57/install/bin/mysqld)
                 2fea0cb Tianmu::core::AggregationAlgorithm::MultiDimensionalGroupByScan+0x789 (/data/stonedb57/install/bin/mysqld)
                 2fe9855 Tianmu::core::AggregationAlgorithm::Aggregate+0xe0f (/data/stonedb57/install/bin/mysqld)
                 2d49f86 Tianmu::core::TempTable::Materialize+0x892 (/data/stonedb57/install/bin/mysqld)
                 2ce4488 Tianmu::core::Engine::Execute+0xa66 (/data/stonedb57/install/bin/mysqld)
                 2ce31ac Tianmu::core::Engine::HandleSelect+0x8ce (/data/stonedb57/install/bin/mysqld)
                 2df2371 Tianmu::DBHandler::ha_my_tianmu_query+0x5c (/data/stonedb57/install/bin/mysqld)
                 24205c6 execute_sqlcom_select+0x254 (/data/stonedb57/install/bin/mysqld)
                 241993c mysql_execute_command+0xe5c (/data/stonedb57/install/bin/mysqld)
                 242162b mysql_parse+0x6b0 (/data/stonedb57/install/bin/mysqld)
                 2416622 dispatch_command+0xcc7 (/data/stonedb57/install/bin/mysqld)
                 2415463 do_command+0x4ba (/data/stonedb57/install/bin/mysqld)
                 25479f9 handle_connection+0x1ee (/data/stonedb57/install/bin/mysqld)
                 2c17238 pfs_spawn_thread+0x173 (/data/stonedb57/install/bin/mysqld)
            7faad6ef81ca start_thread+0xea (/usr/lib64/libpthread-2.28.so)

@adofsauron
Copy link
Collaborator Author

Memory copy has the worst effect on cpu cycles

mysqld 174443 518766.939690:   25789094 cycles:
                 1d15640 memcpy@plt+0x0 (/data/stonedb57/install/bin/mysqld)
                 3003c55 Tianmu::core::ColumnBinEncoder::Encode+0x95 (/data/stonedb57/install/bin/mysqld)
                 2fee7d1 Tianmu::core::GroupTable::PutGroupingValue+0x5f (/data/stonedb57/install/bin/mysqld)
                 2feeb43 Tianmu::core::GroupByWrapper::PutGroupingValue+0x2f (/data/stonedb57/install/bin/mysqld)
                 2feb807 Tianmu::core::AggregationAlgorithm::AggregatePackrow+0x44f (/data/stonedb57/install/bin/mysqld)
                 2fea0cb Tianmu::core::AggregationAlgorithm::MultiDimensionalGroupByScan+0x789 (/data/stonedb57/install/bin/mysqld)
                 2fe9855 Tianmu::core::AggregationAlgorithm::Aggregate+0xe0f (/data/stonedb57/install/bin/mysqld)
                 2d49f86 Tianmu::core::TempTable::Materialize+0x892 (/data/stonedb57/install/bin/mysqld)
                 2ce4488 Tianmu::core::Engine::Execute+0xa66 (/data/stonedb57/install/bin/mysqld)
                 2ce31ac Tianmu::core::Engine::HandleSelect+0x8ce (/data/stonedb57/install/bin/mysqld)
                 2df2371 Tianmu::DBHandler::ha_my_tianmu_query+0x5c (/data/stonedb57/install/bin/mysqld)
                 24205c6 execute_sqlcom_select+0x254 (/data/stonedb57/install/bin/mysqld)
                 241993c mysql_execute_command+0xe5c (/data/stonedb57/install/bin/mysqld)
                 242162b mysql_parse+0x6b0 (/data/stonedb57/install/bin/mysqld)
                 2416622 dispatch_command+0xcc7 (/data/stonedb57/install/bin/mysqld)
                 2415463 do_command+0x4ba (/data/stonedb57/install/bin/mysqld)
                 25479f9 handle_connection+0x1ee (/data/stonedb57/install/bin/mysqld)
                 2c17238 pfs_spawn_thread+0x173 (/data/stonedb57/install/bin/mysqld)
            7faad6ef81ca start_thread+0xea (/usr/lib64/libpthread-2.28.so)

@adofsauron
Copy link
Collaborator Author

the cost of the routine call (in the ballpark of 20
cycles) must be amortized over only one opera�tion, which effectively doubles the operation cost.

@adofsauron
Copy link
Collaborator Author

7f47ad0bece99b2c761c323952c03abc

@adofsauron
Copy link
Collaborator Author

The data structure here is ready for redesign

66632a52d0b468c492d20959d07dfbee

@adofsauron
Copy link
Collaborator Author

adofsauron commented Apr 17, 2023

The data for each row of the column store is extracted and then an in-memory copy is made to generate the hash, and there are duplicates in the column store data

3b93408bd2b215954c064d339973c368

@adofsauron
Copy link
Collaborator Author

These joins are not re�quired in a Volcano-like pipelined execution model. It
can do the selection, computations and aggregation all
in a single pass, not materializing any data.

@adofsauron
Copy link
Collaborator Author

The
same vertically partitioned and even compressed
disk data layout is used in RAM to save space and
bandwidth.

@adofsauron
Copy link
Collaborator Author

Execution proceeds using Volcano-like pipelining,
on the granularity of a vector (e.g. 1000 values).
The Scan operator retrieves data vector-at-a-time from
Monet BATs. Note that only attributes relevant to the
query are actually scanned.

@adofsauron
Copy link
Collaborator Author

Some peculiarities of this
algebra are that Project is just used for expression
calculation; it does not eliminate duplicates.

@adofsauron
Copy link
Collaborator Author

Modern CPUs can typically only perform 1 or
2 load/store operations per cycle. In compound prim�itives, the results from one calculation are passed via a
CPU register to the next calculation, with load/stores
only occurring at the edges of the expression graph.

@adofsauron
Copy link
Collaborator Author

adofsauron commented Apr 17, 2023

Columns store data in data files that are either unduplicated or sorted. Consider generating a new auxiliary column column:

  1. Deduplication of data in the new auxiliary column

  2. Consider whether the data of the new auxiliary column should be sorted?

  3. During the access, use the offset of the data in the secondary column to access the data to avoid memory copy

773D0FB3-0D77-4f3a-871B-CE27685CEC5E

@adofsauron
Copy link
Collaborator Author

  struct {
    std::vector<buf> v;
    size_t sum_len;
    char **index;
    union {
      void *lens;
      uint32_t *lens32;
      uint16_t *lens16;
    };
    uint8_t len_mode;
  } data_{};

@adofsauron
Copy link
Collaborator Author

Such a column data layout suitable for writing is not conducive to reading operations, and the previous LSM tree using rocksdb was equally stupid and inherited this disgusting design

@adofsauron
Copy link
Collaborator Author

Aggregation Stage 3 First split: Establish a column data disk IO file layout suitable for vectorized reading

@adofsauron
Copy link
Collaborator Author

Horizontal slicing of column data

|   5786 | (X_67=[100095]:bat[:oid], C_68=[4]:bat[:oid]) := group.group(X_40=[100095]:bat[:str]);                                                                                       |
|   5852 | (X_69=[100095]:bat[:oid], C_70=[4]:bat[:oid]) := group.group(X_48=[100095]:bat[:str]);                                                                                       |
|   6182 | (X_71=[100095]:bat[:oid], C_72=[4]:bat[:oid]) := group.group(X_51=[100095]:bat[:str]);                                                                                       |
|   5472 | (X_73=[100095]:bat[:oid], C_74=[4]:bat[:oid]) := group.group(X_57=[100095]:bat[:str]);                                                                                       |
|   8232 | (X_75=[100095]:bat[:oid], C_76=[4]:bat[:oid]) := group.group(X_49=[100095]:bat[:str]);                                                                                       |
|   5879 | (X_77=[100097]:bat[:oid], C_78=[4]:bat[:oid]) := group.group(X_63=[100097]:bat[:str]);                                                                                       |
|  12955 | (X_79=[100095]:bat[:oid], C_80=[4000]:bat[:oid]) := group.subgroup(X_39=[100095]:bat[:int], X_67=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  14632 | (X_81=[100095]:bat[:oid], C_82=[4000]:bat[:oid]) := group.subgroup(X_36=[100095]:bat[:int], X_73=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  12384 | (X_83=[100097]:bat[:oid], C_84=[4000]:bat[:oid]) := group.subgroup(X_34=[100097]:bat[:int], X_77=[100097]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  20043 | (X_85=[100095]:bat[:oid], C_86=[4000]:bat[:oid]) := group.subgroup(X_53=[100095]:bat[:int], X_69=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  20069 | (X_87=[100095]:bat[:oid], C_88=[4000]:bat[:oid]) := group.subgroup(X_46=[100095]:bat[:int], X_71=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  18398 | (X_89=[100095]:bat[:oid], C_90=[4000]:bat[:oid]) := group.subgroup(X_38=[100095]:bat[:int], X_75=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
|  26944 | (X_91=[100097]:bat[:oid], C_92=[78707]:bat[:oid]) := group.subgroupdone(X_54=[100097]:bat[:lng], X_83=[100097]:bat[:oid]); # GRP_create_partial_hash_table, dense            |
|  33312 | (X_93=[100095]:bat[:oid], C_94=[78880]:bat[:oid]) := group.subgroupdone(X_44=[100095]:bat[:lng], X_79=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense  

@adofsauron
Copy link
Collaborator Author

The kernel maintains a central table of all active threads. They
are indexed by their tid. The structure contains information on the
input/output file descriptors, which should be set before a
database operation is started. It ensures that output is delivered
to the proper client.

@adofsauron
Copy link
Collaborator Author


tmp_727#600572@0[str]VE,lo=0,hi=100095 -> tmp_731#100095@0[str]VE

@adofsauron
Copy link
Collaborator Author

adofsauron commented Apr 18, 2023

Segmentation for packet processing



DFLOWworker21   calling group.group [17, 18]
DFLOWworker21   calling group.group [17, 18]
DFLOWworker30   calling group.group [65, 66]
DFLOWworker30   calling group.group [65, 66]
DFLOWworker0    calling group.group [49, 50]
DFLOWworker0    calling group.group [49, 50]
DFLOWworker11   calling group.group [81, 82]
DFLOWworker11   calling group.group [81, 82]
DFLOWworker2    calling group.group [33, 34]
DFLOWworker2    calling group.group [33, 34]
DFLOWworker31   calling group.group [97, 98]
DFLOWworker31   calling group.group [97, 98]
DFLOWworker21   calling group.subgroup [18, 19]
DFLOWworker30   calling group.subgroup [66, 67]
DFLOWworker31   calling group.subgroup [98, 99]
DFLOWworker2    calling group.subgroup [34, 35]
DFLOWworker11   calling group.subgroup [82, 83]
DFLOWworker0    calling group.subgroup [50, 51]
DFLOWworker21   calling group.subgroupdone [19, 20]
DFLOWworker31   calling group.subgroupdone [99, 100]
DFLOWworker11   calling group.subgroupdone [83, 84]
DFLOWworker0    calling group.subgroupdone [51, 52]
DFLOWworker2    calling group.subgroupdone [35, 36]
DFLOWworker30   calling group.subgroupdone [67, 68]
DFLOWworker24   calling group.group [129, 130]
DFLOWworker24   calling group.group [129, 130]
DFLOWworker24   calling group.subgroup [130, 131]
DFLOWworker24   calling group.subgroupdone [131, 132]


@adofsauron
Copy link
Collaborator Author





(X_51=   [100095]:bat[:oid], C_52=[4]:bat[:oid])         := group.group(X_39=[100095]:bat[:str]);                                                                                       |
(X_51=   [100095]:bat[:oid], C_52=[4]:bat[:oid])         := group.group(X_39=[100095]:bat[:str]);                                                                                       |
(X_53=   [100095]:bat[:oid], C_54=[4]:bat[:oid])         := group.group(X_34=[100095]:bat[:str]);                                                                                       |
(X_53=   [100095]:bat[:oid], C_54=[4]:bat[:oid])         := group.group(X_34=[100095]:bat[:str]);                                                                                       |
(X_61=   [100095]:bat[:oid], C_62=[4]:bat[:oid])         := group.group(X_41=[100095]:bat[:str]);                                                                                       |
(X_61=   [100095]:bat[:oid], C_62=[4]:bat[:oid])         := group.group(X_41=[100095]:bat[:str]);                                                                                       |
(X_70=   [100095]:bat[:oid], C_71=[4]:bat[:oid])         := group.group(X_59=[100095]:bat[:str]);                                                                                       |
(X_70=   [100095]:bat[:oid], C_71=[4]:bat[:oid])         := group.group(X_59=[100095]:bat[:str]);                                                                                       |
(X_72=   [100095]:bat[:oid], C_73=[4]:bat[:oid])         := group.group(X_45=[100095]:bat[:str]);                                                                                       |
(X_72=   [100095]:bat[:oid], C_73=[4]:bat[:oid])         := group.group(X_45=[100095]:bat[:str]);                                                                                       |
(X_74=   [100097]:bat[:oid], C_75=[4]:bat[:oid])         := group.group(X_68=[100097]:bat[:str]);                                                                                       |
(X_74=   [100097]:bat[:oid], C_75=[4]:bat[:oid])         := group.group(X_68=[100097]:bat[:str]);    
                                                                                   |
(X_78=   [100095]:bat[:oid], C_79=[4000]:bat[:oid])      := group.subgroup(X_55=[100095]:bat[:int], X_53=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
(X_80=   [100095]:bat[:oid], C_81=[4000]:bat[:oid])      := group.subgroup(X_44=[100095]:bat[:int], X_51=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
(X_82=   [100095]:bat[:oid], C_83=[4000]:bat[:oid])      := group.subgroup(X_37=[100095]:bat[:int], X_61=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
(X_84=   [100097]:bat[:oid], C_85=[4000]:bat[:oid])      := group.subgroup(X_48=[100097]:bat[:int], X_74=[100097]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
(X_86=   [100095]:bat[:oid], C_87=[4000]:bat[:oid])      := group.subgroup(X_30=[100095]:bat[:int], X_70=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense                 |
(X_88=   [100095]:bat[:oid], C_89=[4000]:bat[:oid])      := group.subgroup(X_32=[100095]:bat[:int], X_72=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense      
           
(X_90=   [100095]:bat[:oid], C_91=[78696]:bat[:oid])     := group.subgroupdone(X_47=[100095]:bat[:lng], X_78=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense            |
(X_98=   [100097]:bat[:oid], C_99=[78707]:bat[:oid])     := group.subgroupdone(X_42=[100097]:bat[:lng], X_84=[100097]:bat[:oid]); # GRP_create_partial_hash_table, dense            |
(X_104=  [100095]:bat[:oid], C_105=[78787]:bat[:oid])    := group.subgroupdone(X_76=[100095]:bat[:lng], X_82=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense          |
(X_114=  [100095]:bat[:oid], C_115=[78962]:bat[:oid])    := group.subgroupdone(X_38=[100095]:bat[:lng], X_86=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense          |
(X_129=  [100095]:bat[:oid], C_130=[78618]:bat[:oid])    := group.subgroupdone(X_64=[100095]:bat[:lng], X_80=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense          |
(X_160=  [100095]:bat[:oid], C_161=[78880]:bat[:oid])    := group.subgroupdone(X_31=[100095]:bat[:lng], X_88=[100095]:bat[:oid]); # GRP_create_partial_hash_table, dense  
        
(X_0=    [472650]:bat[:oid], C_1=[4]:bat[:oid])          := group.group(X_2=[472650]:bat[:str]);                                                                                          |
(X_0=    [472650]:bat[:oid], C_1=[4]:bat[:oid])          := group.group(X_2=[472650]:bat[:str]);                                                                                          |
(X_3=    [472650]:bat[:oid], C_4=[4000]:bat[:oid])       := group.subgroup(X_5=[472650]:bat[:int], X_0=[472650]:bat[:oid]); # GRP_create_partial_hash_table, dense                     |
(X_202=  [472650]:bat[:oid], C_203=[189919]:bat[:oid])   := group.subgroupdone(X_196=[472650]:bat[:lng], X_3=[472650]:bat[:oid]); # GRP_create_partial_hash_table, dense         |


@adofsauron
Copy link
Collaborator Author

While the layout of disk IO follows the same vectorization structure as RAM, there are several different design strategies for multithreaded task systems, and some emphasis needs to be analyzed in detail

@adofsauron
Copy link
Collaborator Author

The problem is memory visibility. A more efficient design would be to use a completely lock-free approach without any visibility conflicts

@adofsauron adofsauron assigned adofsauron and unassigned adofsauron Apr 27, 2023
@adofsauron adofsauron removed their assignment Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment