Skip to content

[Enhancement] Improve the performance of query with IN predicate #3693

@morningman

Description

@morningman

Describe the bug
In the test, we found that a simple single-table aggregate query, even in the case of hit rollup, still can not achieve a high QPS.

Table: id INT, name varchar(32), v1 INT SUM

Query: SELECT SUM(v1) FROM tbl WHERE id IN (1,2,3,4,5,6, ... (more than 100));

The phenomenon is that the CPU util is 100%, and the delay of a single query is significantly increased.

Why
I found that Doris use BE config doris_max_scan_key_num to limit both the number of scan keys and the number of conditions of a single column that can be pushed down to the storage engine.

In above example, when the elements of IN predicate is larger than doris_max_scan_key_num,
no conditions will be pushed to storage engine for column id, So that none of index of column id will be used. This causes the storage engine to read all the rows and need to filter at the query layer, thus making CPU util high.

How to solve
The BE config doris_max_scan_key_num should only be used to limit the number of scan keys. And we need another config to limit the number of conditions of a single column that can be pushed down to the storage engine.

Solution
I add a new BE config max_pushdown_conditions_per_column, default is 1024.

The test case is as follows:

  1. Env: 3 BE and 1 FE, Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz, 56 core
  2. Table:
+-----------------------------+--------------+------+-------+------------+-------+
| Field                       | Type         | Null | Key   | Default    | Extra |
+-----------------------------+--------------+------+-------+------------+-------+
| date_partition              | DATE         | No   | true  | 1900-01-01 |       |
| user_name                   | VARCHAR(256) | No   | true  |            |       |
| id                          | INT          | No   | true  | N/A        |       |
...
...                    
| commit                      | INT          | No   | false | 0          | SUM   |
+-----------------------------+--------------+------+-------+------------+-------+
25 rows in set (0.01 sec)

about 3674904 rows
  1. SQL:
SELECT SUM(`commit`)  FROM tbl WHERE date_partition >= '2019-11-11' AND date_partition <= '2020-05-11' AND id IN (16730, 74818, 2575, 63428, 15571, 74345, 55241, 52375, 18464, 36687, 22306, 49033, 37124, 14697, 75679, 72319, 50219, 74723, 50778, 37782, 22152, 3573, 1567, 65673, 4366, 52784, 46973, 32716, 37056, 75585, 18418, 66659, 72058, 5423, 78679, 10656, 65591, 64949, 26560, 2709, 60128, 64406, 52200, 41163, 60266, 21974, 79801, 32669, 25025, 6374, 56828, 80599, 13163, 30660, 64337, 2134, 75150, 2985, 66069, 19964, 6054, 24464, 36888, 22353, 37502, 42207, 6411, 53184, 45586, 6169, 39933, 40313, 69984, 52314, 34314, 29295, 43091, 51925, 39434, 839, 27399, 25053, 12930, 39808, 68532, 77795, 33113, 34423, 28728, 38296, 53803, 55817, 63150, 3177, 53253, 27514, 24087, 1097, 21326, 33732, 63689, 52950, 1160, 68816, 76697, 55888, 26789, 42823, 63517, 67566, 38109, 41345, 39837, 78576, 32666, 37680, 78646, 55641, 5221, 80058, 65142, 40914, 72088, 55113, 1184, 45254, 54959, 37503, 1881, 66861, 29902, 2017, 34485, 74905, 29711, 3691, 2508, 66662, 23710, 58297, 42831, 80003, 38781, 13246, 21610, 4402, 52361, 47242, 73721, 17362, 51488, 239, 61924, 38887, 29551, 11765, 15387);
  1. Test tool: Apache jmeter, 50 threads.

  2. Result:

Origin Doris version:

251.5/s Avg: 198ms Min: 68ms Max: 805ms, CPU util: 92%

New Doris version:

604.5/s Avg: 82ms Min: 30ms Max: 369ms, CPU util: 65%

New Doris version with bitmap index on column id:

734.9/s Avg: 67ms Min: 18ms Max: 331ms, CPU util: 50%

image

The improve is significant.

I aslo add 2 new session variable max_scan_key_num and doris_max_scan_key_num which can overwrite the 2 configs in BE config. Because there is no CBO in Doris now, we need provide a method to let user change these 2 config to meet different query scenarios.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions