Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Optimize StringLikeSearch faster #8458

Closed
TCeason opened this issue Oct 26, 2022 · 4 comments · Fixed by #8720
Closed

Feature: Optimize StringLikeSearch faster #8458

TCeason opened this issue Oct 26, 2022 · 4 comments · Fixed by #8720
Assignees
Labels
C-feature Category: feature

Comments

@TCeason
Copy link
Collaborator

TCeason commented Oct 26, 2022

Summary

-- Databend first execute

databend> select count() from orders where o_comment like '%pending%deposits%';
+---------+
| count() |
+---------+
|  161870 |
+---------+
1 row in set (0.96 sec)
Read 15000000 rows, 920.67 MiB in 0.958 sec., 15.65 million rows/sec., 960.80 MiB/sec.


-- ClickHouse first execute

ck :) select count() from orders where o_comment not like '%pending%deposits%';

SELECT count()
FROM orders
WHERE o_comment NOT LIKE '%pending%deposits%'

Query id: f7710e81-6682-4309-8964-f5aa8705fd13

┌──count()─┐
│ 14838130 │
└──────────┘

1 row in set. Elapsed: 0.464 sec. Processed 15.00 million rows, 858.40 MB (32.35 million rows/s., 1.85 GB/s.)

After multiple executions: 2x

Databend: 0.97~0.95s 
ClickHouse: 0.45~0.37s
@TCeason TCeason added the C-feature Category: feature label Oct 26, 2022
@TCeason TCeason self-assigned this Oct 26, 2022
@TCeason
Copy link
Collaborator Author

TCeason commented Oct 26, 2022

Use the newest version Databend. Try to config max_storage_io_requests to different value. The like search query has best performance when set the value equal or greater the CPU core.

cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
      8  Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz

#release databend-query start
./target/release/databend-query 
databend> select version();
+------------------------------------------------------------------------------------+
| version()                                                                          |
+------------------------------------------------------------------------------------+
| 8.0.26-v0.8.43-nightly-ef64046(rust-1.66.0-nightly-2022-10-26T02:09:40.432768335Z) |
+------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

databend> set max_storage_io_requests=1;
Query OK, 0 rows affected (0.00 sec)

databend> select count() from orders where o_comment like '%pending%deposits%';
+---------+
| count() |
+---------+
|  161870 |
+---------+
1 row in set (3.26 sec)
Read 15000000 rows, 806.23 MiB in 3.251 sec., 4.61 million rows/sec., 248.00 MiB/sec.

databend> set max_storage_io_requests=2;
Query OK, 0 rows affected (0.01 sec)

databend> select count() from orders where o_comment like '%pending%deposits%';
+---------+
| count() |
+---------+
|  161870 |
+---------+
1 row in set (1.75 sec)
Read 15000000 rows, 806.23 MiB in 1.746 sec., 8.59 million rows/sec., 461.72 MiB/sec.

databend> set max_storage_io_requests=4;
Query OK, 0 rows affected (0.01 sec)

databend> select count() from orders where o_comment like '%pending%deposits%';
+---------+
| count() |
+---------+
|  161870 |
+---------+
1 row in set (1.02 sec)
Read 15000000 rows, 806.23 MiB in 1.008 sec., 14.88 million rows/sec., 799.94 MiB/sec.

databend> set max_storage_io_requests=8;
Query OK, 0 rows affected (0.01 sec)

databend> select count() from orders where o_comment like '%pending%deposits%';
+---------+
| count() |
+---------+
|  161870 |
+---------+
1 row in set (0.85 sec)
Read 15000000 rows, 806.23 MiB in 0.843 sec., 17.8 million rows/sec., 956.77 MiB/sec.

databend> set max_storage_io_requests=16;
Query OK, 0 rows affected (0.01 sec)

databend> select count() from orders where o_comment like '%pending%deposits%';
+---------+
| count() |
+---------+
|  161870 |
+---------+
1 row in set (0.83 sec)
Read 15000000 rows, 806.23 MiB in 0.818 sec., 18.33 million rows/sec., 985.34 MiB/sec.

databend> set max_storage_io_requests=1000;
Query OK, 0 rows affected (0.01 sec)

databend> select count() from orders where o_comment like '%pending%deposits%';
+---------+
| count() |
+---------+
|  161870 |
+---------+
1 row in set (0.84 sec)
Read 15000000 rows, 806.23 MiB in 0.833 sec., 18 million rows/sec., 967.69 MiB/sec.

@TCeason
Copy link
Collaborator Author

TCeason commented Nov 1, 2022

Try to query

databend avg cost 0.088s

clickhouse avg cost 0.087s

'root'@mysqldb 11:12:12 [tpch]> select count() from orders where o_orderkey>36901;
+----------+
| count()  |
+----------+
| 14990771 |
+----------+
1 row in set (0.10 sec)
Read 15000000 rows, 114.44 MiB in 0.088 sec., 170.41 million rows/sec., 1.27 GiB/sec.

default@mysqldb 11:12:59 [tpch]> select count() from orders where o_orderkey>36901;
+----------+
| count()  |
+----------+
| 14990771 |
+----------+
1 row in set (0.09 sec)
Read 14991808 rows, 114.38 MiB in 0.087953918 sec., 170450712 rows/sec., 1.27 GiB/sec.

So now, the like query cost time gap does seem to be in likesearch?

@TCeason
Copy link
Collaborator Author

TCeason commented Nov 2, 2022

record: In databend position is faster than like.

ck is same.

For regular expressions like %needle%, the code is more optimal and works as fast as the position function. For other regular expressions, the code is the same as for the ‘match’ function.

https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions/#likehaystack-pattern-haystack-like-pattern-operator

'root'@mysqldb 09:24:58 [tpch]> select version();
+------------------------------------------------------------------------------------+
| version()                                                                          |
+------------------------------------------------------------------------------------+
| 8.0.26-v0.8.43-nightly-ccd5ae2(rust-1.66.0-nightly-2022-11-01T02:50:57.392225465Z) |
+------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

'root'@mysqldb 09:25:47 [tpch]> select count() from orders where o_clerk like '%lerk#00000177%';
+---------+
| count() |
+---------+
|   15086 |
+---------+
1 row in set (0.42 sec)
Read 15000000 rows, 329.02 MiB in 0.420 sec., 35.72 million rows/sec., 783.40 MiB/sec.

'root'@mysqldb 09:25:51 [tpch]> select count() from orders where position('lerk#00000177' in o_clerk)>0;
+---------+
| count() |
+---------+
|   15086 |
+---------+
1 row in set (0.23 sec)
Read 15000000 rows, 329.02 MiB in 0.222 sec., 67.47 million rows/sec., 1.45 GiB/sec.
default@mysqldb 09:25:26 [tpch]> select version();
+-----------+
| version() |
+-----------+
| 22.8.5.29 |
+-----------+
1 row in set (0.00 sec)
Read 1 rows, 1.00 B in 0.001141889 sec., 875 rows/sec., 875.74 B/sec.

default@mysqldb 09:25:34 [tpch]> select count() from orders where o_clerk like '%lerk#00000177%';
+---------+
| count() |
+---------+
|   15086 |
+---------+
1 row in set (0.11 sec)
Read 15000000 rows, 343.32 MiB in 0.117024642 sec., 128178131 rows/sec., 2.87 GiB/sec.

default@mysqldb 09:25:37 [tpch]> select count() from orders where position('lerk#00000177' in o_clerk)>0;
+---------+
| count() |
+---------+
|   15086 |
+---------+
1 row in set (0.13 sec)
Read 15000000 rows, 343.32 MiB in 0.123843314 sec., 121120789 rows/sec., 2.71 GiB/sec.

@TCeason
Copy link
Collaborator Author

TCeason commented Nov 9, 2022

Try to query

databend avg cost 0.088s

clickhouse avg cost 0.087s

'root'@mysqldb 11:12:12 [tpch]> select count() from orders where o_orderkey>36901;
+----------+
| count()  |
+----------+
| 14990771 |
+----------+
1 row in set (0.10 sec)
Read 15000000 rows, 114.44 MiB in 0.088 sec., 170.41 million rows/sec., 1.27 GiB/sec.

default@mysqldb 11:12:59 [tpch]> select count() from orders where o_orderkey>36901;
+----------+
| count()  |
+----------+
| 14990771 |
+----------+
1 row in set (0.09 sec)
Read 14991808 rows, 114.38 MiB in 0.087953918 sec., 170450712 rows/sec., 1.27 GiB/sec.

So now, the like query cost time gap does seem to be in likesearch?

This result has some mistake.

Test again:

ck: 0.023
databend: 0.09

--databend-quer
mysql> select count() from orders where o_orderkey>36901;
+----------+
| count()  |
+----------+
| 14990771 |
+----------+
1 row in set (0.09 sec)
Read 15000000 rows, 114.44 MiB in 0.076 sec., 197.11 million rows/sec., 1.47 GiB/sec.

--clickhouse
databend-arch :) select count() from orders where o_orderkey>36901;

SELECT count()
FROM orders
WHERE o_orderkey > 36901

Query id: 31305d2a-82dc-4279-b227-10c34607ae56

┌──count()─┐
│ 14990771 │
└──────────┘

1 row in set. Elapsed: 0.023 sec. Processed 14.99 million rows, 119.93 MB (648.38 million rows/s., 5.19 GB/s.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant