Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(function): change retention return type from Variant to Array<T> #5302

Merged
merged 1 commit into from
May 15, 2022

Conversation

fkuner
Copy link
Contributor

@fkuner fkuner commented May 11, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Now the get method of Array is supported.

We can change the return type of retention function from Variant to Array.

Changelog

  • New Feature

Related Issues

Fixes #5226

@mergify
Copy link
Contributor

mergify bot commented May 11, 2022

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

@vercel
Copy link

vercel bot commented May 11, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) May 15, 2022 at 8:10AM (UTC)

@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label May 11, 2022
@fkuner fkuner requested review from b41sh and sundy-li and removed request for BohuTANG May 11, 2022 12:23
@BohuTANG
Copy link
Member

BohuTANG commented May 15, 2022

Here is the ClickHouse performance of retention on my laptop, FYI:

SELECT version()

Query id: b0028bdf-53cc-4590-97b8-d5aed83844bd

┌─version()─┐
│ 22.2.2.1  │
└───────────┘


CREATE TABLE retention_test
(
    `date` DATE,
    `uid` INT
)
ENGINE = MergeTree
PARTITION BY date
ORDER BY uid


INSERT INTO retention_test SELECT '2018-08-06', number FROM numbers(10000000);
INSERT INTO retention_test SELECT '2018-08-07', number FROM numbers(9000000);
INSERT INTO retention_test SELECT '2018-08-08', number FROM numbers(1000000);

Performance:

SELECT
    sum(r[1]) AS r1,
    sum(r[2]) AS r2,
    sum(r[3]) AS r3
FROM
(
    SELECT
        uid,
        retention(date = '2018-08-06', date = '2018-08-07', date = '2018-08-08') AS r
    FROM retention_test
    GROUP BY uid
)

Query id: 7e90f2ca-62f4-419a-bf60-4aee4691e0f9

┌───────r1─┬──────r2─┬──────r3─┐
│ 10000000 │ 9000000 │ 1000000 │
└──────────┴─────────┴─────────┘

1 rows in set. Elapsed: 1.224 sec. Processed 20.00 million rows, 120.00 MB (16.34 million rows/s., 98.04 MB/s.)

@fkuner
Copy link
Contributor Author

fkuner commented May 15, 2022

Here is the ClickHouse performance of retention on my laptop, FYI:

SELECT version()

Query id: b0028bdf-53cc-4590-97b8-d5aed83844bd

┌─version()─┐
│ 22.2.2.1  │
└───────────┘


CREATE TABLE retention_test
(
    `date` DATE,
    `uid` INT
)
ENGINE = MergeTree
PARTITION BY date
ORDER BY uid


INSERT INTO retention_test SELECT '2018-08-06', number FROM numbers(10000000);
INSERT INTO retention_test SELECT '2018-08-07', number FROM numbers(9000000);
INSERT INTO retention_test SELECT '2018-08-08', number FROM numbers(1000000);

Performance:

SELECT
    sum(r[1]) AS r1,
    sum(r[2]) AS r2,
    sum(r[3]) AS r3
FROM
(
    SELECT
        uid,
        retention(date = '2018-08-06', date = '2018-08-07', date = '2018-08-08') AS r
    FROM retention_test
    GROUP BY uid
)

Query id: 7e90f2ca-62f4-419a-bf60-4aee4691e0f9

┌───────r1─┬──────r2─┬──────r3─┐
│ 10000000 │ 9000000 │ 1000000 │
└──────────┴─────────┴─────────┘

1 rows in set. Elapsed: 1.224 sec. Processed 20.00 million rows, 120.00 MB (16.34 million rows/s., 98.04 MB/s.)

Got it. I will compare our performance with ck.

@BohuTANG
Copy link
Member

Have tested this PR:

mysql> CREATE TABLE retention_test
    -> (
    ->     `date` DATE,
    ->     `uid` INT
    -> );
Query OK, 0 rows affected (0.01 sec)
Read 0 rows, 0.00 B in 0.008 sec., 0 rows/sec., 0.00 B/sec.

mysql> INSERT INTO retention_test SELECT '2018-08-06', number FROM numbers(10000000);
Query OK, 0 rows affected (0.42 sec)
Read 10000000 rows, 76.29 MiB in 0.412 sec., 24.25 million rows/sec., 185.00 MiB/sec.

mysql> INSERT INTO retention_test SELECT '2018-08-07', number FROM numbers(9000000);
Query OK, 0 rows affected (0.35 sec)
Read 9000000 rows, 68.66 MiB in 0.345 sec., 26.12 million rows/sec., 199.29 MiB/sec.

mysql> INSERT INTO retention_test SELECT '2018-08-08', number FROM numbers(1000000);


mysql> SELECT     sum(r[0]) AS r1,     sum(r[1]) AS r2,     sum(r[2]) AS r3 FROM (     SELECT         uid,         retention(date = '2018-08-06', date = '2018-08-07', date = '2018-08-08') AS r     FROM retention_test     GROUP BY uid );
+----------+---------+---------+
| r1       | r2      | r3      |
+----------+---------+---------+
| 10000000 | 9000000 | 1000000 |
+----------+---------+---------+
1 row in set (5.78 sec)
Read 20000000 rows, 152.59 MiB in 5.760 sec., 3.47 million rows/sec., 26.49 MiB/sec.

@fkuner
Copy link
Contributor Author

fkuner commented May 15, 2022

I have optimized the implementation just now. The performance can decrease from 3 seconds to 2.4 seconds in my machine. But it still has a gap with ck. The reason should be that the cardinal number is too high, but we haven't use the parallel merge.

@BohuTANG BohuTANG requested a review from sundy-li May 15, 2022 08:51
@BohuTANG
Copy link
Member

With the new patch, the performance(5.760 to 4.226):

mysql> SELECT     sum(r[0]) AS r1,     sum(r[1]) AS r2,     sum(r[2]) AS r3 FROM (     SELECT         uid,         retention(date = '2018-08-06', date = '2018-08-07', date = '2018-08-08') AS r     FROM retention_test     GROUP BY uid );
+----------+---------+---------+
| r1       | r2      | r3      |
+----------+---------+---------+
| 10000000 | 9000000 | 1000000 |
+----------+---------+---------+
1 row in set (4.26 sec)
Read 20000000 rows, 152.59 MiB in 4.226 sec., 4.73 million rows/sec., 36.10 MiB/sec.

@BohuTANG BohuTANG merged commit 9597485 into databendlabs:main May 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need-review pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change return type of retention function from JsonValue to Array<UInt8>
5 participants