From d8a4a87831593a16726a8155851b66e475f74e89 Mon Sep 17 00:00:00 2001 From: xzhangxian1008 Date: Sat, 8 Feb 2025 09:38:34 +0800 Subject: [PATCH 1/6] init --- .../aggregate-group-by-functions.md | 26 +++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index 6fe8a1069894f..e40570f647d16 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -64,6 +64,32 @@ In addition, TiDB also provides the following aggregate functions: Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). ++ `APPROX_COUNT_DISTINCT(expr)` + + This function returns the approximate distinct count of `expr`. It uses `BJKST` algorithm and consumes less memory during simultaneous computation of cardinality for a large number of data sets whose cardinality has power law distribution. Moreover, it's very accurate for data sets with small cardinality and very efficient on CPU. + + The following example shows how to use this function: + + ```sql + DROP TABLE IF EXISTS t; + CREATE TABLE t(a INT, b INT); + INSERT INTO t VALUES(1, 1), (2, 1), (2, 1), (3, 1), (5, 2), (5, 2), (6, 2), (7, 2); + ``` + + ```sql + SELECT APPROX_COUNT_DISTINCT(a) FROM t GROUP BY b; + ``` + + ```sql + +--------------------------+ + | APPROX_COUNT_DISTINCT(a) | + +--------------------------+ + | 3 | + | 3 | + +--------------------------+ + 1 row in set (0.00 sec) + ``` + ## GROUP BY modifiers TiDB does not currently support `GROUP BY` modifiers such as `WITH ROLLUP`. We plan to add support in the future. See [TiDB #4250](https://github.com/pingcap/tidb/issues/4250). From 87991b9121d2a757db8ed8690ee4d27d16b32595 Mon Sep 17 00:00:00 2001 From: xzhangxian1008 Date: Sat, 8 Feb 2025 16:19:22 +0800 Subject: [PATCH 2/6] address comments --- .../aggregate-group-by-functions.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index e40570f647d16..bad370c69623a 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -64,30 +64,30 @@ In addition, TiDB also provides the following aggregate functions: Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). -+ `APPROX_COUNT_DISTINCT(expr)` ++ `APPROX_COUNT_DISTINCT(expr, [expr...])` - This function returns the approximate distinct count of `expr`. It uses `BJKST` algorithm and consumes less memory during simultaneous computation of cardinality for a large number of data sets whose cardinality has power law distribution. Moreover, it's very accurate for data sets with small cardinality and very efficient on CPU. + The usage of this function is almost same with `COUNT(DISTINCT)` but returns approximate result. It uses `BJKST` algorithm and consumes less memory during simultaneous computation of cardinality for a large number of data sets whose cardinality has power law distribution. Moreover, it's very accurate for data sets with small cardinality and very efficient on CPU. The following example shows how to use this function: ```sql DROP TABLE IF EXISTS t; - CREATE TABLE t(a INT, b INT); - INSERT INTO t VALUES(1, 1), (2, 1), (2, 1), (3, 1), (5, 2), (5, 2), (6, 2), (7, 2); + CREATE TABLE t(a INT, b INT, c INT); + INSERT INTO t VALUES(1, 1, 1), (2, 1, 1), (2, 2, 1), (3, 1, 1), (5, 1, 2), (5, 1, 2), (6, 1, 2), (7, 1, 2); ``` ```sql - SELECT APPROX_COUNT_DISTINCT(a) FROM t GROUP BY b; + SELECT APPROX_COUNT_DISTINCT(a, b) FROM t GROUP BY c; ``` ```sql - +--------------------------+ - | APPROX_COUNT_DISTINCT(a) | - +--------------------------+ - | 3 | - | 3 | - +--------------------------+ - 1 row in set (0.00 sec) + +-----------------------------+ + | approx_count_distinct(a, b) | + +-----------------------------+ + | 3 | + | 4 | + +-----------------------------+ + 2 rows in set (0.00 sec) ``` ## GROUP BY modifiers From 0b14c30ba1000ba6d6b4e959c4edf9a821ad4c7d Mon Sep 17 00:00:00 2001 From: xzhangxian1008 Date: Wed, 12 Feb 2025 10:55:43 +0800 Subject: [PATCH 3/6] Update functions-and-operators/aggregate-group-by-functions.md Co-authored-by: Grace Cai --- functions-and-operators/aggregate-group-by-functions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index bad370c69623a..01045c24e03b4 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -66,7 +66,7 @@ Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the pre + `APPROX_COUNT_DISTINCT(expr, [expr...])` - The usage of this function is almost same with `COUNT(DISTINCT)` but returns approximate result. It uses `BJKST` algorithm and consumes less memory during simultaneous computation of cardinality for a large number of data sets whose cardinality has power law distribution. Moreover, it's very accurate for data sets with small cardinality and very efficient on CPU. + This function is similar to `COUNT(DISTINCT)` in counting the number of distinct values but returns an approximate result. It uses the `BJKST` algorithm, significantly reducing memory consumption when processing large datasets with a power-law distribution. Moreover, for low-cardinality data, this function provides high accuracy while maintaining efficient CPU utilization. The following example shows how to use this function: From d3f28b92827e57cc520daf5071a483aeccf9cf5d Mon Sep 17 00:00:00 2001 From: xzhangxian1008 Date: Wed, 12 Feb 2025 10:56:47 +0800 Subject: [PATCH 4/6] Update functions-and-operators/aggregate-group-by-functions.md --- functions-and-operators/aggregate-group-by-functions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index 01045c24e03b4..366cfb7199846 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -80,7 +80,7 @@ Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the pre SELECT APPROX_COUNT_DISTINCT(a, b) FROM t GROUP BY c; ``` - ```sql + ``` +-----------------------------+ | approx_count_distinct(a, b) | +-----------------------------+ From 80f1f50d26ac89315155591407085f94a6ad4d25 Mon Sep 17 00:00:00 2001 From: xzhangxian1008 Date: Wed, 12 Feb 2025 11:24:06 +0800 Subject: [PATCH 5/6] tweaking --- functions-and-operators/aggregate-group-by-functions.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index 366cfb7199846..7eb458f2f6b5e 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -62,8 +62,6 @@ In addition, TiDB also provides the following aggregate functions: 1 row in set (0.00 sec) ``` -Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). - + `APPROX_COUNT_DISTINCT(expr, [expr...])` This function is similar to `COUNT(DISTINCT)` in counting the number of distinct values but returns an approximate result. It uses the `BJKST` algorithm, significantly reducing memory consumption when processing large datasets with a power-law distribution. Moreover, for low-cardinality data, this function provides high accuracy while maintaining efficient CPU utilization. @@ -90,6 +88,8 @@ Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the pre 2 rows in set (0.00 sec) ``` +Except for the `GROUP_CONCAT()`, `APPROX_PERCENTILE()` and `APPROX_COUNT_DISTINCT` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). + ## GROUP BY modifiers TiDB does not currently support `GROUP BY` modifiers such as `WITH ROLLUP`. We plan to add support in the future. See [TiDB #4250](https://github.com/pingcap/tidb/issues/4250). From 96d838dc9c0f5f1ab9c4ce3b84aafdf4069581ad Mon Sep 17 00:00:00 2001 From: Grace Cai Date: Wed, 12 Feb 2025 11:29:23 +0800 Subject: [PATCH 6/6] minor punctuation changes --- functions-and-operators/aggregate-group-by-functions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index 7eb458f2f6b5e..cc442c227e9c6 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -88,7 +88,7 @@ In addition, TiDB also provides the following aggregate functions: 2 rows in set (0.00 sec) ``` -Except for the `GROUP_CONCAT()`, `APPROX_PERCENTILE()` and `APPROX_COUNT_DISTINCT` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). +Except for the `GROUP_CONCAT()`, `APPROX_PERCENTILE()`, and `APPROX_COUNT_DISTINCT` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). ## GROUP BY modifiers