Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement](function) Add more detail explanation about approx_count_distinct function #1458

Merged
merged 1 commit into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -24,26 +24,27 @@ specific language governing permissions and limitations
under the License.
-->

## APPROX_COUNT_DISTINCT
### Description
#### Syntax

`APPROX_COUNT_DISTINCT (expr)`
`APPROX_COUNT_DISTINCT(expr)`

Returns an approximate aggregation function similar to the result of `COUNT(DISTINCT col)`.

Returns an approximate aggregation function similar to the result of COUNT (DISTINCT col).
It is implemented based on the HyperLogLog algorithm, which uses a fixed size of memory to estimate the column base. The algorithm is based on the assumption of a null distribution in the tails, and the accuracy depends on the data distribution. Based on the fixed bucket size used by Doris, the relative standard error of the algorithm is 0.8125%.

It combines COUNT and DISTINCT faster and uses fixed-size memory, so less memory can be used for columns with high cardinality.
For a more detailed and specific analysis, see [related paper](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)

### example
```
### Example

```sql
MySQL > select approx_count_distinct(query_id) from log_statis group by datetime;
+-----------------+
| approx_count_distinct(`query_id`) |
+-----------------+
| 17721 |
+-----------------+
```
### keywords

APPROX_COUNT_DISTINCT
### Keywords
APPROX_COUNT_DISTINCT
Original file line number Diff line number Diff line change
Expand Up @@ -24,25 +24,27 @@ specific language governing permissions and limitations
under the License.
-->

## APPROX_COUNT_DISTINCT
### description
### Description
#### Syntax

`APPROX_COUNT_DISTINCT(expr)`

返回类似于 `COUNT(DISTINCT col)` 结果的近似值聚合函数。

返回类似于 COUNT(DISTINCT col) 结果的近似值聚合函数。
它基于 HyperLogLog 算法实现,使用固定大小的内存估算列基数。该算法基于尾部零分布假设进行计算,具体精确程度取决于数据分布。基于 Doris 使用的固定桶大小,该算法相对标准误差为 0.8125%

它比 COUNT 和 DISTINCT 组合的速度更快,并使用固定大小的内存,因此对于高基数的列可以使用更少的内存。
更详细具体的分析,详见[相关论文](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)

### example
```
### Example

```sql
MySQL > select approx_count_distinct(query_id) from log_statis group by datetime;
+-----------------+
| approx_count_distinct(`query_id`) |
+-----------------+
| 17721 |
+-----------------+
```
### keywords
APPROX_COUNT_DISTINCT

### Keywords
APPROX_COUNT_DISTINCT
Original file line number Diff line number Diff line change
Expand Up @@ -24,25 +24,27 @@ specific language governing permissions and limitations
under the License.
-->

## APPROX_COUNT_DISTINCT
### description
### Description
#### Syntax

`APPROX_COUNT_DISTINCT(expr)`

返回类似于 `COUNT(DISTINCT col)` 结果的近似值聚合函数。

返回类似于 COUNT(DISTINCT col) 结果的近似值聚合函数。
它基于 HyperLogLog 算法实现,使用固定大小的内存估算列基数。该算法基于尾部零分布假设进行计算,具体精确程度取决于数据分布。基于 Doris 使用的固定桶大小,该算法相对标准误差为 0.8125%

它比 COUNT 和 DISTINCT 组合的速度更快,并使用固定大小的内存,因此对于高基数的列可以使用更少的内存。
更详细具体的分析,详见[相关论文](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)

### example
```
### Example

```sql
MySQL > select approx_count_distinct(query_id) from log_statis group by datetime;
+-----------------+
| approx_count_distinct(`query_id`) |
+-----------------+
| 17721 |
+-----------------+
```
### keywords
APPROX_COUNT_DISTINCT

### Keywords
APPROX_COUNT_DISTINCT
Original file line number Diff line number Diff line change
Expand Up @@ -24,25 +24,27 @@ specific language governing permissions and limitations
under the License.
-->

## APPROX_COUNT_DISTINCT
### description
### Description
#### Syntax

`APPROX_COUNT_DISTINCT(expr)`

返回类似于 `COUNT(DISTINCT col)` 结果的近似值聚合函数。

返回类似于 COUNT(DISTINCT col) 结果的近似值聚合函数。
它基于 HyperLogLog 算法实现,使用固定大小的内存估算列基数。该算法基于尾部零分布假设进行计算,具体精确程度取决于数据分布。基于 Doris 使用的固定桶大小,该算法相对标准误差为 0.8125%

它比 COUNT 和 DISTINCT 组合的速度更快,并使用固定大小的内存,因此对于高基数的列可以使用更少的内存。
更详细具体的分析,详见[相关论文](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)

### example
```
### Example

```sql
MySQL > select approx_count_distinct(query_id) from log_statis group by datetime;
+-----------------+
| approx_count_distinct(`query_id`) |
+-----------------+
| 17721 |
+-----------------+
```
### keywords
APPROX_COUNT_DISTINCT

### Keywords
APPROX_COUNT_DISTINCT
Original file line number Diff line number Diff line change
Expand Up @@ -24,25 +24,27 @@ specific language governing permissions and limitations
under the License.
-->

## APPROX_COUNT_DISTINCT
### description
### Description
#### Syntax

`APPROX_COUNT_DISTINCT(expr)`

返回类似于 `COUNT(DISTINCT col)` 结果的近似值聚合函数。

返回类似于 COUNT(DISTINCT col) 结果的近似值聚合函数。
它基于 HyperLogLog 算法实现,使用固定大小的内存估算列基数。该算法基于尾部零分布假设进行计算,具体精确程度取决于数据分布。基于 Doris 使用的固定桶大小,该算法相对标准误差为 0.8125%

它比 COUNT 和 DISTINCT 组合的速度更快,并使用固定大小的内存,因此对于高基数的列可以使用更少的内存。
更详细具体的分析,详见[相关论文](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)

### example
```
### Example

```sql
MySQL > select approx_count_distinct(query_id) from log_statis group by datetime;
+-----------------+
| approx_count_distinct(`query_id`) |
+-----------------+
| 17721 |
+-----------------+
```
### keywords
APPROX_COUNT_DISTINCT

### Keywords
APPROX_COUNT_DISTINCT
Original file line number Diff line number Diff line change
Expand Up @@ -24,26 +24,27 @@ specific language governing permissions and limitations
under the License.
-->

## APPROX_COUNT_DISTINCT
### Description
#### Syntax

`APPROX_COUNT_DISTINCT (expr)`
`APPROX_COUNT_DISTINCT(expr)`

Returns an approximate aggregation function similar to the result of `COUNT(DISTINCT col)`.

Returns an approximate aggregation function similar to the result of COUNT (DISTINCT col).
It is implemented based on the HyperLogLog algorithm, which uses a fixed size of memory to estimate the column base. The algorithm is based on the assumption of a null distribution in the tails, and the accuracy depends on the data distribution. Based on the fixed bucket size used by Doris, the relative standard error of the algorithm is 0.8125%.

It combines COUNT and DISTINCT faster and uses fixed-size memory, so less memory can be used for columns with high cardinality.
For a more detailed and specific analysis, see [related paper](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)

### example
```
### Example

```sql
MySQL > select approx_count_distinct(query_id) from log_statis group by datetime;
+-----------------+
| approx_count_distinct(`query_id`) |
+-----------------+
| 17721 |
+-----------------+
```
### keywords

APPROX_COUNT_DISTINCT
### Keywords
APPROX_COUNT_DISTINCT
Original file line number Diff line number Diff line change
Expand Up @@ -24,26 +24,27 @@ specific language governing permissions and limitations
under the License.
-->

## APPROX_COUNT_DISTINCT
### Description
#### Syntax

`APPROX_COUNT_DISTINCT (expr)`
`APPROX_COUNT_DISTINCT(expr)`

Returns an approximate aggregation function similar to the result of `COUNT(DISTINCT col)`.

Returns an approximate aggregation function similar to the result of COUNT (DISTINCT col).
It is implemented based on the HyperLogLog algorithm, which uses a fixed size of memory to estimate the column base. The algorithm is based on the assumption of a null distribution in the tails, and the accuracy depends on the data distribution. Based on the fixed bucket size used by Doris, the relative standard error of the algorithm is 0.8125%.

It combines COUNT and DISTINCT faster and uses fixed-size memory, so less memory can be used for columns with high cardinality.
For a more detailed and specific analysis, see [related paper](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)

### example
```
### Example

```sql
MySQL > select approx_count_distinct(query_id) from log_statis group by datetime;
+-----------------+
| approx_count_distinct(`query_id`) |
+-----------------+
| 17721 |
+-----------------+
```
### keywords

APPROX_COUNT_DISTINCT
### Keywords
APPROX_COUNT_DISTINCT
Original file line number Diff line number Diff line change
Expand Up @@ -24,26 +24,27 @@ specific language governing permissions and limitations
under the License.
-->

## APPROX_COUNT_DISTINCT
### Description
#### Syntax

`APPROX_COUNT_DISTINCT (expr)`
`APPROX_COUNT_DISTINCT(expr)`

Returns an approximate aggregation function similar to the result of `COUNT(DISTINCT col)`.

Returns an approximate aggregation function similar to the result of COUNT (DISTINCT col).
It is implemented based on the HyperLogLog algorithm, which uses a fixed size of memory to estimate the column base. The algorithm is based on the assumption of a null distribution in the tails, and the accuracy depends on the data distribution. Based on the fixed bucket size used by Doris, the relative standard error of the algorithm is 0.8125%.

It combines COUNT and DISTINCT faster and uses fixed-size memory, so less memory can be used for columns with high cardinality.
For a more detailed and specific analysis, see [related paper](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)

### example
```
### Example

```sql
MySQL > select approx_count_distinct(query_id) from log_statis group by datetime;
+-----------------+
| approx_count_distinct(`query_id`) |
+-----------------+
| 17721 |
+-----------------+
```
### keywords

APPROX_COUNT_DISTINCT
### Keywords
APPROX_COUNT_DISTINCT
Loading