Improve performance of COUNT (distinct x) for dictionary columns #258 #5554

jaylmiller · 2023-03-11T22:05:19Z

Which issue does this PR close?

Closes #258.

Rationale for this change

The count distinct physical expr was doing alot of unnecessary hashing when it is ran on dictionary types. Previously, every cell in the dictionary array was being added to the distinct values hashset, with this change, we only need to add each value type at most once (never if it is not present) to the array.

What changes are included in this PR?

A new accumulator (CountDistinctDictAccumulator) that is returned by DistinctCount in the case that a dictionary array is being counted. There is a fair amount of shared logic between the accumulators so that was also pulled out into helper funcs.

Are these changes tested?

Added some new unit tests.

Are there any user-facing changes?

waynexia

This is a good improvement 👍 But I have some questions about its correctness when the input dictionaries are not that normalized:

datafusion/physical-expr/src/aggregate/count_distinct.rs

jaylmiller · 2023-03-12T17:55:26Z

@waynexia thanks for correcting my assumption about how normalized dicts 😀. I've made changes correcting this. There is a bit of hashing required but significantly less than before since we only need to hash once for each value type (instead of hashing every cell).

Also since there was now alot of shared logic between the 2 accumulators, i've pulled that out into funcs so both accumulators can use it.

alamb · 2023-03-13T17:19:12Z

I plan to review this PR tomorrow. Thank you @waynexia for the review

comphead · 2023-03-13T21:21:59Z

@jaylmiller thanks for the PR. Would be great to get some knowloedge how the much performance increased?

datafusion/physical-expr/src/aggregate/count_distinct.rs

alamb

Thank you @jaylmiller -- this looks great. Thank you @waynexia for the initial review and ensuring that the dictionaries are handled correctle.

I double checked at the logic now appears to handle different dictionaries correctly -- could you give it one more review @waynexia ?

Also, I wonder if you have had a chance to do any sort of benchmarking to show the improvement?

datafusion/physical-expr/src/aggregate/count_distinct.rs

alamb · 2023-03-14T12:54:22Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

@@ -31,7 +32,7 @@ use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::Accumulator;

 type DistinctScalarValues = ScalarValue;
-
+type ValueSet = HashSet<DistinctScalarValues, RandomState>;


I wonder what value these type aliases add. The extra indirection of DistinctScalarValues --> ScalarValue simply seems to make things more complicated 🤔

I was a bit confused about the purpose of DistinctScalarValues as well to be honest but I kind of figured it was there for a good reason so I left it in 😅

In terms of the added ValueSet alias, I personally thought it made the code a bit more readable but that is kind of subjective of course.

I think maybe we remove DistinctScalarValues alias but keep ValueSet?

datafusion/physical-expr/src/aggregate/count_distinct.rs

waynexia · 2023-03-14T13:59:57Z

Sorry for the delay, I plan to review it tomorrow!

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

jaylmiller · 2023-03-14T15:53:51Z

Also, I wonder if you have had a chance to do any sort of benchmarking to show the improvement?

Currently looking into this. Will post findings today

jackwener · 2023-03-14T16:27:24Z

A great job. I also notice this performance(I notice it by ClickBeach). Thanks for your job @jaylmiller .
I prepare to review it tomorrow.

Also, I wonder if you have had a chance to do any sort of benchmarking to show the improvement?

Currently looking into this. Will post findings today

I think some case in clickbench will be improved.

jaylmiller · 2023-03-14T16:42:50Z

I think some case in clickbench will be improved.

Ok I'll look into getting some results on these cases

jaylmiller · 2023-03-14T23:30:54Z

ClickBench count distinct query when using dictionary columns is getting killed (this is on main as well as the PR) 🤔

❯ CREATE EXTERNAL TABLE hits_base
STORED AS PARQUET
LOCATION 'hits.parquet';
0 rows in set. Query took 0.041 seconds.
❯ CREATE TABLE hits as
select
  arrow_cast("UserID", 'Dictionary(Int32, Utf8)') as "UserID"
FROM hits_base;

0 rows in set. Query took 13.887 seconds.
❯ SELECT COUNT(DISTINCT "UserID") from hits;
Killed

"UserID" table is pretty high cardinality though: is there a better clickbench query/column pair to bench with?

jackwener · 2023-03-15T07:35:42Z

cc @sundy-li

sundy-li · 2023-03-15T07:36:48Z

type ValueSet = HashSet<DistinctScalarValues, RandomState>;

For numeric/string args in distinct, I think it's better to have special states rather than putting the enum into the HashSet.

SELECT COUNT(DISTINCT "UserID") from hits;

Another approach is to rewrite this SQL to select count() from (select userid from hits group by userid).

arrow_cast("UserID", 'Dictionary(Int32, Utf8)') as "UserID"

Cast could be overhead, if UserID is already Utf8 array, we just need to siphash it to u128, it's safe in a cryptographic way.

waynexia

The current implementation generally looks good to me. I left some little suggestions about style. Looking forward to the bench result 🚀

waynexia · 2023-03-15T15:10:54Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

+                        return Err(DataFusionError::Internal(
+                            "Dict key has invalid datatype".to_string(),


nit: I would prefer to add the concrete type in the error message

waynexia · 2023-03-15T15:14:21Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

+// calculating the size of values hashset for fixed length values,
+// taking first batch size * number of batches.
+// This method is faster than full_size(), however it is not suitable for variable length
+// values like strings or complex types


Suggested change

// calculating the size of values hashset for fixed length values,

// taking first batch size * number of batches.

// This method is faster than full_size(), however it is not suitable for variable length

// values like strings or complex types

/// calculating the size of values hashset for fixed length values,

/// taking first batch size * number of batches.

/// This method is faster than full_size(), however it is not suitable for variable length

/// values like strings or complex types

style: prefer to use document comments

waynexia · 2023-03-15T15:14:54Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

+}
+// calculates the size as accurate as possible, call to this method is expensive


Suggested change

}

// calculates the size as accurate as possible, call to this method is expensive

}

// calculates the size as accurate as possible, call to this method is expensive

style: add empty line between two fns

waynexia · 2023-03-15T15:15:49Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

+}
+impl<K: ArrowDictionaryKeyType + std::marker::Send + std::marker::Sync>


Suggested change

}

impl<K: ArrowDictionaryKeyType + std::marker::Send + std::marker::Sync>

}

impl<K: ArrowDictionaryKeyType + std::marker::Send + std::marker::Sync>

style: add empty line between two blocks

jaylmiller · 2023-03-16T00:28:45Z

I'm not seeing any noticeable ClickBench improvements when changing "RegionID" column to use dict encoding (setting other columns e.g. UserID to use dict encoding cause my datafusion process to be killed...)

Maybe this change is not worth merging if ClickBench improvements aren't being seen

mingmwang · 2023-03-16T12:20:08Z

@jaylmiller
How do you do the benchmark test ? I do not get a chance to take a closer look at this PR.
There is one logical optimization rule SingleDistinctToGroupBy in DataFusion which will rewrite the single distinct aggregate to normal group by aggregate and replace the distinct aggregator to normal count aggregator.

This optimization make sense in most cases but might have conflict with this optimization in this PR.

jaylmiller · 2023-03-16T12:49:11Z

I am essentially just running clickbench queries against my PR and against main. The only change is I am setting RegionID column to a dict:

CREATE EXTERNAL TABLE hits_base
STORED AS PARQUET
LOCATION 'hits.parquet';
CREATE TABLE hits as
select 
  arrow_cast("RegionID", 'Dictionary(Int32, Utf8)') as "RegionID"
....

Do you have any recommendations, @mingmwang ?

mingmwang · 2023-03-16T12:53:34Z

I am essentially just running clickbench queries against my PR and against main. The only change is I am setting RegionID column to a dict:
CREATE EXTERNAL TABLE hits_base
STORED AS PARQUET
LOCATION 'hits.parquet';
CREATE TABLE hits as
select 
  arrow_cast("RegionID", 'Dictionary(Int32, Utf8)') as "RegionID"
....
Do you have any recommendations, @mingmwang ?

I had never check clickbench queries. Maybe you can comment the rule SingleDistinctToGroupBy and run the benchmark again and see whether there are improvements.

jaylmiller · 2023-03-16T13:01:29Z

Thanks I'll try that out

alamb · 2023-03-16T14:28:49Z

ClickBench count distinct query when using dictionary columns is getting killed (this is on main as well as the PR) 🤔

I wonder if we can try a smaller subset 🤔

❯ CREATE TABLE hits as select
  arrow_cast("UserID", 'Dictionary(Int32, Utf8)') as "UserID"
FROM 'hits.parquet'
limit 10000000;
0 rows in set. Query took 0.776 seconds.
❯ select count(distinct "UserID") from hits;
+-----------------------------+
| COUNT(DISTINCT hits.UserID) |
+-----------------------------+
| 1530334                     |
+-----------------------------+
1 row in set. Query took 71.388 seconds.

I will try this on my benchmark machine

mingmwang · 2023-03-16T14:53:58Z

Any luck?
I think the rule SingleDistinctToGroupBy has conflicts with the optimization and rewriting Distinct to Group By is not always beneficial. Maybe we should add a configuration to turn on/off this rewriting.

jhorstmann · 2023-03-20T13:25:46Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

+        let arr = as_dictionary_array::<K>(&values[0])?;
+        let nvalues = arr.values().len();
+        // map keys to whether their corresponding value has been seen or not
+        let mut seen_map = vec![false; nvalues];


seen_map could become a bitmap to save space.

High-cardinality inputs relative to the batch size (like UserID in clickbench) probably don't benefit that much from this map. I don't know the standard batch size that datafusion uses for that query, but a much larger batch size could improve the performance in this case.

Thanks for suggestion. Just to clarify, you mean using arrow bitmap, correct? Something like

let mut seen_map = arrow::array::BooleanBufferBuilder::new(nvalues);

jaylmiller · 2023-03-20T14:09:38Z

Any luck? I think the rule SingleDistinctToGroupBy has conflicts with the optimization and rewriting Distinct to Group By is not always beneficial. Maybe we should add a configuration to turn on/off this rewriting.

@mingmwang Sorry for delay. I haven't had a chance to get back to this PR yet (currently working on #5292).

alamb · 2023-03-28T20:27:20Z

Marking as draft to signify this PR has feedback and is not waiting for another review at the moment.

alamb · 2024-04-08T21:05:30Z

Since this has been open for more than a year, closing it down. Feel free to reopen if/when you keep working on it.

Add new count distinct accumulator when running on dict arrays

74f5042

github-actions bot added the physical-expr Physical Expressions label Mar 11, 2023

clippy

e5f42c3

jaylmiller marked this pull request as ready for review March 11, 2023 22:29

waynexia reviewed Mar 12, 2023

View reviewed changes

datafusion/physical-expr/src/aggregate/count_distinct.rs Show resolved Hide resolved

datafusion/physical-expr/src/aggregate/count_distinct.rs Outdated Show resolved Hide resolved

datafusion/physical-expr/src/aggregate/count_distinct.rs Outdated Show resolved Hide resolved

jaylmiller added 2 commits March 12, 2023 13:41

dont assume normalized dicts. move shared logic into fns

dc479b6

organize type alias

93ea2db

jaylmiller requested a review from waynexia March 12, 2023 17:55

comphead reviewed Mar 13, 2023

View reviewed changes

datafusion/physical-expr/src/aggregate/count_distinct.rs Outdated Show resolved Hide resolved

datafusion/physical-expr/src/aggregate/count_distinct.rs Outdated Show resolved Hide resolved

alamb approved these changes Mar 14, 2023

View reviewed changes

jaylmiller and others added 2 commits March 14, 2023 11:42

Update datafusion/physical-expr/src/aggregate/count_distinct.rs

3a0a01c

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update datafusion/physical-expr/src/aggregate/count_distinct.rs

3824bd9

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

suggested changes

a881c5a

jackwener requested review from jackwener and removed request for waynexia March 14, 2023 16:27

waynexia reviewed Mar 15, 2023

View reviewed changes

jhorstmann reviewed Mar 20, 2023

View reviewed changes

alamb marked this pull request as draft March 28, 2023 20:27

alamb closed this Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of COUNT (distinct x) for dictionary columns #258 #5554

Improve performance of COUNT (distinct x) for dictionary columns #258 #5554

jaylmiller commented Mar 11, 2023 •

edited

Loading

waynexia left a comment

jaylmiller commented Mar 12, 2023

alamb commented Mar 13, 2023

comphead commented Mar 13, 2023

alamb left a comment

alamb Mar 14, 2023

jaylmiller Mar 14, 2023

jaylmiller Mar 14, 2023

waynexia commented Mar 14, 2023

jaylmiller commented Mar 14, 2023

jackwener commented Mar 14, 2023 •

edited

Loading

jaylmiller commented Mar 14, 2023

jaylmiller commented Mar 14, 2023 •

edited

Loading

jackwener commented Mar 15, 2023

sundy-li commented Mar 15, 2023 •

edited

Loading

waynexia left a comment

waynexia Mar 15, 2023

waynexia Mar 15, 2023

waynexia Mar 15, 2023

waynexia Mar 15, 2023

jaylmiller commented Mar 16, 2023 •

edited

Loading

mingmwang commented Mar 16, 2023 •

edited

Loading

jaylmiller commented Mar 16, 2023

mingmwang commented Mar 16, 2023

jaylmiller commented Mar 16, 2023

alamb commented Mar 16, 2023

mingmwang commented Mar 16, 2023

jhorstmann Mar 20, 2023

jaylmiller Mar 20, 2023 •

edited

Loading

jaylmiller commented Mar 20, 2023 •

edited

Loading

alamb commented Mar 28, 2023

alamb commented Apr 8, 2024

		return Err(DataFusionError::Internal(
		"Dict key has invalid datatype".to_string(),

		}
		// calculates the size as accurate as possible, call to this method is expensive

		}
		impl<K: ArrowDictionaryKeyType + std::marker::Send + std::marker::Sync>

Improve performance of COUNT (distinct x) for dictionary columns #258 #5554

Improve performance of COUNT (distinct x) for dictionary columns #258 #5554

Conversation

jaylmiller commented Mar 11, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

waynexia left a comment

Choose a reason for hiding this comment

jaylmiller commented Mar 12, 2023

alamb commented Mar 13, 2023

comphead commented Mar 13, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waynexia commented Mar 14, 2023

jaylmiller commented Mar 14, 2023

jackwener commented Mar 14, 2023 • edited Loading

jaylmiller commented Mar 14, 2023

jaylmiller commented Mar 14, 2023 • edited Loading

jackwener commented Mar 15, 2023

sundy-li commented Mar 15, 2023 • edited Loading

waynexia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaylmiller commented Mar 16, 2023 • edited Loading

mingmwang commented Mar 16, 2023 • edited Loading

jaylmiller commented Mar 16, 2023

mingmwang commented Mar 16, 2023

jaylmiller commented Mar 16, 2023

alamb commented Mar 16, 2023

mingmwang commented Mar 16, 2023

Choose a reason for hiding this comment

jaylmiller Mar 20, 2023 • edited Loading

Choose a reason for hiding this comment

jaylmiller commented Mar 20, 2023 • edited Loading

alamb commented Mar 28, 2023

alamb commented Apr 8, 2024

jaylmiller commented Mar 11, 2023 •

edited

Loading

jackwener commented Mar 14, 2023 •

edited

Loading

jaylmiller commented Mar 14, 2023 •

edited

Loading

sundy-li commented Mar 15, 2023 •

edited

Loading

jaylmiller commented Mar 16, 2023 •

edited

Loading

mingmwang commented Mar 16, 2023 •

edited

Loading

jaylmiller Mar 20, 2023 •

edited

Loading

jaylmiller commented Mar 20, 2023 •

edited

Loading