-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of COUNT (distinct x) for dictionary columns #258 #5554
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good improvement 👍 But I have some questions about its correctness when the input dictionaries are not that normalized:
@waynexia thanks for correcting my assumption about how normalized dicts 😀. I've made changes correcting this. There is a bit of hashing required but significantly less than before since we only need to hash once for each value type (instead of hashing every cell). Also since there was now alot of shared logic between the 2 accumulators, i've pulled that out into funcs so both accumulators can use it. |
I plan to review this PR tomorrow. Thank you @waynexia for the review |
@jaylmiller thanks for the PR. Would be great to get some knowloedge how the much performance increased? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jaylmiller -- this looks great. Thank you @waynexia for the initial review and ensuring that the dictionaries are handled correctle.
I double checked at the logic now appears to handle different dictionaries correctly -- could you give it one more review @waynexia ?
Also, I wonder if you have had a chance to do any sort of benchmarking to show the improvement?
@@ -31,7 +32,7 @@ use datafusion_common::{DataFusionError, Result}; | |||
use datafusion_expr::Accumulator; | |||
|
|||
type DistinctScalarValues = ScalarValue; | |||
|
|||
type ValueSet = HashSet<DistinctScalarValues, RandomState>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what value these type aliases add. The extra indirection of DistinctScalarValues
--> ScalarValue
simply seems to make things more complicated 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit confused about the purpose of DistinctScalarValues
as well to be honest but I kind of figured it was there for a good reason so I left it in 😅
In terms of the added ValueSet
alias, I personally thought it made the code a bit more readable but that is kind of subjective of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we remove DistinctScalarValues
alias but keep ValueSet
?
Sorry for the delay, I plan to review it tomorrow! |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Currently looking into this. Will post findings today |
A great job. I also notice this performance(I notice it by ClickBeach). Thanks for your job @jaylmiller .
I think some case in clickbench will be improved. |
Ok I'll look into getting some results on these cases |
ClickBench count distinct query when using dictionary columns is getting killed (this is on main as well as the PR) 🤔
"UserID" table is pretty high cardinality though: is there a better clickbench query/column pair to bench with? |
cc @sundy-li |
For numeric/string args in distinct, I think it's better to have special states rather than putting the enum into the HashSet.
Another approach is to rewrite this SQL to
Cast could be overhead, if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation generally looks good to me. I left some little suggestions about style. Looking forward to the bench result 🚀
return Err(DataFusionError::Internal( | ||
"Dict key has invalid datatype".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would prefer to add the concrete type in the error message
// calculating the size of values hashset for fixed length values, | ||
// taking first batch size * number of batches. | ||
// This method is faster than full_size(), however it is not suitable for variable length | ||
// values like strings or complex types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// calculating the size of values hashset for fixed length values, | |
// taking first batch size * number of batches. | |
// This method is faster than full_size(), however it is not suitable for variable length | |
// values like strings or complex types | |
/// calculating the size of values hashset for fixed length values, | |
/// taking first batch size * number of batches. | |
/// This method is faster than full_size(), however it is not suitable for variable length | |
/// values like strings or complex types |
style: prefer to use document comments
} | ||
// calculates the size as accurate as possible, call to this method is expensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
// calculates the size as accurate as possible, call to this method is expensive | |
} | |
// calculates the size as accurate as possible, call to this method is expensive |
style: add empty line between two fns
} | ||
impl<K: ArrowDictionaryKeyType + std::marker::Send + std::marker::Sync> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
impl<K: ArrowDictionaryKeyType + std::marker::Send + std::marker::Sync> | |
} | |
impl<K: ArrowDictionaryKeyType + std::marker::Send + std::marker::Sync> |
style: add empty line between two blocks
I'm not seeing any noticeable ClickBench improvements when changing "RegionID" column to use dict encoding (setting other columns e.g. UserID to use dict encoding cause my datafusion process to be killed...) Maybe this change is not worth merging if ClickBench improvements aren't being seen |
@jaylmiller This optimization make sense in most cases but might have conflict with this optimization in this PR. |
I am essentially just running clickbench queries against my PR and against main. The only change is I am setting RegionID column to a dict:
Do you have any recommendations, @mingmwang ? |
I had never check clickbench queries. Maybe you can comment the rule |
Thanks I'll try that out |
I wonder if we can try a smaller subset 🤔 ❯ CREATE TABLE hits as select
arrow_cast("UserID", 'Dictionary(Int32, Utf8)') as "UserID"
FROM 'hits.parquet'
limit 10000000;
0 rows in set. Query took 0.776 seconds.
❯ select count(distinct "UserID") from hits;
+-----------------------------+
| COUNT(DISTINCT hits.UserID) |
+-----------------------------+
| 1530334 |
+-----------------------------+
1 row in set. Query took 71.388 seconds. I will try this on my benchmark machine |
Any luck? |
let arr = as_dictionary_array::<K>(&values[0])?; | ||
let nvalues = arr.values().len(); | ||
// map keys to whether their corresponding value has been seen or not | ||
let mut seen_map = vec![false; nvalues]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seen_map
could become a bitmap to save space.
High-cardinality inputs relative to the batch size (like UserID
in clickbench) probably don't benefit that much from this map. I don't know the standard batch size that datafusion uses for that query, but a much larger batch size could improve the performance in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion. Just to clarify, you mean using arrow bitmap, correct? Something like
let mut seen_map = arrow::array::BooleanBufferBuilder::new(nvalues);
@mingmwang Sorry for delay. I haven't had a chance to get back to this PR yet (currently working on #5292). |
Marking as draft to signify this PR has feedback and is not waiting for another review at the moment. |
Since this has been open for more than a year, closing it down. Feel free to reopen if/when you keep working on it. |
Which issue does this PR close?
Closes #258.
Rationale for this change
The count distinct physical expr was doing alot of unnecessary hashing when it is ran on dictionary types. Previously, every cell in the dictionary array was being added to the distinct values hashset, with this change, we only need to add each value type at most once (never if it is not present) to the array.
What changes are included in this PR?
A new accumulator (
CountDistinctDictAccumulator
) that is returned byDistinctCount
in the case that a dictionary array is being counted. There is a fair amount of shared logic between the accumulators so that was also pulled out into helper funcs.Are these changes tested?
Added some new unit tests.
Are there any user-facing changes?