-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grouping on arrays as arrays #12078
Grouping on arrays as arrays #12078
Conversation
I don't think putting the output format into the data source spec is a good design. Is it possible to extend current |
Hey @FrankChen021 Thanks for looking into it. I am not sure that I understand your concern. We generally put the column type in the dimension spec no? Checkout the class The pr extends the functionality of the group by engine. The grouping keys themselves are changing. I am not sure how |
Just wondering - can't you achieve the same thing with MV_TO_STRING? |
@dbardbar Good Question. The goal here is to make druid SQL layer similar to other SQL layers. In traditional DB's like POSTGRES, the SQL construct of declaring an array is an 'array'. I am using that construct itself to hook the SQL layer with the native layer. Just finishing up some test cases for that. |
@cryptoe - two questions
|
Is it possible to share a sample Q here. I am trying to understand the use case here. Do you want to group by on array ?
If the original string dimension is of low cardinality, then IMHO this implementation will be slightly more performant with regards to memory usage across the historical's, brokers as we are doing optimizations on the way we lookup. |
The use-case is that some of the values in the MV field are not interesting for a specific query, and we would like to ignore them for the purpose of the GROUP BY. They are kept there because those ignored tags might be used for filtering, or might be used for GROUP BY when performing a different query. An example query, using the example data appearing at the top of this PR: SELECT MV_FILTER_ONLY(tags, ARRAY['t3', 't4']), COUNT(*) FROM test GROUP BY 1 with your new code enabled, I would expect the following to return:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for looking into this, super into being able to group on arrays natively 👍
processing/src/main/java/org/apache/druid/query/groupby/ResultRowDeserializer.java
Outdated
Show resolved
Hide resolved
.../org/apache/druid/query/groupby/epinephelinae/column/ArrayGroupByColumnSelectorStrategy.java
Outdated
Show resolved
Hide resolved
processing/src/test/java/org/apache/druid/query/groupby/GroupByQueryQueryToolChestTest.java
Outdated
Show resolved
Hide resolved
@@ -1357,6 +1373,12 @@ private RowBasedKeySerdeHelper makeSerdeHelper( | |||
) | |||
{ | |||
switch (valueType.getType()) { | |||
case ARRAY: | |||
return new ArrayRowBasedKeySerdeHelper( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't look like it handles all arrays, though using the TypeStrategy
added in #11888 would maybe give the necessary comparators to handle the other types of arrays, but arrayDictionary
would need to be able to hold any type of array, not just strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acked. Working on it.
.../org/apache/druid/query/groupby/epinephelinae/column/ArrayGroupByColumnSelectorStrategy.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java
Outdated
Show resolved
Hide resolved
if (plannerContext.getQueryContext() | ||
.getOrDefault( | ||
QueryContexts.ENABLE_UNNESTED_ARRAYS_KEY, | ||
QueryContexts.DEFAULT_ENABLE_UNNESTED_ARRAYS | ||
).equals(Boolean.FALSE)) { | ||
outputType = Calcites.getValueTypeForRelDataTypeFull(dataType); | ||
} else { | ||
outputType = Calcites.getColumnTypeForRelDataType(dataType); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if grouping on arrays supported all array types, the correct thing to do instead of a flag I think would to just be consolidate getColumnTypeForRelDataType
and getValueTypeForRelDataTypeFull
to remove the string coercion and let arrays stay as arrays. They are separate basically because we allow using array functions on string typed columns, but grouping on them requires they remain as string typed in the native layer, and if that were no longer true we wouldn't have to do this coercion any longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acked.
...java/org/apache/druid/sql/calcite/expression/builtin/ArrayConstructorOperatorConversion.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/dimension/ColumnSelectorStrategyFactory.java
Outdated
Show resolved
Hide resolved
@cryptoe It's my fault to misunderstand your design. Just ignore the comment I left before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the Review
2. Removing ResultRowDeserializer
2. Removing dimension spec as part of columnSelector
2. Removing dimension spec as part of columnSelector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving to PR out of draft
.../org/apache/druid/query/groupby/epinephelinae/column/ArrayGroupByColumnSelectorStrategy.java
Outdated
Show resolved
Hide resolved
Sure. Collating all the next steps:
|
core/src/main/java/org/apache/druid/math/expr/ExpressionProcessingConfig.java
Outdated
Show resolved
Hide resolved
case DOUBLE: | ||
return new ArrayDoubleGroupByColumnSelectorStrategy(); | ||
case FLOAT: | ||
// Array<Float> not supported in expressions, ingestion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self, double check this is actually true at this layer (if it is not, it might be possibly handled with the Double strategy). While definitely true that FLOAT
doesn't exist in expressions, and so within expressions there exists no float array, this type might still be specified by the SQL planner, whenever some float column is added into an array for example. I'm unsure if the expression selectors column capabilities would report ARRAY or ARRAY as the type of the virtual column, i know it coerces DOUBLE back to FLOAT when the planner requests FLOAT types, but don't think it does the same thing for ARRAY
so, this is probably true.
sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/ArraySqlAggregator.java
Outdated
Show resolved
Hide resolved
* only coerce multi-value string null values when `ExpressionPlan.Trait.NEEDS_APPLIED` is set * correct return type inference for ARRAY_APPEND,ARRAY_PREPEND,ARRAY_SLICE,ARRAY_CONCAT * fix bug with ExprEval.ofType when actual type of object from binding doesn't match its claimed type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lot of stuff I can think to do as follow-ups, but the basic behavior lgtm, i think this is a good start 👍
As part of #12078 one of the followup's was to have a specific config which does not allow accidental unnesting of multi value columns if such columns become part of the grouping key. Added a config groupByEnableMultiValueUnnesting which can be set in the query context. The default value of groupByEnableMultiValueUnnesting is true, therefore it does not change the current engine behavior. If groupByEnableMultiValueUnnesting is set to false, the query will fail if it encounters a multi-value column in the grouping key.
Description
Currently, grouping on multivalue columns inside druid work by exploding each value. From the druid docs :
For an example datasoure
This query returns the following result:
Notice that original rows are "exploded" into multiple rows and merged.
The goal of this PR is to group on multi value columns as arrays without exploding them. Please look at the virtual column and the dimensionSpec to see how we are activating this behavior. For eg:
will return with the following results
To activate this behavior in SQL we use a new
mv_to_array
which takes in a multiValue/String column.Core Engine changes
In druid, we were coercing array's to string in the calcites layer. With this PR, we have removed the coercion and arrays are being passed as arrays to the native layer.
The idea was to copy concepts from how
DictionaryBuildingStringGroupByColumnSelectorStrategy
generated int<-> List on the fly.Some optimizations are done for Array[String] while generating the dictionary keeping in mind that we want to store each string only once on the heap and reference that with the integer. As we are dealing with arrays, we are referencing the indexed array with yet another int so that the least amount of heap is used when string cardinality is low.
So if we have 2 rows:
the global dictionary would look like this:
and the corresponding arrays would look like
Also the inMemory structures needed to implement
comparable
and would be called per row. HenceComparableStringArrays
,ComparableIntArrays
are coded in such a way that they don't have the overhead ofLists
andIntegers
.Key changed/added classes in this PR
ComparableStringArray.java
ComparableIntArray.java
ResultRowDeserializer.java
RowBasedGrouperHelper.java
ArrayConstructorOperatorConversion.java
ArrayStringGroupByColumnSelectorStrategy.java
ListGroupByColumnSelectorStrategy.java
ComparableList.java
Calcites.java
Function.java
To enable the legacy behavior, for whatever reason, one can set a runtime property
-Ddruid.expressions.processArraysAsMultiValueStrings=true
while starting the broker.This PR has: