Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouping on arrays as arrays #12078

Merged
merged 42 commits into from
Jan 26, 2022
Merged

Conversation

cryptoe
Copy link
Contributor

@cryptoe cryptoe commented Dec 17, 2021

Description

Currently, grouping on multivalue columns inside druid work by exploding each value. From the druid docs :

For an example datasoure

{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}  #row1
{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]}  #row2
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]}  #row3
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []}                #row4
{
  "queryType": "groupBy",
  "dataSource": "test",
  "intervals": [
    "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
  ],
  "granularity": {
    "type": "all"
  },
  "dimensions": [
    {
      "type": "default",
      "dimension": "tags",
      "outputName": "tags"
    }
  ],
  "aggregations": [
    {
      "type": "count",
      "name": "count"
    }
  ]
}

This query returns the following result:

[
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "t1"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "t2"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 2,
      "tags": "t3"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "t4"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 2,
      "tags": "t5"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "t6"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "t7"
    }
  }
]

Notice that original rows are "exploded" into multiple rows and merged.

The goal of this PR is to group on multi value columns as arrays without exploding them. Please look at the virtual column and the dimensionSpec to see how we are activating this behavior. For eg:

{
  "queryType": "groupBy",
  "dataSource": "test",
  "intervals": [
    "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
  ],
  "granularity": {
    "type": "all"
  },
  "virtualColumns" : [ {
    "type" : "expression",
    "name" : "v0",
    "expression" : "mv_to_array(\"tags\")",
    "outputType" : "ARRAY<STRING>"
  } ],
  "dimensions": [
    {
      "type": "default",
      "dimension": "v0",
      "outputName": "tags"
      "outputType":"ARRAY<STRING>"
    }
  ],
  "aggregations": [
    {
      "type": "count",
      "name": "count"
    }
  ]
}

will return with the following results

[
 {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "[]"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "["t1","t2","t3"]"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "[t3","t4","t5"]"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 2,
      "tags": "["t5","t6","t7"]"
    }
  }
]

To activate this behavior in SQL we use a new mv_to_array which takes in a multiValue/String column.

select mv_to_array[tags] , count(*) from inline_data group by 1

Core Engine changes

In druid, we were coercing array's to string in the calcites layer. With this PR, we have removed the coercion and arrays are being passed as arrays to the native layer.

The idea was to copy concepts from how DictionaryBuildingStringGroupByColumnSelectorStrategy generated int<-> List on the fly.
Some optimizations are done for Array[String] while generating the dictionary keeping in mind that we want to store each string only once on the heap and reference that with the integer. As we are dealing with arrays, we are referencing the indexed array with yet another int so that the least amount of heap is used when string cardinality is low.
So if we have 2 rows:

 [a,b]
 [b,c]

the global dictionary would look like this:

"a"<>1
"b"<>2
"c"<>3

and the corresponding arrays would look like

[1,2]<>1
[2,3]<>2 

Also the inMemory structures needed to implement comparable and would be called per row. Hence ComparableStringArrays, ComparableIntArrays are coded in such a way that they don't have the overhead of Lists and Integers.


Key changed/added classes in this PR
  • ComparableStringArray.java
  • ComparableIntArray.java
  • ResultRowDeserializer.java
  • RowBasedGrouperHelper.java
  • ArrayConstructorOperatorConversion.java
  • ArrayStringGroupByColumnSelectorStrategy.java
  • ListGroupByColumnSelectorStrategy.java
  • ComparableList.java
  • Calcites.java
  • Function.java

To enable the legacy behavior, for whatever reason, one can set a runtime property
-Ddruid.expressions.processArraysAsMultiValueStrings=true while starting the broker.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@FrankChen021
Copy link
Member

I don't think putting the output format into the data source spec is a good design. Is it possible to extend current ResultFormat to meet your need?

@cryptoe
Copy link
Contributor Author

cryptoe commented Dec 19, 2021

Hey @FrankChen021 Thanks for looking into it. I am not sure that I understand your concern. We generally put the column type in the dimension spec no? Checkout the class DefaultDimensionSpec. I am using that.

The pr extends the functionality of the group by engine. The grouping keys themselves are changing. I am not sure how resultFormat would help here.

@dbardbar
Copy link
Contributor

Just wondering - can't you achieve the same thing with MV_TO_STRING?
Also, have you given any thought about how your code will be generated from SQL?

@cryptoe
Copy link
Contributor Author

cryptoe commented Dec 20, 2021

@dbardbar Good Question. The goal here is to make druid SQL layer similar to other SQL layers. In traditional DB's like POSTGRES, the SQL construct of declaring an array is an 'array'. I am using that construct itself to hook the SQL layer with the native layer. Just finishing up some test cases for that.

@dbardbar
Copy link
Contributor

@cryptoe - two questions

  1. Would this work also for a VirtualColumn which produces a multi-value? Specifically, VirtualColumn of type "mv-filtered".
  2. We needed the capability in this PR, but since it was missing we created another string dimension during ingestion, which holds the values of the MV column as a simple string. We then used that extra column in our GROUP BY, to achieve the same functionality as this PR. Do you think that your method would be comparable in performance?

@cryptoe
Copy link
Contributor Author

cryptoe commented Dec 21, 2021

1. Would this work also for a VirtualColumn which produces a multi-value? Specifically, VirtualColumn of type "mv-filtered".

Is it possible to share a sample Q here. I am trying to understand the use case here. Do you want to group by on array ?
https://github.com/apache/druid/pull/12078/files#diff-8bc53aec44924e671b3c6f6f34c6f0c499f873d9000649c47c237444707aea4bR975 . This PR does support virtual col's .

2. We needed the capability in this PR, but since it was missing we created another string dimension during ingestion, which holds the values of the MV column as a simple string. We then used that extra column in our GROUP BY, to achieve the same functionality as this PR. Do you think that your method would be comparable in performance?

If the original string dimension is of low cardinality, then IMHO this implementation will be slightly more performant with regards to memory usage across the historical's, brokers as we are doing optimizations on the way we lookup.

@dbardbar
Copy link
Contributor

dbardbar commented Dec 21, 2021

@cryptoe,

The use-case is that some of the values in the MV field are not interesting for a specific query, and we would like to ignore them for the purpose of the GROUP BY. They are kept there because those ignored tags might be used for filtering, or might be used for GROUP BY when performing a different query.

An example query, using the example data appearing at the top of this PR:

SELECT MV_FILTER_ONLY(tags, ARRAY['t3', 't4']), COUNT(*) FROM test GROUP BY 1

with your new code enabled, I would expect the following to return:

["t3"],       1            (from row1)
["t3", "t4"], 1            (from row2)
null,         2            (from row3+row4)

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this, super into being able to group on arrays natively 👍

@@ -1357,6 +1373,12 @@ private RowBasedKeySerdeHelper makeSerdeHelper(
)
{
switch (valueType.getType()) {
case ARRAY:
return new ArrayRowBasedKeySerdeHelper(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look like it handles all arrays, though using the TypeStrategy added in #11888 would maybe give the necessary comparators to handle the other types of arrays, but arrayDictionary would need to be able to hold any type of array, not just strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked. Working on it.

Comment on lines 408 to 416
if (plannerContext.getQueryContext()
.getOrDefault(
QueryContexts.ENABLE_UNNESTED_ARRAYS_KEY,
QueryContexts.DEFAULT_ENABLE_UNNESTED_ARRAYS
).equals(Boolean.FALSE)) {
outputType = Calcites.getValueTypeForRelDataTypeFull(dataType);
} else {
outputType = Calcites.getColumnTypeForRelDataType(dataType);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if grouping on arrays supported all array types, the correct thing to do instead of a flag I think would to just be consolidate getColumnTypeForRelDataType and getValueTypeForRelDataTypeFull to remove the string coercion and let arrays stay as arrays. They are separate basically because we allow using array functions on string typed columns, but grouping on them requires they remain as string typed in the native layer, and if that were no longer true we wouldn't have to do this coercion any longer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acked.

@FrankChen021
Copy link
Member

@cryptoe It's my fault to misunderstand your design. Just ignore the comment I left before.

Copy link
Contributor Author

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the Review

2. Removing ResultRowDeserializer
2. Removing dimension spec as part of columnSelector
2. Removing dimension spec as part of columnSelector
Copy link
Contributor Author

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to PR out of draft

@cryptoe cryptoe marked this pull request as ready for review January 6, 2022 07:04
@cryptoe
Copy link
Contributor Author

cryptoe commented Jan 14, 2022

Sure. Collating all the next steps:

  • Getting ordering on array grouping key to work
  • Null coercion
  • Optimized Limit pushdown in array comparators
  • mv_to_array to support expressions. Move to native cast.
  • Query context flag to not allow grouping on multi value columns
  • Refactor stuff to remove DimensionHandlerUtils returning a comparable
  • Tighter validation on matching dimension spec with column type

@cryptoe cryptoe closed this Jan 18, 2022
@cryptoe cryptoe reopened this Jan 18, 2022
case DOUBLE:
return new ArrayDoubleGroupByColumnSelectorStrategy();
case FLOAT:
// Array<Float> not supported in expressions, ingestion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self, double check this is actually true at this layer (if it is not, it might be possibly handled with the Double strategy). While definitely true that FLOAT doesn't exist in expressions, and so within expressions there exists no float array, this type might still be specified by the SQL planner, whenever some float column is added into an array for example. I'm unsure if the expression selectors column capabilities would report ARRAY or ARRAY as the type of the virtual column, i know it coerces DOUBLE back to FLOAT when the planner requests FLOAT types, but don't think it does the same thing for ARRAY so, this is probably true.

clintropolis and others added 5 commits January 25, 2022 08:29
* only coerce multi-value string null values when `ExpressionPlan.Trait.NEEDS_APPLIED` is set
* correct return type inference for ARRAY_APPEND,ARRAY_PREPEND,ARRAY_SLICE,ARRAY_CONCAT
* fix bug with ExprEval.ofType when actual type of object from binding doesn't match its claimed type
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lot of stuff I can think to do as follow-ups, but the basic behavior lgtm, i think this is a good start 👍

@clintropolis clintropolis merged commit 96b3498 into apache:master Jan 26, 2022
abhishekagarwal87 pushed a commit that referenced this pull request Feb 16, 2022
As part of #12078 one of the followup's was to have a specific config which does not allow accidental unnesting of multi value columns if such columns become part of the grouping key.
Added a config groupByEnableMultiValueUnnesting which can be set in the query context.

The default value of groupByEnableMultiValueUnnesting is true, therefore it does not change the current engine behavior.
If groupByEnableMultiValueUnnesting is set to false, the query will fail if it encounters a multi-value column in the grouping key.
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants