-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Enable subfield pushdown for map_subset function #25394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f894a58
to
63eace6
Compare
63eace6
to
fb0de3c
Compare
@@ -704,7 +713,23 @@ public Void visitCall(CallExpression call, Context context) | |||
} | |||
if (!isPushDownSubfieldsFromLambdasEnabled) { | |||
context.setLambdaSubfields(Context.ALL_SUBFIELDS_OF_ARRAY_ELEMENT_OR_MAP_VALUE); | |||
call.getArguments().forEach(argument -> argument.accept(this, context)); | |||
// map_subset(feature, constant_array) is only accessing fields specified in feature map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does isPushDownSubfieldsFromLambdasEnabled have to be false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not need to, it's just the lambda handling part is so complex and I haven't figured out how to make it work for it. Since this isPushDownSubfieldsFromLambdasEnabled is default to false, and what is added here is more like a new feature, so I chose to fix for the case when isPushDownSubfieldsFromLambdasEnabled for now. We can create a new issue for it when this PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just followed @kevintang2022 's suggestion and the fix worked for lambda too (and also added test)
call.getArguments().forEach(argument -> argument.accept(this, context)); | ||
// map_subset(feature, constant_array) is only accessing fields specified in feature map. | ||
// For example map_subset(feature, array[1, 2]) is equivalent to calling element_at(feature, 1) and element_at(feature, 2) for subfield extraction | ||
if (isPushdownSubfieldsForMapSubsetEnabled && functionResolution.isMapSubSetFunction(call.getFunctionHandle()) && call.getArguments().get(0) instanceof VariableReferenceExpression && call.getArguments().get(1) instanceof ConstantExpression) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if PushdownSubfields is the right place for this optimization because here only the context is changed and the name of this visitor also is SubfieldExtractor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have been thinking to add a new optimizer which translates all map_subset(feature, array[1, 2, 3]) with constant array elements to something like map(array[1, 2, 3], array[element_at(feature, 1), element_at(feature, 2), element_at(feature, 3)]).
However then I realized that this rewrite will not be useful if there are other reference to this feature map. In order to make it work, we need to check for other non subfield access of feature, which is nontrivial and is basically duplicating most of the logic here.
And also I think it kind of make sense to put it here, as the optimizer is to extract subfield access and push it down, which is what map_subset is doing, i.e. accessing only a part of a map.
Thanks for the release note entry! Suggest adding a link to the doc. See Formatting in the Release Notes Guidelines. Also, can you describe what user-visible effect this might have? Reading the PR description I came up with this. Let me know what you think!
|
I think there's a place in the same file PushdownSubfields.java that is doing something similar. Is there a way to add this change in the same place? It's in the toSubfield method
|
16f3445
to
750a2e7
Compare
@@ -570,17 +574,18 @@ private static String getColumnName(Session session, Metadata metadata, TableHan | |||
return metadata.getColumnMetadata(session, tableHandle, columnHandle).getName(); | |||
} | |||
|
|||
private static Optional<Subfield> toSubfield( | |||
private static Optional<List<Subfield>> toSubfield( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map_subset can return a list of subfields
@kevintang2022 Thanks for the suggestion, updated the PR to move the change here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the optimization, LGTM! There is an unused import in TestHiveLogicalPlanner
that causes the checkstyle failure.
99cf482
750a2e7
to
99cf482
Compare
if (index == null) { | ||
return Optional.empty(); | ||
} | ||
if (index instanceof Number) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought map_subset specifies map keys and not indexes? why can't they be negative? And why do they need to be numeric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case we cannot translate to elementAt function i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
element_at for maps also takes a key and not an index (see docs here: https://prestodb.io/docs/current/functions/map.html). element_at returns the value for a given key. map_subset returns a subset of the map that includes only the entries with keys in the specified array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, I have tried removing the restriction on negative values and adding support for varchar type here, and found that both can be successfully pushed down and got the expected results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought map_subset specifies map keys and not indexes?
Rename to mapKey
And why do they need to be numeric?
Add case for varchar keys
why can't they be negative?
You are right, the associated issue is for array, not for map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, since subfield pushdown of negative keys work for map, I drafted another PR #25445 to fix the current implementation for element_at and subscript functions.
@@ -1461,6 +1462,31 @@ public void testPushdownSubfields() | |||
assertUpdate("DROP TABLE test_pushdown_struct_subfields"); | |||
} | |||
|
|||
@Test | |||
public void testPushdownSubfieldsForMapSubset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test where the specified keys to subset are not also the index of that key? Also for non-numeric map keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a test where the specified keys to subset are not also the index of that key
What do you mean by not also the index of that key? Do you mean the type of index in array is different from map key?
If map key is of integer type, and element in array is of bigint type, the map will be wrapped with a cast and thie pushdown field will not work (and this needs to be resolved with this PR #25395)
If element type of array is not compatible with map key type, for example integer vs. varchar, query will fail during during parsing
Also for non-numeric map keys.
Added tests for varchar keys too
e0d8dbe
to
f1f5138
Compare
f1f5138
to
3928122
Compare
3928122
to
fd18a02
Compare
Description
MAP_SUBSET function is commonly used in ML workload, which is used to extract a subset of features. For example, map_subset(feature, array[301, 205]) is used to extract the features with key to be 301 and 205.
If this is the only case where feature is accessed, we only need to read two fields of the map. However currently PushDownSubfields does not recognize this pattern. In this PR I enabled subfield pushdown for map_subset when the input array is a constant array.
In one of the motivating queries we found, this enhancement reduce the resource usage by 17% (orig vs. improved)
Without change:
With change:
map2 := map2:map<int,int>:1:REGULAR (1:67)
vs.map2 := map2:map<int,int>:1:REGULAR:[map2[1]] (1:67)
Motivation and Context
As in description
Impact
In one of the motivating queries we found, this enhancement reduce the resource usage by 17%
Test Plan
unit tests, also end to end test locally
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.