-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ORC stripe prunning based on the DECIMAL predicates #5001
Conversation
create(ValueSet.ofRanges(greaterThanOrEqual(LONG_DECIMAL, longDecimal("-1234567890.0987654321"))), true)); | ||
} | ||
|
||
private static ColumnStatistics decimalColumnStats(Long numberOfValues, String minimum, String maximum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move below dataColumnStats
lgtm |
604bfff
to
dd4d29a
Compare
public static long shortDecimalToBigint(long value, int sourceScale) | ||
{ | ||
BigDecimal bigDecimal = new BigDecimal(BigInteger.valueOf(value), sourceScale); | ||
return bigDecimal.setScale(0, FLOOR).longValueExact(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really the most efficient way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saturated floor cast is going to be use only in DomainTranslator
to translate user-entered predicates. I don't think that good performance really matter here.
The ORC bits look good, but @erichwang should review the changes to |
Actually I have floor casts implemented for |
All the variables which are involved in calculation must be explicitly declared. After Signature binding refactor there will be no such cases when we need to handle undeclared variables in TypeCalculation.
In order to avoid extra cast of both decimal arguments to the same type ADD,SUBSTRACT and DIVIDE operators must accept different decimal types, e.g. (DECIMAL(3,2), DECIMAL(5,1)). We are going to remove arguments coercions to the same type because it is not required by the all operators. For instance Multiply operator does not require casting, because multiply operator has the semantic that doesn't require re-scaling. Compulsory casting of decimal arguments to the same type will be removed in a very next commit that incudes "matchAndBind" algorithm refactoring.
Bind both Types and Literal variables in single traversal. Simplify function signature parameters binding process. Remove decimal parameters cast to the same type. Move all the signature binding related code to the single class. Before this patch signature variables binding has been made in 2 separate traversals. First we called `matchAndBind(declaredSignature, actualParameters)` to bind type parameters. Than we called `calculateLiteralParameters(declatedSignature, actualParameters)` to bind literal parameters. Similary 2 traversals has been made to bind calculated variables to the declared signatures. This commit simplifies binding process with introducing algorithm that requires only one single reqursive traversal for binding both variable types (type, literal). All related code for type variables matching and binding has been moved to the single `SingatureBinder` class.
As the signature parameters binging code has been moved to SignatureBinder, all the parameters binding tests moved to the TestSignatureBinder class. Assertions in TestSignatureBinder has been refactored in "assertj style". This resolves prestodb#4405
If more that one function is selected as applicable for some particular parameters, the latest one in the candidates list is selected for execution. Candidates list is built with reflection. `Class#getMethods` method doesn't guarrantee any fixed order, so the candidates list order is runtime specific. This commit introduces most specific function selection algorithm. This algorithm is suppose to ensure that selected function for some particular parameters will always be the same, and it will always be the most specific. The main algorithm idea is inspired by Java language specifications: https://docs.oracle.com/javase/specs/jls/se8/html/jls-15.html#jls-15.12.2.5 However the actual implemented algorithm has a single substantial difference. Implemented algorithm in Presto is always trying to select applicable function that doesn't require any coercions for invokation, no matter what type of function it is - parametrized, variable arity or generic. The actual implemented algorithm is next: 1. Try to select function with explicitly declared parameters by direct match. 2. If more than one function selected than throw "ambigulous method invocation exception" 3. Try to select generic function without coercion 4. If more than one function selected than throw "ambigulous method invocation exception" 5. Select all applicable function with coercions 6. Select most specific function One function is more specific than another if the invocation handled by the first method could be passed on to the other one 7. If the most specific function can't be selected than throw "ambigulous method invocation exception"
As we have introduced the most specific function selection algorithm, that throws an error in the situation when the most specific method can't be selected, we must resolve the conflicts explicitly for the UNKNOWN type operators. For instance for the invokation `BIGINT + UKNNOWN` two operators are selected as applicable: - `ADD(BIGINT, BIGINT)` - `ADD(BIGINT, INTERVAL)` Before the `most specific function` algorithm was introduced the random operators from those two was considered for invokation. Most specific function selection algorithm can't resolve this conflict because the `BIGINT` and the `INTERVAL` types are not linked with the implicit coercion. For the algorithm theese two types are just 2 completely different types. But bassicaly the default sematic for UNKNOWN should be simple. `OPERATOR(T, UNKNOWN)` must return `T`. To keep this semantic and resolve the ambigulity it was decided to introduce explicit arithmetic and inequallity operators for the `UNKNOWN` type.
Resolve `length(varchar), length(varbinary)` ambiguous call for unknown type
Resolve `approx_set(varchar), approx_set(bigint)` ambigulous call for unknown type
+ Added StandardErrorCode.AMBIGUOUS_FUNCTION_CALL. + Ambiguous exception is thrown as PrestoException instead of IllegalStateException. + Ambiguous exception has consistent message for same input.
We need to test operators that return decimal not only in dedicated `DECIMAL` test classes.
Changes the algorithm to resolve functions as follows: 1. look for an exact match 2. find the most specific function after coercing arguments 3. If more than one exists, find the most specific function that only coerces null arguments. 4. If more than one exists, look for one whose parameters are of the same type. Also, only create UnknownOperators for argument types that would result in ambiguous functions.
4c795d2
to
fb72094
Compare
In order to introduce |
if (value.isNull()) { | ||
return Optional.of(NullableValue.asNull(targetType)); | ||
} | ||
Object coercedValue = new FunctionInvoker(metadata.getFunctionRegistry()) | ||
if (!TypeRegistry.canCoerce(value.getType(), targetType)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can move this coercion check after the null check. That would imply that the any null type can be coerced into any other type.
private Optional<Signature> getSaturatedFloorCastOperator(Type fromType, Type toType) | ||
{ | ||
if (!metadata.getFunctionRegistry().canResolveOperator(SATURATED_FLOOR_CAST, toType, ImmutableList.of(fromType))) { | ||
return Optional.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have used map
in every other places that you noticed, but it can't be used here, cause there are no method in FunctionRegistry
that return Optional
.
@dain, I looked at the DomainTranslator bits and it looks good other than just a few minor comments there |
DomainTranslator extracts the TupleDomain for the simple comparisons (>=, <=, =, <>, etc.). TupleDomain is further used by connectors to implement the partition pruning based on the query predicates. If the column is compared with the value of narrower type (column_double >= BIGINT '1'), the value can be easily coerced to the wider type (column_double >= DOUBLE '1.0'), and no workarounds are needed. If the column is compared with the value of wider type (column_integer >= DOUBLE '1.0'), the value must be rounded to the narrower type. Rounding algorithm for `column_integer >= const_double` comparation was already implemented. This commit is a generalization of that algoritm. Now in order to make some types pair elligable for rounding and domain translation the `SATURATED_FLOOR_CAST` operator must be registered for that types pair. Resolves: prestodb#5013
fb72094
to
f56b415
Compare
@martint, dain looked at the ORC parts, and I cleared the TupleDomain stuffs |
@arhimondr, can we close this? I've merge the ORC-related bits (#5190), and the other commits are covered by a different PR, no? |
@martint There are still 3 commits left unmerged.
Those commits depend on decimal coercions. |
No description provided.