-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14062: [Format] Initial arrow-internal specification of compute IR #10934
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format?
or
See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Just a few small comments.
d7ed1fe
to
767af50
Compare
format/ComputeIR.fbs
Outdated
table InlineBuffer { | ||
// ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers | ||
// and other alignment sensitive blobs can be stored here | ||
bytes: [ulong] (required); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has implications for languages that aren't C++ (I don't think the ergonmics of converting long[] to byte[] is quite as easy in ava.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be resolved by taking a leaf out of parquet2's book and defining InlineBuffer as a union of vectors of each primitive type, I'll try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure Unions in FB are equivelant to union's in C++ (I don't know that if you select a byte union, you end up sharing the same memory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might be able to guarantee alignment by having a required long followed by a bytes: [byte] field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I think creating a struct { ulong padding_for_struct_alignment, int padding_for_byte_align, bytes: byte[] } and making that member should be guaranteed to align.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice the ForceVectorAlignment
function in C++
We should look at whether
table InlineBuffer {
padding_for_alignment:ulong = 0 (required);
bytes:[ubyte] (required);
}
guarantees bytes
to be aligned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried an approach inspired by arrow2
's decision to include the primitive type being stored in buffers (and thus also alignment information) all the way down to struct Bytes
https://github.com/jorgecarleitao/arrow2/blob/main/src/buffer/bytes.rs#L39
I think this guarantees alignment without requiring padding fields or reinterpret_casts
format/ComputeIR.fbs
Outdated
} | ||
|
||
table Expression { | ||
impl: ExpressionImpl (required); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what this indirection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generators for some languages don't support vector-of-unions. I'll add a comment explaining that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per below, if you lift the output type (field) into Expression (if we agree that every Expression needs an output Field — name and type), than you can simply call this ExpressionOp
, since then Expression
serves more purpose than simply being wrapper for Expression.impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that every Expression needs type, but I disagree that name is meaningful for Expressions. To my mind a name is ascribed to the referent by the referring entity; to speak in graph theory it's an edge property rather than a node property. To give a practical context: in a projection like SELECT $complicated_expr as A, $complicated_expr as B
it should not be an error to use a single memoized instance of $complicated_expr
- but including name as an expression property would make these two incompatible.
format/ComputeIR.fbs
Outdated
table InlineBuffer { | ||
// ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers | ||
// and other alignment sensitive blobs can be stored here | ||
bytes: [ulong] (required); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice the ForceVectorAlignment
function in C++
We should look at whether
table InlineBuffer {
padding_for_alignment:ulong = 0 (required);
bytes:[ubyte] (required);
}
guarantees bytes
to be aligned
format/ComputeIR.fbs
Outdated
/// A specification of a query. | ||
table Plan { | ||
/// One or more output relations. | ||
sinks: [Relation] (required); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the interpretation of multiple outputs (versus a single output)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a difference in execution model for this proposal (for which rpc_service Interactive
is provided semi-didactically to clarify).
If the root_type is a TableExpr which evaluates to batches, that implies that there is a channel open between the consumer and the producer along which those batches can be returned. This is frequently the case but in general I think we'll want to be able to express execution plans which don't rely on an interactive channel. For example: Plans generated as fragments of larger plans in distributed execution or plans which represent an ETL job.
Therefore I think it's preferable that a Plan explicitly include the destination for all batches, even if that will quite commonly be operation=InteractiveOutput
(just pipe them back to the user).
Only a single instance of InteractiveOutput is permitted in a Plan (so no interactive producer will need to deinterleave batches piped back along the interactive channel, which I think was your concern here). However any number of other sinks are permitted. For example, a Plan may specify that one set of batches be streamed to tcp://somehost.com:890
for consumption by a service on that host and an unfiltered superset of those batches cached locally into file://tmp/cache/my_query
for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this feels like it is beyond the scope of a plan and is more about a specific read or write relation as well as communication of plan between systems. Let's try to avoid baking those concepts into the core plan.
format/ComputeIR.fbs
Outdated
} | ||
|
||
table Expression { | ||
impl: ExpressionImpl (required); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per below, if you lift the output type (field) into Expression (if we agree that every Expression needs an output Field — name and type), than you can simply call this ExpressionOp
, since then Expression
serves more purpose than simply being wrapper for Expression.impl
format/ComputeIR.fbs
Outdated
} | ||
|
||
union Function { | ||
CanonicalFunction, NonCanonicalFunction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we would want to add a UserDefinedFunction here which is able to pass an inline buffer containing the serialized function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how that's distinct from providing a NonCanonicalFunction with the serialized function in the options blob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NonCanonicalFunction
and UserDefinedFunction
are synonymous here. NonCanonicalFunction
I think is probably a better name as it leaves no room for ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my comments elsewhere, if we want to support serialized functions, we should make them structured, not just a bucket of bytes. For example:
{
type: "python_pickle",
argument_types: [int],
output_type:"int",
dynamic:"false",
maintains: [sort, cluster, distribution],
python_prerequisites: [pyarrow, pytorch],
bytes:<bytes>"
}
The structure should be such that:
- tools don't have to know how to decode the bytes to do plan transformations
- transformations can move operations without having to worry about invalidating correctness (for example move where distribution of data is happening in a plan)
- the plan can be validated without doing a byte decoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. We'll have to a lot more legwork to make user-defined structures a thing.
format/ComputeIR.fbs
Outdated
|
||
/// Parameters for `operation`; content/format may be unique to each | ||
/// value of `operation`. | ||
options: InlineBuffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the options as a nested / independent flatbuffer for all canonical operations still makes me squirm. What do you think about putting a union of options for the canonical types in CanonicalOperation
so that dealing with the InlineBuffer for "built-ins" is not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1. This adds yet more special casing for the canonical operations and makes it harder to write generic pattern matching utilities while also giving us another discriminant to query and validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another pattern might be to make CanonicalOperationId "CanonicalOperation" which is a union of the options. It doesn't fully get out of slightly harder pattern matching code, but it would eliminate a separate discrimant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a tradeoff here. Having two code paths for handling canonical versus non canonical increases complexity, since a consumer now has to handle canonical things in additional to still handling InlineBuffer. If we're going to allow InlineBuffer, I don't think there's a good reason not to have just a single way to deal with options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't yet have a very specific idea of how people are going to use this, I think we should leave this as InlineBuffers.
format/ComputeIR.fbs
Outdated
|
||
/// Parameters for `function_name`; content/format may be unique to each | ||
/// value of `function_name`. | ||
options: InlineBuffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per Operation below, I might prefer to have the opaque options only in the NonCanonicalFunction table — in the CanonicalFunction, if needed — may not be yet — we could add a union-of-tables providing function-specific options without the need for InlineBuffer
format/ComputeIR.fbs
Outdated
Product, | ||
Sum, | ||
Tdigest, | ||
Quantile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can haggle on the details later but I don't think quantile is well supported enough to be a canonical function.
format/ComputeIR.fbs
Outdated
Mode, | ||
Product, | ||
Sum, | ||
Tdigest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now maybe we leave out everything except the bare bones functions that we need.
format/ComputeIR.fbs
Outdated
} | ||
|
||
union Function { | ||
CanonicalFunction, NonCanonicalFunction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NonCanonicalFunction
and UserDefinedFunction
are synonymous here. NonCanonicalFunction
I think is probably a better name as it leaves no room for ambiguity.
format/ComputeIR.fbs
Outdated
|
||
/// Parameters for `operation`; content/format may be unique to each | ||
/// value of `operation`. | ||
options: InlineBuffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a tradeoff here. Having two code paths for handling canonical versus non canonical increases complexity, since a consumer now has to handle canonical things in additional to still handling InlineBuffer. If we're going to allow InlineBuffer, I don't think there's a good reason not to have just a single way to deal with options.
format/ComputeIR.fbs
Outdated
|
||
/// Parameters for `operation`; content/format may be unique to each | ||
/// value of `operation`. | ||
options: InlineBuffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't yet have a very specific idea of how people are going to use this, I think we should leave this as InlineBuffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass here. A bunch of details but I'd say at a high-level I like the direction in general but am -1 on the places where we just have buckets of bytes. I'd like to see a lot more definition around how and what people can customize. I also think the coupling of ir version to introduction of each new canonical function is mistake and have added comments to that effect.
/// A canonical (probably SQL equivalent) function | ||
// | ||
// TODO: variadics | ||
enum CanonicalFunctionId : uint32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think specific functions should be part of the IR. A couple of reasons:
- Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.
- The speed at which functions are created/added should happen at a much more rapid pace than changes to the core IR.
- Just because new functions are introduced doesn't mean we should change the format version number. The absence of a function in a particular consumer shouldn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is ids here, a separate yaml or similar structured doc that lists canonical functions that includes not only structured data around type, etc but also description/details around specific behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a it will be just as hard to maintain as if we append to an enum.
Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.
I don't follow why operations with multiple overloads need to be dealt with at all in the IR. Wouldn't a function have a singular definition (or be singularly derivable) for a given IR?
Other than performance concerns about passing lots of strings around, what's the problem with a name and an optional namespace string for every function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is ids here, a separate yaml or similar structured doc that lists canonical functions that includes not only structured data around type, etc but also description/details around specific behavior.
I think for scalar functions, since there are so many (but not necessarily relational operators), I also prefer this approach to having all the functions listed in an enum on the Flatbuffers file, so that we can have an append-only yaml file with all the functions. Using an integer id to identify a function versus a string makes the IR smaller (good — only ever need 4 bytes, even only 2 bytes, to identify a function) and implementations marginally less complicated since many engines will have string identifiers for functions that are different than the "canonical" names in our function inventory. The inventory of scalar functions is likely to grow very fast and not having to modify the Flatbuffers files would be beneficial
Just because new functions are introduced doesn't mean we should change the format version number. The absence of a function in a particular consumer shouldn't really matter.
I agree with this
Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.
I see arguments both ways on using a different function id for each overload of a function. The main annoyance I would see in having a different id for each overload of a function is that IR implementations would have to maintain a huge IR mapping table, versus using a common id for all variants of "add" for example. On the other hand, if there is a different id for every overload, then it leaves no ambiguity about what the input and output types should be. But that is a massive scope increase for this project to enumerate all the function signatures of every function overload contemplated...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that is a massive scope increase for this project to
I guess I don't really see this as a massive increase in scope. Maybe it just about being more formal sooner? Without being formal about this I fear that this will become a singly used representation and the implementation will define the specification, rather than the other way around.
...enumerate all the function signatures of every function overload contemplated...
Part of it to me is also that this will be a growing list. It doesn't have to start with all possible functions, only the functions we want to initially specify. I would expect adding new functions would be relatively straightforward.
I don't follow why operations with multiple overloads need to be dealt with at all in the IR. Wouldn't a function have a singular definition (or be singularly derivable) for a given IR?
From an engine implementation point of view, as an example, I feel like decimal division has much more in common with decimal multiplication than it does with integer division. It also have a very different type output resolution system. Overloading a single concept of division to state different output type derivation systems seems quite a bit more complex than simply saying that there is no "overloading". To me, we simply need to consider each function's key to be the name + the input argument types. Sure, two functions may have the same "name" but that doesn't mean they have the same key (or have anything to do with each other).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we actually sure this is possible to do for all of the types we want to support?
What the key is for, say, the ==
operator for a complex type like list or struct?
I don't think a wildcard type is well-defined here without more clarification. For example, List<T>
can only be compared with List<U>
if T == U
, but if T != U
the operation is undefined.
Unnest is another example.
Without a type system that handles generics, you can't write down the type of all possible instantiations of any type that has a type parameter, such as list, map, and struct.
What is the issue with having a list of functions in some structured format, that indicates the canonical name of the function and its arity?
If a producer sends over a call to the add function with input types int32, int32
and output type int32
, then the consumer would look that up, and if it's able to execute that IR, then it does and if it's not able to do so it returns an error.
} | ||
|
||
/// A table write | ||
table Write { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Read & Write should be removed from this patch. It seems like we need to spend a decent amount of time deciding what kind of structure could be there. A required string (and string parsing?) seems like the wrong place to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, though I don't think IR is that useful without Read
. I'm not sure if there's actually a common denominator here.
I will look around at the different ways this has been done. So far, it looks like DuckDB implements different sources as table producing functions, maybe that's a good place to start generalizing from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest we start with just having a read a path of parquet files to start. It seems like a simple set of properties and a realistic initial usecase.
Additional comments: given this interest in targeting physical engines, it is weird that we are missing common physical operations like HashJoin, MergeJoin, HashAggregate, StreamingAggregate, TopN (order + limit operator), as well as a bunch related to RedistributeSend, OrderedRedistribute, UnorderedReception, OrderedReception, etc. I also think we need to go to each operation and declare the following:
|
ASCENDING_THEN_NULLS, | ||
DESCENDING_THEN_NULLS, | ||
NULLS_THEN_ASCENDING, | ||
NULLS_THEN_DESCENDING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need the ability to control NaN
ordering here in the case of floating point columns. I.E. some systems treat NaN
as larger than all other values but smaller than null
, others do differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I think there's probably an argument to split out these options into 3 different enums, one for non-null ordering, null ordering, and nan ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also need an enum for the ordering of NaN
with respect to null
. I.E. is it NaN
then null
then ascending values, or null
then NaN
then ascending values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not follow the IEEE754 recommendation?
- Iceberg uses it: https://github.com/apache/iceberg/blob/master/site/docs/spec.md#sorting
- Java uses it
- Rust lang follows it in practice (see e.g. impl TotalEq/TotalOrd for floats with the standard (IEEE754) total ordering rust-lang/rust#5585 and ord-float)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured there'd be some engines that handles NaN ordering differently. I.E. in Pandas with a float64
dtype column it allows setting NaN
similarly to what we're allowing for nulls here using an na_position
argument. This does work as expected when using the experimental Float64
nullable type though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handful of comments. I will leave some comments later on Relation.fbs
There are some difficult questions here that merit additional analysis / feedback from others. If we're taking a more rigid stance with respect to output type derivations, especially with built-in functions, then I agree that moving that information outside of the serialized format makes sense. A reasonable metric for what constitutes "built-in" or "canonical" would therefore be that there is a well-determined set of output type derivations (the derivations might have to allow for wildcard types in some cases, e.g. * == * -> boolean
)
/// A canonical (probably SQL equivalent) function | ||
// | ||
// TODO: variadics | ||
enum CanonicalFunctionId : uint32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is ids here, a separate yaml or similar structured doc that lists canonical functions that includes not only structured data around type, etc but also description/details around specific behavior.
I think for scalar functions, since there are so many (but not necessarily relational operators), I also prefer this approach to having all the functions listed in an enum on the Flatbuffers file, so that we can have an append-only yaml file with all the functions. Using an integer id to identify a function versus a string makes the IR smaller (good — only ever need 4 bytes, even only 2 bytes, to identify a function) and implementations marginally less complicated since many engines will have string identifiers for functions that are different than the "canonical" names in our function inventory. The inventory of scalar functions is likely to grow very fast and not having to modify the Flatbuffers files would be beneficial
Just because new functions are introduced doesn't mean we should change the format version number. The absence of a function in a particular consumer shouldn't really matter.
I agree with this
Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.
I see arguments both ways on using a different function id for each overload of a function. The main annoyance I would see in having a different id for each overload of a function is that IR implementations would have to maintain a huge IR mapping table, versus using a common id for all variants of "add" for example. On the other hand, if there is a different id for every overload, then it leaves no ambiguity about what the input and output types should be. But that is a massive scope increase for this project to enumerate all the function signatures of every function overload contemplated...
|
||
/// A canonical (probably SQL equivalent) function | ||
enum CanonicalAggregateId : uint32 { | ||
All, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that whatever decision is made with the scalar functions should be consistent here
|
||
/// Boundary is following rows, determined by the contained expression | ||
table Following { | ||
impl: ConcreteBoundImpl (required); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree where a table is being introduced solely to work around Flatbuffers' union issues, that calling it Wrapper
would make it more clear. The rule then would be to never use a ThingWrapper
as a member of any other table, only where you need to put the wrapped union in an array or serialize it as a top-level Flatbuffers object
Cast, | ||
Extract, | ||
WindowCall, | ||
AggregateCall, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to be able to serialize and send e.g. sum($expr_0) / mean($expr_1)
(with these expressions being possibly unbound to a particular table schema) without having to build an aggregation relational operator — if aggregation function calls are "different" then the type system to achieve this is probably a bit more complex, if you have ideas
/// This is a field, because the Type union in Schema.fbs | ||
/// isn't self-contained: Fields are necessary to describe complex types | ||
/// and there's currently no reason to optimize the storage of this. | ||
type: org.apache.arrow.flatbuf.Field; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an important question — connected to the above question about enumerating function signatures — and should be raised and discussed more broadly on the mailing list and try to include the larger group of people who commented in the Google Doc that I made originally.
For built-in / canonical functions where we expect $func($type_0, ...)
to yield a deterministic output type, there isn't much motivation to serialize the output type — you would only want to put the output type there when it is adding useful information. If you always put it there, you're paying the cost of serializing and deserializing a Field for every expression. An IR producer could run in "verbose" mode and put all the output types (if it were useful to a IR consumer that doesn't have the type derivation logic)
I suspect that there will be a need to build an inventory of function signatures to reduce ambiguity and to make things more straightforward for IR implementations (for example, an implementation could read the input/output type rules from a text file — or generate code — rather than having to enter the type derivations by hand)
I'm assuming you mean that you want to avoid having to build a project on top of an aggregate? (please confirm my interprestation of what you said). My initial intuition is that supporting arbitrary expressions in aggregation creates more complexity (and heavier requirement on semantic analysis to confirm plan validity). I agree that there are situations where you might want a compound relational operation that does an aggregation calculation followed immediately by a non-aggregate calculation (e.g. the division in your expression). However, I think that "compound aggregate" would be an additional relational operator we could introduce at a later stage as opposed to being one of the initial primitives (or possibly even be an internal concern/optimization of a particular execution engine). |
|
See also apache#10856 Differing design decisions from the above: - Don't special case for any `Expression`s or `Relation`s. All array functions and relations are identified by name, which may include a namespace for differentiating between extenders. - Freely extensible without recompilation of flatbuffers (with the cost of being a less "pure" flatbuffers format since bytes blobs are used liberally). - The root type is a Plan rather than a Relation- instead of expressing a value it is a specification of a side effect which includes the destination for output rows. Closes apache#10934 from bkietz/compute-ir Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Signed-off-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
See also #10856
Differing design decisions from the above:
Expression
s orRelation
s. All array functions and relations are identified by name, which may include a namespace for differentiating between extenders.