Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14062: [Format] Initial arrow-internal specification of compute IR #10934

Closed
wants to merge 96 commits into from

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Aug 13, 2021

See also #10856

Differing design decisions from the above:

  • Don't special case for any Expressions or Relations. All array functions and relations are identified by name, which may include a namespace for differentiating between extenders.
  • Freely extensible without recompilation of flatbuffers (with the cost of being a less "pure" flatbuffers format since bytes blobs are used liberally).
  • The root type is a Plan rather than a Relation- instead of expressing a value it is a specification of a side effect which includes the destination for output rows.

format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
@bkietz bkietz changed the title Open [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion Aug 13, 2021
@apache apache deleted a comment from github-actions bot Aug 13, 2021
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just a few small comments.

docs/source/format/ComputeIR.rst Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
@bkietz bkietz force-pushed the compute-ir branch 2 times, most recently from d7ed1fe to 767af50 Compare August 17, 2021 16:41
table InlineBuffer {
// ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers
// and other alignment sensitive blobs can be stored here
bytes: [ulong] (required);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has implications for languages that aren't C++ (I don't think the ergonmics of converting long[] to byte[] is quite as easy in ava.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be resolved by taking a leaf out of parquet2's book and defining InlineBuffer as a union of vectors of each primitive type, I'll try that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure Unions in FB are equivelant to union's in C++ (I don't know that if you select a byte union, you end up sharing the same memory).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might be able to guarantee alignment by having a required long followed by a bytes: [byte] field

Copy link
Contributor

@emkornfield emkornfield Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I think creating a struct { ulong padding_for_struct_alignment, int padding_for_byte_align, bytes: byte[] } and making that member should be guaranteed to align.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice the ForceVectorAlignment function in C++

https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L1755

We should look at whether

table InlineBuffer {
  padding_for_alignment:ulong = 0 (required);
  bytes:[ubyte] (required);
}

guarantees bytes to be aligned

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried an approach inspired by arrow2's decision to include the primitive type being stored in buffers (and thus also alignment information) all the way down to struct Bytes

https://github.com/jorgecarleitao/arrow2/blob/main/src/buffer/bytes.rs#L39

I think this guarantees alignment without requiring padding fields or reinterpret_casts

}

table Expression {
impl: ExpressionImpl (required);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what this indirection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generators for some languages don't support vector-of-unions. I'll add a comment explaining that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per below, if you lift the output type (field) into Expression (if we agree that every Expression needs an output Field — name and type), than you can simply call this ExpressionOp, since then Expression serves more purpose than simply being wrapper for Expression.impl

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that every Expression needs type, but I disagree that name is meaningful for Expressions. To my mind a name is ascribed to the referent by the referring entity; to speak in graph theory it's an edge property rather than a node property. To give a practical context: in a projection like SELECT $complicated_expr as A, $complicated_expr as B it should not be an error to use a single memoized instance of $complicated_expr- but including name as an expression property would make these two incompatible.

format/ComputeIR.fbs Outdated Show resolved Hide resolved
cpp/build-support/update-flatbuffers.sh Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
table InlineBuffer {
// ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers
// and other alignment sensitive blobs can be stored here
bytes: [ulong] (required);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice the ForceVectorAlignment function in C++

https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L1755

We should look at whether

table InlineBuffer {
  padding_for_alignment:ulong = 0 (required);
  bytes:[ubyte] (required);
}

guarantees bytes to be aligned

/// A specification of a query.
table Plan {
/// One or more output relations.
sinks: [Relation] (required);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the interpretation of multiple outputs (versus a single output)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a difference in execution model for this proposal (for which rpc_service Interactive is provided semi-didactically to clarify).

If the root_type is a TableExpr which evaluates to batches, that implies that there is a channel open between the consumer and the producer along which those batches can be returned. This is frequently the case but in general I think we'll want to be able to express execution plans which don't rely on an interactive channel. For example: Plans generated as fragments of larger plans in distributed execution or plans which represent an ETL job.

Therefore I think it's preferable that a Plan explicitly include the destination for all batches, even if that will quite commonly be operation=InteractiveOutput (just pipe them back to the user).

Only a single instance of InteractiveOutput is permitted in a Plan (so no interactive producer will need to deinterleave batches piped back along the interactive channel, which I think was your concern here). However any number of other sinks are permitted. For example, a Plan may specify that one set of batches be streamed to tcp://somehost.com:890 for consumption by a service on that host and an unfiltered superset of those batches cached locally into file://tmp/cache/my_query for debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this feels like it is beyond the scope of a plan and is more about a specific read or write relation as well as communication of plan between systems. Let's try to avoid baking those concepts into the core plan.

format/ComputeIR.fbs Outdated Show resolved Hide resolved
}

table Expression {
impl: ExpressionImpl (required);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per below, if you lift the output type (field) into Expression (if we agree that every Expression needs an output Field — name and type), than you can simply call this ExpressionOp, since then Expression serves more purpose than simply being wrapper for Expression.impl

}

union Function {
CanonicalFunction, NonCanonicalFunction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we would want to add a UserDefinedFunction here which is able to pass an inline buffer containing the serialized function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how that's distinct from providing a NonCanonicalFunction with the serialized function in the options blob

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NonCanonicalFunction and UserDefinedFunction are synonymous here. NonCanonicalFunction I think is probably a better name as it leaves no room for ambiguity.

Copy link
Contributor

@jacques-n jacques-n Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my comments elsewhere, if we want to support serialized functions, we should make them structured, not just a bucket of bytes. For example:

{
  type: "python_pickle", 
  argument_types: [int], 
  output_type:"int", 
  dynamic:"false", 
  maintains: [sort, cluster, distribution],  
  python_prerequisites: [pyarrow, pytorch],
  bytes:<bytes>"
}

The structure should be such that:

  • tools don't have to know how to decode the bytes to do plan transformations
  • transformations can move operations without having to worry about invalidating correctness (for example move where distribution of data is happening in a plan)
  • the plan can be validated without doing a byte decoding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. We'll have to a lot more legwork to make user-defined structures a thing.


/// Parameters for `operation`; content/format may be unique to each
/// value of `operation`.
options: InlineBuffer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the options as a nested / independent flatbuffer for all canonical operations still makes me squirm. What do you think about putting a union of options for the canonical types in CanonicalOperation so that dealing with the InlineBuffer for "built-ins" is not necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1. This adds yet more special casing for the canonical operations and makes it harder to write generic pattern matching utilities while also giving us another discriminant to query and validate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pattern might be to make CanonicalOperationId "CanonicalOperation" which is a union of the options. It doesn't fully get out of slightly harder pattern matching code, but it would eliminate a separate discrimant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a tradeoff here. Having two code paths for handling canonical versus non canonical increases complexity, since a consumer now has to handle canonical things in additional to still handling InlineBuffer. If we're going to allow InlineBuffer, I don't think there's a good reason not to have just a single way to deal with options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't yet have a very specific idea of how people are going to use this, I think we should leave this as InlineBuffers.


/// Parameters for `function_name`; content/format may be unique to each
/// value of `function_name`.
options: InlineBuffer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Operation below, I might prefer to have the opaque options only in the NonCanonicalFunction table — in the CanonicalFunction, if needed — may not be yet — we could add a union-of-tables providing function-specific options without the need for InlineBuffer

Product,
Sum,
Tdigest,
Quantile,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can haggle on the details later but I don't think quantile is well supported enough to be a canonical function.

Mode,
Product,
Sum,
Tdigest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now maybe we leave out everything except the bare bones functions that we need.

cpp/build-support/update-flatbuffers.sh Outdated Show resolved Hide resolved
cpp/build-support/update-flatbuffers.sh Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
format/ComputeIR.fbs Outdated Show resolved Hide resolved
}

union Function {
CanonicalFunction, NonCanonicalFunction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NonCanonicalFunction and UserDefinedFunction are synonymous here. NonCanonicalFunction I think is probably a better name as it leaves no room for ambiguity.


/// Parameters for `operation`; content/format may be unique to each
/// value of `operation`.
options: InlineBuffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a tradeoff here. Having two code paths for handling canonical versus non canonical increases complexity, since a consumer now has to handle canonical things in additional to still handling InlineBuffer. If we're going to allow InlineBuffer, I don't think there's a good reason not to have just a single way to deal with options.

format/ComputeIR.fbs Outdated Show resolved Hide resolved

/// Parameters for `operation`; content/format may be unique to each
/// value of `operation`.
options: InlineBuffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't yet have a very specific idea of how people are going to use this, I think we should leave this as InlineBuffers.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass here. A bunch of details but I'd say at a high-level I like the direction in general but am -1 on the places where we just have buckets of bytes. I'd like to see a lot more definition around how and what people can customize. I also think the coupling of ir version to introduction of each new canonical function is mistake and have added comments to that effect.

format/experimental/computeir/Expression.fbs Outdated Show resolved Hide resolved
/// A canonical (probably SQL equivalent) function
//
// TODO: variadics
enum CanonicalFunctionId : uint32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think specific functions should be part of the IR. A couple of reasons:

  • Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.
  • The speed at which functions are created/added should happen at a much more rapid pace than changes to the core IR.
  • Just because new functions are introduced doesn't mean we should change the format version number. The absence of a function in a particular consumer shouldn't really matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is ids here, a separate yaml or similar structured doc that lists canonical functions that includes not only structured data around type, etc but also description/details around specific behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a it will be just as hard to maintain as if we append to an enum.

Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.

I don't follow why operations with multiple overloads need to be dealt with at all in the IR. Wouldn't a function have a singular definition (or be singularly derivable) for a given IR?

Other than performance concerns about passing lots of strings around, what's the problem with a name and an optional namespace string for every function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is ids here, a separate yaml or similar structured doc that lists canonical functions that includes not only structured data around type, etc but also description/details around specific behavior.

I think for scalar functions, since there are so many (but not necessarily relational operators), I also prefer this approach to having all the functions listed in an enum on the Flatbuffers file, so that we can have an append-only yaml file with all the functions. Using an integer id to identify a function versus a string makes the IR smaller (good — only ever need 4 bytes, even only 2 bytes, to identify a function) and implementations marginally less complicated since many engines will have string identifiers for functions that are different than the "canonical" names in our function inventory. The inventory of scalar functions is likely to grow very fast and not having to modify the Flatbuffers files would be beneficial

Just because new functions are introduced doesn't mean we should change the format version number. The absence of a function in a particular consumer shouldn't really matter.

I agree with this

Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.

I see arguments both ways on using a different function id for each overload of a function. The main annoyance I would see in having a different id for each overload of a function is that IR implementations would have to maintain a huge IR mapping table, versus using a common id for all variants of "add" for example. On the other hand, if there is a different id for every overload, then it leaves no ambiguity about what the input and output types should be. But that is a massive scope increase for this project to enumerate all the function signatures of every function overload contemplated...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that is a massive scope increase for this project to

I guess I don't really see this as a massive increase in scope. Maybe it just about being more formal sooner? Without being formal about this I fear that this will become a singly used representation and the implementation will define the specification, rather than the other way around.

...enumerate all the function signatures of every function overload contemplated...
Part of it to me is also that this will be a growing list. It doesn't have to start with all possible functions, only the functions we want to initially specify. I would expect adding new functions would be relatively straightforward.

I don't follow why operations with multiple overloads need to be dealt with at all in the IR. Wouldn't a function have a singular definition (or be singularly derivable) for a given IR?
From an engine implementation point of view, as an example, I feel like decimal division has much more in common with decimal multiplication than it does with integer division. It also have a very different type output resolution system. Overloading a single concept of division to state different output type derivation systems seems quite a bit more complex than simply saying that there is no "overloading". To me, we simply need to consider each function's key to be the name + the input argument types. Sure, two functions may have the same "name" but that doesn't mean they have the same key (or have anything to do with each other).

Copy link
Contributor

@cpcloud cpcloud Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we actually sure this is possible to do for all of the types we want to support?

What the key is for, say, the == operator for a complex type like list or struct?

I don't think a wildcard type is well-defined here without more clarification. For example, List<T> can only be compared with List<U> if T == U, but if T != U the operation is undefined.

Unnest is another example.

Without a type system that handles generics, you can't write down the type of all possible instantiations of any type that has a type parameter, such as list, map, and struct.

What is the issue with having a list of functions in some structured format, that indicates the canonical name of the function and its arity?

If a producer sends over a call to the add function with input types int32, int32 and output type int32, then the consumer would look that up, and if it's able to execute that IR, then it does and if it's not able to do so it returns an error.

format/experimental/computeir/Literal.fbs Outdated Show resolved Hide resolved
format/experimental/computeir/Plan.fbs Outdated Show resolved Hide resolved
format/experimental/computeir/Relation.fbs Outdated Show resolved Hide resolved
format/experimental/computeir/Expression.fbs Outdated Show resolved Hide resolved
format/experimental/computeir/ExampleExtensions.fbs Outdated Show resolved Hide resolved
}

/// A table write
table Write {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Read & Write should be removed from this patch. It seems like we need to spend a decent amount of time deciding what kind of structure could be there. A required string (and string parsing?) seems like the wrong place to start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, though I don't think IR is that useful without Read. I'm not sure if there's actually a common denominator here.

I will look around at the different ways this has been done. So far, it looks like DuckDB implements different sources as table producing functions, maybe that's a good place to start generalizing from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we start with just having a read a path of parquet files to start. It seems like a simple set of properties and a realistic initial usecase.

format/experimental/computeir/Relation.fbs Outdated Show resolved Hide resolved
format/experimental/computeir/Relation.fbs Outdated Show resolved Hide resolved
@jacques-n
Copy link
Contributor

Additional comments: given this interest in targeting physical engines, it is weird that we are missing common physical operations like HashJoin, MergeJoin, HashAggregate, StreamingAggregate, TopN (order + limit operator), as well as a bunch related to RedistributeSend, OrderedRedistribute, UnorderedReception, OrderedReception, etc.

I also think we need to go to each operation and declare the following:

  • Properties each operation maintains. (e.g. if you have Read > Sort > Filter > Write, is the write expected to be sorted?)
  • The order of the output fields/columns. For example, Join (A, B), Project(a+b, c+d), etc.

Comment on lines 192 to 195
ASCENDING_THEN_NULLS,
DESCENDING_THEN_NULLS,
NULLS_THEN_ASCENDING,
NULLS_THEN_DESCENDING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need the ability to control NaN ordering here in the case of floating point columns. I.E. some systems treat NaN as larger than all other values but smaller than null, others do differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I think there's probably an argument to split out these options into 3 different enums, one for non-null ordering, null ordering, and nan ordering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also need an enum for the ordering of NaN with respect to null. I.E. is it NaN then null then ascending values, or null then NaN then ascending values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured there'd be some engines that handles NaN ordering differently. I.E. in Pandas with a float64 dtype column it allows setting NaN similarly to what we're allowing for nulls here using an na_position argument. This does work as expected when using the experimental Float64 nullable type though.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handful of comments. I will leave some comments later on Relation.fbs

There are some difficult questions here that merit additional analysis / feedback from others. If we're taking a more rigid stance with respect to output type derivations, especially with built-in functions, then I agree that moving that information outside of the serialized format makes sense. A reasonable metric for what constitutes "built-in" or "canonical" would therefore be that there is a well-determined set of output type derivations (the derivations might have to allow for wildcard types in some cases, e.g. * == * -> boolean)

/// A canonical (probably SQL equivalent) function
//
// TODO: variadics
enum CanonicalFunctionId : uint32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is ids here, a separate yaml or similar structured doc that lists canonical functions that includes not only structured data around type, etc but also description/details around specific behavior.

I think for scalar functions, since there are so many (but not necessarily relational operators), I also prefer this approach to having all the functions listed in an enum on the Flatbuffers file, so that we can have an append-only yaml file with all the functions. Using an integer id to identify a function versus a string makes the IR smaller (good — only ever need 4 bytes, even only 2 bytes, to identify a function) and implementations marginally less complicated since many engines will have string identifiers for functions that are different than the "canonical" names in our function inventory. The inventory of scalar functions is likely to grow very fast and not having to modify the Flatbuffers files would be beneficial

Just because new functions are introduced doesn't mean we should change the format version number. The absence of a function in a particular consumer shouldn't really matter.

I agree with this

Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.

I see arguments both ways on using a different function id for each overload of a function. The main annoyance I would see in having a different id for each overload of a function is that IR implementations would have to maintain a huge IR mapping table, versus using a common id for all variants of "add" for example. On the other hand, if there is a different id for every overload, then it leaves no ambiguity about what the input and output types should be. But that is a massive scope increase for this project to enumerate all the function signatures of every function overload contemplated...


/// A canonical (probably SQL equivalent) function
enum CanonicalAggregateId : uint32 {
All,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that whatever decision is made with the scalar functions should be consistent here


/// Boundary is following rows, determined by the contained expression
table Following {
impl: ConcreteBoundImpl (required);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree where a table is being introduced solely to work around Flatbuffers' union issues, that calling it Wrapper would make it more clear. The rule then would be to never use a ThingWrapper as a member of any other table, only where you need to put the wrapped union in an array or serialize it as a top-level Flatbuffers object

Cast,
Extract,
WindowCall,
AggregateCall,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to be able to serialize and send e.g. sum($expr_0) / mean($expr_1) (with these expressions being possibly unbound to a particular table schema) without having to build an aggregation relational operator — if aggregation function calls are "different" then the type system to achieve this is probably a bit more complex, if you have ideas

/// This is a field, because the Type union in Schema.fbs
/// isn't self-contained: Fields are necessary to describe complex types
/// and there's currently no reason to optimize the storage of this.
type: org.apache.arrow.flatbuf.Field;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an important question — connected to the above question about enumerating function signatures — and should be raised and discussed more broadly on the mailing list and try to include the larger group of people who commented in the Google Doc that I made originally.

For built-in / canonical functions where we expect $func($type_0, ...) to yield a deterministic output type, there isn't much motivation to serialize the output type — you would only want to put the output type there when it is adding useful information. If you always put it there, you're paying the cost of serializing and deserializing a Field for every expression. An IR producer could run in "verbose" mode and put all the output types (if it were useful to a IR consumer that doesn't have the type derivation logic)

I suspect that there will be a need to build an inventory of function signatures to reduce ambiguity and to make things more straightforward for IR implementations (for example, an implementation could read the input/output type rules from a text file — or generate code — rather than having to enter the type derivations by hand)

format/experimental/computeir/Literal.fbs Outdated Show resolved Hide resolved
format/experimental/computeir/Literal.fbs Outdated Show resolved Hide resolved
format/experimental/computeir/Literal.fbs Outdated Show resolved Hide resolved
format/experimental/computeir/Literal.fbs Outdated Show resolved Hide resolved
format/experimental/computeir/Literal.fbs Outdated Show resolved Hide resolved
@jacques-n
Copy link
Contributor

I think it would be useful to be able to serialize and send e.g. sum($expr_0) / mean($expr_1) (with these expressions being possibly unbound to a particular table schema) without having to build an aggregation relational operator

I'm assuming you mean that you want to avoid having to build a project on top of an aggregate? (please confirm my interprestation of what you said).

My initial intuition is that supporting arbitrary expressions in aggregation creates more complexity (and heavier requirement on semantic analysis to confirm plan validity). I agree that there are situations where you might want a compound relational operation that does an aggregation calculation followed immediately by a non-aggregate calculation (e.g. the division in your expression). However, I think that "compound aggregate" would be an additional relational operator we could introduce at a later stage as opposed to being one of the initial primitives (or possibly even be an internal concern/optimization of a particular execution engine).

@cpcloud cpcloud changed the title [Format] Arrow Compute Serialized Intermediate Representation draft for discussion ARROW-14062: [Format] Initial arrow-internal specification of compute IR Sep 21, 2021
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@cpcloud cpcloud closed this in ce34ea1 Sep 21, 2021
@bkietz bkietz deleted the compute-ir branch September 22, 2021 14:59
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
See also apache#10856

Differing design decisions from the above:
- Don't special case for any `Expression`s or `Relation`s. All array functions and relations are identified by name, which may include a namespace for differentiating between extenders.
- Freely extensible without recompilation of flatbuffers (with the cost of being a less "pure" flatbuffers format since bytes blobs are used liberally).
- The root type is a Plan rather than a Relation- instead of expressing a value it is a specification of a side effect which includes the destination for output rows.

Closes apache#10934 from bkietz/compute-ir

Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
Signed-off-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants