ARROW-14062: [Format] Initial arrow-internal specification of compute IR #10934

bkietz · 2021-08-13T18:50:29Z

See also:

format/ComputeIR.fbs

cpcloud

Looks great. Just a few small comments.

docs/source/format/ComputeIR.rst

format/ComputeIR.fbs

emkornfield · 2021-08-17T17:30:30Z

format/ComputeIR.fbs

+table InlineBuffer {
+  // ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers
+  // and other alignment sensitive blobs can be stored here
+  bytes: [ulong] (required);


this has implications for languages that aren't C++ (I don't think the ergonmics of converting long[] to byte[] is quite as easy in ava.

I think this can be resolved by taking a leaf out of parquet2's book and defining InlineBuffer as a union of vectors of each primitive type, I'll try that.

I'm not sure Unions in FB are equivelant to union's in C++ (I don't know that if you select a byte union, you end up sharing the same memory).

you might be able to guarantee alignment by having a required long followed by a bytes: [byte] field

actually, I think creating a struct { ulong padding_for_struct_alignment, int padding_for_byte_align, bytes: byte[] } and making that member should be guaranteed to align.

I notice the ForceVectorAlignment function in C++

https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L1755

We should look at whether

table InlineBuffer { padding_for_alignment:ulong = 0 (required); bytes:[ubyte] (required); }

guarantees bytes to be aligned

I've tried an approach inspired by arrow2's decision to include the primitive type being stored in buffers (and thus also alignment information) all the way down to struct Bytes

https://github.com/jorgecarleitao/arrow2/blob/main/src/buffer/bytes.rs#L39

I think this guarantees alignment without requiring padding fields or reinterpret_casts

emkornfield · 2021-08-17T17:31:27Z

format/ComputeIR.fbs

+}
+
+table Expression {
+  impl: ExpressionImpl (required);


what this indirection?

The generators for some languages don't support vector-of-unions. I'll add a comment explaining that

Per below, if you lift the output type (field) into Expression (if we agree that every Expression needs an output Field — name and type), than you can simply call this ExpressionOp, since then Expression serves more purpose than simply being wrapper for Expression.impl

I agree that every Expression needs type, but I disagree that name is meaningful for Expressions. To my mind a name is ascribed to the referent by the referring entity; to speak in graph theory it's an edge property rather than a node property. To give a practical context: in a projection like SELECT $complicated_expr as A, $complicated_expr as B it should not be an error to use a single memoized instance of $complicated_expr- but including name as an expression property would make these two incompatible.

format/ComputeIR.fbs

cpp/build-support/update-flatbuffers.sh

format/ComputeIR.fbs

wesm · 2021-08-18T13:47:03Z

format/ComputeIR.fbs

+table InlineBuffer {
+  // ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers
+  // and other alignment sensitive blobs can be stored here
+  bytes: [ulong] (required);


I notice the ForceVectorAlignment function in C++

https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L1755

We should look at whether

table InlineBuffer { padding_for_alignment:ulong = 0 (required); bytes:[ubyte] (required); }

guarantees bytes to be aligned

wesm · 2021-08-18T13:54:28Z

format/ComputeIR.fbs

+/// A specification of a query.
+table Plan {
+  /// One or more output relations.
+  sinks: [Relation] (required);


What is the interpretation of multiple outputs (versus a single output)?

This is a difference in execution model for this proposal (for which rpc_service Interactive is provided semi-didactically to clarify).

If the root_type is a TableExpr which evaluates to batches, that implies that there is a channel open between the consumer and the producer along which those batches can be returned. This is frequently the case but in general I think we'll want to be able to express execution plans which don't rely on an interactive channel. For example: Plans generated as fragments of larger plans in distributed execution or plans which represent an ETL job.

Therefore I think it's preferable that a Plan explicitly include the destination for all batches, even if that will quite commonly be operation=InteractiveOutput (just pipe them back to the user).

Only a single instance of InteractiveOutput is permitted in a Plan (so no interactive producer will need to deinterleave batches piped back along the interactive channel, which I think was your concern here). However any number of other sinks are permitted. For example, a Plan may specify that one set of batches be streamed to tcp://somehost.com:890 for consumption by a service on that host and an unfiltered superset of those batches cached locally into file://tmp/cache/my_query for debugging.

In general, this feels like it is beyond the scope of a plan and is more about a specific read or write relation as well as communication of plan between systems. Let's try to avoid baking those concepts into the core plan.

format/ComputeIR.fbs

wesm · 2021-08-18T13:58:06Z

format/ComputeIR.fbs

+}
+
+table Expression {
+  impl: ExpressionImpl (required);


Per below, if you lift the output type (field) into Expression (if we agree that every Expression needs an output Field — name and type), than you can simply call this ExpressionOp, since then Expression serves more purpose than simply being wrapper for Expression.impl

wesm · 2021-08-18T13:59:46Z

format/ComputeIR.fbs

+}
+
+union Function {
+  CanonicalFunction, NonCanonicalFunction


Presumably we would want to add a UserDefinedFunction here which is able to pass an inline buffer containing the serialized function

I'm not sure how that's distinct from providing a NonCanonicalFunction with the serialized function in the options blob

NonCanonicalFunction and UserDefinedFunction are synonymous here. NonCanonicalFunction I think is probably a better name as it leaves no room for ambiguity.

Per my comments elsewhere, if we want to support serialized functions, we should make them structured, not just a bucket of bytes. For example:

{ type: "python_pickle", argument_types: [int], output_type:"int", dynamic:"false", maintains: [sort, cluster, distribution], python_prerequisites: [pyarrow, pytorch], bytes:<bytes>" }

The structure should be such that:

tools don't have to know how to decode the bytes to do plan transformations

transformations can move operations without having to worry about invalidating correctness (for example move where distribution of data is happening in a plan)

the plan can be validated without doing a byte decoding

I think that makes sense. We'll have to a lot more legwork to make user-defined structures a thing.

wesm · 2021-08-18T15:09:57Z

format/ComputeIR.fbs

+
+  /// Parameters for `operation`; content/format may be unique to each
+  /// value of `operation`.
+  options: InlineBuffer;


Having the options as a nested / independent flatbuffer for all canonical operations still makes me squirm. What do you think about putting a union of options for the canonical types in CanonicalOperation so that dealing with the InlineBuffer for "built-ins" is not necessary?

-1. This adds yet more special casing for the canonical operations and makes it harder to write generic pattern matching utilities while also giving us another discriminant to query and validate.

Another pattern might be to make CanonicalOperationId "CanonicalOperation" which is a union of the options. It doesn't fully get out of slightly harder pattern matching code, but it would eliminate a separate discrimant.

I think there's a tradeoff here. Having two code paths for handling canonical versus non canonical increases complexity, since a consumer now has to handle canonical things in additional to still handling InlineBuffer. If we're going to allow InlineBuffer, I don't think there's a good reason not to have just a single way to deal with options.

Since we don't yet have a very specific idea of how people are going to use this, I think we should leave this as InlineBuffers.

wesm · 2021-08-18T15:12:03Z

format/ComputeIR.fbs

+
+  /// Parameters for `function_name`; content/format may be unique to each
+  /// value of `function_name`.
+  options: InlineBuffer;


Per Operation below, I might prefer to have the opaque options only in the NonCanonicalFunction table — in the CanonicalFunction, if needed — may not be yet — we could add a union-of-tables providing function-specific options without the need for InlineBuffer

cpcloud · 2021-08-17T20:47:54Z

format/ComputeIR.fbs

+  Product,
+  Sum,
+  Tdigest,
+  Quantile,


We can haggle on the details later but I don't think quantile is well supported enough to be a canonical function.

cpcloud · 2021-08-17T20:48:33Z

format/ComputeIR.fbs

+  Mode,
+  Product,
+  Sum,
+  Tdigest,


I think for now maybe we leave out everything except the bare bones functions that we need.

cpp/build-support/update-flatbuffers.sh

format/ComputeIR.fbs

cpcloud · 2021-08-19T14:10:31Z

format/ComputeIR.fbs

+}
+
+union Function {
+  CanonicalFunction, NonCanonicalFunction


NonCanonicalFunction and UserDefinedFunction are synonymous here. NonCanonicalFunction I think is probably a better name as it leaves no room for ambiguity.

cpcloud · 2021-08-19T14:16:09Z

format/ComputeIR.fbs

+
+  /// Parameters for `operation`; content/format may be unique to each
+  /// value of `operation`.
+  options: InlineBuffer;


I think there's a tradeoff here. Having two code paths for handling canonical versus non canonical increases complexity, since a consumer now has to handle canonical things in additional to still handling InlineBuffer. If we're going to allow InlineBuffer, I don't think there's a good reason not to have just a single way to deal with options.

format/ComputeIR.fbs

cpcloud · 2021-08-19T14:22:07Z

format/ComputeIR.fbs

+
+  /// Parameters for `operation`; content/format may be unique to each
+  /// value of `operation`.
+  options: InlineBuffer;


Since we don't yet have a very specific idea of how people are going to use this, I think we should leave this as InlineBuffers.

jacques-n

Did a first pass here. A bunch of details but I'd say at a high-level I like the direction in general but am -1 on the places where we just have buckets of bytes. I'd like to see a lot more definition around how and what people can customize. I also think the coupling of ir version to introduction of each new canonical function is mistake and have added comments to that effect.

format/experimental/computeir/Expression.fbs

jacques-n · 2021-08-26T22:23:30Z

format/experimental/computeir/Expression.fbs

+/// A canonical (probably SQL equivalent) function
+//
+// TODO: variadics
+enum CanonicalFunctionId : uint32 {


I don't think specific functions should be part of the IR. A couple of reasons:

Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.

The speed at which functions are created/added should happen at a much more rapid pace than changes to the core IR.

Just because new functions are introduced doesn't mean we should change the format version number. The absence of a function in a particular consumer shouldn't really matter.

My suggestion is ids here, a separate yaml or similar structured doc that lists canonical functions that includes not only structured data around type, etc but also description/details around specific behavior.

That seems like a it will be just as hard to maintain as if we append to an enum.

Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.

I don't follow why operations with multiple overloads need to be dealt with at all in the IR. Wouldn't a function have a singular definition (or be singularly derivable) for a given IR?

Other than performance concerns about passing lots of strings around, what's the problem with a name and an optional namespace string for every function?

My suggestion is ids here, a separate yaml or similar structured doc that lists canonical functions that includes not only structured data around type, etc but also description/details around specific behavior.

I think for scalar functions, since there are so many (but not necessarily relational operators), I also prefer this approach to having all the functions listed in an enum on the Flatbuffers file, so that we can have an append-only yaml file with all the functions. Using an integer id to identify a function versus a string makes the IR smaller (good — only ever need 4 bytes, even only 2 bytes, to identify a function) and implementations marginally less complicated since many engines will have string identifiers for functions that are different than the "canonical" names in our function inventory. The inventory of scalar functions is likely to grow very fast and not having to modify the Flatbuffers files would be beneficial

Just because new functions are introduced doesn't mean we should change the format version number. The absence of a function in a particular consumer shouldn't really matter.

I agree with this

Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.

I see arguments both ways on using a different function id for each overload of a function. The main annoyance I would see in having a different id for each overload of a function is that IR implementations would have to maintain a huge IR mapping table, versus using a common id for all variants of "add" for example. On the other hand, if there is a different id for every overload, then it leaves no ambiguity about what the input and output types should be. But that is a massive scope increase for this project to enumerate all the function signatures of every function overload contemplated...

But that is a massive scope increase for this project to

I guess I don't really see this as a massive increase in scope. Maybe it just about being more formal sooner? Without being formal about this I fear that this will become a singly used representation and the implementation will define the specification, rather than the other way around.

...enumerate all the function signatures of every function overload contemplated...
Part of it to me is also that this will be a growing list. It doesn't have to start with all possible functions, only the functions we want to initially specify. I would expect adding new functions would be relatively straightforward.

I don't follow why operations with multiple overloads need to be dealt with at all in the IR. Wouldn't a function have a singular definition (or be singularly derivable) for a given IR?
From an engine implementation point of view, as an example, I feel like decimal division has much more in common with decimal multiplication than it does with integer division. It also have a very different type output resolution system. Overloading a single concept of division to state different output type derivation systems seems quite a bit more complex than simply saying that there is no "overloading". To me, we simply need to consider each function's key to be the name + the input argument types. Sure, two functions may have the same "name" but that doesn't mean they have the same key (or have anything to do with each other).

Are we actually sure this is possible to do for all of the types we want to support?

What the key is for, say, the == operator for a complex type like list or struct?

I don't think a wildcard type is well-defined here without more clarification. For example, List<T> can only be compared with List<U> if T == U, but if T != U the operation is undefined.

Unnest is another example.

Without a type system that handles generics, you can't write down the type of all possible instantiations of any type that has a type parameter, such as list, map, and struct.

What is the issue with having a list of functions in some structured format, that indicates the canonical name of the function and its arity?

If a producer sends over a call to the add function with input types int32, int32 and output type int32, then the consumer would look that up, and if it's able to execute that IR, then it does and if it's not able to do so it returns an error.

format/experimental/computeir/Literal.fbs

format/experimental/computeir/Plan.fbs

format/experimental/computeir/Relation.fbs

format/experimental/computeir/Expression.fbs

format/experimental/computeir/ExampleExtensions.fbs

jacques-n · 2021-08-26T23:06:25Z

format/experimental/computeir/Relation.fbs

+}
+
+/// A table write
+table Write {


I think Read & Write should be removed from this patch. It seems like we need to spend a decent amount of time deciding what kind of structure could be there. A required string (and string parsing?) seems like the wrong place to start.

Agreed, though I don't think IR is that useful without Read. I'm not sure if there's actually a common denominator here.

I will look around at the different ways this has been done. So far, it looks like DuckDB implements different sources as table producing functions, maybe that's a good place to start generalizing from.

I'd suggest we start with just having a read a path of parquet files to start. It seems like a simple set of properties and a realistic initial usecase.

format/experimental/computeir/Relation.fbs

jacques-n · 2021-08-26T23:29:18Z

Additional comments: given this interest in targeting physical engines, it is weird that we are missing common physical operations like HashJoin, MergeJoin, HashAggregate, StreamingAggregate, TopN (order + limit operator), as well as a bunch related to RedistributeSend, OrderedRedistribute, UnorderedReception, OrderedReception, etc.

I also think we need to go to each operation and declare the following:

Properties each operation maintains. (e.g. if you have Read > Sort > Filter > Write, is the write expected to be sorted?)
The order of the output fields/columns. For example, Join (A, B), Project(a+b, c+d), etc.

kkraus14 · 2021-08-27T18:52:14Z

format/experimental/computeir/Expression.fbs

+  ASCENDING_THEN_NULLS,
+  DESCENDING_THEN_NULLS,
+  NULLS_THEN_ASCENDING,
+  NULLS_THEN_DESCENDING


I think we also need the ability to control NaN ordering here in the case of floating point columns. I.E. some systems treat NaN as larger than all other values but smaller than null, others do differently.

In that case, I think there's probably an argument to split out these options into 3 different enums, one for non-null ordering, null ordering, and nan ordering.

We may also need an enum for the ordering of NaN with respect to null. I.E. is it NaN then null then ascending values, or null then NaN then ascending values?

Why not follow the IEEE754 recommendation?

Iceberg uses it: https://github.com/apache/iceberg/blob/master/site/docs/spec.md#sorting

Java uses it

Rust lang follows it in practice (see e.g. impl TotalEq/TotalOrd for floats with the standard (IEEE754) total ordering rust-lang/rust#5585 and ord-float)

I figured there'd be some engines that handles NaN ordering differently. I.E. in Pandas with a float64 dtype column it allows setting NaN similarly to what we're allowing for nulls here using an na_position argument. This does work as expected when using the experimental Float64 nullable type though.

wesm

Handful of comments. I will leave some comments later on Relation.fbs

There are some difficult questions here that merit additional analysis / feedback from others. If we're taking a more rigid stance with respect to output type derivations, especially with built-in functions, then I agree that moving that information outside of the serialized format makes sense. A reasonable metric for what constitutes "built-in" or "canonical" would therefore be that there is a well-determined set of output type derivations (the derivations might have to allow for wildcard types in some cases, e.g. * == * -> boolean)

wesm · 2021-08-27T20:58:02Z

format/experimental/computeir/Expression.fbs

+/// A canonical (probably SQL equivalent) function
+//
+// TODO: variadics
+enum CanonicalFunctionId : uint32 {


My suggestion is ids here, a separate yaml or similar structured doc that lists canonical functions that includes not only structured data around type, etc but also description/details around specific behavior.

I think for scalar functions, since there are so many (but not necessarily relational operators), I also prefer this approach to having all the functions listed in an enum on the Flatbuffers file, so that we can have an append-only yaml file with all the functions. Using an integer id to identify a function versus a string makes the IR smaller (good — only ever need 4 bytes, even only 2 bytes, to identify a function) and implementations marginally less complicated since many engines will have string identifiers for functions that are different than the "canonical" names in our function inventory. The inventory of scalar functions is likely to grow very fast and not having to modify the Flatbuffers files would be beneficial

Just because new functions are introduced doesn't mean we should change the format version number. The absence of a function in a particular consumer shouldn't really matter.

I agree with this

Each of these will actually have many variations and each should be identified separately add(int,int) => int vs add(bigint,bigint) => bigint.

I see arguments both ways on using a different function id for each overload of a function. The main annoyance I would see in having a different id for each overload of a function is that IR implementations would have to maintain a huge IR mapping table, versus using a common id for all variants of "add" for example. On the other hand, if there is a different id for every overload, then it leaves no ambiguity about what the input and output types should be. But that is a massive scope increase for this project to enumerate all the function signatures of every function overload contemplated...

wesm · 2021-08-27T21:00:53Z

format/experimental/computeir/Expression.fbs

+
+/// A canonical (probably SQL equivalent) function
+enum CanonicalAggregateId : uint32 {
+  All,


I agree that whatever decision is made with the scalar functions should be consistent here

wesm · 2021-08-27T21:36:59Z

format/experimental/computeir/Expression.fbs

+
+/// Boundary is following rows, determined by the contained expression
+table Following {
+  impl: ConcreteBoundImpl (required);


I agree where a table is being introduced solely to work around Flatbuffers' union issues, that calling it Wrapper would make it more clear. The rule then would be to never use a ThingWrapper as a member of any other table, only where you need to put the wrapped union in an array or serialize it as a top-level Flatbuffers object

wesm · 2021-08-27T21:48:39Z

format/experimental/computeir/Expression.fbs

+  Cast,
+  Extract,
+  WindowCall,
+  AggregateCall,


I think it would be useful to be able to serialize and send e.g. sum($expr_0) / mean($expr_1) (with these expressions being possibly unbound to a particular table schema) without having to build an aggregation relational operator — if aggregation function calls are "different" then the type system to achieve this is probably a bit more complex, if you have ideas

wesm · 2021-08-27T22:18:50Z

format/experimental/computeir/Expression.fbs

+  /// This is a field, because the Type union in Schema.fbs
+  /// isn't self-contained: Fields are necessary to describe complex types
+  /// and there's currently no reason to optimize the storage of this.
+  type: org.apache.arrow.flatbuf.Field;


I think this is an important question — connected to the above question about enumerating function signatures — and should be raised and discussed more broadly on the mailing list and try to include the larger group of people who commented in the Google Doc that I made originally.

For built-in / canonical functions where we expect $func($type_0, ...) to yield a deterministic output type, there isn't much motivation to serialize the output type — you would only want to put the output type there when it is adding useful information. If you always put it there, you're paying the cost of serializing and deserializing a Field for every expression. An IR producer could run in "verbose" mode and put all the output types (if it were useful to a IR consumer that doesn't have the type derivation logic)

I suspect that there will be a need to build an inventory of function signatures to reduce ambiguity and to make things more straightforward for IR implementations (for example, an implementation could read the input/output type rules from a text file — or generate code — rather than having to enter the type derivations by hand)

format/experimental/computeir/Literal.fbs

jacques-n · 2021-08-29T02:43:49Z

I think it would be useful to be able to serialize and send e.g. sum($expr_0) / mean($expr_1) (with these expressions being possibly unbound to a particular table schema) without having to build an aggregation relational operator

I'm assuming you mean that you want to avoid having to build a project on top of an aggregate? (please confirm my interprestation of what you said).

My initial intuition is that supporting arbitrary expressions in aggregation creates more complexity (and heavier requirement on semantic analysis to confirm plan validity). I agree that there are situations where you might want a compound relational operation that does an aggregation calculation followed immediately by a non-aggregate calculation (e.g. the division in your expression). However, I think that "compound aggregate" would be an additional relational operator we could introduce at a later stage as opposed to being one of the initial primitives (or possibly even be an internal concern/optimization of a particular execution engine).

github-actions · 2021-09-21T20:43:08Z

https://issues.apache.org/jira/browse/ARROW-14062

github-actions · 2021-09-21T20:43:10Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

See also apache#10856 Differing design decisions from the above: - Don't special case for any `Expression`s or `Relation`s. All array functions and relations are identified by name, which may include a namespace for differentiating between extenders. - Freely extensible without recompilation of flatbuffers (with the cost of being a less "pure" flatbuffers format since bytes blobs are used liberally). - The root type is a Plan rather than a Relation- instead of expressing a value it is a specification of a side effect which includes the destination for output rows. Closes apache#10934 from bkietz/compute-ir Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Signed-off-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>

github-actions bot added the Component: C++ label Aug 13, 2021

bkietz commented Aug 13, 2021

View reviewed changes

format/ComputeIR.fbs Outdated Show resolved Hide resolved

bkietz commented Aug 13, 2021

View reviewed changes

format/ComputeIR.fbs Outdated Show resolved Hide resolved

bkietz commented Aug 13, 2021

View reviewed changes

format/ComputeIR.fbs Outdated Show resolved Hide resolved

bkietz commented Aug 13, 2021

View reviewed changes

format/ComputeIR.fbs Outdated Show resolved Hide resolved

format/ComputeIR.fbs Outdated Show resolved Hide resolved

bkietz changed the title ~~Open [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion~~ [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion Aug 13, 2021

apache deleted a comment from github-actions bot Aug 13, 2021

kkraus14 reviewed Aug 13, 2021

View reviewed changes

format/ComputeIR.fbs Outdated Show resolved Hide resolved

format/ComputeIR.fbs Outdated Show resolved Hide resolved

format/ComputeIR.fbs Outdated Show resolved Hide resolved

format/ComputeIR.fbs Outdated Show resolved Hide resolved

format/ComputeIR.fbs Outdated Show resolved Hide resolved

cpcloud reviewed Aug 16, 2021

View reviewed changes

lidavidm reviewed Aug 17, 2021

View reviewed changes

format/ComputeIR.fbs Outdated Show resolved Hide resolved

format/ComputeIR.fbs Outdated Show resolved Hide resolved

bkietz force-pushed the compute-ir branch 2 times, most recently from d7ed1fe to 767af50 Compare August 17, 2021 16:41

emkornfield reviewed Aug 17, 2021

View reviewed changes

format/ComputeIR.fbs Outdated Show resolved Hide resolved

wesm reviewed Aug 18, 2021

View reviewed changes

cpcloud reviewed Aug 19, 2021

View reviewed changes

cpcloud mentioned this pull request Aug 19, 2021

[RFC] Arrow Compute Serialized Intermediate Representation draft for discussion #10856

Closed

jacques-n mentioned this pull request Aug 23, 2021

[RFC] Alternative IR approach #10979

Closed

github-actions bot added the Component: Python label Aug 26, 2021

cpcloud force-pushed the compute-ir branch from 3a1028f to ce97e58 Compare August 26, 2021 19:17

jacques-n requested changes Aug 26, 2021

View reviewed changes

kkraus14 reviewed Aug 27, 2021

View reviewed changes

wesm reviewed Aug 27, 2021

View reviewed changes

cpcloud mentioned this pull request Sep 1, 2021

FEAT: Compute IR for relational operators ibis-project/ibis#2922

Closed

cpcloud added 21 commits September 21, 2021 13:23

Remove the type field from Expression

076ff73

Replace Read with more abstract Table

e8afb42

Punt on function registry for now

6bb3306

Check in generated code

7f7db42

Move ir to experimental/computeir

1ec041d

Remove generated python code

25db445

Move computeir docs to developers section

856e518

Allow any type of value in MapLiteral key

b1e9f93

Reuse Time/DateUnit from Schema.fbs

4e70dda

Make DecimalLiteral scale/precision signed

a89302b

Add interval literal with nanos

0f7b261

Add FixedSizeBinaryLiteral

902e73f

Add comment about endiannes to DecimalLiteral bytes

427372d

Fix outdated comments on FieldRef

88b3d25

Rename Collation to Ordering and remove CLUSTERED

bc3766d

MapKey expression should allow any type for the key

c2e916a

Audit use of (required) in Literal.fbs

b6513d7

Restrict StructLiteral fields to string keys

36ce349

Rename FieldName to FieldIndex and use it Remap

48da544

Remove python from .gitattributes file

5ac6cbd

Remove intervalnanos for now

2b7e5d0

cpcloud force-pushed the compute-ir branch from ed3419b to 2b7e5d0 Compare September 21, 2021 17:30

cpcloud changed the title ~~[Format] Arrow Compute Serialized Intermediate Representation draft for discussion~~ ARROW-14062: [Format] Initial arrow-internal specification of compute IR Sep 21, 2021

cpcloud closed this in ce34ea1 Sep 21, 2021

bkietz deleted the compute-ir branch September 22, 2021 14:59

cpcloud mentioned this pull request Sep 22, 2021

FEAT: Ibis Backend for Arrow Compute ibis-project/ibis#3006

Closed

asfimport mentioned this pull request Sep 21, 2021

[Format] Initial arrow-internal specification of compute IR #29658

Closed

ARROW-14062: [Format] Initial arrow-internal specification of compute IR #10934

ARROW-14062: [Format] Initial arrow-internal specification of compute IR #10934

Conversation

bkietz commented Aug 13, 2021 • edited Loading

github-actions bot commented Aug 13, 2021

cpcloud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield Aug 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud Aug 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n commented Aug 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n commented Aug 29, 2021

github-actions bot commented Sep 21, 2021

github-actions bot commented Sep 21, 2021

bkietz commented Aug 13, 2021 •

edited

Loading

emkornfield Aug 17, 2021 •

edited

Loading

jacques-n Aug 26, 2021 •

edited

Loading

cpcloud Aug 30, 2021 •

edited

Loading