You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge uses a special spread kernel that keeps track of past values for each column that are being merged. The behavior may be latched or unlatched, depending on if it is interpolated as continuous or discrete.
The new partitioned code attempts to simplify the output types by using
where data keeps the individual columns. This allows us to work with datatypes rather than pass a separate Schema and individual columns around. It also simplifies accessing columns of a merge output -- rather than have to name each column and lookup by name, we can keep them in each side's output structs, and access via that side/index. This avoids naming collisions.
However, a consequence of that is we can no longer (with the current implementation) interpolate each column individually. Since the values are now in a struct array, each column will be interpreted as part of the struct value at that time, meaning that we cannot support latched and unlatched values within the same input.
For example, If we have {a: 5} at 10:00 AM, and {a: 6} at 11:00 AM, and {c: 4} at 10:30 AM, we likely expect:
The likely fastest way to get back to parity with the old merge behavior is to flatten the data back into individual columns. It's possible we'll have refactoring needed to work with named columns rather than indices, but it's preferable to revert to existing behavior we think works.
Future ideas:
The logical -> physical conversion may be able to detect interpolation of merge columns, and add a project after that uses the spread kernel.
Merge is now stateless, meaning we can perform concurrent merging
Spread is separated out of merge, significantly reducing merge pipeline complexity
Unlatched spread is the default behavior, so we wouldn't need functionality for that
We can visualize the spread functionality within the plan, rather than it being an opaque attribute of merge.
Open questions:
The implementation of the spread(???) needs to determine "should I latch here"
It needs to distinguish between a) Null because there was no Step 1 at this time (Step 2 existing and caused a row to be created during merge), and b) Null because the aggregation cleared itself (through windowing).
The text was updated successfully, but these errors were encountered:
Summary
Merge uses a special
spread
kernel that keeps track of past values for each column that are being merged. The behavior may belatched
orunlatched
, depending on if it is interpolated as continuous or discrete.The new partitioned code attempts to simplify the output types by using
where
data
keeps the individual columns. This allows us to work withdatatypes
rather than pass a separateSchema
and individual columns around. It also simplifies accessing columns of a merge output -- rather than have to name each column and lookup by name, we can keep them in each side's output structs, and access via that side/index. This avoids naming collisions.However, a consequence of that is we can no longer (with the current implementation) interpolate each column individually. Since the values are now in a struct array, each column will be interpreted as part of the struct value at that time, meaning that we cannot support latched and unlatched values within the same input.
For example, If we have
{a: 5}
at 10:00 AM, and{a: 6}
at 11:00 AM, and{c: 4}
at 10:30 AM, we likely expect:But with interpolating each struct individually, we would see
Ideas
Fastest to parity:
The likely fastest way to get back to parity with the old merge behavior is to flatten the
data
back into individual columns. It's possible we'll have refactoring needed to work with named columns rather than indices, but it's preferable to revert to existing behavior we think works.Future ideas:
The logical -> physical conversion may be able to detect interpolation of merge columns, and add a project after that uses the spread kernel.
The benefits of this approach are:
Open questions:
spread(???)
needs to determine "should I latch here"The text was updated successfully, but these errors were encountered: