Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Decouple logical from physical types #12622

Open
3 of 6 tasks
notfilippo opened this issue Sep 25, 2024 · 14 comments
Open
3 of 6 tasks

[EPIC] Decouple logical from physical types #12622

notfilippo opened this issue Sep 25, 2024 · 14 comments
Labels
enhancement New feature or request

Comments

@notfilippo
Copy link
Contributor

notfilippo commented Sep 25, 2024

Is your feature request related to a problem or challenge?

This epic tracks an ordered list tasks related to the proposal: Decouple logical from physical types (#11513). The goal is:

Logical operators during logical planning should unquestionably not have access to the physical type information, which should exclusively be reserved to the physical planning and physical execution.

LogicalPlans will use LogicalType while PhysicalPlans will use DataType.

Describe the solution you'd like

Make ScalarValue values logical:

@notfilippo notfilippo added the enhancement New feature or request label Sep 25, 2024
@notfilippo
Copy link
Contributor Author

notfilippo commented Sep 26, 2024

I would like to organise a call so we can discuss the plan of action of this epic (as proposed #12536 (review) and in the ASF slack). In the meantime I'll try to collect as many open questions as possible and outline a meeting document.

cc @alamb , @findepi , @jayzhan211 , @ozankabak and anyone else interested in this effort.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 26, 2024

I suggest to propose the plan directly, since it requires thinking to response, maybe not that efficiently to work in a call synchronously. And it would be more open to random people in the community.

@findepi
Copy link
Member

findepi commented Sep 26, 2024

Logical operators during logical planning should unquestionably not have access to the physical type information, which should exclusively be reserved to the physical planning and physical execution.

We should make it a goal that physical planning also abstracts over physical representation of individual batches.
see #11513 (comment)
and also @andygrove 's feedback from Comet perspective #11513 (comment)

We should also make it a goal that function are expressible directly on logical types. Otherwise we won't be able to use them in logical planning. While function invocation can continue to work on physical representation, it does not necessarily have to be function implementor's burden to cater for them. See #12635

@notfilippo
Copy link
Contributor Author

notfilippo commented Sep 26, 2024

We should make it a goal that physical planning also abstracts over physical representation of individual batches.
We should also make it a goal that function are expressible directly on logical types.

While I support both ideas (one of them was also mentioned in the original proposal) I think that there is still some discussion to be had before making them goals of this effort (#11513 (comment), #11513 (comment), and #11513 (comment)).

I suggest we first define a plan (I'm drafting it now) for decoupling logical types from physical types (including the issue with functions that you are mentioning) and in parallel we can continue to validate this two ideas.

@findepi
Copy link
Member

findepi commented Sep 27, 2024

Thanks @notfilippo . I understand it's not your goal to remove Arrow types from physical plans.
It looks like we have same discussion in two places now. I won't copy my reply from #11513, it's here for reference: #11513 (comment)
TL;DR is. -- the goals impact the design; incremental improvements lead to local optimum.

@notfilippo
Copy link
Contributor Author

notfilippo commented Sep 30, 2024

What follows is my idea of the plan we might want to take to tackle this issue.


Status and goals

Current status

image

  • Currently Datafusion uses DFSchema (a qualified arrow::Schema) as the type source for both its logical and physical plan.
  • During execution a physical plan gets interpreted against some physical data, in the form of record batches which contain (if everything is correct) the same schema expected as input by the plan.

#11513 initial goal

image

  • Datafusion will be using a more generic schema as input for it's logical plan, containing only logical type information.
  • Physical planning will remain unchanged, requiring instead the complete schema with the physical type information.
  • Execution will remain unchanged.

"Record batches as the physical type source" goal

image

  • Datafusion will only use logical type information for both logical and physical planning
  • During execution the physical type information will be sourced by the record batches as the single source of truth.

How do we get there?

a.k.a. changing the tires of a moving vehicle.

image

Introducing logical type limitations internally while keeping inputs and outputs the same should be helpful in order not to introduce too many breaking changes at once. At a high level, we could keep the type sources unchanged and just gradually limit the internal behaviour of the engine in order to reason about logical types instead of physical.

Towards a Logical LogicalPlan

The Scalar type

image

Currently the LogicalPlan uses physical type information in the tree. This information comes from table scans and scalar values. We could initially look into modifying the scalar value in order to potentially represent fewer variants of types (from Utf8, LargeUtf8, Utf8View, Dictionary(Utf8) and more to just Utf8). But the physical type information required to be understandable by the current system still needs to be temporarily taken into account.

image

This would require the introduction of a scalar type which would track the physical type of the scalar in the current logical plan, wrapping the "logical" value of the scalar.

image

Casting to logically equivalent types should become straightforward.

This effort is being tracked in this epic in Make ScalarValue values logical section.

Introduce the logical type

Introduce LogicalType and the "logically equivalent" notion. Now ScalarValue will not have a data_type() method, instead it will return its LogicalType, and data_type() will be instead defined in Scalar. Currently there is some discussion in #11513 around how should it be represented.

Logical expressions

image

It's time to make Expr logical. While type sources like Scalar and TableScans's columns will still provide physical types, their interaction internally will happen via logical type comparisons.

What about functions?

image

UDFs should be able to the work in the logical type system. A user defined function should be able to:

  • Decide, given a signature, if a / some logical type can fullfill the requirements
  • Identify, given some logical inputs, a logical return type

Discussion is happening here: #12635

Logical LogicalPlan

image

Once expressions are logical we can move type sources to the logical plane of translate their type information from physical to logical upstream. (Like a TableProvider schema() (physical) call being immediately translated to logical).


👷 WIP "Record batches as the physical type source" goal

cc @alamb, @jayzhan211, @findepi lmk if this plan seems logical 😄

@alamb
Copy link
Contributor

alamb commented Oct 1, 2024

#11513 initial goal

Datafusion will be using a more generic schema as input for it's logical plan, containing only logical type information.
Physical planning will remain unchanged, requiring instead the complete schema with the physical type information.
Execution will remain unchanged.

This seems like a good first step to me (and also a massive project in itself)

"Record batches as the physical type source" goal

I think it is wise to split this out into its own second goal (as it can scope down the first part). It is probably good to note that this means we won't have "adaptive schema" in DataFusion at runtime until the second part is complete

UDFs should be able to the work in the logical type system. A user defined function should be able to:

I think UDFs will need to retain physical information (at least in invoke() which manipulates the data directly)

All in all this sounds like an epic project. I hope we have the people who are excited to help make it happen!

@notfilippo
Copy link
Contributor Author

notfilippo commented Oct 7, 2024

I've opened #12793 in order to continue the effort according to the plan. cc @jayzhan211 @findepi

@jayzhan211
Copy link
Contributor

jayzhan211 commented Oct 8, 2024

I can work on the logical type that replace current function signature on main. #12751 (comment)

@notfilippo
Copy link
Contributor Author

I can work on the logical type that replace current function signature on main. #12751 (comment)

@jayzhan211 -- I was planning on introducing the logical types in the following PR, as I already started working on it. Do you mind waiting for it and then using it to replace the function signature?

@jayzhan211
Copy link
Contributor

I can work on the logical type that replace current function signature on main. #12751 (comment)

@jayzhan211 -- I was planning on introducing the logical types in the following PR, as I already started working on it. Do you mind waiting for it and then using it to replace the function signature?

Sure

@findepi
Copy link
Member

findepi commented Oct 9, 2024

What follows is my idea of the plan we might want to take to tackle this issue.

@notfilippo this (#12622 (comment)) is a very nice graphical representation of what we want to do. thank you

  • I was planning on introducing the logical types

@notfilippo awesome!
i was hoping we get logical types sooner than later, even if nothing uses them initially.
simple functions #12635 is currently blocked on this

@notfilippo
Copy link
Contributor Author

i was hoping we get logical types sooner than later, even if nothing uses them initially. Simple functions #12635 is currently blocked on this

I've opened #12853 targeting main.

@notfilippo
Copy link
Contributor Author

Can a committer merge #13016 on logical-types? cc @alamb @jayzhan211

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants