-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[draft] Add LogicalType
, try to support user-defined types
#11160
Conversation
fa0ef9e
to
c30746c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lot's to absorb here, so I'll be looking closer soon. But left some initial observations.
pub mod fields; | ||
|
||
#[derive(Clone, Debug)] | ||
pub enum LogicalType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a discussion to be had about what level we want logical type to be here. We previously discussed this, but I can't find that discussion anymore. (@alamb might remember.)
Some notes:
- IMO the
Large
list/binary/string variants should be considered encodings of their respective logical types. IMO considering these different types is one of the biggest sources of friction in the Arrow ecosystem and one I would hope would be solved by a - There's probably a good argument for signed and unsigned integers, as well as decimals, to be grouped together as more general types. That is,
DataType::Int8
andDataType::Int64
could both be an encoded version ofLogicalType::Int
. Date32
andDate64
might be consolidated.Date64
is a weird type that exists for Pandas compatibility, and doesn't actually provide more precision.- Types where the choice provides meaningful differences in precision should be preserved as logical types. For example, I think
Float64
andFloat16
shouldn't be considered the same logical type. Similarly for timestamp precisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree with all the points discussed and this matches my intuition.
The current state of the PR mostly reflects the state for the original proposal in #8143 as I didn't want to include any additional changes without discussing them first. I'm not too familiar on how these changes should be discussed and if these points need to be also circulated in the mailing list. Let me know what's best :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a discussion to be had about what level we want logical type to be here. We previously discussed this, but I can't find that discussion anymore. (@alamb might remember.)
I also remember but can't find this discussion
As I recall, the conclusion was that types that can store different ranges of values shoudl be different logical types.
- So that would mean
Int32
should be different thanUInt32
- I don't recall consensus about String vs LargeString, but I agree that treating these as the same logical type with different phsical encodings would be the easiest to reason about (even though theoretically LargeString can represent strings larger than 2GB that can not be represented in String)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added support for treating large types as their non-large logical type. I agree that types that can store different ranges of values should be different logical types
pub trait ExtensionType: std::fmt::Debug { | ||
fn display_name(&self) -> &str; | ||
fn type_signature(&self) -> TypeSignature; | ||
fn physical_type(&self) -> DataType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this mean we need to have a variant of each extension type for each physical storage type?
A good built-in example is string. Right now LogicalType::String
always returns DataType::String
, but arguably could be DataType::StringView
or DataType::LargeString
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! Any LogicalType which has different physical representations should probably be implemented through multiple ExtensionTypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the code a bit to keep a reference to the physical type.
I plan to check this out shortly -- thanks @notfilippo |
use std::sync::Arc; | ||
|
||
#[derive(Debug, Clone, PartialEq, Eq, Hash)] | ||
pub struct TypeSignature { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have existing TypeSignature for function
pub enum TypeSignature { |
Therefore, I think what we need is rewriting existing signature with LogicalType, instead of introducing another similar concept here. Or we create a similar one and gradually replace the existing signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @notfilippo for reviving this PR.
I think this PR does a pretty good job of sketching out the structures required for logical types.
One thing I think would help me understand how complete / not complete this proposal is would be to show how we would use this code to create a user defined type. Would it be possible to sketch out an example, perhaps in https://github.com/apache/datafusion/tree/main/datafusion-examples, showing how we would use LogicalType to create a User Defined Type?
Obviously if we pursue this approach, there will be substantial churn for all downstream crates.
cc @yukkit @lewiszlw @samuelcolvin and @jayzhan211
Also this could be related to StringViewArray #10918 / @XiangpengHao
pub mod fields; | ||
|
||
#[derive(Clone, Debug)] | ||
pub enum LogicalType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a discussion to be had about what level we want logical type to be here. We previously discussed this, but I can't find that discussion anymore. (@alamb might remember.)
I also remember but can't find this discussion
As I recall, the conclusion was that types that can store different ranges of values shoudl be different logical types.
- So that would mean
Int32
should be different thanUInt32
- I don't recall consensus about String vs LargeString, but I agree that treating these as the same logical type with different phsical encodings would be the easiest to reason about (even though theoretically LargeString can represent strings larger than 2GB that can not be represented in String)
datafusion/common/src/scalar/mod.rs
Outdated
type Error = DataFusionError; | ||
|
||
/// Create a Null instance of ScalarValue for this datatype | ||
fn try_from(data_type: &LogicalType) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would tend to agree that ScalarValue should follow LogicalType -- in fact we have had several issues when scalar value types don't quite match (like a logical string encoded as ScalarValue::String
was compared to a DictionaryArray
and was treated as a different type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense. I'll add a note to refactor ScalarValues at some point
c30746c
to
f8facb1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool @notfilippo
In addition to supporting user defined types, I think adding the notion of a LogicalType would make it easier for DataFusion to take better advantage of different encodings (specifically RunArray, StringViewArray and DictionaryArray)
We already have ok support for DictionaryArray but it is somewhat unnatural and required adding special DataType::Dictionary support all over the codebase and had various bugs like #11145 and #11032 that resulted when we missed support in some non obvious place
As we begin adding support for StringViewArray
in #10918 the same pattern is emerging
That is all to say, I think we should pursue this idea
The biggest challenge of making this kind of change I think will be to manage the rollout and migration with downstream crates / make the transition as smooth as possible
So my suggestions for next steps are:
- Try to update some downstream dependency to see what the change might require. Perhaps we can try with https://github.com/apache/datafusion-comet as that is open source
- Write a more detailed plan (the proposal seems similar to what is in [Proposal] Support User-Defined Types (UDT) #7923) for what we are going to change (with a specific section "What this would mean for downstream crates")
I think I can find time to help with the writeup and communication, and maybe the code. I want to finish up some other inflight work first (specifically finishing porting aggregate functions, cleaning up parquet statistics, and the user defined sql parser)
let df = ctx.sql("SELECT * FROM example").await?; | ||
let records = df.collect().await?; | ||
|
||
println!("{}", pretty_format_batches(&records)?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the primary usecase of the LogicalType so that we could write functions that take the logical type, or define custom behavior?
Like I wonder is the idea that we could now create a ScalarFunctionImpl
whose signature refers to LogicalType rather than PhysicalType 🤔 Or somehow plan a binary operation on logical types?
BTW I think the work we are doing with @samuelcolvin @dharanad and @jayzhan211 in #11207 would make it straightforward to implement a custom comparsion (via a function) for this magical type
That might be a good thing to show too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I wonder is the idea that we could now create a ScalarFunctionImpl whose signature refers to LogicalType rather than PhysicalType 🤔 Or somehow plan a binary operation on logical types?
Potentially both. The motivation behind this change is to simplify the interaction between standard types represented by different encodings (like RunArray, the various Views and DictionaryArray but potentially also user defined ones via Arrow Extension Types).
Completely agree. I will try to experiment with the user facing APIs (e.g. what's returned by the |
Thanks @notfilippo -- I will try and get the other projects I have under way to a better state so I can more fully help plan / communicate / coordinate this one. |
Sounds good! Feel free to follow up here / on slack / on discord 😄 I've also noticed that this potential change could greatly benefit the substrait encoding / decoding of the logical plan. Its current implementation has troubles dealing with dictionaries. I'll look into that as well while waiting for further instructions. |
@alamb, @wjones127, @jayzhan211 -- I've found some time to finally draft a proposal: [Proposal] Decouple logical from physical types (this draft PR was updated to DataFusion v40 in order to test |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Proposal
The proposal now lives here: #11513, the original draft PR follows:
Which issue does this PR close?
Closes #7923
Follows up on #8143, which is stale.
In the current state the PR is a draft implementation to validate the idea based on the discussion from #7421.
New additions
LogicalType
enum.ExtensionType
trait. Abstraction for extension types.TypeSignature
struct. Uniquely identifies a data type.LogicalSchema
&LogicalField
, equivalent to arrow'sSchema
andField
without changing much of the logic of of
DFSchema
. In next iterationsDFSchema
andLogicalSchema
could potentially merge.Major changes
DFSchema
usesLogicalSchema
&LogicalField
.ExprSchemable
andExprSchema
now useLogicalType
.ast
to logical plan conversion now usesLogicalType
.To be implemented
ContextProvider
.To be determined
Note
Most of these open questions remain similar to the initial PR.
ScalarValue
useLogicalType
or arrowDataType
?LogicalType
DataType
TableSource
returnLogicalSchema
or arrowSchema
?Schema
LogicalSchema
(for now)