feat(logical-types): add NativeType and LogicalType #12853

notfilippo · 2024-10-10T14:50:53Z

This PR is part of #12622. It introduces the notion of DataFusion NativeType and the LogicalType trait.

The changes stem from various discussions of #11513 #12644 #12635 and other. The goal of this PR is to unify those discussions into a concrete code proposal to be merged to main [1] in order to unblock work on #12644 #12635.

[1] as opposed to the logical-types branch used currently for experimenting #12622

datafusion/common/src/types/logical.rs

notfilippo · 2024-10-10T14:52:49Z

datafusion/common/src/types/logical.rs

+impl PartialEq for dyn LogicalType {
+    fn eq(&self, other: &Self) -> bool {
+        self.native().eq(other.native()) && self.name().eq(&other.name())
+    }
+}
+
+impl Eq for dyn LogicalType {}
+
+impl PartialOrd for dyn LogicalType {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl Ord for dyn LogicalType {
+    fn cmp(&self, other: &Self) -> Ordering {
+        self.name()
+            .cmp(&other.name())
+            .then(self.native().cmp(other.native()))
+    }
+}
+
+impl Hash for dyn LogicalType {
+    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
+        self.name().hash(state);
+        self.native().hash(state);
+    }
+}


Should we implement these traits manually or should we leave to the implementors?

That's a good question. It all comes form Eq for me.
If we implement Eq by name (or name + native), then we disallow implementations from having any attributes that could affect their semantics (ie an implementation may have a field caching something, but it should effectively be a function of the name). And then we can implement other traits for the type as well.

And in fact this makes sense. We need a way to identify a type. We will store this information in the field metadata, so either type needs to be serializable, or it's name needs to be serializable and resolvable. The latter sounds like the way to go, especially given extension types (#12644), which won't be known at compile time

datafusion/common/src/types/logical.rs

datafusion/common/src/types/native.rs

jayzhan211 · 2024-10-11T00:15:49Z

datafusion/common/src/types/logical.rs

+pub type LogicalTypeRef = Arc<dyn LogicalType>;
+
+pub trait LogicalType: fmt::Debug {
+    fn native(&self) -> &NativeType;


I previously propose can_decode_to(DataType) -> bool, so given logical type and DataType, we can know whether they are paired.

How can we do the equivalent check by the current design?

Given say arrow Int64 data, i want to know whether these is numbers, timestamp, time, date or something else (eg user-defined enum). The fact that any of these hypothetical logical types could be stored as Int64 doesn't help me know. Asking logical type "could you please decode this arrow type?" doesn't help me know.
Thus, going from arrow type to logical type is not an option. We simply need to know what logical type this should be.

I think the idea is that we have LogicalType already. In logical level, they are either LogicalNumber, LogicalTimestamp or LogicalDate, and we can differ them in logical level. They can also decode as i64, i32 in physical level. So asking logical type "could you please decode this arrow type?" is to tell the relationship between logical type and physical type. We don't need to know whether the arrow i64 is number or timestamp, because we already know that.

I'm not sure I can follow. @jayzhan211 -- can you write a small practical example? I want to make sure I understand the use case. Thanks :)

impl From<DataType> for NativeType is enough for native type since we can know whether the ArrayRef matches the LogicalType we have. But for LogicalType::UserDefined, I think we need to define what kind of DataType it could be decoded to.

We can figure this out if we meet any practical usage.

For any user defined logical type you still know the backing native type (via the native() method), so you should be able to use the same logic to know if your DataType can represent that logical type.

datafusion/common/src/types/native.rs

datafusion/common/src/types/builtin.rs

datafusion/common/src/types/native.rs

datafusion/common/src/types/logical.rs

findepi · 2024-10-14T07:27:28Z

datafusion/common/src/types/logical.rs

+impl PartialEq for dyn LogicalType {
+    fn eq(&self, other: &Self) -> bool {
+        self.native().eq(other.native()) && self.name().eq(&other.name())
+    }
+}
+
+impl Eq for dyn LogicalType {}
+
+impl PartialOrd for dyn LogicalType {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl Ord for dyn LogicalType {
+    fn cmp(&self, other: &Self) -> Ordering {
+        self.name()
+            .cmp(&other.name())
+            .then(self.native().cmp(other.native()))
+    }
+}
+
+impl Hash for dyn LogicalType {
+    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
+        self.name().hash(state);
+        self.native().hash(state);
+    }
+}


That's a good question. It all comes form Eq for me.
If we implement Eq by name (or name + native), then we disallow implementations from having any attributes that could affect their semantics (ie an implementation may have a field caching something, but it should effectively be a function of the name). And then we can implement other traits for the type as well.

And in fact this makes sense. We need a way to identify a type. We will store this information in the field metadata, so either type needs to be serializable, or it's name needs to be serializable and resolvable. The latter sounds like the way to go, especially given extension types (#12644), which won't be known at compile time

findepi · 2024-10-14T07:30:08Z

cc @sadboy @milevin @wolfram-s @wizardxz

datafusion/common/src/types/logical.rs

…-types

findepi · 2024-10-18T11:07:54Z

@alamb @andygrove @comphead please take a look

comphead · 2024-10-18T16:37:21Z

I dont remember the roots, so wondering, can we investigate and use a single type system which is Arrow Types and get rid of other types. At the end of the day it is based on Arrow Types.

…-types

notfilippo · 2024-10-19T22:36:28Z

dont remember the roots, so wondering, can we investigate and use a single type system which is Arrow Types and get rid of other types. At the end of the day it is based on Arrow Types.

I can try to summarise the discussion that has happened in the last months regarding this proposal.
Currently DataFusion:

Doesn’t support extension types, both in the logical sense (e.g. JSON) and in the physical sense (e.g. a logical string natively is Utf8, LargeUtf8, Utf8View but ideally a user might want to also define a special physical type like List(u8) to be logically equivalent to a string)
Has a lot of redundant code to handle logically equivalent values during logical planning (i.e. ScalarValue, function signatures)
Doesn’t support any form of runtime adaptability and assumes that all record batches as input are of the same exact schema while potentially having all the infrastructure needed to be able to support late coercion of logically equivalent values during physical execution (which seems to be a problem that comet is also looking to solve)

while “arrow datatype everywhere” is definitely working for DataFusion currently, my opinion is that this is a needed step towards extensibility and it will help enterprise users looking to migrate their existing engine, custom file format and types to DataFusion, efficiently.

* Backport native and logical types from #12853 * Fix clippy error on wasmtest (#12844) --------- Co-authored-by: Jonah Gao <jonahgao@msn.com>

timsaucer

The code looks straightforward. I hesitate to hit approve since I'm not familiar with all of the details of the epic.

datafusion/common/src/types/native.rs

timsaucer · 2024-10-22T11:12:43Z

datafusion/common/src/types/logical.rs

+#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)]
+pub enum TypeParameter<'a> {
+    Type(TypeSignature<'a>),
+    Number(i128),
+}


As someone unfamiliar with the entire discussion of this feature, it's not clear to me from reading the above link to the extension-types of arrow how these type parameters map to the metadata described in the link.

Ideally everything could be encoded in the name using ARROW:extension:name but we might leave the option handle parameters in a more structured manner to allow better matching when executing.

findepi · 2024-10-22T12:15:01Z

datafusion/common/src/types/native.rs

+    /// has two children: key type and the second the value type. The names of the
+    /// child fields may be respectively "entries", "key", and "value", but this is
+    /// not enforced.
+    Map(NativeFieldRef),


as in #12853 (comment)

Map should be something like Map(key_type, value_type)

value_type should be any type, not necessarily native.

That's a good point. It should be LogicalTypeRef here and in the NativeField type then (which I guess will become the LogicalField type).

Actually there is a bit of a problem translating DataType::Map to NativeType::Map since there is no way of retrieving the key and the value because of the "but this is not enforced." regarding the naming of the struct field types.

findepi · 2024-10-22T12:17:08Z

datafusion/common/src/types/logical.rs

+    ///
+    /// The `name` should contain the same value as 'ARROW:extension:name'.
+    Extension {
+        name: &'a str,


should be owned String, so that TypeSignature can live independently on its own

TypeSignature uses lifetimes just to make pattern matching easier when dealing with parameters (or lack thereof).

pub fn important_stuff(logical_type: &dyn LogicalType) { match logical_type.signature() { TypeSignature::Native(NativeType::Utf8) => todo!(), TypeSignature::Extension { name: "JSON", // <-- changing from &'a str to String leads to: expected `String`, but found `&str`, parameters: &[] } => todo!(), _ => unimplemented!() } }

notfilippo · 2024-10-25T14:21:26Z

@jayzhan211 I have a working prototype of the ideas we discussed above.

datafusion/common/src/types/native.rs

jayzhan211

I have no further concern so far, we can always add more function or modify it when we meet the use case.

datafusion/common/src/types/logical.rs

datafusion/common/src/types/native.rs

alamb · 2024-10-26T11:45:06Z

I have no further concern so far, we can always add more function or modify it when we meet the use case.

Thank you for reviewing and helping this project along @jayzhan211 🙏

notfilippo · 2024-10-26T16:40:32Z

Thanks a lot for the review @jayzhan211 . I've added all the suggestion mentioned :)

goldmedal

Thanks, @notfilippo. I took some time to understand the context of this topic. I only have two questions; everything else makes sense to me. We can merge it to continue with the follow-up work.

datafusion/common/src/types/native.rs

alamb · 2024-10-29T11:32:41Z

Shall we merge this PR?

notfilippo · 2024-10-29T12:14:23Z

Shall we merge this PR?

I think there is enough agreement to start working on the other pieces of the puzzle.

ozankabak · 2024-10-29T13:16:51Z

@findepi - what do you think?

goldmedal · 2024-10-31T16:34:43Z

If there are no more comments on this PR, I think we can merge it tomorrow.
@alamb @jayzhan211 @ozankabak - what do you think?

comphead · 2024-10-31T16:51:34Z

datafusion/common/src/types/logical.rs

+    /// (<https://arrow.apache.org/docs/format/Columnar.html#extension-types>)
+    ///
+    /// The `name` should contain the same value as 'ARROW:extension:name'.
+    Extension {


maybe we can add the comment why the extension is needed?

It is because arrow has extension type too.

Got there is a link to extension Arrow types above, it should be enough

comphead · 2024-10-31T16:53:54Z

datafusion/common/src/types/native.rs

+// mapping solutions to provide backwards compatibility while transitioning from
+// the purely physical system to a logical / physical system.
+
+impl From<DataType> for NativeType {


NativeType name might be confusing or we need more docs? I'm reading the code and first comes to my mind Arrow Types or maybe rust native types? I mean the name should be sufficient to understand what kind of native it is

Think of it as DataFusion’s type, or its built-in LogicalType. The term NativeType aligns with the concept of native types in Rust, so I believe it’s the preferred name.

If I read the structure doc

/// Representation of a type that DataFusion can handle natively. It is a subset /// of the physical variants in Arrow's native [`DataType`].

its is not a Rust native types its more related to Arrow Types which eventually come to rust types. I feel we need to clarify this a little bit, but I cannot come up with the better phrase right now, lets leave it for follow up PR

comphead

lgtm thanks @notfilippo and everyone.
But I still feel we need to improve the wording a little bit

jayzhan211 · 2024-11-03T03:22:44Z

I think this is ready to merge. Additional improvements can be made in a follow-up

alamb · 2024-11-05T17:03:56Z

Epic work!

[logical-types] add NativeType and LogicalType

3827974

github-actions bot added the common Related to common crate label Oct 10, 2024

notfilippo commented Oct 10, 2024

View reviewed changes

datafusion/common/src/types/logical.rs Outdated Show resolved Hide resolved

notfilippo commented Oct 10, 2024

View reviewed changes

Add license header

17a70d8

notfilippo mentioned this pull request Oct 10, 2024

[EPIC] Decouple logical from physical types #12622

Open

6 tasks

jayzhan211 reviewed Oct 11, 2024

View reviewed changes

findepi reviewed Oct 14, 2024

View reviewed changes

notfilippo added 2 commits October 15, 2024 09:38

Add NativeField and derivates

3dba963

Support TypeSignatures

4217970

notfilippo commented Oct 15, 2024

View reviewed changes

datafusion/common/src/types/logical.rs Outdated Show resolved Hide resolved

Fix doc

01f0089

notfilippo commented Oct 15, 2024

View reviewed changes

datafusion/common/src/types/logical.rs Show resolved Hide resolved

notfilippo changed the title ~~[logical-types] add NativeType and LogicalType~~ feat(logical-types): add NativeType and LogicalType Oct 16, 2024

notfilippo added 4 commits October 17, 2024 15:54

Add documentation

5b5f4c1

Fix doc tests

88e1b3c

Remove dummy test

ab16a2d

Merge remote-tracking branch 'origin/main' into fr/native-and-logical…

0b2ed2d

…-types

Merge remote-tracking branch 'origin/main' into fr/native-and-logical…

6150ea9

…-types

notfilippo added a commit to notfilippo/datafusion that referenced this pull request Oct 19, 2024

Backport native and logical types from apache#12853

4ec5fd3

notfilippo mentioned this pull request Oct 19, 2024

[logical-types branch] Backport native and logical types #13016

Merged

alamb pushed a commit that referenced this pull request Oct 21, 2024

[logical-types branch] Backport native and logical types (#13016)

861a8b7

* Backport native and logical types from #12853 * Fix clippy error on wasmtest (#12844) --------- Co-authored-by: Jonah Gao <jonahgao@msn.com>

notfilippo requested a review from jayzhan211 October 22, 2024 07:29

timsaucer reviewed Oct 22, 2024

View reviewed changes

findepi reviewed Oct 22, 2024

View reviewed changes

From NativeField to LogicalField

7ed7891

Add default_cast_for

bdd0155

notfilippo force-pushed the fr/native-and-logical-types branch from 1530c5b to bdd0155 Compare October 25, 2024 14:20

jayzhan211 reviewed Oct 25, 2024

View reviewed changes

datafusion/common/src/types/native.rs Outdated Show resolved Hide resolved

Add type order with can_cast_types

056a8f1

notfilippo requested a review from jayzhan211 October 25, 2024 15:58

jayzhan211 approved these changes Oct 26, 2024

View reviewed changes

jayzhan211 reviewed Oct 26, 2024

View reviewed changes

datafusion/common/src/types/logical.rs Show resolved Hide resolved

jayzhan211 reviewed Oct 26, 2024

View reviewed changes

datafusion/common/src/types/native.rs Outdated Show resolved Hide resolved

jayzhan211 reviewed Oct 26, 2024

View reviewed changes

datafusion/common/src/types/native.rs Show resolved Hide resolved

notfilippo added 3 commits October 26, 2024 18:06

Rename NativeType Utf8 to String

117f23b

NativeType from &DataType

e0923e2

Add builtin types

f1ff963

From LazyLock to OnceLock

7942f91

notfilippo force-pushed the fr/native-and-logical-types branch from d29afa1 to 7942f91 Compare October 26, 2024 18:26

goldmedal approved these changes Oct 27, 2024

View reviewed changes

datafusion/common/src/types/native.rs Show resolved Hide resolved

datafusion/common/src/types/native.rs Show resolved Hide resolved

comphead reviewed Oct 31, 2024

View reviewed changes

comphead approved these changes Nov 1, 2024

View reviewed changes

jayzhan211 merged commit b40a298 into apache:main Nov 3, 2024
24 checks passed

alamb mentioned this pull request Nov 5, 2024

Nov 5. 2024: This week in DataFusion #13265

Closed

3 tasks

alamb mentioned this pull request Nov 14, 2024

Why not use DataFusion as the backend engine (rather than rewriting it all from scratch)? Eventual-Inc/Daft#3282

Closed

feat(logical-types): add NativeType and LogicalType #12853

feat(logical-types): add NativeType and LogicalType #12853

Conversation

notfilippo commented Oct 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

notfilippo Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Oct 14, 2024

findepi commented Oct 18, 2024

comphead commented Oct 18, 2024

notfilippo commented Oct 19, 2024

timsaucer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notfilippo commented Oct 25, 2024

jayzhan211 left a comment

Choose a reason for hiding this comment

alamb commented Oct 26, 2024

notfilippo commented Oct 26, 2024

goldmedal left a comment

Choose a reason for hiding this comment

alamb commented Oct 29, 2024

notfilippo commented Oct 29, 2024

ozankabak commented Oct 29, 2024

goldmedal commented Oct 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead left a comment

Choose a reason for hiding this comment

jayzhan211 commented Nov 3, 2024

alamb commented Nov 5, 2024

jayzhan211 Oct 14, 2024 •

edited

Loading

jayzhan211 Oct 19, 2024 •

edited

Loading

notfilippo Oct 21, 2024 •

edited

Loading

comphead Oct 31, 2024 •

edited

Loading