feat: adopt kernel schema types #2495

roeap · 2024-05-09T14:11:01Z

Description

First pass adopting delta_kernel in delta-rs.

depends on delta-io/delta-kernel-rs#189 being merged and released.

This PR focusses on the schema types. Adopting the action types would be a follow up, but might have a even greater blast radius then this one.

Related Issue(s)

part of #2489

Documentation

ion-elgreco · 2024-05-09T14:17:50Z

crates/core/src/kernel/arrow/mod.rs

-            ArrowDataType::Decimal128(p, s) => {
-                Ok(DataType::Primitive(PrimitiveType::Decimal(*p, *s)))
-            }
-            ArrowDataType::Decimal256(p, s) => DataType::decimal(*p, *s).map_err(|_| {


This arrow type is missing in kernel, I think that should be upstreamed there as well.

hmm, given that precision/scale are limited to 38, does that make sense? not entirely sure anymore but I think the 256 bit type only makes sense for larger p/s values.

It's mostly for convenience. Users might have their source data using decimal256, but still precision/scale below 38.

roeap · 2024-05-24T14:01:33Z

python/tests/test_writer.py

@@ -1458,7 +1458,7 @@ def test_invalid_decimals(tmp_path: pathlib.Path, engine):

    with pytest.raises(
        SchemaMismatchError,
-        match=re.escape("Invalid data type for Delta Lake: decimal(39,1)"),
+        match=re.escape("Invalid data type for Delta Lake: Decimal256(39, 1)"),


this change is a bit more significant then meets the eye - i.e. we are no longer trying to convert 256 bit decimals to a compliant type in delta. main reason being, for precisions that fit into 128 bit decimals, the user should really be using these, if the larger type is required, we cannot store it in the table anyhow.

roeap · 2024-05-24T19:42:38Z

crates/core/src/kernel/scalars.rs

+            Struct(fields) => {
+                let struct_fields = fields
+                    .iter()
+                    .flat_map(|f| TryFrom::try_from(f.as_ref()))
+                    .collect::<Vec<_>>();
+                let values = arr
+                    .as_any()
+                    .downcast_ref::<StructArray>()
+                    .and_then(|struct_arr| {
+                        struct_fields
+                            .iter()
+                            .map(|f: &StructField| {
+                                struct_arr
+                                    .column_by_name(f.name())
+                                    .and_then(|c| Self::from_array(c.as_ref(), index))
+                            })
+                            .collect::<Option<Vec<_>>>()
+                    })?;
+                if struct_fields.len() != values.len() {
+                    return None;
+                }
+                Some(Self::Struct(
+                    StructData::try_new(struct_fields, values).ok()?,
+                ))
+            }


@scovich - here is an example of creating a struct scalar in the wild.

alexwilcoxson-rel · 2024-05-28T15:05:01Z

crates/core/src/kernel/snapshot/mod.rs

@@ -315,8 +315,8 @@ impl Snapshot {
        let stats_fields = if let Some(stats_cols) = self.table_config().stats_columns() {
            stats_cols
                .iter()
-                .map(|col| match schema.field_with_name(col) {
-                    Ok(field) => match field.data_type() {
+                .map(|col| match schema.field(col) {


The stats columns are defined for nested fields like parent_a.child_b. Looking at kernel code, schema.field will just look up by this name and fail. The reason I introduced #2519

Is #2519 appropriate for the kernel or should we support the nested lookup here and parse the stats col? Other uses of field_with_name should be audited to ensure they don't accept a nested name

good question, let me investigate.

@alexwilcoxson-rel - stated some asking around if it would be legal to have a nested field defined in that property.

There are a couple of questions that also come to mind, and maybe you have an opinion? So using . as a separator is - AFAIK - just a convention (certainly a common one though). In delta one could for instance define a column name that does contain a . character under the columnMapping feature. In kernel we have some places where we do represent nested coluns as dot separated, but we do already know that we eventually need an array representation.

One reason one might want to specify a nested field, is if we cannot compute stats for some child fields. There we could also just be lenient and simply omit these (actually not sure what we do now :D). Is that your case as well, or do you have others?

btw - the fact that .fields() does not any parsing / traversing is by design. i.e. in this case we would need to split here...

So turns out nested fields are perfectly valid, but they do require some more parsing, as it is also legal to escape field names that contain special characters.

https://github.com/delta-io/delta/blob/4b102d34a2ce881b2a851b4c6cfbf2ab3ab5534f/spark/src/main/scala/org/apache/spark/sql/delta/DeltaConfig.scala#L549-L561

i opened #2572 to track this.

in our case if nested fields our omitted thats fine we only care about the stats on the root fields for query performance

we only included the nested fields cause they are not nullable fields of nullable parent fields and if we didn't collect them chckpoint parsing would break

also at the time, delta-rs did not respect the stats configuration so we just used the parquet writer properties to only collect stats on what we needed

rtyler

Suggested some changes, but I would rather see this merged than wait for my nitpicks 😆

(also don't want to block up more testing before Data and AI Summit)

rtyler · 2024-06-05T02:27:49Z

Cargo.toml

+delta_kernel = { version = "0.1" }
+# delta_kernel = { path = "../delta-kernel-rs/kernel" }


Suggested change

delta_kernel = { version = "0.1" }

# delta_kernel = { path = "../delta-kernel-rs/kernel" }

delta_kernel = { version = "0.1", path = "../delta-kernel-rs/kernel" }

These can combined for ease of development, since cargo allows multiple locations

rtyler · 2024-06-05T02:28:59Z

crates/core/Cargo.toml

+delta_kernel.workspace = true
+


Suggested change

delta_kernel.workspace = true

delta-kernel = { workspace = true }

github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels May 9, 2024

ion-elgreco reviewed May 9, 2024

View reviewed changes

roeap force-pushed the feature/kernelize branch from cd589f9 to 13e3ac8 Compare May 24, 2024 07:37

roeap commented May 24, 2024

View reviewed changes

roeap force-pushed the feature/kernelize branch from e9fc656 to 813ab1b Compare May 25, 2024 06:13

roeap mentioned this pull request May 26, 2024

feat: implement transaction identifiers - continued #2539

Merged

roeap force-pushed the feature/kernelize branch from 813ab1b to c7acebb Compare May 27, 2024 19:18

alexwilcoxson-rel reviewed May 28, 2024

View reviewed changes

roeap added 10 commits June 4, 2024 19:30

feat: adopt kernel schema types

ebdf5e7

fix: remove tests upstreamed to kernel

018c1fb

fix: test cleanup

e306bc2

feat: adopt more kernel

7ed93ef

fix: bing back python expresssions

e18c54d

fix: python tests

6ae76f4

fix: update to ScalarData

14e5053

fix: convert tests

f626822

fix: simplify to_array for scalars

a7f70e0

chore: use released kernel

a4b92cd

roeap force-pushed the feature/kernelize branch from 6ccc3b5 to a4b92cd Compare June 4, 2024 17:30

fix: more fixes after rebase

a78d5ba

roeap marked this pull request as ready for review June 4, 2024 18:25

roeap requested review from wjones127, fvaleye and rtyler as code owners June 4, 2024 18:25

rtyler enabled auto-merge (rebase) June 5, 2024 02:26

rtyler approved these changes Jun 5, 2024

View reviewed changes

rtyler merged commit bc3bdb7 into delta-io:main Jun 5, 2024
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adopt kernel schema types #2495

feat: adopt kernel schema types #2495

roeap commented May 9, 2024 •

edited

Loading

ion-elgreco May 9, 2024

roeap May 9, 2024

ion-elgreco May 9, 2024

roeap May 24, 2024

roeap May 24, 2024

alexwilcoxson-rel May 28, 2024 •

edited

Loading

roeap Jun 4, 2024

roeap Jun 4, 2024

roeap Jun 4, 2024

roeap Jun 4, 2024

roeap Jun 4, 2024

alexwilcoxson-rel Jun 4, 2024

rtyler left a comment

rtyler Jun 5, 2024

rtyler Jun 5, 2024

		delta_kernel = { version = "0.1" }
		# delta_kernel = { path = "../delta-kernel-rs/kernel" }

	delta_kernel = { version = "0.1" }
	# delta_kernel = { path = "../delta-kernel-rs/kernel" }
	delta_kernel = { version = "0.1", path = "../delta-kernel-rs/kernel" }

	delta_kernel.workspace = true
	delta-kernel = { workspace = true }

feat: adopt kernel schema types #2495

feat: adopt kernel schema types #2495

Conversation

roeap commented May 9, 2024 • edited Loading

Description

Related Issue(s)

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexwilcoxson-rel May 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rtyler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap commented May 9, 2024 •

edited

Loading

alexwilcoxson-rel May 28, 2024 •

edited

Loading