Add support for "name" column mapping #205

nicklan · 2024-05-15T22:24:07Z

This only "just works" if you're using Scan::execute or Scan::get_scan_data in conjunction with transform_to_logical.

It does also add enough information into the GlobalScanState for an engine to remap things on its own.

Additionally this does some work to optimize how we set up and execute scans. In particular, it calls the expensive get_state_info only when building the scan, or when calling transform_to_logical (which is assumed to be potentially running on another node/thread from the scan). This required two main changes:

ScanBuilder::build() is now fallible, and does more work
The ColumnType enum can't have a lifetime on it anymore as we want to store it in the scan and it would reference data in the scan which is just messy in rust.

Tested by running the column_mapping test from dat

if using `Scan::execute` or `transform_to_logical`

nicklan · 2024-05-16T00:24:28Z

kernel/src/scan/mod.rs

-pub enum ColumnType<'a> {
- Selected(&'a StructField),
- Partition(&'a StructField),
+/// Scan uses this to set up what kinds of columns it is scanning. For Selected and Partition, the


note i've included both ways we could do things here. Selected stores an actual copy of the field which we have to clone() when we do ScanBuilder::build(), but does simplify things slightly later when we need the field. Partition just stores the index, and then when needed we can index into the schema to get the actual field. This introduces the problem that things could get out of sync (although that would likely indicate bigger issues), but should be more efficient that cloning the whole field.

roeap

looking good.

left a few minor questions, but noting major :).

acceptance/src/data.rs

roeap · 2024-05-16T05:21:21Z

kernel/src/column_mapping.rs

+pub(crate) fn get_name_mapped_physical_field(
+ logical_field: &StructField,
+) -> DeltaResult<(StructField, &str)> {
+ match logical_field.metadata.get(ColumnMetadataKey::ColumnMappingPhysicalName.as_ref()) {
+ Some(val) => match val {
+ MetadataValue::Number(_) => {
+ Err(Error::generic("{ColumnMetadataKey::ColumnMappingPhysicalName} must be a string in name mapping mode"))
+ }
+ MetadataValue::String(name) => {
+ Ok((
+ StructField::new(name, logical_field.data_type().clone(), logical_field.is_nullable()),
+ name
+ ))
+ }
+ }
+ None => {
+ Err(Error::generic("fields MUST have a {ColumnMetadataKey::ColumnMappingPhysicalName} key in their metadata in name mapping mode"))
+ }
+ }
+}


in delta-rs we just extended the StructField with get_physical_name. Don't have too string feeling about this, since its only internal, but personally I find it a little bit more convenient to work with.

"more convenient" as in field.get_physical_name() rather than get_name_mapped_physical_field(field)? I would tend to agree with that.

Also, based on my experience debugging column mapping issues in delta-spark -- we should strongly consider making logical vs. physical be a first class concept everywhere in kernel. At any point when we have a field name (or schema), it should be immediately clear from the context whether that name/schema is logical or physical. In case column mapping is disabled, logical and physical name are the same. But almost all code shouldn't have to care about that corner case and should instead assume column mapping is enabled.

Otherwise, we risk trying to map a physical name, or failing to map a logical name.

Basically, the column mapping function would take the current column mapping mode as input, and then would use that to require the appropriate metadata is available:

impl StructField { // Returns the physical name of this field, based on the column mapping mode in effect. fn physical_name(&self, mode: ColumnMappingMode): DeltaResult<&str> { let name = logical_field.metadata.get(ColumnMetadataKey::ColumnMappingPhysicalName.as_ref()); match (mode, name) { (ColumnMappingMode::None, None) => Ok(self.name.as_str()), (ColumnMappingMode::Name, Some(MetadataValue::String(name))) => Ok(name.as_str()), _ => Err(...), } } }

Yep, I've done mostly that.

Re the logical/physical naming comment, makes a lot of sense, and I've converted a lot of stuff to be explicit here. I'll take another pass to see what else could be renamed though.

kernel/src/scan/mod.rs

scovich · 2024-05-20T16:00:36Z

kernel/src/column_mapping.rs

+pub(crate) fn get_name_mapped_physical_field(
+ logical_field: &StructField,
+) -> DeltaResult<(StructField, &str)> {
+ match logical_field.metadata.get(ColumnMetadataKey::ColumnMappingPhysicalName.as_ref()) {
+ Some(val) => match val {
+ MetadataValue::Number(_) => {
+ Err(Error::generic("{ColumnMetadataKey::ColumnMappingPhysicalName} must be a string in name mapping mode"))
+ }
+ MetadataValue::String(name) => {
+ Ok((
+ StructField::new(name, logical_field.data_type().clone(), logical_field.is_nullable()),
+ name
+ ))
+ }
+ }
+ None => {
+ Err(Error::generic("fields MUST have a {ColumnMetadataKey::ColumnMappingPhysicalName} key in their metadata in name mapping mode"))
+ }
+ }
+}


"more convenient" as in field.get_physical_name() rather than get_name_mapped_physical_field(field)? I would tend to agree with that.

scovich · 2024-05-20T16:03:05Z

kernel/src/column_mapping.rs

+pub(crate) fn get_name_mapped_physical_field(
+ logical_field: &StructField,
+) -> DeltaResult<(StructField, &str)> {
+ match logical_field.metadata.get(ColumnMetadataKey::ColumnMappingPhysicalName.as_ref()) {
+ Some(val) => match val {
+ MetadataValue::Number(_) => {
+ Err(Error::generic("{ColumnMetadataKey::ColumnMappingPhysicalName} must be a string in name mapping mode"))
+ }
+ MetadataValue::String(name) => {
+ Ok((
+ StructField::new(name, logical_field.data_type().clone(), logical_field.is_nullable()),
+ name
+ ))
+ }
+ }
+ None => {
+ Err(Error::generic("fields MUST have a {ColumnMetadataKey::ColumnMappingPhysicalName} key in their metadata in name mapping mode"))
+ }
+ }
+}


Also, based on my experience debugging column mapping issues in delta-spark -- we should strongly consider making logical vs. physical be a first class concept everywhere in kernel. At any point when we have a field name (or schema), it should be immediately clear from the context whether that name/schema is logical or physical. In case column mapping is disabled, logical and physical name are the same. But almost all code shouldn't have to care about that corner case and should instead assume column mapping is enabled.

Otherwise, we risk trying to map a physical name, or failing to map a logical name.

kernel/src/column_mapping.rs

scovich · 2024-05-20T16:24:19Z

kernel/src/column_mapping.rs

+pub(crate) fn get_name_mapped_physical_field(
+ logical_field: &StructField,
+) -> DeltaResult<(StructField, &str)> {
+ match logical_field.metadata.get(ColumnMetadataKey::ColumnMappingPhysicalName.as_ref()) {
+ Some(val) => match val {
+ MetadataValue::Number(_) => {
+ Err(Error::generic("{ColumnMetadataKey::ColumnMappingPhysicalName} must be a string in name mapping mode"))
+ }
+ MetadataValue::String(name) => {
+ Ok((
+ StructField::new(name, logical_field.data_type().clone(), logical_field.is_nullable()),
+ name
+ ))
+ }
+ }
+ None => {
+ Err(Error::generic("fields MUST have a {ColumnMetadataKey::ColumnMappingPhysicalName} key in their metadata in name mapping mode"))
+ }
+ }
+}


Basically, the column mapping function would take the current column mapping mode as input, and then would use that to require the appropriate metadata is available:

impl StructField { // Returns the physical name of this field, based on the column mapping mode in effect. fn physical_name(&self, mode: ColumnMappingMode): DeltaResult<&str> { let name = logical_field.metadata.get(ColumnMetadataKey::ColumnMappingPhysicalName.as_ref()); match (mode, name) { (ColumnMappingMode::None, None) => Ok(self.name.as_str()), (ColumnMappingMode::Name, Some(MetadataValue::String(name))) => Ok(name.as_str()), _ => Err(...), } } }

kernel/src/scan/mod.rs

scovich · 2024-05-20T16:43:26Z

kernel/src/column_mapping.rs

+// key to look in metadata.configuration for to get column mapping mode
+pub(crate) const COLUMN_MAPPING_MODE_KEY: &str = "delta.columnMapping.mode";
+
+impl TryFrom<&str> for ColumnMappingMode {


Suggested change

impl TryFrom<&str> for ColumnMappingMode {

impl TryFrom<T: AsRef<str>> for ColumnMappingMode {

and then below

fn try_from(mode: &AsRef<str>) -> DeltaResult<Self> { match mode.as_ref() {

(allows callers to pass a variety of string-like things more easily)

Yep, I wanted to do that, but you run into rust-lang/rust#50133

kernel/src/snapshot.rs

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

roeap

👍

kernel/src/scan/mod.rs

scovich · 2024-05-22T14:22:32Z

kernel/src/scan/mod.rs

+ // A column, selected from the data, as is
+ Selected(String),
+ // A partition column that needs to be added back in
+ Partition(usize),


Now that column mapping is no longer exposed here, can Partition be a reference again?
Then there's nothing to get out of sync any more and we can revert the corresponding error checking.
(related to #205 (comment))

See https://github.com/delta-incubator/delta-kernel-rs/pull/205/files#r1610345479

kernel/src/scan/mod.rs

scovich · 2024-05-22T15:28:39Z

kernel/src/scan/mod.rs

+/// to materialize the partition column.
+pub enum ColumnType {
+ // A column, selected from the data, as is
+ Selected(String),


If I'm not mistaken, this could be &'a str since it's either a reference to the field's name, or to an entry from the field's metadata map? That way, we could avoid creating a String unless partitions and/or CMM are actually involved.

I still keep a Vec<ColumnType> in Scan (the all_fields field). This is to avoid having to recompute everything each time.

If we make this a reference with a lifetime, then Scan has to have a lifetime, and then everything gets... messy.

I'd actually argue for going the "other way" if we want to remove string allocation, and just have this also store an index in the schema.

Ah, I forgot that you hoisted all_fields to be a field computed previously, where it used to be converted internally...

However -- I don't think we can "just" store a schema index for the column name, when it could be name-mapped or not?

kernel/src/scan/mod.rs

kernel/src/schema.rs

scovich · 2024-05-22T15:43:56Z

kernel/src/schema.rs

+ }
+ MetadataValue::String(name) => Ok(name),
+ }
+ (ColumnMappingMode::Id, _) => Err(Error::generic("Don't support id column mapping yet")),


We should think about how to eventually support id mapping mode, because it's likely to be more invasive than name mapping, and name mapping is expressible in terms of id mapping. Maybe we can get away with a single implementation?

Actually, we can't get away with a single implementation because (a) field ids are really a problem for the parquet reader to deal with; and (b) existing tables converted to column mapping mode have parquet files whose schema lacks field ids (which is one reason why the parquet reader has to handle field ids).

But by the same token... if the parquet reader is handling field ids, then maybe id mapping mode isn't so invasive after all. Could kernel "implement" id mapping mode by just verifying, at snapshot creation time, that the table schema contains field ids iff the mode is enabled, so the parquet reader doesn't get confused?

Yeah. My feeling was that id mapping will require some changes to either the api, or at least the api contract, that we expose for parquet reading.

regardless, i've heard name mapping is more common and important to support, so merging this and then looking at adding id mapping seems to make sense to me. hopefully it won't require big changes to any of this code, and will mostly just be adding new logic in the places it matters.

scovich · 2024-05-22T19:10:12Z

kernel/src/scan/mod.rs

 .schema
 .unwrap_or_else(|| self.snapshot.schema().clone().into());
- Scan {
+ let (all_fields, read_fields, have_partition_cols) = get_state_info(


aside: This takes a lot of self.snapshot args... should it be Snapshot::get_scan_state_info or similar?

maybe? is 2 a lot? :)

Snapshot::get_scan_state_info would need to take a logical schema, so I'm not clear it's a win

Yeah, you're probably right. I was just thinking out loud

scovich · 2024-05-22T19:12:44Z

kernel/src/scan/mod.rs

+ logical_schema: self.logical_schema.as_ref().clone(),
+ read_schema: self.physical_schema.as_ref().clone(),


aside: Is there a particular reason GlobalScanState takes Schema instead of SchemaRef?
Why force a clone like this?

Yeah, the idea is that GlobalScanState should be (de)serializable so it would be easy for multi-node systems to send between nodes. Schema is Serializable and SchemaRef isn't, so this makes it be less code.

We could add a custom De/Serialize for it and just go into the arc and serialize the inner, but it wasn't clear to me that complexity was worth it (yet...)

Serialization is an engine design problem, not a kernel problem. At most kernel might provide a default serializer mechanism -- but even then I don't know what we'd provide, given the wide variety of ways engines handle serialization? And anyway this is useless for FFI because we don't even expose our own schema type to extern engines.

Meanwhile, large schemas (10k+ columns) could be pretty expensive to clone?

(we want to be friendly to distributed execution, but making single-thread clients pay extra overhead for a potential future use case seems cart-before-horse?)

We would need to provide a de/serialize_global_state ffi call yes. If we want to let engines handle serialization all by themselves, we'll need to have more complex APIs to allow reconstruction of state that the kernel understands. possible but something we'll need to design.

ideally we'll sort most of this out with the "single expression" mode of fixup.

I've added #216 to at least get the clones gone, and then we can iterate further.

nicklan added 10 commits April 3, 2024 15:16

checkpoint

ce4e4d1

move col mapping stuff into own module

de2aa36

fix for rebase

19459aa

Merge branch 'main' into column-mapping

1c3efbb

make name column mapping work

8bc26bf

if using `Scan::execute` or `transform_to_logical`

test works

1284a47

slightly more clever read schema building

0391d09

ScanBuilder::build() is fallible

9879eee

some final build() -> build()?, and move column-mapping into Scan

8835fc3

not so much state getting

c5a6ae3

nicklan commented May 16, 2024

View reviewed changes

nicklan requested review from zachschuermann, roeap, tdas and scovich and removed request for zachschuermann May 16, 2024 00:28

roeap reviewed May 16, 2024

View reviewed changes

nicklan added 2 commits May 16, 2024 09:53

Merge branch 'main' into column-mapping

601df22

same init style

34d4d41

This was referenced May 16, 2024

Add a scan_builder() method to Snapshot #206

Closed

don't fully enforce matching schema #210

Merged

scovich reviewed May 20, 2024

View reviewed changes

nicklan added 2 commits May 20, 2024 16:01

put physical_name on the StructField

a5e495f

cleanup

5e89463

nicklan force-pushed the column-mapping branch from bec80c0 to 5e89463 Compare May 20, 2024 23:13

nicklan and others added 4 commits May 20, 2024 16:17

comment fixup

027714e

more comment fixup

3e8c1e7

get rid of unwraps

6a17d2e

Update kernel/src/snapshot.rs

b239d7a

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

suggestion fixup

e514c7b

nicklan requested review from roeap and scovich May 20, 2024 23:31

nicklan added 2 commits May 21, 2024 11:06

more logical name

af52294

Merge branch 'main' into column-mapping

d46b6de

roeap approved these changes May 22, 2024

View reviewed changes

scovich reviewed May 22, 2024

View reviewed changes

nicklan added 2 commits May 22, 2024 10:06

pass by value + with_name for StructField

b808214

fix err

3b6d0ce

nicklan requested a review from scovich May 22, 2024 17:41

nicklan added 3 commits May 22, 2024 10:49

make snapshot have the mapping mode

d18c624

Merge branch 'main' into column-mapping

953724d

Merge branch 'main' into column-mapping

3a6c072

scovich approved these changes May 22, 2024

View reviewed changes

nicklan merged commit c965697 into delta-incubator:main May 22, 2024
9 checks passed

jtanx mentioned this pull request Jul 14, 2024

Does not work with column mapping duckdb/duckdb_delta#50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for "name" column mapping #205

Add support for "name" column mapping #205

nicklan commented May 15, 2024 •

edited

Loading

nicklan May 16, 2024

roeap left a comment

roeap May 16, 2024

scovich May 20, 2024

scovich May 20, 2024

scovich May 20, 2024 •

edited

Loading

nicklan May 20, 2024

scovich May 20, 2024

scovich May 20, 2024

scovich May 20, 2024 •

edited

Loading

scovich May 20, 2024

nicklan May 20, 2024

roeap left a comment

scovich May 22, 2024

nicklan May 22, 2024

scovich May 22, 2024

nicklan May 22, 2024

scovich May 22, 2024

scovich May 22, 2024

scovich May 22, 2024

nicklan May 22, 2024

scovich May 22, 2024

nicklan May 22, 2024

scovich May 22, 2024

scovich May 22, 2024

nicklan May 22, 2024 •

edited

Loading

scovich May 22, 2024 •

edited

Loading

nicklan May 22, 2024

	impl TryFrom<&str> for ColumnMappingMode {
	impl TryFrom<T: AsRef<str>> for ColumnMappingMode {

		logical_schema: self.logical_schema.as_ref().clone(),
		read_schema: self.physical_schema.as_ref().clone(),

Add support for "name" column mapping #205

Add support for "name" column mapping #205

Conversation

nicklan commented May 15, 2024 • edited Loading

Choose a reason for hiding this comment

roeap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan May 22, 2024 • edited Loading

Choose a reason for hiding this comment

scovich May 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan commented May 15, 2024 •

edited

Loading

scovich May 20, 2024 •

edited

Loading

scovich May 20, 2024 •

edited

Loading

nicklan May 22, 2024 •

edited

Loading

scovich May 22, 2024 •

edited

Loading