Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not work with column mapping #50

Open
jtanx opened this issue Jul 12, 2024 · 8 comments
Open

Does not work with column mapping #50

jtanx opened this issue Jul 12, 2024 · 8 comments

Comments

@jtanx
Copy link

jtanx commented Jul 12, 2024

I don't actually know if this is a bug with this extension or in delta-kernel-rs (or maybe I'm doing something wrong?)

Test table created with pyspark:

import pyspark
from delta import *
import os


builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

spark.sql("""CREATE TABLE delta.`/tmp/delta_table` USING DELTA AS SELECT col1 as id FROM VALUES 0,1,2,3,4;""")
spark.sql("""
ALTER TABLE delta.`/tmp/delta_table` SET TBLPROPERTIES (
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '5',
    'delta.columnMapping.mode' = 'name'
)
""")

spark.sql("""
ALTER TABLE delta.`/tmp/delta_table` RENAME COLUMN id TO id2
""")

Querying with duckdb

import duckdb
con = duckdb.connect()
print(con.query("select * from delta_scan('/tmp/delta_table')"))

returns

┌───────┐
│  id2  │
│ int32 │
├───────┤
│  NULL │
│  NULL │
│  NULL │
│  NULL │
│  NULL │
└───────┘

i.e. returns all nulls instead of the expected values

@jtanx
Copy link
Author

jtanx commented Jul 14, 2024

From my brief look, it seems like this is missing the column mapping:

// FIXME: this override is pretty hacky: for missing columns we just insert NULL constants
auto &global_type = global_types[global_id];
Value val (global_type);
reader_data.constant_map.push_back({i, val});
continue;

(see also: delta-incubator/delta-kernel-rs#205)

I don't know if there's a better way, but the struct schema visitor also does not pass the mapping through:
https://github.com/delta-incubator/delta-kernel-rs/blob/d19b125519a5834377e6442f1ed73bdb344d0101/ffi/src/lib.rs#L865

(I think you can do field.physical_name(column_mapping_type))

@samansmink
Copy link
Collaborator

Thanks for reporting @jtanx! I will do some debugging and forward the bug report to the delta-kernel-rs repo if necessary

@jsum
Copy link

jsum commented Aug 8, 2024

Hello, I am facing the same issue with delta_scan as well. If I use read_parquet, I get some random column names like "col-xxxxx" instead of actual columns names when the 'delta.columnMapping.mode' is set to 'name'. Is there a work around for this. Thanks for you help in advance.

image

@aersam
Copy link

aersam commented Aug 8, 2024

Hello, I am facing the same issue with delta_scan as well. If I use read_parquet, I get some random column names like "col-xxxxx" instead of actual columns names when the 'delta.columnMapping.mode' is set to 'name'. Is there a work around for this. Thanks for you help in advance.

image

Well, this is by design. You could use our deltalake2db package as a workaround 😉

@samansmink
Copy link
Collaborator

We are still working on implementing column mapping, that currently does not work yet with duckdb_delta

@jsum
Copy link

jsum commented Aug 8, 2024

Thanks @samansmink and @aersam.

@kmatt
Copy link

kmatt commented Aug 21, 2024

We are still working on implementing column mapping, that currently does not work yet with duckdb_delta

Is the root issue with delta-kernel-rs or DuckDB? Wondering if this is a solution, or changes to DuckDB will be need to incorporate it: delta-incubator/delta-kernel-rs#205

Similar to #11

@samansmink
Copy link
Collaborator

samansmink commented Aug 26, 2024

@kmatt The pr you link is currently not yet exposed in the ffi in a way DuckDB can use it. This is something we are working on though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants