Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Z Order can't find column with a capitalised character #1586

Closed
mrfuxi opened this issue Aug 7, 2023 · 6 comments · Fixed by #1982
Closed

Z Order can't find column with a capitalised character #1586

mrfuxi opened this issue Aug 7, 2023 · 6 comments · Fixed by #1982
Labels
binding/python Issues for the Python package bug Something isn't working

Comments

@mrfuxi
Copy link

mrfuxi commented Aug 7, 2023

Environment

Delta-rs version: 0.10.1

Binding: python

Environment:

  • Cloud provider: local
  • OS: Linux Fedora fc38
  • Other: Python 3.8, pyarrow==12.0.1, arrow==1.2.3

Bug

What happened:
Running z order on table that has columns containing cap case causes exception

Error

Traceback (most recent call last):
  File "./case.py", line 16, in <module>
    main()
  File "./case.py", line 12, in main
    dt.optimize.z_order(columns=["col_a"])
  File "/.../python3.8/site-packages/deltalake/table.py", line 694, in z_order
    metrics = self.table._table.z_order_optimize(
_internal.DeltaError: Generic DeltaTable error: Schema error: No field named col_b. Valid fields are "?table?".col_a, "?table?"."col_B", __zorder_key.

What you expected to happen:
Z order to finish the work

How to reproduce it:

from deltalake import write_deltalake, DeltaTable
import pandas as pd


def main():
    path = "./test_caps"

    df = pd.DataFrame({"col_a": [1, 2, 3], "col_B": ["a", "b", "c"]})
    write_deltalake(path, df, mode="overwrite")

    dt = DeltaTable(path)
    dt.optimize.z_order(columns=["col_a"])


if __name__ == "__main__":
    main()

More details:

@mrfuxi mrfuxi added the bug Something isn't working label Aug 7, 2023
@wjones127
Copy link
Collaborator

I suspect this boils down to DataFusion not being case sensitive. Will need more investigation though.

@watfordkcf
Copy link
Contributor

watfordkcf commented Aug 17, 2023

Are columns being quoted somehow? I know I've had issues with quoted columns being case sensitive in datafusion.

EDIT: I can get the python to die every time, but not when I do the same effective code in rust.

➜ RUST_LOG=debug RUST_BACKTRACE=full python delta_bug.py
[2023-08-17T22:51:22Z DEBUG deltalake::action] loading checkpoint from _delta_log/_last_checkpoint
[2023-08-17T22:51:22Z DEBUG deltalake::delta] update without checkpoint
[2023-08-17T22:51:22Z DEBUG deltalake::delta] incremental update with version(-1) and max_version(None)
[2023-08-17T22:51:22Z DEBUG deltalake::delta] parsing commit with version 0...
[2023-08-17T22:51:22Z DEBUG deltalake::delta] merging table state with version: 0
[2023-08-17T22:51:22Z DEBUG deltalake::operations::optimize] Rewriting files in partition: PartitionTuples([])
[2023-08-17T22:51:22Z DEBUG sqlparser::parser] Parsing sql '?table?'...
[2023-08-17T22:51:22Z DEBUG sqlparser::parser] Parsing sql 'col_a'...
[2023-08-17T22:51:22Z DEBUG sqlparser::parser] Parsing sql 'col_B'...
[2023-08-17T22:51:22Z DEBUG sqlparser::parser] Parsing sql 'col_a'...
[2023-08-17T22:51:22Z DEBUG sqlparser::parser] Parsing sql '__zorder_key'...
Traceback (most recent call last):
  File "/Users/watford/scratch/python/delta_bug.py", line 16, in <module>
    main()
  File "/Users/watford/scratch/python/delta_bug.py", line 12, in main
    dt.optimize.z_order(columns=["col_a"])
  File "/Users/watford/scratch/python/.venv/lib/python3.10/site-packages/deltalake/table.py", line 694, in z_order
    metrics = self.table._table.z_order_optimize(
_internal.DeltaError: Generic DeltaTable error: Schema error: No field named col_b. Valid fields are "?table?".col_a, "?table?"."col_B", __zorder_key.

This code works just fine:

    #[cfg(all(feature = "datafusion", feature = "arrow"))]
    #[tokio::test]
    async fn test_optimize_casing() {
        use std::future::IntoFuture;

        let mut dt = DeltaTableBuilder::from_uri("/Users/watford/scratch/python/test_caps")
            .build()
            .unwrap();

        dt.load().await.unwrap();

        let schema = dt.get_schema().unwrap();
        let fields = schema.get_fields().iter().map(|f| { String::from(f.get_name()) }).collect::<Vec<String>>().join(",");
        assert_eq!("col_a,col_B", fields);

        let v = Vec::<PartitionFilter<&str>>::new();
        let cmd: crate::operations::optimize::OptimizeBuilder<'_> = crate::operations::optimize::OptimizeBuilder::new(dt.object_store(), dt.get_state().clone())
            .with_max_concurrent_tasks(4)
            .with_type(crate::operations::optimize::OptimizeType::ZOrder(vec![String::from("col_a")]))
            .with_filters(&v);

        let (_dtx, metrics) = cmd.into_future().await.unwrap();

        dt.update_incremental(None).await.unwrap();
    }

And this code from inside the python bindings works just fine:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_optimize_casing() {
        // approximate the steps taken by the python code
        let mut dt = RawDeltaTable::new("/Users/watford/scratch/python/test_caps", None, None, false).unwrap();
        let _metadata = dt.metadata().unwrap();
        dt.z_order_optimize(vec![String::from("col_a")], None, None, None).unwrap();
        dt.update_incremental().unwrap();
    }
}

@Trebla2010
Copy link

I'm having issues with z_order as well.
dt.schema()
Schema([Field(FieldA, PrimitiveType("string"), nullable=True), Field(FieldB, PrimitiveType("string"), nullable=True), Field(Value, PrimitiveType("decimal(12,4)"), nullable=True), Field(Date, PrimitiveType("date"), nullable=True))


DeltaError Traceback (most recent call last)
Cell In[144], line 1
----> 1 dt.optimize.z_order(["Date"])
DeltaError: Generic DeltaTable error: Schema error: No field named date. Valid fields are "?table?"."FieldA", "?table?"."FieldB", "?table?"."Value", "?table?"."Date".

@rtyler rtyler added the binding/python Issues for the Python package label Sep 29, 2023
@sachin-snt
Copy link

sachin-snt commented Oct 18, 2023

@wjones127 We are also seeing the same issue but while Updating the delta table

@shumaev99
Copy link

@wjones127 We are also seeing the same issue but while Updating the delta table

Same here.

@djouallah
Copy link

same here :(

roeap pushed a commit that referenced this issue Dec 19, 2023
# Description
Enable usage of z-order optimization on columns that have
capitalization.

# Related Issue(s)
- closes #1586
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants