Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame::cache errors with Plan("Mismatch between schema and batches") but query works when not cached #8476

Closed
tv42 opened this issue Dec 9, 2023 · 2 comments · Fixed by #8510
Labels
bug Something isn't working

Comments

@tv42
Copy link
Contributor

tv42 commented Dec 9, 2023

Describe the bug

Dataframe::cache gives an error where an execution that doesn't first cache results succeeds.

I would have expected caching to have no effect on success/failure.

To Reproduce

use datafusion::prelude::SessionContext;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let sql = "SELECT CASE WHEN true THEN NULL ELSE 1 END;";
    let ctx = SessionContext::new();
    let plan = ctx.state().create_logical_plan(sql).await?;
    let df = ctx.execute_logical_plan(plan).await?;
    // Comment out the next line to make the error go away.
    let df = df.cache().await?;
    let batches = df.collect().await?;
    let display = datafusion::arrow::util::pretty::pretty_format_batches(&batches).unwrap();
    println!("{}", display);
    Ok(())
}

Expected behavior

Behavior with and without let df = df.cache().await? to be functionally same, only changing performance and memory use.

Additional context

No response

@tv42 tv42 added the bug Something isn't working label Dec 9, 2023
@tv42 tv42 changed the title DataFrame::cache errors with Plan("Mismatch between schema and batches") but work when not cached DataFrame::cache errors with Plan("Mismatch between schema and batches") but query works when not cached Dec 9, 2023
@Asura7969
Copy link
Contributor

The reason is that schema comparison:
https://github.com/apache/arrow-datafusion/blob/d091b55be6a4ce552023ef162b5d081136d3ff6d/datafusion/core/src/datasource/memory.rs#L68

schema:

Field { name: "CASE WHEN Boolean(true) THEN NULL ELSE Int64(1) END", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

batches_schema:

Field { name: "CASE WHEN Boolean(true) THEN NULL ELSE Int64(1) END", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

data_type:Null & Int64

Can we use the optimized schema to compare,like:
before:

    pub async fn cache(self) -> Result<DataFrame> {
        let context = SessionContext::new_with_state(self.session_state.clone());
        let mem_table = MemTable::try_new(
            SchemaRef::from(self.schema().clone()),
            self.collect_partitioned().await?,
        )?;

        context.read_table(Arc::new(mem_table))
    }

after:

    pub async fn cache(self) -> Result<DataFrame> {
        let context = SessionContext::new_with_state(self.session_state.clone());
        let physical_plan = self.create_physical_plan().await?;
        let mem_table = MemTable::try_new(
            physical_plan.schema(),
            self.collect_partitioned().await?,
        )?;

        context.read_table(Arc::new(mem_table))
    }

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

In general, DataType::Null is normally resolved (via Coercion) to an actual type of the target schema

Maybe we could apply the same approach here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants