DataFrame::cache errors with `Plan("Mismatch between schema and batches")` but query works when not cached #8476

tv42 · 2023-12-09T02:27:39Z

Describe the bug

Dataframe::cache gives an error where an execution that doesn't first cache results succeeds.

I would have expected caching to have no effect on success/failure.

To Reproduce

use datafusion::prelude::SessionContext;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let sql = "SELECT CASE WHEN true THEN NULL ELSE 1 END;";
    let ctx = SessionContext::new();
    let plan = ctx.state().create_logical_plan(sql).await?;
    let df = ctx.execute_logical_plan(plan).await?;
    // Comment out the next line to make the error go away.
    let df = df.cache().await?;
    let batches = df.collect().await?;
    let display = datafusion::arrow::util::pretty::pretty_format_batches(&batches).unwrap();
    println!("{}", display);
    Ok(())
}

Expected behavior

Behavior with and without let df = df.cache().await? to be functionally same, only changing performance and memory use.

Additional context

No response

The text was updated successfully, but these errors were encountered:

Asura7969 · 2023-12-11T06:56:32Z

The reason is that schema comparison：
https://github.com/apache/arrow-datafusion/blob/d091b55be6a4ce552023ef162b5d081136d3ff6d/datafusion/core/src/datasource/memory.rs#L68

schema：

Field { name: "CASE WHEN Boolean(true) THEN NULL ELSE Int64(1) END", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

batches_schema：

Field { name: "CASE WHEN Boolean(true) THEN NULL ELSE Int64(1) END", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

data_type：Null & Int64

Can we use the optimized schema to compare，like：
before：

    pub async fn cache(self) -> Result<DataFrame> {
        let context = SessionContext::new_with_state(self.session_state.clone());
        let mem_table = MemTable::try_new(
            SchemaRef::from(self.schema().clone()),
            self.collect_partitioned().await?,
        )?;

        context.read_table(Arc::new(mem_table))
    }

after:

    pub async fn cache(self) -> Result<DataFrame> {
        let context = SessionContext::new_with_state(self.session_state.clone());
        let physical_plan = self.create_physical_plan().await?;
        let mem_table = MemTable::try_new(
            physical_plan.schema(),
            self.collect_partitioned().await?,
        )?;

        context.read_table(Arc::new(mem_table))
    }

alamb · 2023-12-11T14:31:30Z

In general, DataType::Null is normally resolved (via Coercion) to an actual type of the target schema

Maybe we could apply the same approach here

tv42 added the bug Something isn't working label Dec 9, 2023

tv42 changed the title ~~DataFrame::cache errors with Plan("Mismatch between schema and batches") but work when not cached~~ DataFrame::cache errors with Plan("Mismatch between schema and batches") but query works when not cached Dec 9, 2023

Asura7969 mentioned this issue Dec 12, 2023

Fix DataFrame::cache errors with Plan("Mismatch between schema and batches") #8510

Merged

alamb closed this as completed in #8510 Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame::cache errors with `Plan("Mismatch between schema and batches")` but query works when not cached #8476

DataFrame::cache errors with `Plan("Mismatch between schema and batches")` but query works when not cached #8476

tv42 commented Dec 9, 2023

Asura7969 commented Dec 11, 2023

alamb commented Dec 11, 2023

DataFrame::cache errors with Plan("Mismatch between schema and batches") but query works when not cached #8476

DataFrame::cache errors with Plan("Mismatch between schema and batches") but query works when not cached #8476

Comments

tv42 commented Dec 9, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

Asura7969 commented Dec 11, 2023

alamb commented Dec 11, 2023

DataFrame::cache errors with `Plan("Mismatch between schema and batches")` but query works when not cached #8476

DataFrame::cache errors with `Plan("Mismatch between schema and batches")` but query works when not cached #8476