Skip to content

ListingTable cannot handle partition evolution #13270

@adriangb

Description

@adriangb

Describe the bug

With CSV:

echo "a,b\n1,2" > data1.csv
mkdir a=2
echo "b\n3" > a=2/data2.csv
datafusion-cli
> SELECT * FROM '**/*.csv';
Arrow error: Csv error: incorrect number of fields for line 1, expected 2 got 1

With Parquet:

import os
import polars as pl

pl.DataFrame({'a': [1], 'b': [2]}).write_parquet('data1.parquet')
os.mkdir('a=2')
pl.DataFrame({'b': [3]}).write_parquet('a=2/data2.parquet')
datafusion-cli
> SELECT * FROM '**/*.parquet';
+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 |   |
+---+---+
2 row(s) fetched.
Elapsed 0.055 seconds.

To Reproduce

No response

Expected behavior

Partition evolution is handled and both cases return

+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 | 2 |
+---+---+

Additional context

Having played around quite a bit with ParquetExec and the SchemaAdapter machinery I think what should happen is:

  • Partition values are on a per-file basis, in particular on each PartitionedFile and not on the FileScanConfig
  • Partition values are passed into the SchemaAdapter machinery and for each file it decides if it needs to add a column generated from partition values or not

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions