-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
With CSV:
echo "a,b\n1,2" > data1.csv
mkdir a=2
echo "b\n3" > a=2/data2.csv
datafusion-cli
> SELECT * FROM '**/*.csv';
Arrow error: Csv error: incorrect number of fields for line 1, expected 2 got 1
With Parquet:
import os
import polars as pl
pl.DataFrame({'a': [1], 'b': [2]}).write_parquet('data1.parquet')
os.mkdir('a=2')
pl.DataFrame({'b': [3]}).write_parquet('a=2/data2.parquet')
datafusion-cli
> SELECT * FROM '**/*.parquet';
+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 | |
+---+---+
2 row(s) fetched.
Elapsed 0.055 seconds.
To Reproduce
No response
Expected behavior
Partition evolution is handled and both cases return
+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 | 2 |
+---+---+
Additional context
Having played around quite a bit with ParquetExec and the SchemaAdapter machinery I think what should happen is:
- Partition values are on a per-file basis, in particular on each
PartitionedFile
and not on theFileScanConfig
- Partition values are passed into the SchemaAdapter machinery and for each file it decides if it needs to add a column generated from partition values or not
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working