- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scan_parquet with allow_missing_columns does not include the missing columns #20639
Comments
Note that allow_missing_columns explicitly mentions that it is based on the first file:
I don't think this is a bug. |
The docs say:
This behavior does not match my example. Column |
Since |
Ah, I see. This is referring to schema inconsistency in the other order than what's in my example. Ok. How are users supposed to handle inconsistency in the order in my example? Just use the concat approach? I don't really understand the use case being targetted here. I can see how treating the first file as special is useful from an implementation perspective. However from a user's perspective, I don't think this is desirable. Was this behavior motivated by the implementation constraints, or is this asymmetry desired in some use case I haven't thought of? A common situation is that you have files generated over time (e.g. one per month), and over the years new columns are added to the data. If the files are sorted alphabetically, the first file will be the oldest, which is missing the new columns. The impression I get (from this, and some bugs I've mentioned in discord last year) is that users should generally avoid passing multiple files to one Or, what if we add an extra argument to |
The reason why is this is not allowed is mostly because it would mean that we would need to scan every file before we can create a query plan. What I think we should do is allow |
As a workaround, you can make an empty parquet with your good schema and then use that as your first file. import io
import polars as pl
f_with = io.BytesIO()
f_without = io.BytesIO()
f_schema = io.BytesIO()
pl.DataFrame({ 'a': [1], 'b': [1] }).write_parquet(f_with)
pl.DataFrame({ 'a': [1] }).write_parquet(f_without)
pl.DataFrame(schema={"a":pl.Int64, "b":pl.Int64}).write_parquet(f_schema)
f_with.seek(0)
f_without.seek(0)
f_schema.seek(0)
print(pl.scan_parquet([f_schema] + [f_without, f_with], allow_missing_columns=True).select(pl.all()).collect())
shape: (2, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ null │
│ 1 ┆ 1 │
└─────┴──────┘ |
Hmm, that's more work than just doing the concat approach. e.g. here is one dataset that I'm scanning. It's got about 70 columns. I'm scanning about 10 datasets like that. I don't want to hard-code that. I'll stick to the concat approach. I still think it's worth modifying the documentation. How about adding to
|
It is probably going to be a lot more efficient and optimization friendly. That is partially why we would want to support a better version of this. |
Checks
Reproducible example
Log output
Issue description
When you try to scan multiple files in one
scan_parquet
call, withallow_missing_columns=True
, the 'missing' columns are still missing in the final output.Might be related to #20361.
Expected behavior
I expect that the return value of
scan_parquet
is exactly the same as if I manually scanned each individual file and did a diagonal_relaxed concat.I expect that if I set
allow_missing_columns=True
, then the order of the files does not matter (other than impacting the other of rows), and that the first file is not treated as special.Installed versions
The text was updated successfully, but these errors were encountered: