scan_parquet with allow_missing_columns does not include the missing columns #20639

mdavis-xyz · 2025-01-09T08:48:57Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import io
import polars as pl

f_with = io.BytesIO()
f_without = io.BytesIO()

pl.DataFrame({ 'a': [1], 'b': [1] }).write_parquet(f_with)
pl.DataFrame({ 'a': [1] }).write_parquet(f_without)

f_with.seek(0)
f_without.seek(0)

print(pl.scan_parquet([f_without, f_with], allow_missing_columns=True).select(pl.all()).collect())

Log output

parquet scan with parallel = None
parquet scan with parallel = None
shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 1   │
└─────┘

Issue description

When you try to scan multiple files in one scan_parquet call, with allow_missing_columns=True, the 'missing' columns are still missing in the final output.

Might be related to #20361.

Expected behavior

I expect that the return value of scan_parquet is exactly the same as if I manually scanned each individual file and did a diagonal_relaxed concat.

pl.concat([pl.scan_parquet(f) for f in [f_without, f_with]], how='diagonal_relaxed').select(pl.all()).collect()

shape: (2, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ i64  │
╞═════╪══════╡
│ 1   ┆ null │
│ 1   ┆ 1    │
└─────┴──────┘

I expect that if I set allow_missing_columns=True, then the order of the files does not matter (other than impacting the other of rows), and that the first file is not treated as special.

Installed versions

--------Version info---------
Polars:              1.19.0
Index type:          UInt32
Platform:            Linux-6.8.0-51-generic-x86_64-with-glibc2.39
Python:              3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0]
LTS CPU:             False

----Optional dependencies----
<not installed>ager  
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>      
1.6.0asyncio         
2.2.1                
3.1.5yxl             
2.2.3s               
18.1.0w              
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>

The text was updated successfully, but these errors were encountered:

coastalwhite · 2025-01-09T09:03:32Z

Note that allow_missing_columns explicitly mentions that it is based on the first file:

allow_missing_columns
When reading a list of parquet files, if a column existing in the first file cannot be found in
subsequent files, the default behavior is to raise an error. However, if allow_missing_columns is set to
True, a full-NULL column is returned instead of erroring for the files that do not contain the column.

I don't think this is a bug.

mdavis-xyz · 2025-01-09T11:51:32Z

The docs say:

if allow_missing_columns is set to True, a full-NULL column is returned instead of erroring for the files that do not contain the column.

This behavior does not match my example. Column b is not returned as a full-NULL or partly-NULL column. Column b is missing.
So I still think this is a bug.

coastalwhite · 2025-01-09T13:24:36Z

Since f_without is the first file and does not have column b the schema is assumed to be { a: Int64 } as is described in the docs.

mdavis-xyz · 2025-01-09T15:01:31Z

Ah, I see. This is referring to schema inconsistency in the other order than what's in my example.

Ok. How are users supposed to handle inconsistency in the order in my example? Just use the concat approach?

I don't really understand the use case being targetted here. I can see how treating the first file as special is useful from an implementation perspective. However from a user's perspective, I don't think this is desirable. Was this behavior motivated by the implementation constraints, or is this asymmetry desired in some use case I haven't thought of?

A common situation is that you have files generated over time (e.g. one per month), and over the years new columns are added to the data. If the files are sorted alphabetically, the first file will be the oldest, which is missing the new columns.

The impression I get (from this, and some bugs I've mentioned in discord last year) is that users should generally avoid passing multiple files to one scan_parquet call, unless they know for sure that each file has every column. I think the documentation should make this clearer, and suggest the explicit concat approach as a workaround (noting that there may be some performance penalty).

Or, what if we add an extra argument to scan_parquet, which defaults to False (current behavior), and if set to true, does the concat approach for us? (It gets annoying copy-pasting the line pl.concat([os.path.join(mydir, f) for f in os.listdir(mydir)], how='diagonal_relaxed') all the time.)

coastalwhite · 2025-01-09T16:02:11Z

The reason why is this is not allowed is mostly because it would mean that we would need to scan every file before we can create a query plan. What I think we should do is allow allow_missing_columns=True to also work if an explicit schema argument is given.

deanm0000 · 2025-01-09T16:13:42Z

As a workaround, you can make an empty parquet with your good schema and then use that as your first file.

import io
import polars as pl

f_with = io.BytesIO()
f_without = io.BytesIO()
f_schema = io.BytesIO()

pl.DataFrame({ 'a': [1], 'b': [1] }).write_parquet(f_with)
pl.DataFrame({ 'a': [1] }).write_parquet(f_without)
pl.DataFrame(schema={"a":pl.Int64, "b":pl.Int64}).write_parquet(f_schema)

f_with.seek(0)
f_without.seek(0)
f_schema.seek(0)
print(pl.scan_parquet([f_schema] + [f_without, f_with], allow_missing_columns=True).select(pl.all()).collect())
shape: (2, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ i64  │
╞═════╪══════╡
│ 1   ┆ null │
│ 1   ┆ 1    │
└─────┴──────┘

mdavis-xyz · 2025-01-09T22:10:05Z

Hmm, that's more work than just doing the concat approach.

e.g. here is one dataset that I'm scanning. It's got about 70 columns. I'm scanning about 10 datasets like that. I don't want to hard-code that. I'll stick to the concat approach.

I still think it's worth modifying the documentation. How about adding to allow_missing_columns' explanation:

If a column that does not exist in the first file is found in subsequent files, it will be omitted from the returned value, regardless of the value of allow_missing_columns.

coastalwhite · 2025-01-09T22:15:12Z

Hmm, that's more work than just doing the concat approach.

It is probably going to be a lot more efficient and optimization friendly. That is partially why we would want to support a better version of this.

mdavis-xyz added bug needs triage python labels Jan 9, 2025

mdavis-xyz mentioned this issue Jan 9, 2025

cache_compiler return value UNSW-CEEM/NEMOSIS#47

Open

coastalwhite added invalid and removed needs triage labels Jan 9, 2025

deanm0000 added enhancement and removed bug invalid labels Jan 9, 2025

vyasr moved this to Todo in cuDF Python Feb 26, 2025

vyasr added this to cuDF Python Feb 26, 2025

vyasr removed this from cuDF Python Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scan_parquet with allow_missing_columns does not include the missing columns #20639

scan_parquet with allow_missing_columns does not include the missing columns #20639

mdavis-xyz commented Jan 9, 2025

coastalwhite commented Jan 9, 2025

mdavis-xyz commented Jan 9, 2025 •

edited

Loading

coastalwhite commented Jan 9, 2025 •

edited

Loading

mdavis-xyz commented Jan 9, 2025

coastalwhite commented Jan 9, 2025 •

edited

Loading

deanm0000 commented Jan 9, 2025

mdavis-xyz commented Jan 9, 2025 •

edited

Loading

coastalwhite commented Jan 9, 2025

scan_parquet with allow_missing_columns does not include the missing columns #20639

scan_parquet with allow_missing_columns does not include the missing columns #20639

Comments

mdavis-xyz commented Jan 9, 2025

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

coastalwhite commented Jan 9, 2025

mdavis-xyz commented Jan 9, 2025 • edited Loading

coastalwhite commented Jan 9, 2025 • edited Loading

mdavis-xyz commented Jan 9, 2025

coastalwhite commented Jan 9, 2025 • edited Loading

deanm0000 commented Jan 9, 2025

mdavis-xyz commented Jan 9, 2025 • edited Loading

coastalwhite commented Jan 9, 2025

mdavis-xyz commented Jan 9, 2025 •

edited

Loading

coastalwhite commented Jan 9, 2025 •

edited

Loading

coastalwhite commented Jan 9, 2025 •

edited

Loading

mdavis-xyz commented Jan 9, 2025 •

edited

Loading