Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big file with dtype=pl.List(pl.UInt32) are written to parquet incorrectly (row_group_size) #6289

Closed
2 tasks done
Vincenthays opened this issue Jan 17, 2023 · 2 comments
Closed
2 tasks done
Labels
A-io Area: reading and writing data bug Something isn't working python Related to Python Polars

Comments

@Vincenthays
Copy link
Contributor

Vincenthays commented Jan 17, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

On really precise conditions, apparently when row_group_size is the third of the Dataframe height, the pl.List(pl.UInt32) type will not be written properly

Reproducible example

import polars as pl

df = pl.Series('a', [*[None]*900_000, [1, 2]], dtype=pl.List(pl.UInt32)).to_frame()
print(df.tail(1)) # all the output should be the same as this one
# shape: (1, 1)
# ┌──────────────┐
# │ a            │
# │ ---          │
# │  list[u32]   │
# ╞══════════════╡
# │ [1, 2]       │
# └──────────────┘

df.write_parquet('test.pq')
print(pl.read_parquet('test.pq').tail(1)) # not the same (not working)
# shape: (1, 1)
# ┌──────────────┐
# │ a            │
# │ ---          │
# │  list[u32]   │
# ╞══════════════╡
# │ [null, null] │
# └──────────────┘

df.write_parquet('test.pq', row_group_size=300_000)
print(pl.read_parquet('test.pq').tail(1)) # the same (working)
# shape: (1, 1)
# ┌──────────────┐
# │ a            │
# │ ---          │
# │  list[u32]   │
# ╞══════════════╡
# │ [1, 2]       │
# └──────────────┘

df.write_parquet('test.pq', row_group_size=300_001)
print(pl.read_parquet('test.pq').tail(1)) # not the same (not working)
# shape: (1, 1)
# ┌──────────────┐
# │ a            │
# │ ---          │
# │  list[u32]   │
# ╞══════════════╡
# │ [null, null] │
# └──────────────┘

Expected behavior

>> import polars as pl
>> df = pl.Series('a', [*[None]*900_000, [1, 2]], dtype=pl.List(pl.UInt32)).to_frame()
>> df
shape: (2, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[u32] │
╞═══════════╡
│ null      │
│ [1, 2]    │
└───────────┘
>> df.write_parquet('test.pq')
>> pl.read_parquet('test.pq').tail(2)
shape: (2, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[u32] │
╞═══════════╡
│ null      │
│ [1, 2]    │
└───────────┘

Installed versions

---Version info---
Polars: 0.15.15
Index type: UInt32
Platform: macOS-13.1-arm64-arm-64bit
Python: 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.22.4
fsspec: 2022.11.0
connectorx: 0.3.1
xlsx2csv: <not installed>
matplotlib: <not installed>
@Vincenthays Vincenthays added bug Something isn't working python Related to Python Polars labels Jan 17, 2023
@ritchie46
Copy link
Member

Thanks for the issue report. In the meantime you can circumvent the issue by writing with use_pyarrow=True.

@stinodego
Copy link
Member

This has been fixed.

@stinodego stinodego removed the needs triage Awaiting prioritization by a maintainer label Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants