Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does s3.read_parquet() returns different data type depending on chunk-size? #3123

Closed
jakov7 opened this issue Apr 2, 2025 · 2 comments · Fixed by #3127
Closed

Why does s3.read_parquet() returns different data type depending on chunk-size? #3123

jakov7 opened this issue Apr 2, 2025 · 2 comments · Fixed by #3127
Labels
question Further information is requested

Comments

@jakov7
Copy link

jakov7 commented Apr 2, 2025

df = pd.DataFrame({"cat1": ['a', 'b', 'b']})
df["cat1"] = df["cat1"].astype("category")

wr.s3.to_parquet(
    df=df,
    path='s3://DWH/test',
    dataset=True
)
for chunk_size in range(1,4):
    print(f"chunk_size: {chunk_size}")
    for df in wr.s3.read_parquet("s3://DWH/test", chunked=chunk_size):
        print(df["cat1"].dtypes)

This returns all categories

chunk_size: 1
category
category
category
chunk_size: 2
category
category
chunk_size: 3
category
df = pd.DataFrame({"cat1": ['a', 'b', 'b', 'xxx']})
df["cat1"] = df["cat1"].astype("category")

wr.s3.to_parquet(
    df=df,
    path='s3://DWH/test',
    dataset=True
)
for chunk_size in range(1,8):
    print(f"chunk_size: {chunk_size}")
    for df in wr.s3.read_parquet("s3://DWH/test", chunked=chunk_size):
        print(df["cat1"].dtypes)

This returns mixed data types

chunk_size: 1
category
category
category
category
category
category
category
chunk_size: 2
category
category
category
category
chunk_size: 3
category
object
object
chunk_size: 4
category
category
chunk_size: 5
object
object
chunk_size: 6
object
object
chunk_size: 7
object

Problem occurs when new category is introduced. Is there a way to be sure category is returned. pyarrow_additional_kwargs do not help with this.

@jakov7 jakov7 added the question Further information is requested label Apr 2, 2025
@kukushking
Copy link
Contributor

Hi @jakov7 this is related to pandas-dev/pandas#51362 - looks like pd.concat() does not preserve categoricals (if they are not matching).

@kukushking
Copy link
Contributor

We might be able to fix it with union_categoricals() - I'll test it and let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants