fix: handle partitions with empty table in read_parquet with dataset=True #2983

cournape · 2024-10-02T04:23:22Z

When reading a set of parquet files with dataset=True, if the first partition is empty the current logic for dtype inference will fail. It ill raise exceptions as follows:

pyarrow.lib.ArrowTypeError: Unable to merge: Field col0 has incompatible
types: dictionary<values=null, indices=int32, ordered=0> vs
dictionary<values=string, indices=int32, ordered=0

To fix this, we filter out empty table(s) before merging them into one parquet file.

Note: I have only run the mock test suite, I can't easily run the suite against actual AWS services.

…mpty. When reading a set of parquet files with dataset=True, if the first partition is empty the current logic for dtype inference will fail. It ill raise exceptions as follows: ``` pyarrow.lib.ArrowTypeError: Unable to merge: Field col0 has incompatible types: dictionary<values=null, indices=int32, ordered=0> vs dictionary<values=string, indices=int32, ordered=0 ``` To fix this, we filter out empty table(s) before merging them into one parquet file.

While that corner case was caughed in the full test suite, we add a mock test for this corner case for quick turnaround.

jaidisido · 2024-10-02T09:20:42Z

I am bit confused about the way you are generating your partitioned dataset. Specifically this bit:

    for i, df in enumerate(dataframes):
        wr.s3.to_parquet(
            df=df,
            path=f"{s3_key}/part{i}.parquet",
        )

You are artificially creating the partitions instead of relying on awswrangler or pandas to do it.

When I rewrite your test with a proper partitioning call, the exception is not raised:

def test_s3_dataset_empty_table(moto_s3_client: "S3Client") -> None:
    """Test that a dataset split into multiple parquet files whose first
    partition is an empty table still loads properly.
    """
    s3_key = f"s3://bucket/"

    dtypes = {"id": "string[python]"}
    df1 = pd.DataFrame({"id": []}).astype(dtypes)
    df2 = pd.DataFrame({"id": ["1"] * 2}).astype(dtypes)
    df3 = pd.DataFrame({"id": ["1"] * 3}).astype(dtypes)

    dataframes = [df1, df2, df3]
    r_df = pd.concat(dataframes, ignore_index=True)
    r_df = r_df.assign(col0=pd.Categorical(["1"] * len(r_df)))

    wr.s3.to_parquet(r_df, path=s3_key, dataset=True, partition_cols=["col0"])

    result_df = wr.s3.read_parquet(path=s3_key, dataset=True)
    pd.testing.assert_frame_equal(result_df, r_df, check_dtype=True)

The difference being that in the code above, the files are written as a single dataset and the metadata for it is preserved

cournape · 2024-10-02T12:24:16Z

The difference being that in the code above, the files are written as a single dataset and the metadata for it is preserved

In that test example, there is no empty table written to s3 since you are concatenating the dataframes before writing to s3.

To trigger the error, the execution needs to go through

aws-sdk-pandas/awswrangler/_arrow.py

Line 33 in 635f6d5

def _add_table_partitions(

with > 1 table, including one empty.

That exception I am trying to fix is triggered by some real parquet datasets created from spark. If that helps, happy to give details through our internal slack at amazon, including the dataset.

cournape · 2024-10-02T12:39:34Z

To trigger the bug, you need all of the following to be true

dataset=True
the datasets being read has more than one parquet file, with the first one read to be empty (order may not matter)
the s3 key must contain some partition so that awswrangler add the corresponding column(s)

When that happens, the underlying error is the type inference happening here: https://github.com/aws/aws-sdk-pandas/blob/main/awswrangler/_arrow.py#L41.

If a table is not empty, part_value will contain the right "type" for that added column partition. But if table is empty, you get a type that is independent of the value, e.g.

>>> pa.array(["1"] * 0).dictionary_encode()
<pyarrow.lib.DictionaryArray object at 0x7f63e97d9cb0>

-- dictionary:
0 nulls
-- indices:
  []

vs (no empty table)

>>> pa.array(["1"] * 1).dictionary_encode()
<pyarrow.lib.DictionaryArray object at 0x7f63bda8eb20>

-- dictionary:
  [
    "1"
  ]
-- indices:
  [
    0
  ]

When both cases happen in the tables list, you get an exception when merging them because of incompatible column types.

cournape · 2024-10-21T06:27:45Z

@jaidisido anything else I could provide to move this PR forward ? Happy to add more tests if needed

malachi-constant · 2024-10-22T05:11:44Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 8170283
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-10-22T05:43:43Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 8170283
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-12-03T13:25:43Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 6ffa9f8
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-12-03T13:52:28Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 8170283
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-12-03T13:52:46Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 6ffa9f8
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-12-03T14:31:05Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 4996373
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-12-03T14:58:20Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 4996373
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

cournape changed the title ~~BUG: fix read_parquet with dataset=True when the first partition is e…~~ FIX: fix read_parquet with dataset=True when the first partition is e… Oct 2, 2024

[style]: forgot to run ruff on the new code.

1644f2d

This comment was marked as outdated.

Sign in to view

cournape changed the title ~~FIX: fix read_parquet with dataset=True when the first partition is e…~~ fix: handle partitions with empty table in read_parquet with dataset=True Oct 2, 2024

This comment was marked as outdated.

Sign in to view

bug: fix the corner case where every table is empty.

f8ebab2

While that corner case was caughed in the full test suite, we add a mock test for this corner case for quick turnaround.

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into bug/fix-dataset-empty-table

8170283

jaidisido requested a review from kukushking December 3, 2024 12:06

Merge branch 'main' into bug/fix-dataset-empty-table

6ffa9f8

Merge branch 'main' into bug/fix-dataset-empty-table

4996373

kukushking requested a review from jaidisido December 3, 2024 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle partitions with empty table in read_parquet with dataset=True #2983

fix: handle partitions with empty table in read_parquet with dataset=True #2983

cournape commented Oct 2, 2024

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

jaidisido commented Oct 2, 2024

cournape commented Oct 2, 2024 •

edited

Loading

cournape commented Oct 2, 2024 •

edited

Loading

cournape commented Oct 21, 2024

malachi-constant commented Oct 22, 2024

malachi-constant commented Oct 22, 2024

malachi-constant commented Dec 3, 2024

malachi-constant commented Dec 3, 2024

malachi-constant commented Dec 3, 2024

malachi-constant commented Dec 3, 2024

malachi-constant commented Dec 3, 2024

fix: handle partitions with empty table in read_parquet with dataset=True #2983

Are you sure you want to change the base?

fix: handle partitions with empty table in read_parquet with dataset=True #2983

Conversation

cournape commented Oct 2, 2024

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

jaidisido commented Oct 2, 2024

cournape commented Oct 2, 2024 • edited Loading

cournape commented Oct 2, 2024 • edited Loading

cournape commented Oct 21, 2024

malachi-constant commented Oct 22, 2024

AWS CodeBuild CI Report

malachi-constant commented Oct 22, 2024

AWS CodeBuild CI Report

malachi-constant commented Dec 3, 2024

AWS CodeBuild CI Report

malachi-constant commented Dec 3, 2024

AWS CodeBuild CI Report

malachi-constant commented Dec 3, 2024

AWS CodeBuild CI Report

malachi-constant commented Dec 3, 2024

AWS CodeBuild CI Report

malachi-constant commented Dec 3, 2024

AWS CodeBuild CI Report

cournape commented Oct 2, 2024 •

edited

Loading

cournape commented Oct 2, 2024 •

edited

Loading