Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Handle nullable fields in schema across blocks for parquet files #48478

Merged
merged 9 commits into from
Nov 14, 2024

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Oct 31, 2024

Why are these changes needed?

When writing blocks to parquet, there might be blocks with fields that differ ONLY in nullability - by default, this would be rejected since some blocks might have a different schema than the ParquetWriter. However, we could potentially allow it to happen by tweaking the schema.

This PR goes through all blocks before writing them to parquet, and merge schemas that differ only in nullability of the fields.
It also casts the table to the newly merged schema so that the write could happen.

Related issue number

Closes #48102

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level approach LGTM

@@ -75,10 +75,12 @@ def write(

def write_blocks_to_path():
with self.open_output_stream(write_path) as file:
schema = BlockAccessor.for_block(blocks[0]).to_arrow().schema
tables = [BlockAccessor.for_block(block).to_arrow() for block in blocks]
schema = self._try_merge_nullable_fields(tables)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than introducing a new method, we could extend the existing unify_schemas function:

Comment on lines 82 to 83
if not table.schema.equals(schema):
table = table.cast(schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we don't explicitly cast the tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table would still have a mismatch schema. i.e.
table.schema.equals(schema) in this case would still be false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Wasn't sure if PyArrow would implicitly cast tables to match the specified schema under-the-hood

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it doesn't do the casting because there's check on the schema equality here:

https://github.com/apache/arrow/blob/main/python/pyarrow/parquet/core.py#L1110-L1114

@rickyyx
Copy link
Contributor Author

rickyyx commented Nov 4, 2024

Updates:

  • Use pyarrow.unify_schemas to unify schemas from various blocks (Didn't use transform_arrows.py::unify_schemas since that routine's goal seems to be doing a lot more extra work that's not required here. But open to use that as well.)
  • Added some simple tests.

@rickyyx rickyyx marked this pull request as ready for review November 4, 2024 21:10
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 1191 to 1193

if OperatorFusionRule in _PHYSICAL_RULES:
_PHYSICAL_RULES.remove(OperatorFusionRule)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note why we're removing operator fusion here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why are we removing fusion?

We'd not need to do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we don't do this - I think there will be only 1 block somehow. So the repro here didn't work, I guess we would need some other examples/repros if we don't remove the rule?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, i see what you're saying.

Surely we can disable operator fusion, but that should be done t/h configuration not "physically" removing the rule from the list (just add config to DataContext disabling it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just add config to DataContext disabling it

@alexeykudinkin do you envision us adding a config for each optimization rule, or special-case operator fusion?

In any case, adding an interface for disabling optimization rules seems orthogonal to the goal of this PR, and can probably be handled as a follow-up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickyyx no need to block this PR on this, let's just reshape your test a bit:

  • Instead of using ray.data.range as source, create 2 parquet files -- 1 without nulls, another with nulls
  • Read both of these and then write out as single one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin I might be missing something here, but I am not sure how I can force writing the 2 files with a single block w/o disabling the operator fusing.

Something like below still only writes to the file with a single block (so there's technically no schema unification needed)

    # Write each row to a separate file.
    for i, row in enumerate(row_data):
        ray.data.from_pandas(pd.DataFrame([row])).write_parquet(
            os.path.join(tmp_path, f"file_{i}.parquet")
        )

    # Read files and merge into a single file shouldn't error.
    ray.data.read_parquet(tmp_path).write_parquet(tmp_path, num_rows_per_file=2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using ray.data.range as source, create 2 parquet files -- 1 without nulls, another with nulls
Read both of these and then write out as single one

I don't think this'd reproduce the error. IIRC Ray Data will read both files in a single task, and then BlockOutputBuffer will combine the read data into a single block before passing it to the datasink

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that might require some fidgeting to make it work.

Alternative path is to specify num_cpus which should make them diverge and hence avoid fusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to force with by changing the target_max_block_size if that's a better approach.

[
[{"a": 1, "b": None}, {"a": 1, "b": 2}],
[{"a": None, "b": None}, {"a": 1, "b": 2}],
[{"a": 1, "b": 2}, {"a": 1, "b": "hi"}],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the type get promoted to for "b" in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - this shouldn't pass actually. it was somehow passing without remove the fusion.

python/ray/data/_internal/datasource/parquet_datasink.py Outdated Show resolved Hide resolved
Comment on lines 1191 to 1193

if OperatorFusionRule in _PHYSICAL_RULES:
_PHYSICAL_RULES.remove(OperatorFusionRule)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why are we removing fusion?

We'd not need to do that

Signed-off-by: rickyx <rickyx@anyscale.com>
@rickyyx rickyyx requested a review from srinathk10 as a code owner November 13, 2024 02:31
@rickyyx
Copy link
Contributor Author

rickyyx commented Nov 13, 2024

Updates

  • Change test to avoid removal of operator
  • Resolve conflict.

],
ids=["row1_b_null", "row1_a_null", "row_each_null"],
)
def test_write_auto_infer_nullable_fields(tmp_path, ray_start_regular_shared, row_data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the restore_data_context fixture so that changes aren't persisted across tests

Suggested change
def test_write_auto_infer_nullable_fields(tmp_path, ray_start_regular_shared, row_data):
def test_write_auto_infer_nullable_fields(tmp_path, ray_start_regular_shared, row_data, restore_data_context):

ctx = DataContext.get_current()
# So that we force multiple blocks on mapping.
ctx.target_max_block_size = 1
ds = ray.data.range(len(row_data)).map(lambda i: row_data[i["id"]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The name i makes me think that i is an int (index).

Suggested change
ds = ray.data.range(len(row_data)).map(lambda i: row_data[i["id"]])
ds = ray.data.range(len(row_data)).map(lambda row: row_data[row["id"]])

(Feel free to keep it as-is, too)

Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
@rickyyx rickyyx enabled auto-merge (squash) November 13, 2024 22:31
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 13, 2024
@rickyyx rickyyx merged commit 138e59a into ray-project:master Nov 14, 2024
7 checks passed
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
…es (ray-project#48478)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

When writing blocks to parquet, there might be blocks with fields that
differ ONLY in nullability - by default, this would be rejected since
some blocks might have a different schema than the ParquetWriter.
However, we could potentially allow it to happen by tweaking the schema.

This PR goes through all blocks before writing them to parquet, and
merge schemas that differ only in nullability of the fields.
It also casts the table to the newly merged schema so that the write
could happen.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Closes ray-project#48102

---------

Signed-off-by: rickyx <rickyx@anyscale.com>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…es (ray-project#48478)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

When writing blocks to parquet, there might be blocks with fields that
differ ONLY in nullability - by default, this would be rejected since
some blocks might have a different schema than the ParquetWriter.
However, we could potentially allow it to happen by tweaking the schema.

This PR goes through all blocks before writing them to parquet, and
merge schemas that differ only in nullability of the fields.
It also casts the table to the newly merged schema so that the write
could happen.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Closes ray-project#48102

---------

Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
dentiny pushed a commit to dentiny/ray that referenced this pull request Dec 7, 2024
…es (ray-project#48478)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

When writing blocks to parquet, there might be blocks with fields that
differ ONLY in nullability - by default, this would be rejected since
some blocks might have a different schema than the ParquetWriter.
However, we could potentially allow it to happen by tweaking the schema.

This PR goes through all blocks before writing them to parquet, and
merge schemas that differ only in nullability of the fields.
It also casts the table to the newly merged schema so that the write
could happen.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Closes ray-project#48102

---------

Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: hjiang <dentinyhao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Schema error while writing Parquet files
3 participants