Add write operator in new logical plan #32440

jianoaix · 2023-02-10T21:06:40Z

Why are these changes needed?

This is a followup to #32015

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: jianoaix <iamjianxiao@gmail.com>

c21 · 2023-02-13T20:04:26Z

python/ray/data/_internal/logical/operators/map_operator.py

+        super().__init__(
+            "Write",
+            input_op,
+            fn=lambda x: x,


This is kind of awkward that we cannot support write operator. Could we mark fn as optional in AbstractMap?

Shouldn't fn be the write function for the data here?

Like fn=datasource.write

The fn is the UDF, and datasrouce.write is the transform_fn which is internal.
Maybe we should rename user-supplied fn to udf for clarity.

I see. I think you actually want a hierarchy of AbstractMap>BatchMap and AbstractMap>Write then? Both are generating maps, just with different strategies.

If we want a logical-level base operator for all map-like operators, then we should probably introduce a new base AbstractMap, with the existing AbstractMap renamed to AbstractUDFMap:

AbstractMap(ray_remote_args) - Read - Write - AbstractUDFMap(fn, compute) - MapBatches - MapRows - Filter - FlatMap

Uplift the AbstractMap looks good to me and it seems correspond to the MapOperator in the physical operator.

I'll leave it to you to decide, since I don't have much context on the fusion/planner code yet.

The only point I want to make is I don't think this class really needs to exist, except as an alias: Write ~= MapBatches(batch_size=None, fn=datasource.write_fn, ray_remote_args={})). But this may just be a non-useful shortcut.

I feel quite strongly that that's a decision for the planner to make, there's limited utility of making that translation/assertion at the logical level via logical operation abstraction hierarchies, and we should only create abstraction hierarchies at the logical operator level if it's useful for the optimization rules or the planner. That shortcut only becomes useful at the planning level, when we're creating the physical MapOperator.

btw let's check if the write operator breaks the current randomize_blocks_order reordering.

c21 · 2023-02-13T20:13:21Z

python/ray/data/tests/test_execution_optimizer.py

@@ -518,6 +518,12 @@ def test_read_map_chain_operator_fusion_e2e(ray_start_regular_shared, enable_opt
    assert name in ds.stats()


+def test_write_operator(ray_start_regular_shared, enable_optimizer, tmp_path):


can we also add a unit test for Write operator similar to others - such as test_sort_operator below.

clarkzinzow · 2023-02-13T20:27:24Z

python/ray/data/_internal/logical/operators/map_operator.py

@@ -119,6 +120,26 @@ def __init__(
        )


+class Write(AbstractMap):


@c21 @jianoaix Should Write derive from AbstractMap, or should it be a standalone logical operator like Read? https://github.com/ray-project/ray/blob/2248ea602fbf6c53db1c5afc58f8bd386a66e1de/python/ray/data/_internal/logical/operators/read_operator.py

Then we could avoid this awkwardness with fn and the like. I don't see a compelling reason to have Write derive from AbstractMap, on first pass.

@clarkzinzow - I think if we make Write not AbstractMap, we need to add some extra code during planning (current code).

Would it break anything for operator fusion and zero-copy batching? If not, I am also in favor to make Write extend LogicalOperator directly, because Read is doing that, and we can add as a seprate plan_write_op.

@c21 Yeah I think it's worth breaking out the separate planning bit, and it shouldn't break anything with operator fusion or zero-copy batching; it will actually line up better with the latter!

I have this bit refactored to make the dispatch easier to centralize beyond just the AbstractMap ops: https://github.com/ray-project/ray/pull/32178/files#diff-4caa8ddd8103dd8f8d6a3e8c1237aec4eaa168a81dc914ae83b4f6042d68a1da

cool then +1 to make Write extend LogicalOperator directly.

yeah I was feeling the fn was weird to have since it's not relevant for Write.

clarkzinzow · 2023-02-14T16:44:28Z

@ericl This wasn't ready to merge, we were going to change Write to extending LogicalOperator instead of AbstractMap. @jianoaix can you do that in a follow-up PR?

jianoaix · 2023-02-14T17:00:48Z

I had local changes not pushed here. I'll make a followup.

With the new `write` added (from #32015 and #32440), Ray Data intends to support both the `write` and `do_write` functions for now. The check currently uses the `hasattr()` function to ensure the datasource object has a `write` method before using it. However, this is insufficient for a custom datasource that inherits from `Datasource` since `Datasource` has the `write` method implemented. If the custom datasource only has `do_write` implemented, `hasattr(datasource, "write")` will return True since `hasattr()` will detect methods via inheritance. The solution is to check if the `write` method was overwritten from `Datasource.write`. Any class that has not implemented `write` will have the equality check return True

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

With the new `write` added (from ray-project#32015 and ray-project#32440), Ray Data intends to support both the `write` and `do_write` functions for now. The check currently uses the `hasattr()` function to ensure the datasource object has a `write` method before using it. However, this is insufficient for a custom datasource that inherits from `Datasource` since `Datasource` has the `write` method implemented. If the custom datasource only has `do_write` implemented, `hasattr(datasource, "write")` will return True since `hasattr()` will detect methods via inheritance. The solution is to check if the `write` method was overwritten from `Datasource.write`. Any class that has not implemented `write` will have the equality check return True Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

With the new `write` added (from ray-project#32015 and ray-project#32440), Ray Data intends to support both the `write` and `do_write` functions for now. The check currently uses the `hasattr()` function to ensure the datasource object has a `write` method before using it. However, this is insufficient for a custom datasource that inherits from `Datasource` since `Datasource` has the `write` method implemented. If the custom datasource only has `do_write` implemented, `hasattr(datasource, "write")` will return True since `hasattr()` will detect methods via inheritance. The solution is to check if the `write` method was overwritten from `Datasource.write`. Any class that has not implemented `write` will have the equality check return True

Signed-off-by: elliottower <elliot@elliottower.com>

With the new `write` added (from ray-project#32015 and ray-project#32440), Ray Data intends to support both the `write` and `do_write` functions for now. The check currently uses the `hasattr()` function to ensure the datasource object has a `write` method before using it. However, this is insufficient for a custom datasource that inherits from `Datasource` since `Datasource` has the `write` method implemented. If the custom datasource only has `do_write` implemented, `hasattr(datasource, "write")` will return True since `hasattr()` will detect methods via inheritance. The solution is to check if the `write` method was overwritten from `Datasource.write`. Any class that has not implemented `write` will have the equality check return True Signed-off-by: elliottower <elliot@elliottower.com>

jianoaix added 30 commits December 8, 2022 23:20

Fix read_tfrecords_benchmark nightly test

edc51bd

Signed-off-by: jianoaix <iamjianxiao@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray

61f4d6d

Merge branch 'master' of https://github.com/ray-project/ray

a33a943

Merge branch 'master' of https://github.com/ray-project/ray

36ebe52

Merge branch 'master' of https://github.com/ray-project/ray

ce6763e

Merge branch 'master' of https://github.com/ray-project/ray

0e2c29e

Merge branch 'master' of https://github.com/ray-project/ray

f2b6ed0

Merge branch 'master' of https://github.com/ray-project/ray

bb6c5c4

Merge branch 'master' of https://github.com/ray-project/ray

540fe79

Merge branch 'master' of https://github.com/ray-project/ray

edad7d0

Merge branch 'master' of https://github.com/ray-project/ray

60cc079

Merge branch 'master' of https://github.com/ray-project/ray

a3d3980

Merge branch 'master' of https://github.com/ray-project/ray

001579c

Merge branch 'master' of https://github.com/ray-project/ray

8aeed6c

Merge branch 'master' of https://github.com/ray-project/ray

7a9a49b

Merge branch 'master' of https://github.com/ray-project/ray

ef97167

Merge branch 'master' of https://github.com/ray-project/ray

6f0563c

Merge branch 'master' of https://github.com/ray-project/ray

bcec4d6

Merge branch 'master' of https://github.com/ray-project/ray

ddef4e5

Merge branch 'master' of https://github.com/ray-project/ray

fc9a175

Merge branch 'master' of https://github.com/ray-project/ray

f0e90b7

Merge branch 'master' of https://github.com/ray-project/ray

999d1de

Merge branch 'master' of https://github.com/ray-project/ray

d8159e3

Merge branch 'master' of https://github.com/ray-project/ray

d81cd02

Merge branch 'master' of https://github.com/ray-project/ray

bc831bb

Merge branch 'master' of https://github.com/ray-project/ray

c444395

Merge branch 'master' of https://github.com/ray-project/ray

642da6f

Merge branch 'master' of https://github.com/ray-project/ray

f713f2f

Merge branch 'master' of https://github.com/ray-project/ray

d416a73

write op

e6281be

jianoaix added 2 commits February 10, 2023 19:56

write op

bbb0b52

write op

2248ea6

jianoaix requested review from ericl, scv119, clarkzinzow, jjyao and c21 as code owners February 10, 2023 21:06

jianoaix assigned ericl, c21 and clarkzinzow Feb 13, 2023

ericl approved these changes Feb 13, 2023

View reviewed changes

c21 approved these changes Feb 13, 2023

View reviewed changes

clarkzinzow reviewed Feb 13, 2023

View reviewed changes

ericl merged commit 3414797 into ray-project:master Feb 14, 2023

jianoaix mentioned this pull request Feb 14, 2023

Make Write op extend AbstractMap operator #32538

Merged

7 tasks

matthew29tang mentioned this pull request Feb 17, 2023

[Data] Fix legacy do_write #32661

Merged

7 tasks

edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023

Add write operator in new logical plan (ray-project#32440)

597e566

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023

Add write operator in new logical plan (ray-project#32440)

4175732

Signed-off-by: elliottower <elliot@elliottower.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add write operator in new logical plan #32440

Add write operator in new logical plan #32440

jianoaix commented Feb 10, 2023 •

edited

Loading

c21 Feb 13, 2023

ericl Feb 14, 2023

ericl Feb 14, 2023

jianoaix Feb 14, 2023

ericl Feb 14, 2023

clarkzinzow Feb 14, 2023 •

edited

Loading

jianoaix Feb 14, 2023

ericl Feb 14, 2023

clarkzinzow Feb 14, 2023

c21 Feb 14, 2023

c21 Feb 13, 2023

clarkzinzow Feb 13, 2023

c21 Feb 13, 2023

clarkzinzow Feb 13, 2023

c21 Feb 13, 2023

jianoaix Feb 13, 2023

clarkzinzow commented Feb 14, 2023

jianoaix commented Feb 14, 2023

		@@ -518,6 +518,12 @@ def test_read_map_chain_operator_fusion_e2e(ray_start_regular_shared, enable_opt
		assert name in ds.stats()


		def test_write_operator(ray_start_regular_shared, enable_optimizer, tmp_path):

		@@ -119,6 +120,26 @@ def __init__(
		)


		class Write(AbstractMap):

Add write operator in new logical plan #32440

Add write operator in new logical plan #32440

Conversation

jianoaix commented Feb 10, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow Feb 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow commented Feb 14, 2023

jianoaix commented Feb 14, 2023

jianoaix commented Feb 10, 2023 •

edited

Loading

clarkzinzow Feb 14, 2023 •

edited

Loading