[Data] Make zip operator accept multiple input #56524

owenowenisme · 2025-09-15T08:19:58Z

Why are these changes needed?

Before making zip operator a streaming operator, we make it accept multiple input first.

Now Zip operator can be used with

>>> import ray
>>> ds1 = ray.data.range(5)
>>> ds2 = ray.data.range(5)
>>> ds3 = ray.data.range(5)
>>> ds1.zip(ds2, ds3).take_batch()
{'id': array([0, 1, 2, 3, 4]), 'id_1': array([0, 1, 2, 3, 4]), 'id_2': array([0, 1, 2, 3, 4])}                                                                         
>>> ds1.zip(ds2, ds3).take_all()
[{'id': 0, 'id_1': 0, 'id_2': 0}, {'id': 1, 'id_1': 1, 'id_2': 1}, {'id': 2, 'id_1': 2, 'id_2': 2}, {'id': 3, 'id_1': 3, 'id_2': 3}, {'id': 4, 'id_1': 4, 'id_2': 4}]
>>>

Related issue number

#56504

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

owenowenisme · 2025-09-15T13:34:33Z

@richardliaw @gvspraveen @alexeykudinkin PTAL, thanks!

python/ray/data/_internal/execution/operators/zip_operator.py

python/ray/data/tests/test_zip.py

goutamvenkat-anyscale · 2025-09-15T17:50:32Z

Thanks for your contribution! Overall the first change looks good. Just a few minor comments.

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

…rable dataset size Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

iamjustinhsu · 2025-09-16T23:45:03Z

ci/lint/pydoclint-baseline.txt

 python/ray/data/_internal/execution/operators/zip_operator.py
    DOC101: Method `ZipOperator.__init__`: Docstring contains fewer arguments than in function signature.
-    DOC103: Method `ZipOperator.__init__`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [data_context: DataContext, left_input_op: PhysicalOperator]. Arguments in the docstring but not in the function signature: [left_input_ops: ].
+    DOC103: Method `ZipOperator.__init__`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [*input_ops: PhysicalOperator, data_context: DataContext]. Arguments in the docstring but not in the function signature: [input_ops: ].


typically when there are changes to the baseline, we would want to fix it. Is this possible to fix or is this a bug with pydoc linting?

Yeah I am aware of this, but I thought this is intended? Other operator does not have data_context in its doc
e.g. UnionOperator

def __init__( self, data_context: DataContext, *input_ops: PhysicalOperator, ): """Create a UnionOperator. Args: input_ops: Operators generating input data for this operator to union. """

ah in the future we can just add data_context into the doc, i think that's a good thing to fix

alexeykudinkin · 2025-09-17T00:14:54Z

python/ray/data/_internal/execution/operators/zip_operator.py

+            if num_outputs is None:
+                num_outputs = input_num_outputs
+            else:
+                num_outputs = max(num_outputs, input_num_outputs)


This should be min, not max

Let's make sure this is covered with tests also

Thanks! On second thought, neither max nor min seems accurate, right? Since the number of input blocks for each output should be the same (to perform a zip), and we already assert:

total_left_rows = sum(left_block_rows) total_right_rows = sum(right_block_rows) if total_left_rows != total_right_rows: raise ValueError( "Cannot zip datasets of different number of rows: " f"{total_left_rows}, {total_right_rows}" )

Maybe we don't actually need to calculate num_outputs here?
Correct me if I'm wrong, thanks!

Even if we use ‎min for the number of output rows right now, this logic will need to change when user-directed dropping or padding is introduced.

Padding would require using ‎max, while dropping would use ‎min, so the calculating number of rows here is redundant.

Discussed with @gvspraveen offline.

alexeykudinkin · 2025-09-17T00:15:10Z

python/ray/data/_internal/execution/operators/zip_operator.py

+            if num_rows is None:
+                num_rows = input_num_rows
+            else:
+                num_rows = max(num_rows, input_num_rows)


Same comment

alexeykudinkin · 2025-09-17T00:17:00Z

python/ray/data/_internal/logical/operators/n_ary_operator.py

+            num_outputs = input.estimated_num_outputs()
+            if num_outputs is None:
+                return None
+            total_num_outputs = max(total_num_outputs, num_outputs)


Same commetn

…rator-accept-multiple-input

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

cursor · 2025-09-25T16:12:44Z

python/ray/data/_internal/execution/operators/zip_operator.py

-        else:
-            self._right_buffer.append(refs)
-            self._metrics.on_input_queued(refs)
+        assert 0 <= input_index <= len(self._input_dependencies), input_index


Bug: Index Assertion Error Causes Buffer Access Failure

The assertion for input_index in _add_input_inner allows an out-of-bounds index equal to len(self._input_dependencies). This can lead to an IndexError when accessing self._input_buffers.

## Why are these changes needed? Before making zip operator a streaming operator, we make it accept multiple input first. Now Zip operator can be used with ```py >>> import ray >>> ds1 = ray.data.range(5) >>> ds2 = ray.data.range(5) >>> ds3 = ray.data.range(5) >>> ds1.zip(ds2, ds3).take_batch() {'id': array([0, 1, 2, 3, 4]), 'id_1': array([0, 1, 2, 3, 4]), 'id_2': array([0, 1, 2, 3, 4])} >>> ds1.zip(ds2, ds3).take_all() [{'id': 0, 'id_1': 0, 'id_2': 0}, {'id': 1, 'id_1': 1, 'id_2': 1}, {'id': 2, 'id_1': 2, 'id_2': 2}, {'id': 3, 'id_1': 3, 'id_2': 3}, {'id': 4, 'id_1': 4, 'id_2': 4}] >>> ```  ## Related issue number #56504  Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

## Why are these changes needed? Before making zip operator a streaming operator, we make it accept multiple input first. Now Zip operator can be used with ```py >>> import ray >>> ds1 = ray.data.range(5) >>> ds2 = ray.data.range(5) >>> ds3 = ray.data.range(5) >>> ds1.zip(ds2, ds3).take_batch() {'id': array([0, 1, 2, 3, 4]), 'id_1': array([0, 1, 2, 3, 4]), 'id_2': array([0, 1, 2, 3, 4])} >>> ds1.zip(ds2, ds3).take_all() [{'id': 0, 'id_1': 0, 'id_2': 0}, {'id': 1, 'id_1': 1, 'id_2': 1}, {'id': 2, 'id_1': 2, 'id_2': 2}, {'id': 3, 'id_1': 3, 'id_2': 3}, {'id': 4, 'id_1': 4, 'id_2': 4}] >>> ```  ## Related issue number #56504  Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

## Why are these changes needed? Before making zip operator a streaming operator, we make it accept multiple input first. Now Zip operator can be used with ```py >>> import ray >>> ds1 = ray.data.range(5) >>> ds2 = ray.data.range(5) >>> ds3 = ray.data.range(5) >>> ds1.zip(ds2, ds3).take_batch() {'id': array([0, 1, 2, 3, 4]), 'id_1': array([0, 1, 2, 3, 4]), 'id_2': array([0, 1, 2, 3, 4])} >>> ds1.zip(ds2, ds3).take_all() [{'id': 0, 'id_1': 0, 'id_2': 0}, {'id': 1, 'id_1': 1, 'id_2': 1}, {'id': 2, 'id_1': 2, 'id_2': 2}, {'id': 3, 'id_1': 3, 'id_2': 3}, {'id': 4, 'id_1': 4, 'id_2': 4}] >>> ```  ## Related issue number ray-project#56504  Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

Original PR #56524 by owenowenisme Original: ray-project/ray#56524

Merged from original PR #56524 Original: ray-project/ray#56524

## Why are these changes needed? Before making zip operator a streaming operator, we make it accept multiple input first. Now Zip operator can be used with ```py >>> import ray >>> ds1 = ray.data.range(5) >>> ds2 = ray.data.range(5) >>> ds3 = ray.data.range(5) >>> ds1.zip(ds2, ds3).take_batch() {'id': array([0, 1, 2, 3, 4]), 'id_1': array([0, 1, 2, 3, 4]), 'id_2': array([0, 1, 2, 3, 4])} >>> ds1.zip(ds2, ds3).take_all() [{'id': 0, 'id_1': 0, 'id_2': 0}, {'id': 1, 'id_1': 1, 'id_2': 1}, {'id': 2, 'id_1': 2, 'id_2': 2}, {'id': 3, 'id_1': 3, 'id_2': 3}, {'id': 4, 'id_1': 4, 'id_2': 4}] >>> ```  ## Related issue number ray-project#56504  Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

## Why are these changes needed? Before making zip operator a streaming operator, we make it accept multiple input first. Now Zip operator can be used with ```py >>> import ray >>> ds1 = ray.data.range(5) >>> ds2 = ray.data.range(5) >>> ds3 = ray.data.range(5) >>> ds1.zip(ds2, ds3).take_batch() {'id': array([0, 1, 2, 3, 4]), 'id_1': array([0, 1, 2, 3, 4]), 'id_2': array([0, 1, 2, 3, 4])} >>> ds1.zip(ds2, ds3).take_all() [{'id': 0, 'id_1': 0, 'id_2': 0}, {'id': 1, 'id_1': 1, 'id_2': 1}, {'id': 2, 'id_1': 2, 'id_2': 2}, {'id': 3, 'id_1': 3, 'id_2': 3}, {'id': 4, 'id_1': 4, 'id_2': 4}] >>> ```  ## Related issue number ray-project#56504  Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

owenowenisme and others added 4 commits September 13, 2025 09:27

inherit NAry operator first

4fd074c

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

fix zip operator

13441b7

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

make zip operator accept list of input

3de48b1

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

Merge branch 'master' into data/make-zip-operator-streaming

fb2eaff

owenowenisme marked this pull request as ready for review September 15, 2025 08:22

owenowenisme requested a review from a team as a code owner September 15, 2025 08:22

owenowenisme changed the title ~~Data/make zip operator accept multiple input~~ [Data] Make zip operator accept multiple input Sep 15, 2025

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Sep 15, 2025

gvspraveen assigned alexeykudinkin and goutamvenkat-anyscale Sep 15, 2025

goutamvenkat-anyscale reviewed Sep 15, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/zip_operator.py Show resolved Hide resolved

goutamvenkat-anyscale reviewed Sep 15, 2025

View reviewed changes

python/ray/data/tests/test_zip.py Outdated Show resolved Hide resolved

owenowenisme added 2 commits September 16, 2025 03:00

add assert for input_op length > 2

3e70237

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

modify test for zip operator to make it parametrize test with configu…

1548a44

…rable dataset size Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

owenowenisme requested a review from goutamvenkat-anyscale September 16, 2025 06:30

goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Sep 16, 2025

goutamvenkat-anyscale approved these changes Sep 16, 2025

View reviewed changes

iamjustinhsu reviewed Sep 16, 2025

View reviewed changes

iamjustinhsu approved these changes Sep 16, 2025

View reviewed changes

alexeykudinkin reviewed Sep 17, 2025

View reviewed changes

owenowenisme added 2 commits September 17, 2025 04:00

Merge remote-tracking branch 'upstream/master' into data/make-zip-ope…

19dbb20

…rator-accept-multiple-input

update pydo lint

592479d

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

owenowenisme marked this pull request as draft September 25, 2025 11:59

owenowenisme marked this pull request as ready for review September 25, 2025 16:07

richardliaw merged commit 09a9970 into ray-project:master Sep 25, 2025
8 checks passed

cursor bot reviewed Sep 25, 2025

View reviewed changes

snorkelopstesting3-bot mentioned this pull request Oct 22, 2025

[Data] Make zip operator accept multiple input snorkel-marlin-repos/ray-project_ray_pr_56524_2dd3e4b0-709f-4113-bec6-7d0226f2ba2a#1

Merged

[Data] Make zip operator accept multiple input #56524

[Data] Make zip operator accept multiple input #56524

Conversation

owenowenisme commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

owenowenisme commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Sep 15, 2025

Uh oh!

iamjustinhsu Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

owenowenisme Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richardliaw Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

owenowenisme Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

owenowenisme Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Sep 25, 2025

Choose a reason for hiding this comment

Bug: Index Assertion Error Causes Buffer Access Failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

owenowenisme commented Sep 15, 2025 •

edited

Loading

owenowenisme commented Sep 15, 2025 •

edited

Loading

owenowenisme Sep 17, 2025 •

edited

Loading

owenowenisme Sep 17, 2025 •

edited

Loading