[Data] Allow unknown estimate of operator output bundles and `ProgressBar` totals #46601

scottjlee · 2024-07-12T20:07:32Z

Ray Data initially assumes that each read task produces exactly one block. Furthermore, one-to-one operators assume that the number of output blocks is the same as its upstream operator. Neither of these assumptions are always guaranteed to be accurate, which results in inaccurate progress bar estimations and can cause confusion. This PR updates PhysicalOperator.num_outputs_total() to allow for unknown estimated number of output bundles, which is the case when no tasks have finished, so it is not possible to provide a reasonable estimate.

For example, given the following reproducible script:

import time
import numpy as np
import ray
ray.init(num_cpus=1)

target_block_size = ray.data.DataContext.get_current().target_max_block_size

def sleep(batch):
    for _ in range(100):
        time.sleep(0.1)
        yield {"batch": np.zeros((target_block_size,), dtype=np.uint8)}

ray.data.range(10, override_num_blocks=10).map_batches(
    sleep, batch_size=None
).materialize()

We can compare the behavior before and after this PR (video links):

Before
After

Why are these changes needed?

Related issue number

Closes #46420

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen · 2024-07-15T18:01:23Z

python/ray/data/_internal/execution/interfaces/physical_operator.py

@@ -273,7 +275,7 @@ def num_outputs_total(self) -> int:
            return self._estimated_num_output_bundles
        if len(self.input_dependencies) == 1:
            return self.input_dependencies[0].num_outputs_total()


should we remove the above if statement? this is the code logic that "assumes that each read task produces exactly one block"

also, I'm thinking if we should make this method abstract to force each subclass to have a reasonable implementation.

should we remove the above if statement? this is the code logic that "assumes that each read task produces exactly one block"

Yeah, let me remove the if block, thanks.

In terms of forcing each operator to implement the method, I think for the majority of operators would simply return self._estimated_num_output_bundles if available, otherwise None. As a followup PR, I can refactor the individual logic to calculate self._estimated_num_output_bundles for operators into the num_outputs_total() method (for example, move this logic into num_outputs_total() method). Let me know if you'd rather I include it in this PR instead.

Actually I included the necessary changes in this PR, turned out to be simpler than I originally thought. For the default behavior, I included in PhysicalOperator.num_outputs_total() which returns self._estimated_num_output_bundles. Operators like AllToAllOperator, Limit, Union, and Zip have their own implementation which adds some more logic.

Looks much nicer. Can you update OutputSplitter as well?

added logic to increment self._estimated_num_output_bundles in OutputSplitter._get_next_inner(). let me know if you have different logic in mind.

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen · 2024-07-15T20:07:59Z

python/ray/data/_internal/execution/interfaces/physical_operator.py

-        if len(self.input_dependencies) == 1:
-            return self.input_dependencies[0].num_outputs_total()
-        raise AttributeError
+        return self._estimated_num_output_bundles


can you add a comment that subclasses should either override this method or update _estimated_num_output_bundles?

bveeramani

Nice

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen · 2024-07-16T17:33:19Z

python/ray/data/_internal/execution/operators/output_splitter.py

@@ -91,6 +91,9 @@ def has_next(self) -> bool:
    def _get_next_inner(self) -> RefBundle:
        output = self._output_queue.popleft()
        self._metrics.on_output_dequeued(output)
+        if self._estimated_num_output_bundles is None:
+            self._estimated_num_output_bundles = 0
+        self._estimated_num_output_bundles += 1


I think it's better to just inherit num_outputs_total from the input op.
because output splitter doesn't change the number of blocks.

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee added 3 commits July 12, 2024 13:03

allow unknown progbar total and op output estimates

943cd1f

Signed-off-by: Scott Lee <sjl@anyscale.com>

add space to progress bar unit

d4d81b4

Signed-off-by: Scott Lee <sjl@anyscale.com>

check for no estimate

f54d3ce

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee added the go add ONLY when ready to merge, run all tests label Jul 12, 2024

fix tests

28bd950

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review July 12, 2024 23:36

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners July 12, 2024 23:36

scottjlee assigned raulchen and bveeramani Jul 12, 2024

raulchen reviewed Jul 15, 2024

View reviewed changes

scottjlee added 3 commits July 15, 2024 11:43

remove logic to take upstream num outputs

18f6cb9

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0712-read-prog-estimate

43db890

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

c6ab156

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen reviewed Jul 15, 2024

View reviewed changes

bveeramani approved these changes Jul 15, 2024

View reviewed changes

update output splitter

d8cf1ec

Signed-off-by: Scott Lee <sjl@anyscale.com>

anyscalesam added data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 16, 2024

raulchen reviewed Jul 16, 2024

View reviewed changes

scottjlee added 3 commits July 16, 2024 11:12

inherit num outputs total

bde5f8a

Signed-off-by: Scott Lee <sjl@anyscale.com>

update tests

dc5384b

Signed-off-by: Scott Lee <sjl@anyscale.com>

simplify

f28b50b

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from raulchen July 16, 2024 21:00

raulchen approved these changes Jul 17, 2024

View reviewed changes

raulchen merged commit abbc6f7 into ray-project:master Jul 17, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Allow unknown estimate of operator output bundles and `ProgressBar` totals #46601

[Data] Allow unknown estimate of operator output bundles and `ProgressBar` totals #46601

scottjlee commented Jul 12, 2024 •

edited

Loading

raulchen Jul 15, 2024

scottjlee Jul 15, 2024 •

edited

Loading

scottjlee Jul 15, 2024

raulchen Jul 15, 2024

scottjlee Jul 15, 2024

raulchen Jul 15, 2024

bveeramani left a comment

raulchen Jul 16, 2024

[Data] Allow unknown estimate of operator output bundles and ProgressBar totals #46601

[Data] Allow unknown estimate of operator output bundles and ProgressBar totals #46601

Conversation

scottjlee commented Jul 12, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

raulchen Jul 15, 2024

Choose a reason for hiding this comment

scottjlee Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

scottjlee Jul 15, 2024

Choose a reason for hiding this comment

raulchen Jul 15, 2024

Choose a reason for hiding this comment

scottjlee Jul 15, 2024

Choose a reason for hiding this comment

raulchen Jul 15, 2024

Choose a reason for hiding this comment

bveeramani left a comment

Choose a reason for hiding this comment

raulchen Jul 16, 2024

Choose a reason for hiding this comment

[Data] Allow unknown estimate of operator output bundles and `ProgressBar` totals #46601

[Data] Allow unknown estimate of operator output bundles and `ProgressBar` totals #46601

scottjlee commented Jul 12, 2024 •

edited

Loading

scottjlee Jul 15, 2024 •

edited

Loading