Skip to content

Conversation

@daiping8
Copy link
Contributor

@daiping8 daiping8 commented Aug 28, 2025

Why are these changes needed?

Implement the 4th proposal in issue #55052. Add total input/output row counts to the information printed after calling Dataset.stats(), thereby enriching the operator-level metrics.

Related issue number

Closes the 4th proposal in issue #55052

Description

The main modification is to obtain the number of input rows for each Operator. The settings are as follows.

For sub-operators: Inputs are inherited based on the order in the current list

  • The first sub-operator: total output from all parent nodes
  • Subsequent sub-operators: output of the previous sub-operator

For an individual operator: Total output from all parent nodes

In addition, the following modifications have been made:

  • Updated operator stats display to include input and output row nums.
  • Adjusted tests to validate new throughput information.

Verification

We can use the following code to verify the modification of Operator throughput information.

import ray

data = [{"id": i, "value": i * 1.5, "category": i % 5} for i in range(500)]
ds = ray.data.from_items(data)

ds = ds.map(lambda x: {**x, "value_squared": x["value"] ** 2})
ds = ds.filter(lambda x: x["value_squared"] > 100)
ds = ds.map(lambda x: {**x, "is_large": x["value"] > 50})
ds = ds.flat_map(lambda x: [x, {**x, "value": x["value"] * 0.5}])
ds = ds.filter(lambda x: x["category"] in [1, 3])
ds = ds.sort("id", descending=True)
ds = ds.limit(10)
ds = ds.map(lambda x: {**x, "is_top3": x["id"] >= x["id"] if x["id"] else False})

print(ds.materialize().stats())
image

It can be seen that the corresponding content has been added to the Operator throughput information.

This is the complete stats() output:
stats.log

Potential Issue

I observed that the current version (Ray 2.49) does not output stats information for the union operator, so the Total input num rows for its child operators is reported as 0. This could be a notable limitation to be aware of.

The test code is as follows:

import ray

data = [{"id": i, "value": i * 1.5, "category": i % 5} for i in range(500)]
ds = ray.data.from_items(data)

ds2 = ray.data.from_items([{"id": i + 100, "value": i} for i in range(50)])

ds = ds.union(ds2) 
ds = ds.map_batches(lambda b: b, batch_size=10) 

print(ds.materialize().stats())
image

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@daiping8 daiping8 force-pushed the explain_api branch 2 times, most recently from a1af744 to bdbd676 Compare August 29, 2025 03:18
- Introduced `total_input_num_rows` to `OperatorStatsSummary` for enhanced metrics.
- Updated operator stats display to include input and output row counts.
- Adjusted tests to validate new throughput information.

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
…in tests.

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
@daiping8 daiping8 marked this pull request as ready for review September 1, 2025 01:21
@daiping8 daiping8 requested a review from a team as a code owner September 1, 2025 01:21
@richardliaw
Copy link
Contributor

Should we just only print output row count per operator? Then we wouldn't need to do extra accounting.

@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues community-contribution Contributed by the community labels Sep 1, 2025
@daiping8
Copy link
Contributor Author

daiping8 commented Sep 1, 2025

Should we just only print output row count per operator? Then we wouldn't need to do extra accounting.

@richardliaw, I believe it is necessary to print both the input and output row counts for each operator. This way, the stats information of individual operators can be extracted for analysis. For instance, regarding a filter operator, we can check whether the filtering operation has truly met the business requirements by comparing its input and output row counts.

If we only print the output row count, it would go against our original intention of providing more sufficiently granular information in stats() (see the 4th proposal in issue #55052).

@richardliaw
Copy link
Contributor

richardliaw commented Sep 4, 2025

Hi @daiping8, sorry for the slow response - but the input count will always be the same as the previous output count right?

I guess it's fine, I can see how this might be nicer as a user experience.

… accordingly.

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
@daiping8 daiping8 changed the title [Data] Add total input/output row counts of Dataset or Operator in the output of Dataset.stats() [Data] Add total input/output row counts of Operator in the output of Dataset.stats() Sep 5, 2025
@richardliaw
Copy link
Contributor

Also could you add a test for actual correctness?

@daiping8
Copy link
Contributor Author

daiping8 commented Sep 5, 2025

@richardliaw I have added a test for actual correctness. Thank you!

@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label Sep 6, 2025
@richardliaw
Copy link
Contributor

hmm @daiping8 seems like tests are failing:


[2025-09-05T12:15:13Z] TIMEOUT: //python/ray/data:test_stats (Summary)
--
  | [2025-09-05T12:15:13Z]       /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_stats/test.log
  | [2025-09-05T12:15:13Z]       /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_stats/test_attempts/attempt_1.log


Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
@daiping8
Copy link
Contributor Author

daiping8 commented Sep 6, 2025

hmm @daiping8 seems like tests are failing:


[2025-09-05T12:15:13Z] TIMEOUT: //python/ray/data:test_stats (Summary)
--
  | [2025-09-05T12:15:13Z]       /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_stats/test.log
  | [2025-09-05T12:15:13Z]       /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_stats/test_attempts/attempt_1.log

ds = ds1.join(ds2, join_type="left_outer", num_partitions=2)

There was a small mistake here. The number of num_partitions was too high, causing a mismatch with ray.init(num_cpus=2), which led the test to wait indefinitely for resources and eventually time out. This has been fixed. Thank you for your attention. @richardliaw

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
@richardliaw richardliaw merged commit 1fa52d2 into ray-project:master Sep 6, 2025
5 checks passed
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
… Dataset.stats() (ray-project#56040)

Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
… Dataset.stats() (ray-project#56040)

Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
… Dataset.stats() (ray-project#56040)

Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
… Dataset.stats() (ray-project#56040)

Signed-off-by: zac <zac@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
… Dataset.stats() (#56040)

Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants