-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Data] Add total input/output row counts of Operator in the output of Dataset.stats() #56040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a1af744 to
bdbd676
Compare
- Introduced `total_input_num_rows` to `OperatorStatsSummary` for enhanced metrics. - Updated operator stats display to include input and output row counts. - Adjusted tests to validate new throughput information. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
…in tests. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
|
Should we just only print output row count per operator? Then we wouldn't need to do extra accounting. |
@richardliaw, I believe it is necessary to print both the input and output row counts for each operator. This way, the stats information of individual operators can be extracted for analysis. For instance, regarding a filter operator, we can check whether the filtering operation has truly met the business requirements by comparing its input and output row counts. If we only print the output row count, it would go against our original intention of providing more sufficiently granular information in stats() (see the 4th proposal in issue #55052). |
|
I guess it's fine, I can see how this might be nicer as a user experience. |
… accordingly. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
|
Also could you add a test for actual correctness? |
…ounts. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
|
@richardliaw I have added a test for actual correctness. Thank you! |
|
hmm @daiping8 seems like tests are failing: |
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
ray/python/ray/data/tests/test_stats.py Line 1744 in dc9d4f9
There was a small mistake here. The number of |
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
… Dataset.stats() (ray-project#56040) Signed-off-by: sampan <sampan@anyscale.com>
… Dataset.stats() (ray-project#56040) Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
… Dataset.stats() (ray-project#56040) Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
… Dataset.stats() (ray-project#56040) Signed-off-by: zac <zac@anyscale.com>
… Dataset.stats() (#56040) Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Why are these changes needed?
Implement the 4th proposal in issue #55052. Add total input/output row counts to the information printed after calling
Dataset.stats(), thereby enriching the operator-level metrics.Related issue number
Closes the 4th proposal in issue #55052
Description
The main modification is to obtain the number of input rows for each Operator. The settings are as follows.
For sub-operators: Inputs are inherited based on the order in the current list
For an individual operator: Total output from all parent nodes
In addition, the following modifications have been made:
Verification
We can use the following code to verify the modification of Operator throughput information.
It can be seen that the corresponding content has been added to the Operator throughput information.
This is the complete
stats()output:stats.log
Potential Issue
I observed that the current version (Ray 2.49) does not output stats information for the
unionoperator, so theTotal input num rowsfor its child operators is reported as 0. This could be a notable limitation to be aware of.The test code is as follows:
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.