-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Data] - Add to_pyarrow() to Expr #57271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] - Add to_pyarrow() to Expr #57271
Conversation
Signed-off-by: Goutam V. <goutam@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a to_pyarrow() method to the Expr class, enabling the conversion of Ray Data expressions into PyArrow compute expressions. This is a valuable addition for integrating with PyArrow's execution engine. The implementation correctly handles various expression types by recursively building the PyArrow expression tree. My feedback includes a minor suggestion to refactor the local imports for better code clarity and to avoid redundancy.
python/ray/data/expressions.py
Outdated
| from ray.data._expression_evaluator import _ARROW_EXPR_OPS_MAP | ||
|
|
||
| if self.op in _ARROW_EXPR_OPS_MAP: | ||
| return _ARROW_EXPR_OPS_MAP[self.op](left, right) | ||
| else: | ||
| raise ValueError(f"Unsupported binary operation for PyArrow: {self.op}") | ||
|
|
||
| elif isinstance(self, UnaryExpr): | ||
| operand = self.operand.to_pyarrow() | ||
|
|
||
| from ray.data._expression_evaluator import _ARROW_EXPR_OPS_MAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Ray Data Expr to Pyarrow Compute Expression converter. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: Pavitra Bhalla <pavitra@rayai.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Ray Data Expr to Pyarrow Compute Expression converter. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Ray Data Expr to Pyarrow Compute Expression converter. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Ray Data Expr to Pyarrow Compute Expression converter. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Ray Data Expr to Pyarrow Compute Expression converter. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Ray Data Expr to Pyarrow Compute Expression converter. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Ray Data Expr to Pyarrow Compute Expression converter. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Ray Data Expr to Pyarrow Compute Expression converter. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Ray Data Expr to Pyarrow Compute Expression converter. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Ray Data Expr to Pyarrow Compute Expression converter. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
* [Data] - Add to_pyarrow() to Expr (#57271)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
## Why are these changes needed?
Ray Data Expr to Pyarrow Compute Expression converter.
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
* [Data] Fixing prefetch loop to avoid being blocked on the block being fetched (#57613)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
## Why are these changes needed?
1. Fixing prefetcher loop to avoid being blocked on the next block being
fetched
2. Adding missing metrics for `BatchIterator`
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
* [serve] Prevent rechunking during streaming tests (#57592)
## Why are these changes needed?
By adding a delay, we can prevent middle layers in the routing path from
coalescing separate chunks together, since we assert each chunk arrives
independently.
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
* [release] Upload & download blobs from Azure (#57540)
This supports the following functionality to enable Azure on release
test pipeline:
- Upload working directory to Azure
- Upload metrics/results json file to Azure
- Download files (metrics/results json file) from Azure
- Helper function to parse ABFSS URI into account, container, and path
This PR is broken down from
https://github.com/ray-project/ray/pull/57252 which also includes a
sample hello world test on Azure to test e2e. Proof that it works:
https://buildkite.com/ray-project/release/builds/62278
---------
Signed-off-by: kevin <kevin@anyscale.com>
* [Data] Avoid `iter_batches` inference benchmark (#57618)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
## Why are these changes needed?
Subject
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
* Revert "Revert "[core][metric] Redefine gcs STATS using metric interface"" (#57255)
Reverts ray-project/ray#57248
Please review
https://github.com/ray-project/ray/pull/57255/commits/227b84139f2d31187b1bffd3764da4528424e869
which is the fix for a previously accepted PR.
---------
Signed-off-by: Cuong Nguyen <can@anyscale.com>
* [core] Clean up worker/raylet client pools on node death (#56613)
Signed-off-by: joshlee <joshlee@anyscale.com>
* [Data] Updating `streaming_split` tests & benchmarks (#57569)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
## Why are these changes needed?
1. Updating `streaming_split` tests to increase coverage
2. Updated release tests to test `equal=False` cases
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
* [llm][ci] Enable Serve LLM doc tests (#57619)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
* [core][autosaler][v2] fix: num_workers_dict calculation by observation (#57539)
Signed-off-by: Rueian <rueiancsie@gmail.com>
* [Doc][Core] Update accelerator-type Doc Regarding CPU-only Nodes (#57596)
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
* Add perf metrics for 2.50.0 (#57585)
```
REGRESSION 16.07%: single_client_tasks_and_get_batch (THROUGHPUT) regresses from 5.261194854317881 to 4.415850247347108 in microbenchmark.json
REGRESSION 11.29%: placement_group_create/removal (THROUGHPUT) regresses from 751.064903521573 to 666.2773993932936 in microbenchmark.json
REGRESSION 11.14%: single_client_tasks_sync (THROUGHPUT) regresses from 900.96738867954 to 800.5633840543425 in microbenchmark.json
REGRESSION 10.14%: actors_per_second (THROUGHPUT) regresses from 566.4200586217125 to 508.9808896382363 in benchmarks/many_actors.json
REGRESSION 8.91%: 1_1_async_actor_calls_sync (THROUGHPUT) regresses from 1374.047824125402 to 1251.6025859481733 in microbenchmark.json
REGRESSION 8.70%: single_client_get_calls_Plasma_Store (THROUGHPUT) regresses from 9176.686326011131 to 8378.589542828342 in microbenchmark.json
REGRESSION 6.79%: 1_n_async_actor_calls_async (THROUGHPUT) regresses from 6964.257909926722 to 6491.439808045807 in microbenchmark.json
REGRESSION 6.68%: 1_1_actor_calls_sync (THROUGHPUT) regresses from 1826.440590474467 to 1704.5035425495187 in microbenchmark.json
REGRESSION 6.62%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.142098493341212 to 12.272053704608084 in microbenchmark.json
REGRESSION 6.61%: n_n_actor_calls_async (THROUGHPUT) regresses from 24808.730524179864 to 23168.372784365154 in microbenchmark.json
REGRESSION 6.19%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 4795.051007052156 to 4498.3519827438895 in microbenchmark.json
REGRESSION 6.05%: 1_1_actor_calls_async (THROUGHPUT) regresses from 7925.658042658907 to 7445.809146193413 in microbenchmark.json
REGRESSION 5.20%: n_n_async_actor_calls_async (THROUGHPUT) regresses from 21602.16598513169 to 20479.183697143773 in microbenchmark.json
REGRESSION 5.18%: single_client_tasks_async (THROUGHPUT) regresses from 7418.67591750316 to 7034.736389002367 in microbenchmark.json
REGRESSION 5.16%: single_client_put_gigabytes (THROUGHPUT) regresses from 20.350152593657818 to 19.30103208209274 in microbenchmark.json
REGRESSION 5.11%: tasks_per_second (THROUGHPUT) regresses from 388.36439061844453 to 368.5098005212305 in benchmarks/many_tasks.json
REGRESSION 4.27%: pgs_per_second (THROUGHPUT) regresses from 13.028153672527967 to 12.47149444972938 in benchmarks/many_pgs.json
REGRESSION 2.33%: single_client_wait_1k_refs (THROUGHPUT) regresses from 4.8129125825624035 to 4.700920788730696 in microbenchmark.json
REGRESSION 1.88%: client__put_gigabytes (THROUGHPUT) regresses from 0.10294244610916167 to 0.10100883378233687 in microbenchmark.json
REGRESSION 1.17%: 1_n_actor_calls_async (THROUGHPUT) regresses from 7563.474741840271 to 7474.798821945149 in microbenchmark.json
REGRESSION 46.59%: stage_3_creation_time (LATENCY) regresses from 1.8725192546844482 to 2.7449533939361572 in stress_tests/stress_test_many_tasks.json
REGRESSION 41.39%: dashboard_p99_latency_ms (LATENCY) regresses from 35.162 to 49.716 in benchmarks/many_nodes.json
REGRESSION 23.68%: dashboard_p99_latency_ms (LATENCY) regresses from 188.103 to 232.641 in benchmarks/many_pgs.json
REGRESSION 20.72%: dashboard_p99_latency_ms (LATENCY) regresses from 3446.344 to 4160.517 in benchmarks/many_actors.json
REGRESSION 15.85%: dashboard_p50_latency_ms (LATENCY) regresses from 4.26 to 4.935 in benchmarks/many_pgs.json
REGRESSION 15.44%: dashboard_p50_latency_ms (LATENCY) regresses from 5.544 to 6.4 in benchmarks/many_tasks.json
REGRESSION 12.31%: avg_iteration_time (LATENCY) regresses from 1.2971700072288512 to 1.4568077945709228 in stress_tests/stress_test_dead_actors.json
REGRESSION 11.15%: 1000000_queued_time (LATENCY) regresses from 179.146127773 to 199.115312395 in scalability/single_node.json
REGRESSION 10.11%: dashboard_p95_latency_ms (LATENCY) regresses from 2612.102 to 2876.107 in benchmarks/many_actors.json
REGRESSION 8.41%: dashboard_p50_latency_ms (LATENCY) regresses from 10.833 to 11.744 in benchmarks/many_actors.json
REGRESSION 8.25%: stage_1_avg_iteration_time (LATENCY) regresses from 12.93162693977356 to 13.99826169013977 in stress_tests/stress_test_many_tasks.json
REGRESSION 6.18%: stage_2_avg_iteration_time (LATENCY) regresses from 33.983641386032104 to 36.08304100036621 in stress_tests/stress_test_many_tasks.json
REGRESSION 6.07%: 3000_returns_time (LATENCY) regresses from 5.790547841000006 to 6.1422604579999955 in scalability/single_node.json
REGRESSION 4.99%: 10000_args_time (LATENCY) regresses from 19.077259766999987 to 20.028864411 in scalability/single_node.json
REGRESSION 4.97%: dashboard_p95_latency_ms (LATENCY) regresses from 10.799 to 11.336 in benchmarks/many_pgs.json
REGRESSION 4.83%: dashboard_p95_latency_ms (LATENCY) regresses from 13.338 to 13.982 in benchmarks/many_nodes.json
REGRESSION 4.73%: 10000_get_time (LATENCY) regresses from 24.000713915999995 to 25.136106761999997 in scalability/single_node.json
REGRESSION 4.40%: stage_3_time (LATENCY) regresses from 1821.4706330299377 to 1901.6145586967468 in stress_tests/stress_test_many_tasks.json
REGRESSION 2.90%: dashboard_p50_latency_ms (LATENCY) regresses from 6.935 to 7.136 in benchmarks/many_nodes.json
REGRESSION 2.74%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 13.41017694899999 to 13.777409734000003 in scalability/object_store.json
REGRESSION 1.89%: stage_0_time (LATENCY) regresses from 7.735846281051636 to 7.882433891296387 in stress_tests/stress_test_many_tasks.json
REGRESSION 1.62%: avg_pg_remove_time_ms (LATENCY) regresses from 1.396923734234321 to 1.419495533032749 in stress_tests/stress_test_placement_group.json
REGRESSION 1.34%: stage_4_spread (LATENCY) regresses from 0.5580154959703073 to 0.565494858742622 in stress_tests/stress_test_many_tasks.json
REGRESSION 0.09%: avg_pg_create_time_ms (LATENCY) regresses from 1.5636188018035782 to 1.5650917102100173 in stress_tests/stress_test_placement_group.json
```
Signed-off-by: kevin <kevin@anyscale.com>
* [release] Add depset filename as part of custom image hash (#56952)
So that the images that share the same post_build_script name but have
different depset file can have unique image tag
---------
Signed-off-by: kevin <kevin@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
* [core] Change config defaults to enable io_service metrics (#57614)
After some experimentation, the main culprit for the performance
degredation is actually from the lag probe being too aggressive. The
default lag probe previously being 250ms caused as much as a 20%
degredation in performance when used in combination with with enabling
io_context metrics. Setting the default to abouve 60s seems to mitigate
the issue. To come to this conclusion we tested with the below:
Trail 1: ~400 actors/s <-- way too slow
-RAY_emit_main_serivce_metrics = 1
Trial 2: ~500+ actor/s <-- where we want to be
-RAY_emit_main_serivce_metrics = -1
Trial 3: ~500+ actor/s
-RAY_emit_main_serivce_metrics = 1
-RAY_io_context_event_loop_lag_collection_interval_ms = -1 <-- disabled
Trial 4: ~500+ actor/s <-- bingo!
-RAY_emit_main_serivce_metrics = 1
-RAY_io_context_event_loop_lag_collection_interval_ms = 6000
The default value of 250ms combined with the increased usage of lag
probes when the metrics are enabled causes enough degredation as to be
noticable. Increasing the interval sufficiently seems to be the way to
go to avoid this and have our metrics.
---------
Signed-off-by: zac <zac@anyscale.com>
Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>
* [Data][LLM] Add Video Processor and vllm example for ray data (#56785)
# [Data][LLM] Add Video Processor and vllm example for ray data
## Summary
This PR introduces a production-grade video preprocessing stage for Ray
LLM batch pipelines. It parses video sources from OpenAI-style chat
messages, resolves sources with stream-first I/O and optional caching,
decodes via PyAV (FFmpeg), samples frames by fps or fixed count, and
outputs frames (PIL or NumPy) with per-video metadata. It aligns with
the image stage’s conventions while addressing video-specific needs.
## Motivation
Many multimodal and VLM workloads require a reliable, performant, and
testable video preprocessing step with:
- Explicit and deterministic frame sampling
- Stream-first networking with optional caching
- Robust error reporting and retries
- Consistent outputs for downstream model inference (PIL/NumPy)
## Key Features
- Source extraction: supports “video” and “video_url” in OpenAI chat
message content.
- Source resolution: HTTP(S), data URI, and local file; cache_mode:
memory | disk | auto.
- Sampling: fps-based or num_frames-based; num_frames deterministically
takes the first N decoded frames; optional max_sampled_frames cap;
bounded target generation.
- Outputs: PIL or NumPy; channels_first for NumPy; optional PIL
preprocessing (resize/crop/convert) with NumPy path routed through PIL
for consistency.
- Reliability: decode in a thread, async orchestration with concurrency
limits, retries with backoff, strict “no frames → failure,” enriched
error metadata.
- Safety caps: bounded target list and per-source decode frame count to
avoid pathological behavior on malformed streams.
## Design and Implementation Notes
- HTTP extraction lives in a shared `_util.py` (HTTPConnection) to:
- Centralize networking behavior (timeouts, retries, chunked download,
file download)
- Improve testability (single patch point across stages)
- Ensure consistent semantics between image and video stages
- Avoid coupling the stage to a dataset/planner runtime
- Optional dependencies (PyAV, Pillow, NumPy) are imported lazily with
clear error messages.
- Decode is CPU-bound; it runs in a thread, while async orchestration
ensures concurrency limits and order-preserving emission.
we cannot directly reuse download API
(https://github.com/ray-project/ray/pull/55824)
- That commit introduces a Ray Data download expression at the
planner/op/expression layer for high-throughput dataset ingestion and
better block-size estimation. It is ideal for offline ETL and bulk
workloads.
- This PR targets an online batch inference stage implemented as a UDF
with asyncio and a lightweight HTTP utility, optimized for low latency,
per-request ordering, and controlled concurrency within a batch stage.
- Directly embedding the planner path would:
- Introduce scheduling and planning overhead unsuited for low-latency
UDFs
- Complicate execution semantics (order preservation, per-request
grouping)
- Increase dependency surface (Data planner) inside LLM batch stages
- Recommended composition: use Ray Data’s download expression offline to
materialize bytes/local files; then feed those paths/data into this
video stage for decoding/processing.
## Usage
- Package entrypoints:
- `PrepareVideoStage` / `PrepareVideoUDF` in
`ray.llm._internal.batch.stages.prepare_video_stage`
### Example 1: Use the UDF in a batch stage
- Input rows must contain `messages` in OpenAI chat format with video
entries (“video” or “video_url”).
```
from ray.llm._internal.batch.stages.prepare_video_stage import PrepareVideoUDF
udf = PrepareVideoUDF(
data_column="__data",
expected_input_keys=["messages"],
sampling={"num_frames": 4}, # or {"fps": 3.0}
output_format="numpy", # or "pil"
channels_first=True, # NumPy-only
cache_mode="auto", # "memory" | "disk" | "auto"
cache_dir="/tmp/video-cache", # optional for disk/auto
)
batch = {
"__data": [
{
"messages": [
{
"content": [
{"type": "video", "video": "https://host/video.mp4"},
{"type": "video_url", "video_url": {"url": "file:///data/v2.mp4"}},
]
}
]
}
]
}
# Consume async UDF
async def run():
outs = []
async for out in udf(batch):
outs.append(out["__data"][0])
# out["video"] -> List[List[Frame]]
# out["video_meta"] -> List[Dict] per video (size, timestamps, num_frames, failed, etc.)
return outs
```
We can directly refer to this test:
``` text
pytest test_prepare_video_stage.py -v
============================================== test session starts ===============================================
platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 -- /ray-workspace/ray/python/requirements/llm/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /ray-workspace/ray
configfile: pytest.ini
plugins: anyio-4.10.0, asyncio-1.2.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 19 items
test_prepare_video_stage.py::test_udf_extract_and_process_basic PASSED [ 5%]
test_prepare_video_stage.py::test_num_frames_sampling_exact PASSED [ 10%]
test_prepare_video_stage.py::test_data_uri_handling PASSED [ 15%]
test_prepare_video_stage.py::test_local_file_path_handling PASSED [ 21%]
test_prepare_video_stage.py::test_auto_cache_to_disk_when_num_frames PASSED [ 26%]
test_prepare_video_stage.py::test_av_missing_import_error_metadata PASSED [ 31%]
test_prepare_video_stage.py::test_multiple_videos_order_preserved PASSED [ 36%]
test_prepare_video_stage.py::test_preprocess_convert_numpy_consistency PASSED [ 42%]
test_prepare_video_stage.py::test_bytesio_format_guess_fallback PASSED [ 47%]
test_prepare_video_stage.py::test_retries_success_and_counts PASSED [ 52%]
test_prepare_video_stage.py::test_non_retriable_no_retry PASSED [ 57%]
test_prepare_video_stage.py::test_target_cap_limits_frames PASSED [ 63%]
test_prepare_video_stage.py::test_numpy_output_channels_first PASSED [ 68%]
test_prepare_video_stage.py::test_strict_no_fallback_when_no_frames PASSED [ 73%]
test_prepare_video_stage.py::test_e2e_with_pyav_synth PASSED [ 78%]
test_prepare_video_stage.py::test_e2e_num_frames_pil PASSED [ 84%]
test_prepare_video_stage.py::test_e2e_fps_sampling PASSED [ 89%]
test_prepare_video_stage.py::test_e2e_preprocess_resize_numpy_channels_first PASSED [ 94%]
test_prepare_video_stage.py::test_e2e_max_sampled_frames_cap PASSED [100%]
=============================================== 19 passed in 1.77s ===============================================
```
### Example 2: Multimodal inference with vLLM
- Sample a few frames, preprocess to PIL/NumPy, then feed frames as
images to your multimodal prompt (one common pattern is to select top-k
frames and attach them as image inputs).
```
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from ray.llm._internal.batch.stages.prepare_video_stage import VideoProcessor
import asyncio
import tempfile
import os
from PIL import Image
from qwen_vl_utils import process_vision_info
async def process_video_with_vlm():
# 1. Extract video frames
vp = VideoProcessor(
sampling={"num_frames": 4},
output_format="pil",
preprocess={"resize": {"size": [384, 384]}, "convert": "RGB"},
)
frames_and_meta = await vp.process(["./2-20.mp4"])
frames = frames_and_meta[0]["frames"]
print(f"Extracted {len(frames)} frames")
# 2. Save frames to temporary files
with tempfile.TemporaryDirectory() as tmp_dir:
image_paths = []
for i, frame in enumerate(frames):
temp_path = os.path.join(tmp_dir, f"frame_{i}.jpg")
frame.save(temp_path)
image_paths.append(temp_path)
# 3. Initialize model
MODEL_PATH = "/vllm-workspace/tmp/vlm"
llm = LLM(
model=MODEL_PATH,
limit_mm_per_prompt={"image": 10},
trust_remote_code=True,
enforce_eager=True
)
# 4. Construct messages
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
*[{"type": "image", "image": path} for path in image_paths],
{"type": "text", "text": "Summarize the content of this video"}
]
}
]
# 5. Process input
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _, = process_vision_info(messages)
# 6. Generate results
sampling_params = SamplingParams(
temperature=0.1,
top_p=0.001,
max_tokens=512
)
outputs = llm.generate([{
"prompt": prompt,
"multi_modal_data": {"image": image_inputs}
}], sampling_params=sampling_params)
print("Generated result:", outputs[0].outputs[0].text)
asyncio.run(process_video_with_vlm())
```
Notes:
- If your vLLM interface expects byte-encoded images, convert PIL frames
to bytes (e.g., PNG/JPEG) before passing.
- If it expects NumPy tensors, use `output_format="numpy"` with
`channels_first` as needed.
## Dependencies
- Runtime (optional-by-use): `av` (PyAV), `pillow`, `numpy`.
- Tests: require the above; E2E tests synthesize MP4 with PyAV and
validate decode/processing.
## Backward Compatibility
- Additive functionality; does not break existing stages or APIs.
## Testing
- Unit tests cover:
- fps/num_frames sampling, data URI, local path, auto cache to disk
- Missing dependency metadata, order preservation, NumPy output/channel
ordering
- BytesIO format guess fallback, retries and non-retriable paths,
sampling caps
- E2E tests (default enabled) synthesize MP4s with PyAV and validate
end-to-end behavior.
## Relate Issue
close #56424 #56767
cc @GuyStone @richardliaw @nrghosh
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Introduce a production-ready video preprocessing stage (with sampling,
caching, and metadata), centralize HTTP utilities, add env-based
tunables, and refactor image stage to use the shared HTTP client with
comprehensive tests.
>
> - **LLM Batch Env Tunables**:
> - Add `python/ray/llm/_internal/batch/envs.py` with lazy env getters:
`RAY_LLM_BATCH_MAX_TARGETS`, `RAY_LLM_BATCH_MAX_DECODE_FRAMES`.
> - **Shared Utilities**:
> - New `python/ray/llm/_internal/batch/stages/_util.py` providing
`HTTPConnection` (sync/async GET, bytes/text/json, chunked download,
file download).
> - **Image Stage Refactor**:
> - `prepare_image_stage.py`: replace inline `HTTPConnection` with
`_util.HTTPConnection`; adjust tests to patch new path.
> - **Video Processing Stage (example)**:
> - Add `video_processor.py` implementing `VideoProcessor`,
`PrepareVideoUDF`, `PrepareVideoStage` with HTTP/data/local source
resolution, optional disk/memory cache, PyAV decode, fps/num_frames
sampling, PIL/NumPy outputs, preprocessing, retries/backoff, and
metadata.
> - Add CLI `main.py` and README for usage.
> - **Tests**:
> - New `test_video_processor.py` covering sampling modes, data
URI/local/http sources, caching, retries, numpy/PIL outputs,
preprocessing, caps, and E2E with PyAV.
> - Update `test_prepare_image_stage.py` to patch
`_util.HTTPConnection`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
b1ee41875d9628508bde5a35af7bde0bfd66b502. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Signed-off-by: PAN <1162953505@qq.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* [Data] Retain bits of paths with semicolons in them (#57231)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
## Why are these changes needed?
See #57226 . I got my env working. Was in py 3.13 by accident
<!-- Please give a short summary of the change and the problem this
solves. -->
## Related issue number
Solves #57226
<!-- For example: "Closes #1234" -->
## Checks
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
* Fix test_api.py::test_max_constructor_retry_count failing for windows (#57541)
`test_api.py::test_max_constructor_retry_count` was failing for windows.
Tried to expand the timeout on wait_on_condition at the last part of the
test to 20s - 40s and added a debug statement to check how far the
counter increments to. It goes up in a varying value but I was able to
observe 9-12, not reaching 13.
Did some drilling and seems like for our ray actor worker process is
forked to be created for Linux and Windows uses `CreateProcessA`, which
builds process from scratch each time ran unlike forking. And this
difference is causing the number of counts for windows to be growing
more slowly IIUC. The call for windows with `CreateProcessA` is
available
[here](https://github.com/ray-project/ray/blob/1296dc4699a3c1681fe3de6dd9f63af51d287582/src/ray/util/process.cc#L171),
and forking for Linux is availabe here.
Hence, the solution is to alleviate the test's resource requirement by
launching 4->3 replicas and attempting on less number of retries to
satisfy both linux and windows.
---------
Signed-off-by: doyoung <doyoung@anyscale.com>
* [1/n] add application level autoscaling policy in schema (#57535)
part 1 of https://github.com/ray-project/ray/pull/56149
1. move `_serialized_policy_def` into `AutoscalingPolicy` from
`AutoscalingConfig`. We need this in order to reuse `AutoscalingPolicy`
for application-level autoscaling.
2. Make `autoscaling_policy` a top-level config in
`ServeApplicationSchema`.
---------
Signed-off-by: abrar <abrar@anyscale.com>
* Introduce sub-tabs with full Grafana dashboard embeds on Metrics tab (#57561)
* Add ray docs for custom autoscaling in serve (#57600)
1. add docs under advance autoscaling
2. promote autoscaling_context to public api
---------
Signed-off-by: abrar <abrar@anyscale.com>
* [serve][llm] Enable engine metrics by default (#57615)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
* [Data] - Add _source_paths to filedatasources (#57574)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
## Why are these changes needed?
Add `_unresolved_paths` for file based datasources for lineage tracking
capabilities.
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: Goutam <goutam@anyscale.com>
* Add task_execution_time histogram (#56355)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
## Why are these changes needed?
Adds block completion time, task completion time, block size histograms
Note to calculate block completion time histogram, we must approximate
because we only measure the task completion time. To aproximate, we
assume each block took an equal amount of time within a task and split
the time amongst them.
<img width="1342" height="316" alt="Screenshot 2025-09-28 at 1 14 33 PM"
src="https://github.com/user-attachments/assets/baf1e9c3-26a2-48ce-92e4-3299d698ddaf"
/>
<img width="1359" height="321" alt="Screenshot 2025-09-28 at 1 14 52 PM"
src="https://github.com/user-attachments/assets/84c3c7a4-2631-4626-9677-d947d1afb112"
/>
<!-- Please give a short summary of the change and the problem this
solves. -->
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Introduce histogram metrics for task/block completion time and block
sizes, wire them through export, and add Grafana bar chart panels to
visualize these distributions.
>
> - **Ray Data metrics**:
> - Add histogram metrics in `OpRuntimeMetrics`:
> - `task_completion_time`, `block_completion_time`, `block_size_bytes`,
`block_size_rows` with predefined bucket boundaries and thread-safe
updates.
> - New helpers `find_bucket_index` and bucket constants; support reset
via `as_dict(..., reset_histogram_metrics=True)`.
> - **Stats pipeline**:
> - `_StatsManager` now snapshots and resets histogram metrics before
sending to `_StatsActor` (both periodic and force updates).
> - `_StatsActor.update_execution_metrics` handles histogram lists by
observing per-bucket values.
> - **Dashboard/Grafana**:
> - Add `HISTOGRAM_BAR_CHART` target and `BAR_CHART` panel templates in
`dashboards/common.py`.
> - Replace `Task Completion Time` with histogram bar chart; add new
panels: `Block Completion Time Histogram (s)`, `Block Size (Bytes)
Histogram`, `Block Size (Rows) Histogram` in `data_dashboard_panels.py`
and include them in `Outputs` and `Tasks` rows.
> - Normalize time units from `seconds` to `s` for several panels.
> - **Factory**:
> - Panel generation respects template-specific settings (no behavioral
change beyond using new templates).
> - **Tests**:
> - Add tests for histogram initialization, bucket indexing, and
counting for task/block durations and block sizes in
`test_op_runtime_metrics.py`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
71430392426d402545e03315b99e118eb5285c6d. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Signed-off-by: Alan Guo <aguo@anyscale.com>
* [deps] raydepsets import all config (2/3) (#57584)
Adding workspace funcs to parse all configs
not currently used in raydepsets
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
* fix ray_serve_deployment_queued_queries discrepancy in docs (#57629)
## Why are these changes needed?
we seem to only make the mistake in our docs
see here
https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/router.py#L545
this part
```python
metrics.Gauge(
"serve_deployment_queued_queries",
description=(
"The current number of queries to this deployment waiting"
" to be assigned to a replica."
),
tag_keys=("deployment", "application", "handle", "actor_id"),
)
```
<!-- Please give a short summary of the change and the problem this
solves. -->
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
* [Core] Remove success field from the release test json result (#57630)
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
* [Data] Make `test_hanging_detector_detects_issues` more robust (#57567)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
## Why are these changes needed?
<!-- Please give a short summary of the change and the problem this
solves. -->
`test_hanging_detector_detects_issues` checks that Ray Data emits a log
if one task takes a lot longer than the others. The issue is that the
test doesn't capture the log output correctly, and so the test fails
even though Ray data correctly emits the log.
To make this test more robust, this PR uses pytest's `caplog` fixture to
capture the logs rather than a bespoke custom handler.
```
[2025-10-08T09:00:41Z] > assert hanging_detected, log_output
| [2025-10-08T09:00:41Z] E AssertionError:
| [2025-10-08T09:00:41Z] E assert False
| [2025-10-08T09:00:41Z]
| [2025-10-08T09:00:41Z] python/ray/data/tests/test_issue_detection_manager.py:153: AssertionError
|
```
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
* [core] (cgroups 14/n) Clean up bazel targets and support cross-platform build. (#57244)
For more details about the resource isolation project see
https://github.com/ray-project/ray/issues/54703.
This PR introduces two public bazel targets from the
`//src/ray/common/cgroup2` subsystem.
* `CgroupManagerFactory` is a cross-platform target that exports a
working CgroupManager on Linux if resource isolation is enabled. It
exports a Noop implementation if running on a non-Linux platform or if
resource isolation is not enabled on Linux.
* `CgroupManagerInterface` is the public API of CgroupManager.
It also introduces a few other changes
1. All resource isolation related configuration parsing and input
validation has been moved into CgroupManagerFactory.
2. NodeManager now controls the lifecycle (and destruction) of
CgroupManager.
3. SysFsCgroupDriver uses a linux header file to find the path of the
mount file instead of hardcoding because different linux distributions
can use different files.
---------
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
* add test for multiple task consumers in a single app (#56618)
- adding a test for having multiple task consumers in a single ray serve
application
---------
Signed-off-by: harshit <harshit@anyscale.com>
* [Data] Adding MCAP datasource (#55716)
* [Data] Fix driver hang during streaming generator block metadata retrieval (#56451)
## Why are these changes needed?
This PR fixes a critical driver hang issue in Ray Data's streaming
generator. The problem occurs when computation completes and block data
is generated, but the worker crashes before the metadata object is
generated, causing the driver to hang completely until the task's
metadata is successfully rebuilt. This creates severe performance
issues, especially in cluster environments with significant resource
fluctuations.
## What was the problem?
**Specific scenario:**
1. Computation completes, block data is generated
2. Worker crashes before the metadata object is generated
3. Driver enters the
[physical_operator.on_data_ready()](https://github.com/ray-project/ray/blob/ray-2.46.0/python/ray/data/_internal/execution/interfaces/physical_operator.py#L124)
logic and waits indefinitely for metadata until task retry succeeds and
meta object becomes available
4. If cluster resources are insufficient, the task cannot be retried
successfully, causing driver to hang for hours (actual case: 12 hours)
**Technical causes:**
- Using `ray.get(next(self._streaming_gen))` for metadata content
retrieval, which may hang indefinitely
- Lack of timeout mechanisms and state tracking, preventing driver
recovery from hang state
- No proper handling when worker crashes between block generation and
metadata generation
## What does this fix do?
- Adds `_pending_block_ref` and `_pending_meta_ref` state tracking to
properly handle block/metadata pairs
- Uses `ray.get(meta_ref, timeout=1)` with timeout for metadata content
retrieval
- Adds error handling for `GetTimeoutError` with warning logs
- Prevents unnecessary re-fetching of already obtained block references
- **Key improvement: Prevents driver from hanging for extended periods
when worker crashes between block and metadata generation**
## Related issue number
Fixes critical performance issue in streaming data processing that
causes driver to hang for extended periods (up to 12 hours) when workers
crash between block generation and metadata generation, especially in
cluster environments with significant resource fluctuations.
## Checks
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- **Testing Strategy**
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: dragongu <andrewgu@vip.qq.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
* [train] Add template for Ray Train + DeepSpeed LLM fine-tuning (#57118)
This PR adds a template that demonstrates how to use Ray Train with
DeepSpeed.
At a high level, this template covers:
- A hands-on example of fine-tuning a language model.
- Saving and loading model checkpoints with Ray Train and DeepSpeed.
- Key DeepSpeed configurations (ZeRO stages, offloading, mixed
precision).
Tested environment: 1 CPU head node + 2 NVIDIA T4 GPU worker nodes.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Adds a DeepSpeed + Ray Train LLM fine-tuning example with a training
script, tutorial notebook, and cloud configs, and links it in the
examples index.
>
> - **New example: `examples/pytorch/deepspeed_finetune`**
> - `train.py`: End-to-end LLM fine-tuning with Ray Train + DeepSpeed
(ZeRO config, BF16, checkpoint save/load, resume support, debug steps).
> - `README.ipynb`: Step-by-step tutorial covering setup, dataloader,
model init, training loop, checkpointing, and launch via `TorchTrainer`.
> - Cluster configs: `configs/aws.yaml`, `configs/gce.yaml` for 1 head +
2 T4 workers.
> - **Docs**
> - Add example entry in `doc/source/train/examples.yml` linking to
`examples/pytorch/deepspeed_finetune/README`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e3761805d091286a38406fbae7b210a79b6ce1de. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
* add celery default serializer and add new fields in celery adapter config (#56707)
* [Data] Streamline concurrency parameter semantic (#57035)
To remove overloaded semantics, we will undeprecate the compute
parameter, deprecate the concurrency parameters, and ask users to use
`ActorPoolStrategy` and `TaskPoolStrategy` directly, which makes the
logic straightforward.
* [release test] remove unused url_exists (#57655)
and related `test_wheels` test
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [release test] extract bazel.py into its own py_library (#57656)
starting to breaking down `ray_release` into smaller bazel build rules.
required to fix windows CI builds, where `test_in_docker` now imports
third-party libraries that do not yet work on windows machines.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [ci] exclude `build_in_docker_windows` from `ray_ci_lib` (#57660)
it is part of the `py_binary`
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [release test] exclude scripts from ray_release (#57662)
the scripts are `py_binary` entrypoints, not for ray_release binary.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [release test] make more files a standalone py_library's (#57661)
working towards breaking down `ray_release package` into smaller modules
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [release test] let custom byod tests only depend on the binary file (#57663)
also fixes typo in build file
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [Doc][KubeRay] Add ReconcileConcurrency configuration instructions to Troubleshooting Guide (#55236)
## Why are these changes needed?
Some users may not know how to configure the `ReconcileConcurrency` in
kuberay.
Docs link:
https://anyscale-ray--55236.com.readthedocs.build/en/55236/cluster/kubernetes/troubleshooting/troubleshooting.html#how-to-configure-reconcile-concurrency-when-there-are-large-mount-of-crs
https://github.com/ray-project/kuberay/issues/3909
---------
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
* [core] (cgroups 15/n) Adding a user cgroup subtree for non-ray processes. (#57269)
This PR stacks on #57244.
For more details about the resource isolation project see
https://github.com/ray-project/ray/issues/54703.
In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.
The new cgroup hierarchy looks like
```
cgroup_path (e.g. /sys/fs/cgroup)
|
ray-node_<node_id>
| |
system user
| | |
leaf workers non-ray
```
The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).
Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`
The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)
The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)
---------
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
* [core][autoscaler][v1] use cluster_resource_state for node state info to fix over-provisioning (#57130)
This PR is related to
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)
The v1 autoscaler monitor currently is pulling metrics from two
different modules in GCS:
- **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_`
and updates it at two different intervals (`UpdateNodeResourceUsage`
every 0.1s, `UpdateResourceLoads` every 1s).
- **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_`
and updates it via `UpdateResourceLoadAndUsage`. This module is already
the source for the v2 autoscaler.
| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| current cluster resources | RaySyncer |
`GcsResourceManager::UpdateNodeResourceUsage` | 100ms
(`raylet_report_resources_period_milliseconds`) |
[gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170)
|
| current pending resources | GcsServer |
`GcsResourceManager::UpdateResourceLoads` | 1s
(`gcs_pull_resource_loads_period_milliseconds`) |
[gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422)
|
Because these two modules update asynchronously, the autoscaler can end
up seeing inconsistent resource states. That causes a race condition
where extra nodes may be launched before the updated availability
actually shows up. In practice, this means clusters can become
over-provisioned even though the demand was already satisfied.
In the long run, the right fix is to fully switch the v1 autoscaler over
to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like
v2 already does. But since v1 will eventually be deprecated, this PR
takes a practical interim step: it merges the necessary info from both
GcsResourceManager::HandleGetAllResourceUsage and
GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid
approach.
This keeps v1 correct without big changes, while still leaving the path
open for a clean migration to v2 later on.
## Details
This PR follows the fix suggested by @rueian in #52864 by switching the
v1 autoscaler's node state source from
`GcsResourceManager::HandleGetAllResourceUsage` to
`GcsAutoscalerStateManager::HandleGetClusterResourceState`.
Root Cause: The v1 autoscaler previously getting data from two
asynchronous update cycles:
- Node resources: updated every ~100ms via `UpdateNodeResourceUsage`
- Resource demands: updated every ~1s via `UpdateResourceLoads`
This created a race condition where newly allocated resources would be
visible before demand metrics updated, causing the autoscaler to
incorrectly perceive unmet demand and launch extra nodes.
The Fix: By using v2's `HandleGetClusterResourceState` for node
iteration, both current resources and pending demands now come from the
same consistent snapshot (same tick), so the extra-node race condition
goes away.
## Proposed Changes in update_load_metrics()
This PR updates how the v1 autoscaler collects cluster metrics.
Most node state information is now taken from **v2
(`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while
certain fields still rely on **v1
(`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't
have an equivalent yet.
| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| Node states (id, ip, resources, idle duration) |
[gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527)
(`resources_batch_data.batch`) |
[autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212)
(`cluster_resource_state.node_states`) | O | Now aligned with v2.
Verified no regressions in tests. |
| waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same
as before | X | v2 does not separate ready vs infeasible requests. Still
needed for metrics/debugging. |
| pending_placement_groups | `placement_group_load` | same as before | X
| No validated equivalent in v2 yet. May migrate later. |
| cluster_full | response flag (`cluster_full_of_actors_detected`) |
same as before | X | No replacement in v2 fields, so kept as is. |
### Additional Notes
- This hybrid approach addresses the race condition while still using
legacy fields where v2 has no equivalent.
- All existing autoscaler monitor tests still pass, which shows that the
change is backward-compatible and does not break existing behavior.
## Changed Behavior (Observed)
(Autoscaler config & serving code are same as this
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864))
After switching to v2 autoscaler state (cluster resource), the issue no
longer occurs:
- Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node
Provider only launches a single `ray.worker.4090.standard` node. (No
extra requests for additional nodes are observed.)
[debug.log](https://github.com/user-attachments/files/22659163/debug.log)
## Related issue number
Closes #52864
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
* [release test] split out global_config, wheels and retry (#57667)
towards fixing test_in_docker for windows
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [release test] move cluster env utils to `anyscale_util` (#57669)
as cluster envs are anyscale specific concepts
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [release test] move azure related functions to `cloud_util.py` (#57675)
out of `util.py`. also adding its own `py_library`
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [core] Add a test for nested sub-processes cleanup (#57638)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
## Why are these changes needed?
I've encountered an issue where Ray sends SIGKILL to child processes
(grandchild will not receive the signal) launched by a Ray actor. As a
result, the subprocess cannot catch the signal to gracefully clean up
its child processes. Therefore, the grandchild processes of the actor
will leak.
I'm glad to see https://github.com/ray-project/ray/pull/56476 by
@codope, and I also built a similar solution myself. This PR adds the
case where I met.
@codope why not enable this feature by default?
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
* [release test] move in `join_cloud_storage_paths` (#57679)
into `anyscale_job_runner`. it is only used in `anyscale_job_runner` now
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [release test] move `upload_working_dir_to_gcs` to `glue.py` (#57681)
it is only used in glue.py
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [release test] remove the use of typing extension for TypedDict (#57682)
should all be using hermetic python with python 3.8 or above now
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [ci] fix windows test launch and building (#57684)
make that `test_in_docker` does not depend on the entire `ray_release`
library, but only depends on python files that are required for the test
db to work. this removes the dependency of `cryptography` library from
`ray_ci`, so that windows wheels can be built and windows tests can run
again.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
* [core] (cgroups 16/n) Changing the algorithm for default values for system reserved resources (#57653)
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: israbbani <israbbani@gmail.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
* [core] Some plasma client cleaning (#57670)
Cleaning out plasma and the plasma client and its neighbors.
The plasma client had a pimpl implementation, even though we didn't
really need anything that would come with pimpl out of plasma. So just
killing the separate impl class and just having the plasma client and
its interface. One note about this is that it needs `shared_from_this`
and the old plasma client would always contain a shared ptr to the impl,
so had to refactor the raylet to use a shared ptr to the plasma client
so we could keep using the `shared_from_this`.
Other cleanup:
- a lot of the ipc functions always returned status::ok so changed to
void
- some extra reserving of vectors and moving
- unnecessary consts in pull manager that would prevent moves
etc.
---------
Signed-off-by: dayshah <dhyey2019@gmail.com>
* [core] Tune timeout to reduce run time of `test_transient_error_retry` (#57626)
The test times out frequently in CI.
Before this change, the test took `~40s` to run on my laptop. After the
change, the test took `~15s` to run on my laptop.
There also seems to be hanging related to in-order execution semantics,
so for now flipping to `allow_out_of_order_exection=True`. @dayshah will
add the `False` variant when he fixes the underlying issue.
---------
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
* [core][event] rename TaskExecutionEvent to TaskLifecycleEvent (#56853)
This PR refactors the `TaskExecutionEvent` proto in two ways:
- Rename the file to `events_task_lifecycle_event.proto`
- Refactor the task_state from a map to an array of TaskState and
timestamp. Also rename the field to `state_transitions` for consistency.
This PR depends on the upstream to update their logic to consume this
new schema.
Test:
- CI
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Renames task execution event to task lifecycle event and changes its
schema from a state map to an ordered state_transitions list, updating
core, GCS, dashboard, builds, tests, and docs.
>
> - **Proto/API changes (breaking)**
> - Rename `TaskExecutionEvent` → `TaskLifecycleEvent` and update
`RayEvent.EventType` (`TASK_EXECUTION_EVENT` → `TASK_LIFECYCLE_EVENT`).
> - Replace `task_state` map with `state_transitions` (list of `{state,
timestamp}`) in `events_task_lifecycle_event.proto`.
> - Update `events_base_event.proto` field from `task_execution_event` →
`task_lifecycle_event` and imports/BUILD deps accordingly.
> - **Core worker**
> - Update buffer/conversion logic in
`src/ray/core_worker/task_event_buffer.{h,cc}` to populate/emit
`TaskLifecycleEvent` with `state_transitions`.
> - **GCS**
> - Update `GcsRayEventConverter` to consume `TASK_LIFECYCLE_EVENT` and
convert `state_transitions` to `state_ts_ns`.
> - **Dashboard/Aggregator**
> - Switch exposable type defaults/env to `TASK_LIFECYCLE_EVENT` in
`python/.../aggregator_agent.py`.
> - **Tests**
> - Adjust unit tests to new event/type and schema across core worker,
GCS, and dashboard.
> - **Docs**
> - Update event export guide references to new lifecycle event proto.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
61507e8206ba75212a73833b6767a60fb3996574. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Signed-off-by: Cuong Nguyen <can@anyscale.com>
* remove async capability enum (#57666)
- removing enum capability enum as it is not being used, for more
details:
https://github.com/ray-project/ray/pull/56707#discussion_r2407901033
---------
Signed-off-by: harshit <harshit@anyscale.com>
* [core] Update `local_mode` warning to `FutureWarning` so it prints by default (#57623)
Previously we were using `DeprecationWarning` which is silenced by
default. Now this is printed:
```
>>> ray.init(local_mode=True)
/Users/eoakes/code/ray/python/ray/_private/client_mode_hook.py:104: FutureWarning: `local_mode` is an experimental feature that is no longer maintained and will be removed in the near future. For debugging consider using the Ray distributed debugger.
return func(*args, **kwargs)
```
---------
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
* add note for celery workers (#57686)
- adding a new note about using filesystem as a broker in celery
---------
Signed-off-by: harshit <harshit@anyscale.com>
* [ci] update Pull Request template (#57193)
<!-- Thank you for contributing to Ray! 🚀 -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->
## Description
<!-- Briefly describe what this PR accomplishes and why it's needed -->
Improved the Ray pull request template to make it less overwhelming for
contributors while giving maintainers better information for reviews and
release notes. The new template has clearer sections and organized
checklists that are much easier to fill out. This should encourage more
contributions while making the review process smoother and release note
generation more straightforward.
## Related issues
<!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234" -->
## Types of change
- [ ] Bug fi…
Why are these changes needed?
Ray Data Expr to Pyarrow Compute Expression converter.
Related issue number
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.