-
Notifications
You must be signed in to change notification settings - Fork 25
π daily merge: master β main 2025-11-17 #678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
β¦t#57579) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: joshlee <joshlee@anyscale.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
β¦AggType, U]) (ray-project#57281) ## Why are these changes needed? The current Generic types in `AggregateFnV2` are not tied to the class, so they are not picked up properly by static type checkers such as mypy. <!-- Please give a short summary of the change and the problem this solves. --> By adding the Generic[] in the class definition, we get full type checking support. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [N/A] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [N/A] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Arthur <atte.book@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>
``` REGRESSION 51.89%: single_client_get_calls_Plasma_Store (THROUGHPUT) regresses from 8378.589542828342 to 4030.5453313124744 in microbenchmark.json REGRESSION 33.66%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 4105.951978131054 to 2723.9690685388855 in microbenchmark.json REGRESSION 29.43%: client__tasks_and_put_batch (THROUGHPUT) regresses from 12873.97871447783 to 9085.541921590711 in microbenchmark.json REGRESSION 27.77%: multi_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 13779.913284159866 to 9952.762154617178 in microbenchmark.json REGRESSION 26.34%: client__get_calls (THROUGHPUT) regresses from 1129.1409512898194 to 831.689705073893 in microbenchmark.json REGRESSION 25.07%: multi_client_put_gigabytes (THROUGHPUT) regresses from 36.697734067834084 to 27.49866375110667 in microbenchmark.json REGRESSION 23.94%: actors_per_second (THROUGHPUT) regresses from 508.9808896382363 to 387.1219957094043 in benchmarks/many_actors.json REGRESSION 17.22%: 1_1_async_actor_calls_async (THROUGHPUT) regresses from 4826.171895058453 to 3995.0258578261814 in microbenchmark.json REGRESSION 16.58%: single_client_tasks_async (THROUGHPUT) regresses from 7034.736389002367 to 5868.239300602419 in microbenchmark.json REGRESSION 14.30%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.9643252583791863 to 0.8264370292993273 in microbenchmark.json REGRESSION 14.27%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1037.1186014627438 to 889.14113884267 in microbenchmark.json REGRESSION 12.90%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1014.64288638885 to 883.7347522770161 in microbenchmark.json REGRESSION 11.37%: client__put_calls (THROUGHPUT) regresses from 805.4069136266919 to 713.82381443796 in microbenchmark.json REGRESSION 11.02%: 1_1_actor_calls_concurrent (THROUGHPUT) regresses from 5222.99132120111 to 4647.56843532278 in microbenchmark.json REGRESSION 7.95%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 12.272053704608084 to 11.296427707979271 in microbenchmark.json REGRESSION 7.86%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 524.6134993747014 to 483.38098840508496 in microbenchmark.json REGRESSION 7.26%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 4498.3519827438895 to 4171.572402867286 in microbenchmark.json REGRESSION 6.47%: single_client_wait_1k_refs (THROUGHPUT) regresses from 4.700920788730696 to 4.396844484606209 in microbenchmark.json REGRESSION 6.40%: 1_1_async_actor_calls_with_args_async (THROUGHPUT) regresses from 2766.907182403518 to 2589.906655726785 in microbenchmark.json REGRESSION 5.06%: single_client_put_gigabytes (THROUGHPUT) regresses from 19.30103208209274 to 18.324991353469613 in microbenchmark.json REGRESSION 1.73%: 1_n_actor_calls_async (THROUGHPUT) regresses from 7474.798821945149 to 7345.613928457275 in microbenchmark.json REGRESSION 421.24%: dashboard_p99_latency_ms (LATENCY) regresses from 232.641 to 1212.615 in benchmarks/many_pgs.json REGRESSION 377.77%: dashboard_p95_latency_ms (LATENCY) regresses from 11.336 to 54.16 in benchmarks/many_pgs.json REGRESSION 306.02%: dashboard_p99_latency_ms (LATENCY) regresses from 749.022 to 3041.184 in benchmarks/many_tasks.json REGRESSION 162.57%: dashboard_p50_latency_ms (LATENCY) regresses from 11.744 to 30.836 in benchmarks/many_actors.json REGRESSION 94.47%: dashboard_p95_latency_ms (LATENCY) regresses from 487.355 to 947.76 in benchmarks/many_tasks.json REGRESSION 35.48%: dashboard_p99_latency_ms (LATENCY) regresses from 49.716 to 67.355 in benchmarks/many_nodes.json REGRESSION 33.15%: dashboard_p95_latency_ms (LATENCY) regresses from 2876.107 to 3829.61 in benchmarks/many_actors.json REGRESSION 27.55%: dashboard_p95_latency_ms (LATENCY) regresses from 13.982 to 17.834 in benchmarks/many_nodes.json REGRESSION 7.56%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 13.777409734000003 to 14.81957527099999 in scalability/object_store.json REGRESSION 4.48%: 3000_returns_time (LATENCY) regresses from 6.1422604579999955 to 6.417386639000014 in scalability/single_node.json REGRESSION 3.41%: avg_pg_remove_time_ms (LATENCY) regresses from 1.419495533032749 to 1.4678401576576676 in stress_tests/stress_test_placement_group.json REGRESSION 1.48%: 10000_get_time (LATENCY) regresses from 25.136106761999997 to 25.508083513999992 in scalability/single_node.json REGRESSION 0.76%: stage_2_avg_iteration_time (LATENCY) regresses from 36.08304100036621 to 36.358218574523924 in stress_tests/stress_test_many_tasks.json ``` Signed-off-by: kevin <kevin@anyscale.com>
β¦ay-project#57631) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? To prevent unknown issues with streaming executor not completing after hours, we will add assert statements to rule out potential isses. This PR ensures that when an operator is completed - Internal Input queue is empty - Internal Output queue is empty - External Input Queue is empty The external output queue can be non-empty, because the downstream operators will consume from it <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
β¦roject#58234) ## Description We need to make sure we're running tests on at least SF100 to make sure we capturing regressions that could fall under the noise level otherwise. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
β¦/5) (ray-project#58068) Updating docgpubuild to run on python 3.10 updating minbuild-multiply job name to minbuild-serve Post merge test that uses the docgpubuild image: https://buildkite.com/ray-project/postmerge/builds/14073 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Nikhil G <nrghosh@users.noreply.github.com>
β¦optimized download function (ray-project#57854) Signed-off-by: ahao-anyscale <ahao@anyscale.com>
β¦ay-project#58261) > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description This change make RD by dump verbose telemetry for `ResourceManager` into the `ray-data.log` by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
β¦GRADES(3/5) " (ray-project#58266) Reverts ray-project#58068
## Description Update formatting of FailurePolicy log message to be more readable. ## Additional information **Before:** ``` [FailurePolicy] Decision: FailureDecision.RAISE, Error source: controller, Error count / maximum errors allowed: 1/0, Error: Training failed due to controller error: Worker group is not active. Call WorkerGroup.create() to create a new worker group. ``` **After:** ``` [FailurePolicy] RAISE Source: controller Error count: 1 (max allowed: 0) Training failed due to controller error: Worker group is not active. Call WorkerGroup.create() to create a new worker group. ``` Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com> Signed-off-by: Yicheng-Lu-llll <51814063+Yicheng-Lu-llll@users.noreply.github.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
β¦#56742) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> Fifth split of ray-project#56416 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Enables Ruff import sorting for rllib/examples by narrowing per-file ignores and updates example filesβ imports accordingly with no functional changes. > > - **Tooling/Lint**: > - Update `pyproject.toml` Ruff per-file-ignores (replace blanket `rllib/*` with targeted subpaths) to enable import-order linting for `rllib/examples`. > - **Examples**: > - Reorder and normalize imports across `rllib/examples/**` to satisfy Ruff isort rules; no logic or behavior changes. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 101586e. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Gagandeep Singh <gdp.1807@gmail.com> Signed-off-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com> Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com> Co-authored-by: Mark Towers <mark.m.towers@gmail.com> Co-authored-by: Mark Towers <mark@anyscale.com>
β¦_options. (ray-project#58275) ## Description > Exclude IMPLICIT_RESOURCE_PREFIX from ReplicaConfig.ray_actor_options ## Related issues > Link related issues: "Fixes ray-project#58085" Signed-off-by: xingsuo-zbz <zhao_abc_123@163.com>
β¦ + discrepancy fix in Python API 'serve.start' function (ray-project#57622) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? 1. Fix bug with 'proxy_location' set for 'serve run' CLI command `serve run` CLI command ignores `proxy_location` from config and uses default value `EveryNode`. Steps to reproduce: - have a script: ```python # hello_world.py from ray.serve import deployment @deployment async def hello_world(): return "Hello, world!" hello_world_app = hello_world.bind() ``` Execute: ``` ray stop ray start --head serve build -o config.yaml hello_world:hello_world_app ``` - change `proxy_location` in the `config.yaml`: EveryNode -> Disabled ``` serve run config.yaml curl -s -X GET "http://localhost:8265/api/serve/applications/" | jq -r '.proxy_location' ``` Output: ``` Before change: EveryNode - but Disabled expected After change: Disabled ``` 2. Fix discrepancy for 'proxy_location' in the Python API 'start' method `serve.start` function in Python API sets different `http_options.location` depending on if `http_options` is provided. Steps to reproduce: - have a script: ```python # discrepancy.py import time from ray import serve from ray.serve.context import _get_global_client if __name__ == '__main__': serve.start() client = _get_global_client() print(f"Empty http_options: `{client.http_config.location}`") serve.shutdown() time.sleep(5) serve.start(http_options={"host": "0.0.0.0"}) client = _get_global_client() print(f"Non empty http_options: `{client.http_config.location}`") ``` Execute: ``` ray stop ray start --head python -m discrepancy ``` Output: ``` Before change: Empty http_options: `EveryNode` Non empty http_options: `HeadOnly` After change: Empty http_options: `EveryNode` Non empty http_options: `EveryNode` ``` ------------------------------------------------------------- It changes current behavior in the following ways: 1. `serve run` CLI command respects `proxy_location` parameter from config instead of using the hardcoded `EveryNode`. 2. `serve.start` function in Python API stops using the default `HeadOnly` in case of empty `proxy_location` and provided `http_options` dictionary without `location` specified. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> Aims to simplify changes in the PR: ray-project#56507 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
## Description ### Status Quo This PR ray-project#54667 addressed issues of OOM by sampling a few lines of the file. However, this code always assumes the input file is seekable(ie, not compressed). This means zipped files are broken like this issue: ray-project#55356 ### Potential Workaround - Refractor reused code between JsonDatasource and FileDatasource - default to 10000 if zipped file found ## Related issues ray-project#55356 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
β¦ray-project#58180) ## Expose Route Patterns in Proxy Metrics fixes ray-project#52212 ### Problem Proxy metrics (`ray_serve_num_http_requests_total`, `ray_serve_http_request_latency_ms`) only expose `route_prefix` (e.g., `/api`) instead of actual route patterns (e.g., `/api/users/{user_id}`). This prevents granular monitoring of individual endpoints without causing high cardinality from unique request paths. ### Design **Route Pattern Extraction & Propagation:** - Replicas extract route patterns from ASGI apps (FastAPI/Starlette) at initialization using `extract_route_patterns()` - Patterns propagate: Replica β `ReplicaMetadata` β `DeploymentState` β `EndpointInfo` β Proxy - Works with both normal patterns (routes in class) and factory patterns (callable returns app) **Proxy Route Matching:** - `ProxyRouter.match_route_pattern()` matches incoming requests to specific patterns using cached mock Starlette apps - Metrics tag requests with parameterized routes (e.g., `/api/users/{user_id}`) instead of prefixes - Fallback to `route_prefix` if patterns unavailable or matching fails **Performance:** Metric | Before | After -- | -- | -- Requests per second (RPS) | 403.39 | 397.82 Mean latency (ms) | 247.9 | 251.37 p50 (ms) | 224 | 223 p90 (ms) | 415 | 428 p99 (ms) | 526 | 544 ### Testing - Unit tests for `extract_route_patterns()` - Integration test verifying metrics use patterns and avoid high cardinality - Parametrized for both normal and factory patterns --------- Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com> Signed-off-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦#58060) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all core worker components. For the most parts, metrics are defined as the top level component (core_worker_process.cc) and pass down as an interface to the sub-components. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. - Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com>
updating tune release tests to run on python 3.10 Successful release test run: https://buildkite.com/ray-project/release/builds/65655 (failing tests are already disabled) --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Remove actor handle from object that get's passed around in long poll communication. Return actor handle in nested objects from the task make the caller of this task a borrower from the reference counting POV. But this pattern, although allowed, is not very well tested. Hence breaking it by passing actor_name from listen_for_change instead. --------- Signed-off-by: abrar <abrar@anyscale.com>
## Description The full name was probably hallucinated from LLM. ## Related issues ## Additional information Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
β¦ross-node parallelism (ray-project#57261) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nikhil G <nrghosh@users.noreply.github.com>
β¦imit` (ray-project#58303) ## Description ## Related issues Fix comment ray-project#58264 (comment) ## Additional information Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
β¦egy (ray-project#58306) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
β¦ist with nixl (ray-project#58263) ## Description For nixl, reuse previous metadata if transferring the same tensor list. This is to avoid repeated `register_memory` before `deregister_memory` --------- Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
β¦tuned_examples/`` in ``rllib`` (ray-project#56746) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> Seventh split of ray-project#56416 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Gagandeep Singh <gdp.1807@gmail.com> Signed-off-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com> Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com> Co-authored-by: Mark Towers <mark.m.towers@gmail.com>
β¦ct#57835) ## Description builds atop of ray-project#58047, this pr ensures the following when `auth_mode` is `token`: calling `ray.init() `(without passing an existing cluster address) -> check if token is present, generate and store in default path if not present calling `ray.init(address="xyz")` (connecting to an existing cluster) -> check if token is present, raise exception if one is not present --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
not used any more; all tests moved to python 3.10 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
it is already imported, and the version being imported is not really being used. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
## Description In IMPALA, we access an attribute `self._minibatch_size` which does not exist anymore. It should be `self._minibatch_size`. While this check is nice, it's effectively untested code. This PR introduces a test that adds a small test that triggers the relevant code path.
Minor follow ups from: ray-project#58539 Example error message: ``` Task failed because the node it was running on is dead or unavailable. Node IP: 127.0.0.1, node ID: e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646. This can happen if the node was preempted, had a hardware failure, or its raylet crashed unexpectedly. To see node death information, use `ray list nodes --filter node_id=e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646`, check the Ray dashboard cluster page, search the node ID in the GCS logs, or use `ray logs raylet.out -ip 127.0.0.1`. ``` --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦ators (ray-project#58555) ## Description Push `Filter` past Join (depends on the join op), `Filter` into Union branches, `Filter` past projections (accounting for renames), past all shuffle ops. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
## Description Similar to ray-project#58599, should have added avg metric for generation length as well. ## Related issues ## Additional information --------- Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
β¦ay-project#57241) ## Why are these changes needed? The gcs_client behavior today is hard to overload/override. This is bad for a few reasons. One being that it can be difficult to instrument testing. There are examples today amongst accessors that need to give βFriendβ level access to methods to different classes in order to accomplish some of the tests today. To resolve this, we're going to introduce some new abstractions. This is the first PR that sets up some of the framework. From here, we'll do the remainder of the accessors. ### New Classes ``` actor_info_accessor.cc <-- Concretion implementations actor_info_accessor.h <-- Concretion declarations (contains private methods and members) actor_info_accessor_interface.h <-- Interface function declarations ``` gcs_client and all other accessor users will interact now with an ActorInfoAccessorInterface object. Next in order to make this injectable, we need to pull away the inline concretion instantiation going on in gcs_client.connect(). To do this, we are going to create a new factory interface called AccessorFactoryInterface which is included in this PR. There is one other class/abstraction we need to introduce. Ideally, in order to make this pluggable, we need separated build targets between interfaces and implementations and gcs_client. Before, everything was bundled together. But since weβre moving things out, we need to break the circular dependency between different accessors and the gcs_client. The main reason why the gcs_client passes itself into each accessor is that today these different accessors try to share a subscriber and a rpc_client (and potentially more in the future). So we introduce a new interface to break this cycle. It is the GcsClientContext. The intention of this object is to be a βgrab bagβ of objects that are needed by accessor implementations. ray-project#54805 --------- Signed-off-by: zac <zac@anyscale.com>
β¦ deleted (ray-project#58605) The tests that exercised actor failures when they go out of scope, such as `test_actor_ray_shutdown_called_on_del` and `test_actor_ray_shutdown_called_on_scope_exit` [were flaky](https://buildkite.com/ray-project/postmerge/builds/14336#019a7abe-73d3-46e0-8dc2-13351e12b7c3/613-1919). This PR fixes the flakiness by ensuring actors use graceful shutdown when GCS polling detects actor refs are deleted. **Problem** When actors go out of scope, GCS uses two mechanisms to detect reference deletion: 1. Push model (`GcsActorManager::HandleReportActorOutOfScope`) - already fixed in ray-project#57090 2. Pull model (`GcsActorManager::PollOwnerForActorRefDeleted`) - was still using force kill The pull model was calling DestroyActor(..., force_kill=true), which skips `__ray_shutdown__` and immediately terminates the actor. This created a race condition: whichever mechanism completed first determined whether cleanup callbacks ran, causing test flakiness. To fix the issue, changed `PollOwnerForActorRefDeleted` to use graceful shutdown with timeout (same as `HandleReportActorOutOfScope`). I ran all the actor failure tests that exercise this shutdown path 20 times locally, and where they failed 3/20 previously, they succeeded everytime after the fix. Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
β¦ray-project#57982) ## Description Currently, when running ray start --block, Ray prints the exit code of subprocesses such as raylet only to stdout. If users donβt redirect or save the console output, this diagnostic information is lost. ## Related issues Closes ray-project#57941 ## Additional information <img width="1607" height="228" alt="image" src="https://github.com/user-attachments/assets/9042e750-5cc2-4f2e-882d-ced114bdfe67" /> - Added a persistent log file ray_process_exit.log under the nodeβs logs directory. - On unexpected subprocess termination, exit codes are now: - Printed to stdout as before. - Appended to ray_process_exit.log for later inspection. --------- Signed-off-by: Dongjun Na <kmu5544616@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
## Description Before this PR, the metrics would follow this path 1. `StreamingExecutor` collects metrics per operator 2. `_StatsManager` creates a thread to export metrics 3. `StreamingExecutor` sends metrics to `_StatsManager`, which performs a copy and holds a `_stats_lock`. 4. Stats Thread reads the metrics sent from 2) 5. Stats Thread sleeps every 5-10 seconds before exporting metrics to `_StatsActor`. These metrics can come in 2 forms: iteration and execution metrics. I believe the purpose of the stats thread created in 2) was 2-fold - Don't export stats very frequently - Don't export Iteration and Execution stats separately (have them sent in the same rpc call) However, this creates a lot of complexity (handling idle threads, etc...) and also makes it harder to perform histogram metrics, which need to copy an entire list of values. See ray-project#57851 for more details. By removing the stats thread in 2), we can reduce complexity of management, and also avoid wasteful copying of metrics. The downside is that iteration and execution metrics are now sent separately, increasing the # of rpc calls. I don't think this is a concern, because the async updates to the `_StatsActor` were happening previously, and we can also tweak the update interval. ~~It's important to note that `_stats_lock` still lives on to update the last timestamps of each dataset. See * below for more details.~~ Now the new flow is: 1. `StreamingExecutor` collects metrics per operator 2. `StreamingExecutor` checks the last time `_StatsActor` was updated. If more than a default 5 seconds has passed since last updated, we send metrics to `_StatsActor` through the `_StatsManager`. Afterwards, we update the last updated timestamp. See * below for caveat. ~~\*[important] Ray Data supports running multiple datasets concurrently. Therefore, I must keep track of each dataset last updated timestamp. `_stats_lock` is used to update that dictionary[dataset, last_updated] safely on `register_dataset` and on `shutdown`. On update, we don't require the lock because it does not update the dictionary's size. If we want to remove the lock entirely, I can think of 2 workarounds.~~ 1. ~~Create a per dataset `StatsManager`. Pros: no thread lock. Cons: Much more code changes. The iteration metrics go through a separate code path that is independent of the streaming executor, which will make this more challenging.~~ 2. ~~Update on every unix_epoch_timestamp % interval == 0, so that at 12:00, 12:05, etc.. the updates will be on that interval. Pros: easy to implement and it's stateless. Cons: Breaks down for slower streaming executors.~~ 3. I can removed the lock by keeping the state in the 2 areas - BatchIterator - StreamingExecutor I also verified that ray-project#55163 still solves the original issue ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
β¦58601) ## Description Now on `write()` the raw data is written to the underlying parquet files and the metadata is returned, namely `DataFiles`. On `on_write_complete()` we commit the transaction. For upsert, the data has to be read back in memory and we do that in a separate Ray Task. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
## Description Adds ray data metrics documentation for visibility. This should be periodically updated with the latest metrics. ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
β¦ay-project#58466) running multimodal inference tests on 3.10 Successful release test runs: https://buildkite.com/ray-project/release/builds/66846#019a60d7-521e-414a-b1ab-0d58b7d8074e --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Deprecate `raw_metrics` API and replace them all with the `raw_metric_timeseries` API. `raw_metrics` returns the current snapshot of a set of metrics, while `raw_metric_timeseries` returns the full time series. The later is more reliable when checking the latest instance of several independent metrics. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com>
β¦eterministic (ray-project#58631) ## Description N/A ## Related issues Fixes ray-project#58560 ## Additional information N/A --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
so that it does not need / create project config files we just need the self-contained binary Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
## Description This pr support return a generator object from map_groups UDF. if UDF have a large output , we return iterator to reduce memory cost. ## Related issues Close ray-project#57935 ## Additional information This change centers on the `_apply_udf_to_groups` helper function within the file ray/data/grouped_data.py. `map_groups` internally calls map_batches, providing a wrapper function (wrapped_fn) that in turn calls `_apply_udf_to_groups` to apply the user's UDF to each group. The key modification is that instead of directly yielding the UDF's return value, the logic now inspects the result first. If the result is an Iterator, it is consumed with `yield from` to produce each data batch individually. If it is not an iterator, the single data batch is yielded directly, preserving the original behavior. --------- Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
β¦URI columns (ray-project#58517) The `_sample_sizes` method was using `as_completed()` to collect file sizes, which returns results in completion order rather than submission order. This scrambled the file sizes list so it no longer corresponded to the input URI order. When multiple URI columns are used, `_estimate_nrows_per_partition` calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which assumes file sizes from different columns align by row index. The scrambled ordering caused file sizes from different rows to be incorrectly combined, producing incorrect partition size estimates. ## Changes - Pre-allocate the `file_sizes` list with the correct size - Use a `future_to_file_index` mapping to track the original submission order - Place results at their correct positions regardless of completion order - Add assertion to verify list length matches expected size ## Related issues ray-project#58464 (comment) --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Adding a version arg to read_delta_lake to support reading from a specific version <!-- Please give a short summary of the change and the problem this solves. --> > ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
it uses `enum.Enum` that are not deepcopy-able Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request #678 has too many files changed.
The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5397.
Summary of ChangesHello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request is an automated daily merge from the Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is an automated daily merge from master to main. It contains a large number of changes, primarily focused on refactoring the CI/CD pipelines and Bazel build system. Key changes include moving to uv for dependency management, modularizing Bazel build files, refactoring Docker image build processes, and improving the test selection logic. I've identified a potential issue with error handling in one of the C++ files.
| memory_store_->Put( | ||
| ::ray::RayObject(buffer, nullptr, std::vector<rpc::ObjectReference>()), | ||
| object_id, | ||
| /*has_reference=*/false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status check for memory_store_->Put has been removed. While the method signature might have changed to void and now throws exceptions on error, the GetRaw method in this same file still performs a status check on the returned Status object from memory_store_->Get. This inconsistency suggests that error handling might have been unintentionally removed here. Please verify if memory_store_->Put can fail and if so, how errors are propagated. If it can fail without throwing, the status check should be restored.
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2025-11-17
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.