-
Notifications
You must be signed in to change notification settings - Fork 25
π daily merge: master β main 2025-11-26 #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
be consistent with doc build environment Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
migrating all doc related things to run on python 3.12 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
excluding `*_tests` directories for now to reduce the impact Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
using `bazelisk run //java:gen_ray_java_pkg` everywhere Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
This PR adds 2 new metrics to core_worker by way of the reference counter. The two new metrics keep track of the count and size of objects owned by the worker as well as keeping track of their states. States are defined as: - **PendingCreation**: An object that is pending creation and hasn't finished it's initialization (and is sizeless) - **InPlasma**: An object which has an assigned node address and isn't spilled - **Spilled**: An object which has an assigned node address and is spilled - **InMemory**: An object which has no assigned address but isn't pending creation (and therefore, must be local) The approach used by these new metrics is to examine the state 'before and after' any mutations on the reference in the reference_counter. This is required in order to do the appropriate bookkeeping (decrementing values and incrementing others). Admittedly, there is potential for counting on the in between decrements/increments depending on when the RecordMetrics loop is run. This unfortunate side effect however seems preferable to doing mutual exclusion with metric collection as this is potentially a high throughput code path. In addition, performing live counts seemed preferable then doing full accounting of the object store and across all references at time of metric collection. Reason being, that potentially the reference counter is tracking millions of objects, and each metric scan could potentially be very expensive. So running the accounting (despite being potentially innaccurate for short periods) seemed the right call. This PR also allows for object size to potentially change due to potential non deterministic instantation (say an object is initially created, but it's primary copy dies, and then the recreation fails). This is an edge case, but seems important for completeness sake. --------- Signed-off-by: zac <zac@anyscale.com>
to 0.21.0; supports wanda priority now. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
β¦8498) Signed-off-by: Seiji Eicher <seiji@anyscale.com>
β¦#58286) ## Description Predicate pushdown (ray-project#58150) in conjunction with this PR should speed up reads from Iceberg. Once the above change lands, we can add the pushdown interface support for IcebergDatasource --------- Signed-off-by: Goutam <goutam@anyscale.com>
## Description * Does the work to bump pydoclint up to the latest version * And allowlist any new violations it finds ## Related issues n/a ## Additional information n/a --------- Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>
fix pattern_async_actor demo typo. Add `self.`. --------- Signed-off-by: curiosity-hyf <curiooosity.h@gmail.com>
β¦hboard agent (ray-project#58405) Add a grpc service interceptor to intercept all dashboard agent rpc calls and validate the presence of auth token (when auth mode is token) --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦tests (ray-project#58528) the auth token test setup in `conftest.py` is breaking macos test. there are two test scripts (`test_microbenchmarks.py` and `test_basic.py`) that run after the wheel is installed but without editable mode. for these test to pass,` conftest.py` cannot import anything under `ray.tests`. this pr moves `authentication_test_utils` into `ray._private` to fix this issue Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>
This PR enables open telemetry as the default backend for ray metric stack. The bulk of this PR is actually to fix tests that were written with some assumptions that no longer hold true. For ease of reviewing, I inline the reasons for the change together with the change for each tests in the comments. This PR also depends on a release of vllm (so that we can update the minimal supported version of vllm in ray). Test: - CI <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Enable OpenTelemetry metrics backend by default and refactor metrics/Serve tests to use timeseries APIs and updated `ray_serve_*` metric names. > > - **Core/Config**: > - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to `true` in `ray_constants.py` and `ray_config_def.h`. > - Metrics `Counter`: use `CythonCount` by default; keep legacy `CythonSum` only when OTEL is explicitly disabled. > - **Serve/Metrics Tests**: > - Replace text scraping with `PrometheusTimeseries` and `fetch_prometheus_metric_timeseries` throughout. > - Update metric names/tags to `ray_serve_*` and counter suffixes `*_total`; adjust latency metric names and processing/queued gauges. > - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and pass through helpers. > - **General Test Fixes**: > - Remove OTEL parametrization/fixtures; simplify expectations where counters-as-gauges no longer apply; drop related tests. > - Cardinality tests: include `"low"` level and remove OTEL gating; stop injecting `enable_open_telemetry` in system config. > - Actor/state/thread tests: migrate to cluster fixtures, wait for dashboard agent, and adjust expected worker thread counts. > - **Build**: > - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env from C++ stats test. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 1d0190f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com>
β¦mmended (ray-project#57726) <!-- Thank you for contributing to Ray! π --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description If users schedule a detached actor into a placement group, Raylet will kill the actor when the placement group is removed. The actor will be stuck in the `RESTARTING` state forever if it's restartable until users explicitly kill it. In that case, if users try to `get_actor` with the actor's name, it can still return the restarting actor, but no process exists. It will no longer be restarted because the PG is gone, and no PG with the same ID will be created during the cluster's lifetime. The better behavior would be for Ray to transition a task/actor's state to dead when it is impossible to restart. However, this would add too much complexity to the core, so I think it's not worth it. Therefore, this PR adds a warning log, and users should use detached actors or PGs correctly. Example: Run the following script and run `ray list actors`. ```python import ray from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy from ray.util.placement_group import placement_group, remove_placement_group @ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1) class Actor: pass ray.init() pg = placement_group([{"CPU": 1}]) ray.get(pg.ready()) actor = Actor.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg, ) ).remote() ray.get(actor.__ray_ready__.remote()) ``` ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix π - [ ] New feature β¨ - [x] Enhancement π - [ ] Code refactoring π§ - [ ] Documentation update π - [ ] Chore π§Ή - [ ] Style π¨ ## Checklist **Does this PR introduce breaking changes?** - [ ] Yesβ οΈ - [x] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested β _(please explain why)_ **Code Quality:** - [x] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Kai-Hsun Chen <khchen@x.ai> Signed-off-by: Robert Nishihara <robertnishihara@gmail.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦58182) Signed-off-by: dayshah <dhyey2019@gmail.com>
β¦y-project#57715) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
β¦-project#56783) Signed-off-by: dayshah <dhyey2019@gmail.com>
The python test step is failing on master now because of this. Probably a logical merge conflict. ``` FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary) ... [2025-11-11T22:11:54Z] from ray.tests.authentication_test_utils import ( -- Β | [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils' ``` Signed-off-by: dayshah <dhyey2019@gmail.com>
be consistent with the default build environment Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
β¦ject#58543) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
## Description - rename RAY_auth_mode β RAY_AUTH_MODE environment variable across codebase - Excluded healthcheck endpoints from authentication for Kubernetes compatibility - Fixed dashboard cookie handling to respect auth mode and clear stale tokens when switching clusters --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦ls (ray-project#58424) ## Description - Use client interceptor for adding auth tokens in grpc calls when `AUTH_MODE=token` - BuildChannel() will automatically include the interceptor - Removed `auth_token` parameter from `ClientCallImpl` - removed manual auth from `python_gcs_subscriber`.cc - tests to verify auth works for autoscaller apis --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
β¦`) (ray-project#57090) When actors terminate gracefully, Ray calls the actor's `__ray_shutdown__()` method if defined, allowing for cleanup of resources. But, this is not invoked in case actor goes out of scope due to `del actor`. ### Why `del actor` doesn't invoke `__ray_shutdown__` Traced through the entire code path, and here's what happens: Flow when `del actor` is called: 1. **Python side**: `ActorHandle.__del__()` -> `worker.core_worker.remove_actor_handle_reference(actor_id)` https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040 2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` -> `reference_counter_->RemoveLocalReference()` - When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed` callback https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506 3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` -> `AsyncReportActorOutOfScope()` to GCS https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183 https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51 4. **GCS receives notification**: `HandleReportActorOutOfScope()` - **THE PROBLEM IS HERE** ([line 279 in `src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)): ```cpp DestroyActor(actor_id, GenActorOutOfScopeCause(actor), /*force_kill=*/true, // <-- HARDCODED TO TRUE! [reply, send_reply_callback]() { ``` 5. **Actor worker receives kill signal**: `HandleKillActor()` in [`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970) ```cpp if (request.force_kill()) { // This is TRUE for OUT_OF_SCOPE ForceExit(...) // Skips __ray_shutdown__ } else { Exit(...) // Would call __ray_shutdown__ } ``` 6. **ForceExit path**: Bypasses graceful shutdown -> No `__ray_shutdown__` callback invoked. This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE actors. Also, updated the docs. --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>
Currently, a node is considered idle while pulling objects from the remote object store. This can lead to situations where a node is terminated as idle, causing the cluster to enter an infinite loop when pulling large objects that exceed the node idle termination timeout. This PR fixes the issue by treating object pulling as a busy activity. Note that nodes can still accept additional tasks while pulling objects (since pulling consumes no resources), but the auto-scaler will no longer terminate the node prematurely. Closes ray-project#54372 Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com>
β¦_FACTOR` to 2 (ray-project#58262) > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description This was setting the value to be aligned with the previous default of 4. However, after some consideration i've realized that 4 is too high of a number so actually lowering this to 2 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
β¦ray-project#58504) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
β¦y-project#58523) ## Description This PR improves documentation consistency in the `python/ray/data` module by converting all remaining rST-style docstrings (`:param:`, `:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.). ## Additional information **Files modified:** - `python/ray/data/preprocessors/utils.py` - Converted `StatComputationPlan.add_callable_stat()` - `python/ray/data/preprocessors/encoder.py` - Converted `unique_post_fn()` - `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()` and `BlockColumnAccessor.is_composed_of_lists()` - `python/ray/data/_internal/datasource/delta_sharing_datasource.py` - Converted `DeltaSharingDatasource.setup_delta_sharing_connections()` Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
β¦oject#58549) ## Description The original `test_concurrency` function combined multiple test scenarios into a single test with complex control flow and expensive Ray cluster initialization. This refactoring extracts the parameter validation tests into focused, independent tests that are faster, clearer, and easier to maintain. Additionally, the original test included "validation" cases that tested valid concurrency parameters but didn't actually verify that concurrency was being limited correctlyβthey only checked that the output was correct, which isn't useful for validating the concurrency feature itself. **Key improvements:** - Split validation tests into `test_invalid_func_concurrency_raises` and `test_invalid_class_concurrency_raises` - Use parametrized tests for different invalid concurrency values - Switch from `shutdown_only` with explicit `ray.init()` to `ray_start_regular_shared` to eliminate cluster initialization overhead - Minimize test data from 10 blocks to 1 element since we're only validating parameter errors - Remove non-validation tests that didn't verify concurrency behavior ## Related issues N/A ## Additional information The validation tests now execute significantly faster and provide clearer failure messages. Each test has a single, well-defined purpose making maintenance and debugging easier. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
previously it was actually using 0.4.0, which is set up by the grpc repo. the declaration in the workspace file was being shadowed.. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
β¦roject#58864) > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Fix obj_store_mem_max_pending_output_per_task reporting Fix `obj_store_mem_max_pending_output_per_task` when sample is unavailable to factor in, - `bytes_per_output` = `MAX_SAFE_BLOCK_SIZE_FACTOR` * `target_max_block_size`. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
β¦te matching (ray-project#58927) The correct route value is already part of RequestMetadata after ray-project#58180, no need to recompute it again. no observed perf diff in microbenchmark After ``` Type Name # Requests # Fails Median (ms) 95%ile (ms) 99%ile (ms) Average (ms) Min (ms) Max (ms) Average size (bytes) Current RPS Current Failures/s GET /echo?message=hello 28068 0 200 410 470 228.27 80 592 26 430.3 0 Aggregated 28068 0 200 410 470 228.27 80 592 26 430.3 0 ``` Before ``` Type Name # Requests # Fails Median (ms) 95%ile (ms) 99%ile (ms) Average (ms) Min (ms) Max (ms) Average size (bytes) Current RPS Current Failures/s GET /echo?message=hello 27427 0 210 410 470 232.12 76 604 26 429.7 0 Aggregated 27427 0 210 410 470 232.12 76 604 26 429.7 0 ``` Additionally, old implementation wrongly assumed that there will only be one method (GET,PUT) corresponding to a route. This PR fixes that assumption and tests for it. --------- Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
> Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Add iter_prefetched_blocks stats Report prefetched bytes per iterator as stats. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: Srinath Krishnamachari <68668616+srinathk10@users.noreply.github.com>
β¦58299) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all common components. Normally, metrics are defined at the top-level component and passed down to sub-components. However, in this case, because the common component is used as an API across, doing so would feel unnecessarily cumbersome. I decided to define the metrics inline within each client and server class instead. Note that the metric classes (Metric, Gauge, Sum, etc.) are simply wrappers around static OpenCensus/OpenTelemetry entities. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com>
ray-project#58710) > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ray-project#58711 decreased the scale of the `map_groups` tests from scale-factor 100 to scale-factor 10 because some of the `map_groups` release tests were failing. However, after more investigation, I realized that the only variant that doesn't work with scale-factor 100 is the hash shuffle with autoscaling variant (see ray-project#58734). This PR re-increases the scale and only disables the cases that fail. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Summary This PR removes `test_large_args_scheduling_strategy` from `test_stats.py` because its flaky and not worth keeping (It tests implementation details rather than behavior and conflates multiple concerns) See https://buildkite.com/ray-project/premerge/builds/54495#019ab720-249f-49c5-8e25-5e9005cc41e2 ## Motivation 1. **Hardcodes scheduling strategy values** - The test assumes large args use `'DEFAULT'` and small args use `'SPREAD'`. If these defaults change in `context.py`, the test fails even though the system is working correctly. 2. **Tests stats format, not scheduling behavior** - The test doesn't verify that the correct scheduling strategy is actually passed to Ray tasks. It only checks that a specific string appears in stats output. 3. **Mixes two concerns** - The test conflates: - Scheduling strategy selection based on data size (belongs in a map-related test) - Stats output including scheduling strategy info (belongs in a general stats formatting test) Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Why are these changes needed? We introduced an improved error message when environments fail in ray-project#55567. At the same time, this bypasses the silencing of env step errors. This PR consolidates the messages. --------- Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>
β¦-project#58915) # Description This PR refactors the `PhysicalOperator` class to eliminate hidden side effects in the `completed()` method. Previously, calling `completed()` could inadvertently modify the internal state of the operator, which could lead to unexpected behavior. This change separates the logic for checking if the operator is marked as finished from the logic that computes whether it is actually finished. Key changes include: - Renaming `_execution_finished` to `_is_execution_marked_finished` to clarify its purpose. - Renaming `execution_finished()` to `has_execution_finished()` and making it a pure computed property without side effects. - Updating the `completed()` method to call `has_execution_finished()` instead of modifying internal state. - Ensuring that `mark_execution_finished()` correctly sets the renamed field. ## Related issues Fixes ray-project#58884 ## Additional information This refactor ensures that both `has_execution_finished()` and `completed()` are pure query methods, allowing them to be called multiple times without altering the state of the operator. T --------- Signed-off-by: Simeet Nayan <simeetnayan.8100@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
## Description The links for APPO were referencing the PPO paper. I updated them to link to the IMPACT paper Signed-off-by: Philipp Schmutz <2059887+pschmutz@users.noreply.github.com>
β¦ completed episodes when sampling a fixed number of episodes (ray-project#58931) ## Description The `MultiAgentEnvRunner` would previously call the callback twice for the final episode of a batch (when sampling a fixed number of episodes). This PR fixes this problem ensuring that the callback only happens once for finished episode ## Related issues Closes ray-project#55452 --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>
## Description When the Autoscaler receives a resource request and decides which type of node to scale up,, only the `UtilizationScore` is considered (that is, Ray tries to avoid launching a large node for a small resource request, which would lead to resource waste). If multiple node types in the cluster have the same `UtilizationScore`, Ray always request for the same node type. In Spot scenarios, cloud resources are dynamically changing. Therefore, we want the Autoscaler to be aware of cloud resource availability β if a certain node type becomes unavailable, the Autoscaler should be able to automatically switch to requesting other node types. In this PR, I added the `CloudResourceMonitor` class, which records node types that have failed resource allocation, and in future scaling events, reduces the weight of these node types. ## Related issues Related to ray-project#49983 Fixes ray-project#53636 ray-project#39788 ray-project#39789 ## implementation details 1. `CloudResourceMonitor` This is a subscriber of Instances. When a Instance get status of `ALLOCATION_FAILED`, `CloudResourceMonitor` record the node_type and set a lower its availability score. 2. `ResourceDemandScheduler` This class determines how to select the best node_type to handle resource request. I modify the part of selecting the best node type: ```python # Sort the results by score. results = sorted( results, key=lambda r: ( r.score, cloud_resource_availabilities.get(r.node.node_type, 1), ), reverse=True ) ``` The sorting includes: 2.1. UtilizationScore: to maximize resource utilization. 2.2. Cloud resource availabilities: prioritize node types with the most available cloud resources, in order to minimize allocation failures. --------- Signed-off-by: xiaowen.wxw <wxw403883@alibaba-inc.com> Co-authored-by: θ‘η <wxw403883@alibaba-inc.com>
this is for kuberay 1.5.1 release, for ray auth token mode Docs link: https://anyscale-ray--58885.com.readthedocs.build/en/58885/cluster/getting-started.html --------- Signed-off-by: Future-Outlier <eric901201@gmail.com>
β¦t events (ray-project#58953) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
## Why are these changes needed? The memory leak being tested ([apache/arrow#45493](apache/arrow#45493)) specifically occurs when inferring types from **ndarray objects**, not from lists containing ndarrays. Testing the `list` case added no value since the leak doesn't manifest thereβit only added execution time and obscured the test's purpose. More importantly, the previous 1 MiB threshold was too tight and caused flaky failures. Memory measurements via RSS are inherently noisy due to OS-level allocation behavior, garbage collection timing, and memory fragmentation. A test that occasionally uses 1.1 MiB would fail despite no actual leak. The new approach: - **Calls `_infer_pyarrow_type` 8 times in a loop**, which leaks 1 GiB without Ray Data's workaround (admittedly, 8 is a magic number here) - **Uses a 64 MiB threshold**, providing a much larger margin above normal variation while still catching any real leak with a clear signal This creates a much stronger test: if the leak exists, we'd see memory growth approaching 1 GiB (with repeated runs), making failures unambiguous. Meanwhile, normal RSS fluctuations of a few MiB won't trigger false positives. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Description Based on the comment here: ray-project#58630 (comment) Current `IssueDetector` base class requires all its subclasses include the `StreamingExecutor` as the arguments, making classes hard to mock and test because we have to mock all of StreamingExecutor. In this PR, we did following: 1. Remove constructor in `IssueDetector` base class and add `from_executor()` to setup the class based on the executor 2. Refactor subclasses of `IssueDetector` to use this format ## Related issues Related to ray-project#58562 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com>
## Description `asv.conf.json` appears to be a legacy file in `python` and `rllib` used for benchmarking that hasn't been modified in 5 years. Core is a nightly benchmark and RLlib is moving to adding this, therefore, this file shouldn't be necessary anymore Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>
## Description `test_backpressure_e2e` occasionally fails without any traceback or warning message: ``` [2025-11-24T21:42:12Z] ==================== Test output for //python/ray/data:test_backpressure_e2e: -- [2025-11-24T21:42:12Z] /opt/miniforge/lib/python3.12/site-packages/paramiko/pkey.py:82: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0. [2025-11-24T21:42:12Z] "cipher": algorithms.TripleDES, [2025-11-24T21:42:12Z] /opt/miniforge/lib/python3.12/site-packages/paramiko/transport.py:253: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0. [2025-11-24T21:42:12Z] "class": algorithms.TripleDES, [2025-11-24T21:42:12Z] ============================= test session starts ============================== [2025-11-24T21:42:12Z] platform linux -- Python 3.12.9, pytest-7.4.4, pluggy-1.3.0 -- /opt/miniforge/bin/python3 [2025-11-24T21:42:12Z] cachedir: .pytest_cache [2025-11-24T21:42:12Z] rootdir: /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray [2025-11-24T21:42:12Z] configfile: pytest.ini [2025-11-24T21:42:12Z] plugins: repeat-0.9.3, anyio-3.7.1, fugue-0.8.7, aiohttp-1.1.0, asyncio-0.17.2, docker-tools-3.1.3, forked-1.4.0, pytest_httpserver-1.1.3, lazy-fixtures-1.1.2, mock-3.14.0, remotedata-0.3.2, rerunfailures-11.1.2, sphinx-0.5.1.dev0, sugar-0.9.5, timeout-2.1.0, typeguard-2.13.3 [2025-11-24T21:42:12Z] asyncio: mode=Mode.AUTO [2025-11-24T21:42:12Z] timeout: 180.0s [2025-11-24T21:42:12Z] timeout method: signal [2025-11-24T21:42:12Z] timeout func_only: False [2025-11-24T21:42:12Z] collecting ... collected 12 items [2025-11-24T21:42:12Z] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_large_e2e_backpressure_no_spilling PASSED [ 8%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[False-3-500] PASSED [ 16%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[False-4-100] PASSED [ 25%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[False-3-100] PASSED [ 33%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[True-3-500] PASSED [ 41%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[True-4-100] PASSED [ 50%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[True-3-100] PASSED [ 58%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_resource_contention[False] PASSED [ 66%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_resource_contention[True] PASSED [ 75%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_with_preserve_order PASSED [ 83%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_input_backpressure_e2e PASSED [ 91%]================================================================================ ``` To make this easier to debug, this PR enables the `-s` flag to log more information. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
β¦ect#58754) ## Description Previously, if `DataContext.get_current().enable_get_object_locations_for_metrics=False` (which it is by default), then we would default return `(-1, -1, -1)`. This wasn't properly being handled, so we would get negative metrics. This PR addresses that This PR also fixes run_index=-1 for **streaming split**. For all iterators but streaming split: 1. create executor with `dataset_tag` from 2) 2. increment `dataset_tag` 3. Get dataset_tag (dataset_-1) However, streaming_split skips step 2. This PR addreses that. ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
## Description RLlib is missing nightly testing making it difficult to track training performance over time. This PR, re-enables it, just for APPO to start with on Atari and MuJoCo environments I've removed the AutoROM comment as its no longer used by ALE to install atari roms. --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
β¦ashboard's reporter_head. (ray-project#58978) There's no user of this endpoint in the codebase. This has the added benefit of reducing the surface area for our cython-bindings for GcsClient by removing ActorInfoAccessor::AsyncKillActor. Signed-off-by: irabbani <irabbani@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request #689 has too many files changed.
The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5512.
Summary of ChangesHello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request represents a significant refactoring and modernization of the project's continuous integration (CI) pipeline, build system, and dependency management. It introduces a more modular CI structure, adopts Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR is a massive and impressive refactoring of the entire CI/CD and build system. The changes introduce better modularity, adopt modern tools like uv and pre-commit, and improve dependency management with the new raydepsets tool. The build process is now more structured with multi-stage Docker builds and pre-built components. The overall direction is excellent and will significantly improve maintainability and developer experience. I've reviewed the changes and have a couple of minor corrections for test cases to align with the new build ID handling logic.
| with mock.patch("subprocess.check_call", side_effect=_mock_subprocess): | ||
| LinuxTesterContainer("team", build_type="debug") | ||
| docker_image = f"{_DOCKER_ECR_REPO}:{_RAYCI_BUILD_ID}-team" | ||
| docker_image = f"{_DOCKER_ECR_REPO}:team" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _RAYCI_BUILD_ID is set to a1b2c3d4 in the test setup, and the get_docker_image utility function prepends it to the docker tag. The expected image name here should include the build ID to match the implementation.
| docker_image = f"{_DOCKER_ECR_REPO}:team" | |
| docker_image = f"{_DOCKER_ECR_REPO}:{os.environ.get('RAYCI_BUILD_ID')}-team" |
| "C:\\rayci", | ||
| "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:unknown-test", | ||
| "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:test", | ||
| "bash", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _get_docker_image method in WindowsContainer will produce an image tag with a leading hyphen (e.g., ...:-test) when RAYCI_BUILD_ID is empty, which is its new default. This test seems to expect the hyphen to be absent. The test should be updated to reflect the actual output. A better long-term fix would be to update WindowsContainer._get_docker_image to use the shared get_docker_image utility, which handles empty build IDs gracefully.
| "bash", | |
| "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:-test", |
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2025-11-26
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.