Skip to content

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2025-11-26
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

aslonnie and others added 30 commits November 10, 2025 14:00
be consistent with doc build environment

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
migrating all doc related things to run on python 3.12

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
excluding `*_tests` directories for now to reduce the impact

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
using `bazelisk run //java:gen_ray_java_pkg` everywhere

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
This PR adds 2 new metrics to core_worker by way of the reference
counter. The two new metrics keep track of the count and size of objects
owned by the worker as well as keeping track of their states. States are
defined as:

- **PendingCreation**: An object that is pending creation and hasn't
finished it's initialization (and is sizeless)
- **InPlasma**: An object which has an assigned node address and isn't
spilled
- **Spilled**: An object which has an assigned node address and is
spilled
- **InMemory**: An object which has no assigned address but isn't
pending creation (and therefore, must be local)

The approach used by these new metrics is to examine the state 'before
and after' any mutations on the reference in the reference_counter. This
is required in order to do the appropriate bookkeeping (decrementing
values and incrementing others). Admittedly, there is potential for
counting on the in between decrements/increments depending on when the
RecordMetrics loop is run. This unfortunate side effect however seems
preferable to doing mutual exclusion with metric collection as this is
potentially a high throughput code path.

In addition, performing live counts seemed preferable then doing full
accounting of the object store and across all references at time of
metric collection. Reason being, that potentially the reference counter
is tracking millions of objects, and each metric scan could potentially
be very expensive. So running the accounting (despite being potentially
innaccurate for short periods) seemed the right call.

This PR also allows for object size to potentially change due to
potential non deterministic instantation (say an object is initially
created, but it's primary copy dies, and then the recreation fails).
This is an edge case, but seems important for completeness sake.

---------

Signed-off-by: zac <zac@anyscale.com>
to 0.21.0; supports wanda priority now.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…#58286)

## Description
Predicate pushdown (ray-project#58150) in
conjunction with this PR should speed up reads from Iceberg.


Once the above change lands, we can add the pushdown interface support
for IcebergDatasource

---------

Signed-off-by: Goutam <goutam@anyscale.com>
## Description
* Does the work to bump pydoclint up to the latest version
* And allowlist any new violations it finds

## Related issues
n/a

## Additional information
n/a

---------

Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>
fix pattern_async_actor demo typo. Add `self.`.

---------

Signed-off-by: curiosity-hyf <curiooosity.h@gmail.com>
…hboard agent (ray-project#58405)

Add a grpc service interceptor to intercept all dashboard agent rpc
calls and validate the presence of auth token (when auth mode is token)

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…tests (ray-project#58528)

the auth token test setup in `conftest.py` is breaking macos test. there
are two test scripts (`test_microbenchmarks.py` and `test_basic.py`)
that run after the wheel is installed but without editable mode. for
these test to pass,` conftest.py` cannot import anything under
`ray.tests`.

this pr moves `authentication_test_utils` into `ray._private` to fix
this issue

Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
This PR enables open telemetry as the default backend for ray metric
stack. The bulk of this PR is actually to fix tests that were written
with some assumptions that no longer hold true. For ease of reviewing, I
inline the reasons for the change together with the change for each
tests in the comments.

This PR also depends on a release of vllm (so that we can update the
minimal supported version of vllm in ray).

Test:
- CI


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Enable OpenTelemetry metrics backend by default and refactor
metrics/Serve tests to use timeseries APIs and updated `ray_serve_*`
metric names.
> 
> - **Core/Config**:
> - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to
`true` in `ray_constants.py` and `ray_config_def.h`.
> - Metrics `Counter`: use `CythonCount` by default; keep legacy
`CythonSum` only when OTEL is explicitly disabled.
> - **Serve/Metrics Tests**:
> - Replace text scraping with `PrometheusTimeseries` and
`fetch_prometheus_metric_timeseries` throughout.
> - Update metric names/tags to `ray_serve_*` and counter suffixes
`*_total`; adjust latency metric names and processing/queued gauges.
> - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and
pass through helpers.
> - **General Test Fixes**:
> - Remove OTEL parametrization/fixtures; simplify expectations where
counters-as-gauges no longer apply; drop related tests.
> - Cardinality tests: include `"low"` level and remove OTEL gating;
stop injecting `enable_open_telemetry` in system config.
> - Actor/state/thread tests: migrate to cluster fixtures, wait for
dashboard agent, and adjust expected worker thread counts.
> - **Build**:
> - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env
from C++ stats test.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
1d0190f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
…mmended (ray-project#57726)

<!-- Thank you for contributing to Ray! πŸš€ -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

If users schedule a detached actor into a placement group, Raylet will
kill the actor when the placement group is removed. The actor will be
stuck in the `RESTARTING` state forever if it's restartable until users
explicitly kill it.

In that case, if users try to `get_actor` with the actor's name, it can
still return the restarting actor, but no process exists. It will no
longer be restarted because the PG is gone, and no PG with the same ID
will be created during the cluster's lifetime.

The better behavior would be for Ray to transition a task/actor's state
to dead when it is impossible to restart. However, this would add too
much complexity to the core, so I think it's not worth it. Therefore,
this PR adds a warning log, and users should use detached actors or PGs
correctly.

Example: Run the following script and run `ray list actors`.

```python
import ray
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
from ray.util.placement_group import placement_group, remove_placement_group

@ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1)
class Actor:
  pass

ray.init()

pg = placement_group([{"CPU": 1}])
ray.get(pg.ready())

actor = Actor.options(
    scheduling_strategy=PlacementGroupSchedulingStrategy(
        placement_group=pg,
    )
).remote()

ray.get(actor.__ray_ready__.remote())
```

## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->

## Types of change

- [ ] Bug fix πŸ›
- [ ] New feature ✨
- [x] Enhancement πŸš€
- [ ] Code refactoring πŸ”§
- [ ] Documentation update πŸ“–
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [x] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
Signed-off-by: Robert Nishihara <robertnishihara@gmail.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…y-project#57715)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
The python test step is failing on master now because of this. Probably
a logical merge conflict.
```
FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary)
...

[2025-11-11T22:11:54Z]     from ray.tests.authentication_test_utils import (
--
Β  | [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils'
```

Signed-off-by: dayshah <dhyey2019@gmail.com>
be consistent with the default build environment

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ject#58543)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
## Description
- rename RAY_auth_mode β†’ RAY_AUTH_MODE environment variable across
codebase
- Excluded healthcheck endpoints from authentication for Kubernetes
compatibility
- Fixed dashboard cookie handling to respect auth mode and clear stale
tokens when switching clusters

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ls (ray-project#58424)

## Description
- Use client interceptor for adding auth tokens in grpc calls when
`AUTH_MODE=token`
- BuildChannel() will automatically include the interceptor
- Removed `auth_token` parameter from `ClientCallImpl`
- removed manual auth from `python_gcs_subscriber`.cc
- tests to verify auth works for autoscaller apis

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…`) (ray-project#57090)

When actors terminate gracefully, Ray calls the actor's
`__ray_shutdown__()` method if defined, allowing for cleanup of
resources. But, this is not invoked in case actor goes out of scope due
to `del actor`.

### Why `del actor` doesn't invoke `__ray_shutdown__`

Traced through the entire code path, and here's what happens:

Flow when `del actor` is called:

1. **Python side**: `ActorHandle.__del__()` ->
`worker.core_worker.remove_actor_handle_reference(actor_id)`

https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040

2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` ->
`reference_counter_->RemoveLocalReference()`
- When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed`
callback

https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506

3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` ->
`AsyncReportActorOutOfScope()` to GCS

https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51

4. **GCS receives notification**: `HandleReportActorOutOfScope()` 
- **THE PROBLEM IS HERE** ([line 279 in
`src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)):
   ```cpp
   DestroyActor(actor_id,
                GenActorOutOfScopeCause(actor),
                /*force_kill=*/true,  // <-- HARDCODED TO TRUE!
                [reply, send_reply_callback]() {
   ```

5. **Actor worker receives kill signal**: `HandleKillActor()` in
[`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970)
   ```cpp
   if (request.force_kill()) {  // This is TRUE for OUT_OF_SCOPE
       ForceExit(...)  // Skips __ray_shutdown__
   } else {
       Exit(...)  // Would call __ray_shutdown__
   }
   ```

6. **ForceExit path**: Bypasses graceful shutdown -> No
`__ray_shutdown__` callback invoked.

This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE
actors. Also, updated the docs.

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>
Currently, a node is considered idle while pulling objects from the
remote object store. This can lead to situations where a node is
terminated as idle, causing the cluster to enter an infinite loop when
pulling large objects that exceed the node idle termination timeout.

This PR fixes the issue by treating object pulling as a busy activity.
Note that nodes can still accept additional tasks while pulling objects
(since pulling consumes no resources), but the auto-scaler will no
longer terminate the node prematurely.

Closes ray-project#54372

Test:
- CI

Signed-off-by: Cuong Nguyen <can@anyscale.com>
…_FACTOR` to 2 (ray-project#58262)

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description

This was setting the value to be aligned with the previous default of 4.

However, after some consideration i've realized that 4 is too high of a
number so actually lowering this to 2

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…y-project#58523)

## Description

This PR improves documentation consistency in the `python/ray/data`
module by converting all remaining rST-style docstrings (`:param:`,
`:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.).

## Additional information

**Files modified:**
- `python/ray/data/preprocessors/utils.py` - Converted
`StatComputationPlan.add_callable_stat()`
- `python/ray/data/preprocessors/encoder.py` - Converted
`unique_post_fn()`
- `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()`
and `BlockColumnAccessor.is_composed_of_lists()`
- `python/ray/data/_internal/datasource/delta_sharing_datasource.py` -
Converted `DeltaSharingDatasource.setup_delta_sharing_connections()`

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…oject#58549)

## Description

The original `test_concurrency` function combined multiple test
scenarios into a single test with complex control flow and expensive Ray
cluster initialization. This refactoring extracts the parameter
validation tests into focused, independent tests that are faster,
clearer, and easier to maintain.

Additionally, the original test included "validation" cases that tested
valid concurrency parameters but didn't actually verify that concurrency
was being limited correctlyβ€”they only checked that the output was
correct, which isn't useful for validating the concurrency feature
itself.

**Key improvements:**
- Split validation tests into `test_invalid_func_concurrency_raises` and
`test_invalid_class_concurrency_raises`
- Use parametrized tests for different invalid concurrency values
- Switch from `shutdown_only` with explicit `ray.init()` to
`ray_start_regular_shared` to eliminate cluster initialization overhead
- Minimize test data from 10 blocks to 1 element since we're only
validating parameter errors
- Remove non-validation tests that didn't verify concurrency behavior

## Related issues

N/A

## Additional information

The validation tests now execute significantly faster and provide
clearer failure messages. Each test has a single, well-defined purpose
making maintenance and debugging easier.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
previously it was actually using 0.4.0, which is set up by the grpc
repo. the declaration in the workspace file was being shadowed..

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
srinathk10 and others added 22 commits November 24, 2025 15:57
…roject#58864)

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

### [Data] Fix obj_store_mem_max_pending_output_per_task reporting

Fix `obj_store_mem_max_pending_output_per_task` when sample is
unavailable to factor in,

- `bytes_per_output` = `MAX_SAFE_BLOCK_SIZE_FACTOR` *
`target_max_block_size`.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…te matching (ray-project#58927)

The correct route value is already part of RequestMetadata after
ray-project#58180, no need to recompute it
again.

no observed perf diff in microbenchmark
After
```
Type	Name	# Requests	# Fails	Median (ms)	95%ile (ms)	99%ile (ms)	Average (ms)	Min (ms)	Max (ms)	Average size (bytes)	Current RPS	Current Failures/s
GET	/echo?message=hello	28068	0	200	410	470	228.27	80	592	26	430.3	0
Aggregated	28068	0	200	410	470	228.27	80	592	26	430.3	0
```

Before
```
Type	Name	# Requests	# Fails	Median (ms)	95%ile (ms)	99%ile (ms)	Average (ms)	Min (ms)	Max (ms)	Average size (bytes)	Current RPS	Current Failures/s
GET	/echo?message=hello	27427	0	210	410	470	232.12	76	604	26	429.7	0
Aggregated	27427	0	210	410	470	232.12	76	604	26	429.7	0
```

Additionally, old implementation wrongly assumed that there will only be
one method (GET,PUT) corresponding to a route. This PR fixes that
assumption and tests for it.

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

### [Data] Add iter_prefetched_blocks stats

Report prefetched bytes per iterator as stats.


## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <68668616+srinathk10@users.noreply.github.com>
…58299)

This PR replace STATS with Metric as a way to define metric inside ray
(as a unification effort) in all common components. Normally, metrics
are defined at the top-level component and passed down to
sub-components. However, in this case, because the common component is
used as an API across, doing so would feel unnecessarily cumbersome. I
decided to define the metrics inline within each client and server class
instead.

Note that the metric classes (Metric, Gauge, Sum, etc.) are simply
wrappers around static OpenCensus/OpenTelemetry entities.

**Details**
Full context of this refactoring work.
- Each component (e.g., gcs, raylet, core_worker, etc.) now has a
metrics.h file located in its top-level directory. This file defines all
metrics for that component.
- In most cases, metrics are defined once in the main entry point of
each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for
Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.).
These metrics are then passed down to subcomponents via the
ray::observability::MetricInterface.
- This approach significantly reduces rebuild time when metric
infrastructure changes. Previously, a change would trigger a full Ray
rebuild; now, only the top-level entry points of each component need
rebuilding.
- There are a few exceptions where metrics are tracked inside object
libraries (e.g., task_specification). In these cases, metrics are
defined within the library itself, since there is no corresponding
top-level entry point.

Test:
- CI

Signed-off-by: Cuong Nguyen <can@anyscale.com>
ray-project#58710)

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

ray-project#58711 decreased the scale of the
`map_groups` tests from scale-factor 100 to scale-factor 10 because some
of the `map_groups` release tests were failing. However, after more
investigation, I realized that the only variant that doesn't work with
scale-factor 100 is the hash shuffle with autoscaling variant (see
ray-project#58734).

This PR re-increases the scale and only disables the cases that fail.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Summary
This PR removes `test_large_args_scheduling_strategy` from
`test_stats.py` because its flaky and not worth keeping (It tests
implementation details rather than behavior and conflates multiple
concerns)

See
https://buildkite.com/ray-project/premerge/builds/54495#019ab720-249f-49c5-8e25-5e9005cc41e2

## Motivation

1. **Hardcodes scheduling strategy values** - The test assumes large
args use `'DEFAULT'` and small args use `'SPREAD'`. If these defaults
change in `context.py`, the test fails even though the system is working
correctly.

2. **Tests stats format, not scheduling behavior** - The test doesn't
verify that the correct scheduling strategy is actually passed to Ray
tasks. It only checks that a specific string appears in stats output.

3. **Mixes two concerns** - The test conflates:
- Scheduling strategy selection based on data size (belongs in a
map-related test)
- Stats output including scheduling strategy info (belongs in a general
stats formatting test)

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Why are these changes needed?

We introduced an improved error message when environments fail in
ray-project#55567.
At the same time, this bypasses the silencing of env step errors.
This PR consolidates the messages.

---------

Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>
…-project#58915)

# Description
This PR refactors the `PhysicalOperator` class to eliminate hidden side
effects in the `completed()` method. Previously, calling `completed()`
could inadvertently modify the internal state of the operator, which
could lead to unexpected behavior. This change separates the logic for
checking if the operator is marked as finished from the logic that
computes whether it is actually finished.

Key changes include:
- Renaming `_execution_finished` to `_is_execution_marked_finished` to
clarify its purpose.
- Renaming `execution_finished()` to `has_execution_finished()` and
making it a pure computed property without side effects.
- Updating the `completed()` method to call `has_execution_finished()`
instead of modifying internal state.
- Ensuring that `mark_execution_finished()` correctly sets the renamed
field.


## Related issues
Fixes ray-project#58884

## Additional information
This refactor ensures that both `has_execution_finished()` and
`completed()` are pure query methods, allowing them to be called
multiple times without altering the state of the operator. T

---------

Signed-off-by: Simeet Nayan <simeetnayan.8100@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
## Description

The links for APPO were referencing the PPO paper. I updated them to
link to the IMPACT paper

Signed-off-by: Philipp Schmutz <2059887+pschmutz@users.noreply.github.com>
… completed episodes when sampling a fixed number of episodes (ray-project#58931)

## Description
The `MultiAgentEnvRunner` would previously call the callback twice for
the final episode of a batch (when sampling a fixed number of episodes).
This PR fixes this problem ensuring that the callback only happens once
for finished episode

## Related issues
Closes ray-project#55452

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
## Description
When the Autoscaler receives a resource request and decides which type
of node to scale up,, only the `UtilizationScore` is considered (that
is, Ray tries to avoid launching a large node for a small resource
request, which would lead to resource waste). If multiple node types in
the cluster have the same `UtilizationScore`, Ray always request for the
same node type.

In Spot scenarios, cloud resources are dynamically changing. Therefore,
we want the Autoscaler to be aware of cloud resource availability β€” if a
certain node type becomes unavailable, the Autoscaler should be able to
automatically switch to requesting other node types.

In this PR, I added the `CloudResourceMonitor` class, which records node
types that have failed resource allocation, and in future scaling
events, reduces the weight of these node types.

## Related issues
Related to ray-project#49983 
Fixes ray-project#53636  ray-project#39788 ray-project#39789 


## implementation details
1. `CloudResourceMonitor`
This is a subscriber of Instances. When a Instance get status of
`ALLOCATION_FAILED`, `CloudResourceMonitor` record the node_type and set
a lower its availability score.
2. `ResourceDemandScheduler`
This class determines how to select the best node_type to handle
resource request. I modify the part of selecting the best node type:
```python
# Sort the results by score.
results = sorted(
    results,
    key=lambda r: (
        r.score,
        cloud_resource_availabilities.get(r.node.node_type, 1),
    ),
    reverse=True
)
```
The sorting includes:
2.1. UtilizationScore: to maximize resource utilization.
2.2. Cloud resource availabilities: prioritize node types with the most
available cloud resources, in order to minimize allocation failures.

---------

Signed-off-by: xiaowen.wxw <wxw403883@alibaba-inc.com>
Co-authored-by: 葌筠 <wxw403883@alibaba-inc.com>
this is for kuberay 1.5.1 release, for ray auth token mode

Docs link:
https://anyscale-ray--58885.com.readthedocs.build/en/58885/cluster/getting-started.html

---------

Signed-off-by: Future-Outlier <eric901201@gmail.com>
…t events (ray-project#58953)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
## Why are these changes needed?

The memory leak being tested
([apache/arrow#45493](apache/arrow#45493))
specifically occurs when inferring types from **ndarray objects**, not
from lists containing ndarrays. Testing the `list` case added no value
since the leak doesn't manifest thereβ€”it only added execution time and
obscured the test's purpose.

More importantly, the previous 1 MiB threshold was too tight and caused
flaky failures. Memory measurements via RSS are inherently noisy due to
OS-level allocation behavior, garbage collection timing, and memory
fragmentation. A test that occasionally uses 1.1 MiB would fail despite
no actual leak.

The new approach:
- **Calls `_infer_pyarrow_type` 8 times in a loop**, which leaks 1 GiB
without Ray Data's workaround (admittedly, 8 is a magic number here)
- **Uses a 64 MiB threshold**, providing a much larger margin above
normal variation while still catching any real leak with a clear signal

This creates a much stronger test: if the leak exists, we'd see memory
growth approaching 1 GiB (with repeated runs), making failures
unambiguous. Meanwhile, normal RSS fluctuations of a few MiB won't
trigger false positives.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Description

Based on the comment here:
ray-project#58630 (comment)

Current `IssueDetector` base class requires all its subclasses include
the `StreamingExecutor` as the arguments, making classes hard to mock
and test because we have to mock all of StreamingExecutor.

In this PR, we did following:
1. Remove constructor in `IssueDetector` base class and add
`from_executor()` to setup the class based on the executor
2. Refactor subclasses of `IssueDetector` to use this format

## Related issues

Related to ray-project#58562

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
## Description
`asv.conf.json` appears to be a legacy file in `python` and `rllib` used
for benchmarking that hasn't been modified in 5 years. Core is a nightly
benchmark and RLlib is moving to adding this, therefore, this file
shouldn't be necessary anymore

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
## Description

`test_backpressure_e2e` occasionally fails without any traceback or
warning message:
```
[2025-11-24T21:42:12Z] ==================== Test output for //python/ray/data:test_backpressure_e2e:
--
[2025-11-24T21:42:12Z] /opt/miniforge/lib/python3.12/site-packages/paramiko/pkey.py:82: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0.
[2025-11-24T21:42:12Z]   "cipher": algorithms.TripleDES,
[2025-11-24T21:42:12Z] /opt/miniforge/lib/python3.12/site-packages/paramiko/transport.py:253: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0.
[2025-11-24T21:42:12Z]   "class": algorithms.TripleDES,
[2025-11-24T21:42:12Z] ============================= test session starts ==============================
[2025-11-24T21:42:12Z] platform linux -- Python 3.12.9, pytest-7.4.4, pluggy-1.3.0 -- /opt/miniforge/bin/python3
[2025-11-24T21:42:12Z] cachedir: .pytest_cache
[2025-11-24T21:42:12Z] rootdir: /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray
[2025-11-24T21:42:12Z] configfile: pytest.ini
[2025-11-24T21:42:12Z] plugins: repeat-0.9.3, anyio-3.7.1, fugue-0.8.7, aiohttp-1.1.0, asyncio-0.17.2, docker-tools-3.1.3, forked-1.4.0, pytest_httpserver-1.1.3, lazy-fixtures-1.1.2, mock-3.14.0, remotedata-0.3.2, rerunfailures-11.1.2, sphinx-0.5.1.dev0, sugar-0.9.5, timeout-2.1.0, typeguard-2.13.3
[2025-11-24T21:42:12Z] asyncio: mode=Mode.AUTO
[2025-11-24T21:42:12Z] timeout: 180.0s
[2025-11-24T21:42:12Z] timeout method: signal
[2025-11-24T21:42:12Z] timeout func_only: False
[2025-11-24T21:42:12Z] collecting ... collected 12 items
[2025-11-24T21:42:12Z]
[2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_large_e2e_backpressure_no_spilling PASSED [  8%]
[2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[False-3-500] PASSED [ 16%]
[2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[False-4-100] PASSED [ 25%]
[2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[False-3-100] PASSED [ 33%]
[2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[True-3-500] PASSED [ 41%]
[2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[True-4-100] PASSED [ 50%]
[2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[True-3-100] PASSED [ 58%]
[2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_resource_contention[False] PASSED [ 66%]
[2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_resource_contention[True] PASSED [ 75%]
[2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_with_preserve_order PASSED [ 83%]
[2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_input_backpressure_e2e PASSED [ 91%]================================================================================
```

To make this easier to debug, this PR enables the `-s` flag to log more
information.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ect#58754)

## Description
Previously, if
`DataContext.get_current().enable_get_object_locations_for_metrics=False`
(which it is by default), then we would default return `(-1, -1, -1)`.
This wasn't properly being handled, so we would get negative metrics.
This PR addresses that

This PR also fixes run_index=-1 for **streaming split**. For all
iterators but streaming split:
1. create executor with `dataset_tag` from 2)
2. increment `dataset_tag`
3. Get dataset_tag (dataset_-1)

However, streaming_split skips step 2. This PR addreses that.

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
## Description
RLlib is missing nightly testing making it difficult to track training
performance over time.
This PR, re-enables it, just for APPO to start with on Atari and MuJoCo
environments

I've removed the AutoROM comment as its no longer used by ALE to install
atari roms.

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…ashboard's reporter_head. (ray-project#58978)

There's no user of this endpoint in the codebase. This has the added
benefit of reducing the surface area for our cython-bindings for
GcsClient by removing ActorInfoAccessor::AsyncKillActor.

Signed-off-by: irabbani <irabbani@anyscale.com>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request #689 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5512.

@gemini-code-assist
Copy link

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents a significant refactoring and modernization of the project's continuous integration (CI) pipeline, build system, and dependency management. It introduces a more modular CI structure, adopts uv and raydepsets for enhanced Python dependency control, and updates the Bazel build configurations for improved efficiency and maintainability. These changes aim to streamline development workflows, ensure more reproducible builds, and prepare the project for future scalability and platform support.

Highlights

  • CI Pipeline Modernization: The Buildkite CI pipeline has undergone extensive refactoring, introducing new modular YAML configurations for image builds and dependency management, and consolidating various build and test steps for improved efficiency and maintainability.
  • Python Dependency Management with uv and raydepsets: The project transitions to uv and a new raydepsets system for managing Python dependencies, replacing older pip-compile and miniconda setups, leading to more hermetic and reproducible Python environments across various platforms.
  • Bazel Build System Enhancements: Significant updates to Bazel configurations include new packaging rules for C++ and Python artifacts, more granular C++ target definitions, and improved Python toolchain management, alongside enabling strict_action_env by default.
  • Updated Python and CUDA Support: CI configurations have been updated to reflect changes in supported Python versions (e.g., dropping Python 3.9 in some areas, defaulting to 3.10) and expanding CUDA versions, ensuring compatibility with newer environments.
  • C++ API and Runtime Refinements: The C++ API and runtime components have undergone refactoring, including changes to remote function handling, metric recording, object store behavior, and network utility functions, enhancing consistency and maintainability.
  • Documentation Tooling Integration: New linting tools like vale and semgrep have been integrated into the pre-commit hooks and CI, alongside updates to the documentation build process and style guide, improving code quality and consistency.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/config.yaml
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/stale_pull_request.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR is a massive and impressive refactoring of the entire CI/CD and build system. The changes introduce better modularity, adopt modern tools like uv and pre-commit, and improve dependency management with the new raydepsets tool. The build process is now more structured with multi-stage Docker builds and pre-built components. The overall direction is excellent and will significantly improve maintainability and developer experience. I've reviewed the changes and have a couple of minor corrections for test cases to align with the new build ID handling logic.

with mock.patch("subprocess.check_call", side_effect=_mock_subprocess):
LinuxTesterContainer("team", build_type="debug")
docker_image = f"{_DOCKER_ECR_REPO}:{_RAYCI_BUILD_ID}-team"
docker_image = f"{_DOCKER_ECR_REPO}:team"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _RAYCI_BUILD_ID is set to a1b2c3d4 in the test setup, and the get_docker_image utility function prepends it to the docker tag. The expected image name here should include the build ID to match the implementation.

Suggested change
docker_image = f"{_DOCKER_ECR_REPO}:team"
docker_image = f"{_DOCKER_ECR_REPO}:{os.environ.get('RAYCI_BUILD_ID')}-team"

"C:\\rayci",
"029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:unknown-test",
"029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:test",
"bash",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _get_docker_image method in WindowsContainer will produce an image tag with a leading hyphen (e.g., ...:-test) when RAYCI_BUILD_ID is empty, which is its new default. This test seems to expect the hyphen to be absent. The test should be updated to reflect the actual output. A better long-term fix would be to update WindowsContainer._get_docker_image to use the shared get_docker_image utility, which handles empty build IDs gracefully.

Suggested change
"bash",
"029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:-test",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.