Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from ray-project:master #140

Open
wants to merge 5,262 commits into
base: master
Choose a base branch
from
Open

Conversation

pull[bot]
Copy link

@pull pull bot commented Jun 29, 2023

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

can-anyscale and others added 28 commits November 13, 2024 19:33
https://github.com/ray-project/ray/pull/48693/files#diff-81d7373cd5567e997c002b244c491e6b3498d206b12c093f4dc4d30e9b5848af
added a test that uses tensorflow. We currently need to skip all
tensorflow-related tests for python 3.12 since we don't support
tensorflow for python 3.12.

Also this is a test for the deprecation of tensorflow ;)

Test:
- CI

Signed-off-by: can <can@anyscale.com>
…tensors > 2Gb) (#48629)

Enabling V2 Arrow Tensor extension type by default (allowing tensors >
2Gb)

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…Error (#48636)

In a recent investigation, we found that when we call the
`ray._private.internal_api.free()` from a task the same time as a Raylet
is gracefully shutting down, the task might fail with application level
Broken pipe IOError. This resulted in job failure without any task
retries.

However, as the Broken pipe happens because the unhealthiness of the
local Raylet, the error should be a system level error and should be
retried automatically.

Updated changes in commit
[01f5f11](01f5f11):
This PR add the logic for the above bahvior:
* When IOError is received in the `CoreWorker::Delete`, throw a system
error exception so that the task can retry

Why not add the exception check in the `free_objects` function?
* It is better to add the logic in the `CoreWorker::Delete` because it
can cover the case for other languages as well.
* The `CoreWorker::Delete` function is intended to be open to all
languages to call and is not called in other ray internal code paths.

Why not crash the worker when IOError is encountered in the
`WriteMessage` function?
* `QuickExit()` function will directly exit the process without
executing any shutdown logic for the worker. Directly calling the
function in the task execution might potentially causing resource leak
* At the same time, the write message function is called also on the
graceful shutdown scenario and it is possible during the graceful
shutdown process that the local Raylet is unreachable. Therefore, in the
graceful shutdown scenario, we shouldn't exit early but let the shutdown
logic finish.
* At the same time, it is not clear in the code regarding the behavior
of the graceful vs force shutdown. We might need some effort to make
them clear. The todo is added in the PR.

Updated changes in commit
[2029d36](2029d36):
> This PR add the logic for the above behavior:
> * When IOError is received in the `free_objects()` function, throw a
system error exception so that the task can retry

Changes in commit
([9d57b29](9d57b29))
:
> This PR add the logic for the above behavior:
> * Today, the internal `free` API deletes the objects from the local
Raylet object store by writing a message through a socket
> * When the write failed because the local Raylet is terminated, there
is already logic to quick exit the task
> * However, the current termination check didn't cover the case where
the local Raylet process is a Zombie process and IOError happens during
write messages.
> * This fix update the check criteria and fail the task when the Raylet
process is terminated or the write message function returns an IOError~


Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
## Why are these changes needed?
While we calling `xxx.map_groups(..., batch_format="...")`, we may
invoke sort function and creating empty blocks which still uses pyarrow
by default. And, when we invoke another sort call on top of it, we will
hit `AttributeError: 'DataFrame' object has no attribute 'num_rows'`
since we uses first block type. (However, we may have different blocks).
See more details in #46748

## Related issue number

Close #46748

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
Created by release automation bot.

Update with commit e393a71

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
…es (#48478)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

When writing blocks to parquet, there might be blocks with fields that
differ ONLY in nullability - by default, this would be rejected since
some blocks might have a different schema than the ParquetWriter.
However, we could potentially allow it to happen by tweaking the schema.

This PR goes through all blocks before writing them to parquet, and
merge schemas that differ only in nullability of the fields.
It also casts the table to the newly merged schema so that the write
could happen.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Closes #48102

---------

Signed-off-by: rickyx <rickyx@anyscale.com>
…48697)

## Why are these changes needed?

This makes SortAggregate more consistent by unifying the API on the
SortKey object, similar to how SortTaskSpec is implemented.


## Related issue number

This is related to #42776 and
#42142


Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: dentiny <dentinyhao@gmail.com>
ray.util.state.get_actor currently have a type annotation specifying
that the function should return an Optional[Dict] but it actually
returns
[ActorState](https://github.com/ray-project/ray/blob/3141dfe4031cc715515b365278cd1d6b8955154e/python/ray/util/state/common.py#L416)
(as the Docstring speficies). This pull request simply changes this type
annotation.


Signed-off-by: Miguel Teixeira <miguel.teixeira@poli.ufrj.br>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

We currently report `iter_total_blocked_seconds` and `iter_user_seconds`
as **Gauge** while we tracking them as counters, i.e.:
- For each iteration, we had a timer that sums locally for each
iteration into an aggregated value (which is the sum of total blocked
seconds)
- When the iteration ends or the iterator GCed, the gauge metric value
is currently set to 0.
- This creates confusion for users as a counter value (total time
blocked on a dataset) should not be going back to 0, generating charts
like below.

---------

Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Some C++ improvements to our codebase.

---------

Signed-off-by: dentiny <dentinyhao@gmail.com>
…coloring (#48473)

Signed-off-by: dayshah <dhyey2019@gmail.com>
## Why are these changes needed?

Adds a Sentinel value for making it possible to sort.

Fixes #42142 

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
… batches first, instead of newest, BUT drop oldest batches if queue full). (#48702)
Signed-off-by: Aziz Belaweid <40893766+azayz@users.noreply.github.com>
Signed-off-by: ltbringer <amresh.venugopal@wise.com>
…ard (#48745)

## Why are these changes needed?

Currently, there are some cases where the `Rows Outputted` value on the
Ray Job page's `Ray Data Overview` section says "0", even after the
dataset execution completes. The root cause of the bug is that we clear
iteration/execution metrics after the dataset completes. This was
previously used to "reset" the metrics to 0 after dataset completion, so
that the last emitted value would not persist on the dashboard, even
after the job finishes. Now that we display rates on the dashboard, this
hack is no longer needed, and we can skip the metrics clearing.

Fixed result:
<img width="1860" alt="Screenshot at Nov 14 12-11-24"
src="https://github.com/user-attachments/assets/35061b3f-9359-412b-8ab2-f4bcce412994">

## Related issue number

Closes #44635

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Fixed typo

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: mohitjain2504 <87856435+mohitjain2504@users.noreply.github.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Gene Der Su <gdsu@ucdavis.edu>
Signed-off-by: dayshah <dhyey2019@gmail.com>
too ancient; and there are no tests.

this should remove the outdated paramiko package from the ray image

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
I randomly find lancedb issue:
lancedb/lancedb#1480
which discloses a high-severity CVE

Considering as lancedb, ray only has one use case for `retry` package, I
took the same approach as lancedb/lancedb#1749,
which names all variables better with unit and default value.

---------

Signed-off-by: dentiny <dentinyhao@gmail.com>
nightly is inherently unstable, and latest is just last stable release.
neither of them should block releases

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
fix NCCL_BFLOAT16 typo in TORCH_NCCL_DTYPE_MAP
also uses 1.3 syntax and heredoc for multiline commands

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Support read from Hudi table into Ray dataset.

---------

Signed-off-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com>
…s download (#48756)

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
dayshah and others added 30 commits December 9, 2024 10:28
Signed-off-by: dayshah <dhyey2019@gmail.com>
…49017)

Primary changes:
* Makes sort and shuffle tests use an autoscaling cluster
* Removes GCE variants (because we never run them)

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…#49116)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This addresses #45541 and
#49014

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
…hrough shared memory (#48957)

Some torch.dtypes don't have a numpy equivalent. We use numpy to store
the tensor data zero-copy in the object store. To support these tensors,
we first view the array with a common dtype (uint8), and then view as a
np array. During deserialization, we use another view back to the
original dtype.

Closes #48141.

---------

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
#48547)

The OptunaSearch currently uses in-memory storage and does not provide a
way to configure any other storages (e.g. database). This update add a
configurable parameter to OptunaSearch that can be set to a valid Optuna
Storage.

Signed-off-by: Ravi Dalal <ravidalal@google.com>
Signed-off-by: Ravi Dalal <12639199+ravi-dalal@users.noreply.github.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Co-authored-by: Hongpeng Guo <hg5@illinois.edu>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Hongpeng Guo <hpguo@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: hjiang <dentinyhao@gmail.com>
In an effort to have fewer but more useful release tests, this PR removes all of the training release tests except for the distributed Parquet training and chaos distributed Parquet training release tests. Notably, this PR removes variants for different input formats, and single-node training release tests.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: dayshah <dhyey2019@gmail.com>
…48117)

Signed-off-by: Matti Picus <matti.picus@gmail.com>
… logs (#49038)

Signed-off-by: win5923 <ken89@kimo.com>
Signed-off-by: Colton Woodruff <coltwood93@gmail.com>
Signed-off-by: hjiang <dentinyhao@gmail.com>
…ker (#48708)

Make it clear that you can specify Ray Core custom resources here.

Signed-off-by: Matthew Deng <matt@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: jukejian <jukejian@bytedance.com>
Co-authored-by: srinathk10 <68668616+srinathk10@users.noreply.github.com>
…s to read data from the Raylet socket (#49163)

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: jukejian <jukejian@bytedance.com>
## Why are these changes needed?

This PR enables passing kwargs to map tasks, which will be accessible
via `TaskContext.kwargs`.

This is a prerequisite to fixing
#49207. And optimization rules
can use this API to pass additional arguments to the map tasks.

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
## Why are these changes needed?

Add an ExecutionCallback interface to allow hooking custom callback
logic into certain execution events. This can be useful for optimization
rules.

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
## Why are these changes needed?

Add a `DEPLOY_FAILED` deployment status.
- If replicas fail to start or health checks fail _during_ deploy, the
deployment status transitions to `DEPLOY_FAILED`
- If any deployments are in `DEPLOY_FAILED`, the application is also in
`DEPLOY_FAILED`.

## Related issue number

#48654


---------

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: Gene Der Su <gdsu@ucdavis.edu>
Greetings from ElastiFlow!

This PR introduces a new ClickHouseDatasource connector for Ray, which
provides a convenient way to read data from ClickHouse into Ray
Datasets. The ClickHouseDatasource is particularly useful for users who
are working with large datasets stored in ClickHouse and want to
leverage Ray's distributed computing capabilities for AI and ML
use-cases. We found this functionality useful while evaluating ML
technologies and wanted to contribute this back.

Key Features and Benefits:
1. **Seamless Integration**: The ClickHouseDatasource allows for
seamless integration of ClickHouse data into Ray workflows, enabling
users to easily access their data and apply Ray's powerful parallel
computation.
2. **Custom Query Support**: Users can specify custom columns, and
orderings, allowing for flexible query generation directly from the Ray
interface, which helps in reading only the necessary data, thereby
improving performance.
3. **User-Friendly API**: The connector abstracts the complexity of
setting up and querying ClickHouse, providing a simple API that allows
users to focus on data analysis rather than data extraction.

Tested locally with a ClickHouse table containing ~12m records.

<img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM"
src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b">

PLEASE NOTE: This PR is a continuation of
#48817, which was closed without
merging.

---------

Signed-off-by: Connor Sanders <connor@elastiflow.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
## Why are these changes needed?

Allow deployment to transition out of `DEPLOY_FAILED` into `UPDATING`.


Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.