forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from ray-project:master #140
Open
pull
wants to merge
5,262
commits into
garymm:master
Choose a base branch
from
ray-project:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+520,193
−402,840
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://github.com/ray-project/ray/pull/48693/files#diff-81d7373cd5567e997c002b244c491e6b3498d206b12c093f4dc4d30e9b5848af added a test that uses tensorflow. We currently need to skip all tensorflow-related tests for python 3.12 since we don't support tensorflow for python 3.12. Also this is a test for the deprecation of tensorflow ;) Test: - CI Signed-off-by: can <can@anyscale.com>
…tensors > 2Gb) (#48629) Enabling V2 Arrow Tensor extension type by default (allowing tensors > 2Gb) --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…Error (#48636) In a recent investigation, we found that when we call the `ray._private.internal_api.free()` from a task the same time as a Raylet is gracefully shutting down, the task might fail with application level Broken pipe IOError. This resulted in job failure without any task retries. However, as the Broken pipe happens because the unhealthiness of the local Raylet, the error should be a system level error and should be retried automatically. Updated changes in commit [01f5f11](01f5f11): This PR add the logic for the above bahvior: * When IOError is received in the `CoreWorker::Delete`, throw a system error exception so that the task can retry Why not add the exception check in the `free_objects` function? * It is better to add the logic in the `CoreWorker::Delete` because it can cover the case for other languages as well. * The `CoreWorker::Delete` function is intended to be open to all languages to call and is not called in other ray internal code paths. Why not crash the worker when IOError is encountered in the `WriteMessage` function? * `QuickExit()` function will directly exit the process without executing any shutdown logic for the worker. Directly calling the function in the task execution might potentially causing resource leak * At the same time, the write message function is called also on the graceful shutdown scenario and it is possible during the graceful shutdown process that the local Raylet is unreachable. Therefore, in the graceful shutdown scenario, we shouldn't exit early but let the shutdown logic finish. * At the same time, it is not clear in the code regarding the behavior of the graceful vs force shutdown. We might need some effort to make them clear. The todo is added in the PR. Updated changes in commit [2029d36](2029d36): > This PR add the logic for the above behavior: > * When IOError is received in the `free_objects()` function, throw a system error exception so that the task can retry Changes in commit ([9d57b29](9d57b29)) : > This PR add the logic for the above behavior: > * Today, the internal `free` API deletes the objects from the local Raylet object store by writing a message through a socket > * When the write failed because the local Raylet is terminated, there is already logic to quick exit the task > * However, the current termination check didn't cover the case where the local Raylet process is a Zombie process and IOError happens during write messages. > * This fix update the check criteria and fail the task when the Raylet process is terminated or the write message function returns an IOError~ Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
## Why are these changes needed? While we calling `xxx.map_groups(..., batch_format="...")`, we may invoke sort function and creating empty blocks which still uses pyarrow by default. And, when we invoke another sort call on top of it, we will hit `AttributeError: 'DataFrame' object has no attribute 'num_rows'` since we uses first block type. (However, we may have different blocks). See more details in #46748 ## Related issue number Close #46748 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Xingyu Long <xingyulong97@gmail.com> Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
Created by release automation bot. Update with commit e393a71 Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
…es (#48478) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When writing blocks to parquet, there might be blocks with fields that differ ONLY in nullability - by default, this would be rejected since some blocks might have a different schema than the ParquetWriter. However, we could potentially allow it to happen by tweaking the schema. This PR goes through all blocks before writing them to parquet, and merge schemas that differ only in nullability of the fields. It also casts the table to the newly merged schema so that the write could happen. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes #48102 --------- Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
ray.util.state.get_actor currently have a type annotation specifying that the function should return an Optional[Dict] but it actually returns [ActorState](https://github.com/ray-project/ray/blob/3141dfe4031cc715515b365278cd1d6b8955154e/python/ray/util/state/common.py#L416) (as the Docstring speficies). This pull request simply changes this type annotation. Signed-off-by: Miguel Teixeira <miguel.teixeira@poli.ufrj.br>
…48746) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? We currently report `iter_total_blocked_seconds` and `iter_user_seconds` as **Gauge** while we tracking them as counters, i.e.: - For each iteration, we had a timer that sums locally for each iteration into an aggregated value (which is the sum of total blocked seconds) - When the iteration ends or the iterator GCed, the gauge metric value is currently set to 0. - This creates confusion for users as a counter value (total time blocked on a dataset) should not be going back to 0, generating charts like below. --------- Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Some C++ improvements to our codebase. --------- Signed-off-by: dentiny <dentinyhao@gmail.com>
…coloring (#48473) Signed-off-by: dayshah <dhyey2019@gmail.com>
## Why are these changes needed? Adds a Sentinel value for making it possible to sort. Fixes #42142 ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
… batches first, instead of newest, BUT drop oldest batches if queue full). (#48702)
Signed-off-by: Aziz Belaweid <40893766+azayz@users.noreply.github.com>
Signed-off-by: ltbringer <amresh.venugopal@wise.com>
…ard (#48745) ## Why are these changes needed? Currently, there are some cases where the `Rows Outputted` value on the Ray Job page's `Ray Data Overview` section says "0", even after the dataset execution completes. The root cause of the bug is that we clear iteration/execution metrics after the dataset completes. This was previously used to "reset" the metrics to 0 after dataset completion, so that the last emitted value would not persist on the dashboard, even after the job finishes. Now that we display rates on the dashboard, this hack is no longer needed, and we can skip the metrics clearing. Fixed result: <img width="1860" alt="Screenshot at Nov 14 12-11-24" src="https://github.com/user-attachments/assets/35061b3f-9359-412b-8ab2-f4bcce412994"> ## Related issue number Closes #44635 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <sjl@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Fixed typo <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: mohitjain2504 <87856435+mohitjain2504@users.noreply.github.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Gene Der Su <gdsu@ucdavis.edu>
Signed-off-by: dayshah <dhyey2019@gmail.com>
too ancient; and there are no tests. this should remove the outdated paramiko package from the ray image Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
I randomly find lancedb issue: lancedb/lancedb#1480 which discloses a high-severity CVE Considering as lancedb, ray only has one use case for `retry` package, I took the same approach as lancedb/lancedb#1749, which names all variables better with unit and default value. --------- Signed-off-by: dentiny <dentinyhao@gmail.com>
nightly is inherently unstable, and latest is just last stable release. neither of them should block releases Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
fix NCCL_BFLOAT16 typo in TORCH_NCCL_DTYPE_MAP
also uses 1.3 syntax and heredoc for multiline commands Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Support read from Hudi table into Ray dataset. --------- Signed-off-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com>
…s download (#48756) Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
…49017) Primary changes: * Makes sort and shuffle tests use an autoscaling cluster * Removes GCE variants (because we never run them) --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…#49116) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This addresses #45541 and #49014 ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: hjiang <dentinyhao@gmail.com>
…hrough shared memory (#48957) Some torch.dtypes don't have a numpy equivalent. We use numpy to store the tensor data zero-copy in the object store. To support these tensors, we first view the array with a common dtype (uint8), and then view as a np array. During deserialization, we use another view back to the original dtype. Closes #48141. --------- Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
#48547) The OptunaSearch currently uses in-memory storage and does not provide a way to configure any other storages (e.g. database). This update add a configurable parameter to OptunaSearch that can be set to a valid Optuna Storage. Signed-off-by: Ravi Dalal <ravidalal@google.com> Signed-off-by: Ravi Dalal <12639199+ravi-dalal@users.noreply.github.com> Signed-off-by: Hongpeng Guo <hpguo@anyscale.com> Co-authored-by: Hongpeng Guo <hg5@illinois.edu> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Hongpeng Guo <hpguo@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: hjiang <dentinyhao@gmail.com>
…nt_t` system, etc..). (#49191)
In an effort to have fewer but more useful release tests, this PR removes all of the training release tests except for the distributed Parquet training and chaos distributed Parquet training release tests. Notably, this PR removes variants for different input formats, and single-node training release tests. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: dayshah <dhyey2019@gmail.com>
…48117) Signed-off-by: Matti Picus <matti.picus@gmail.com>
…49174) Signed-off-by: Alan Guo <aguo@anyscale.com>
… logs (#49038) Signed-off-by: win5923 <ken89@kimo.com>
Signed-off-by: Colton Woodruff <coltwood93@gmail.com>
…cal training example script. (#49127)
Signed-off-by: hjiang <dentinyhao@gmail.com>
…ker (#48708) Make it clear that you can specify Ray Core custom resources here. Signed-off-by: Matthew Deng <matt@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: jukejian <jukejian@bytedance.com> Co-authored-by: srinathk10 <68668616+srinathk10@users.noreply.github.com>
…s to read data from the Raylet socket (#49163) Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: jukejian <jukejian@bytedance.com>
## Why are these changes needed? This PR enables passing kwargs to map tasks, which will be accessible via `TaskContext.kwargs`. This is a prerequisite to fixing #49207. And optimization rules can use this API to pass additional arguments to the map tasks. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>
## Why are these changes needed? Add an ExecutionCallback interface to allow hooking custom callback logic into certain execution events. This can be useful for optimization rules. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>
## Why are these changes needed? Add a `DEPLOY_FAILED` deployment status. - If replicas fail to start or health checks fail _during_ deploy, the deployment status transitions to `DEPLOY_FAILED` - If any deployments are in `DEPLOY_FAILED`, the application is also in `DEPLOY_FAILED`. ## Related issue number #48654 --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: Gene Der Su <gdsu@ucdavis.edu>
Greetings from ElastiFlow! This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back. Key Features and Benefits: 1. **Seamless Integration**: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation. 2. **Custom Query Support**: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance. 3. **User-Friendly API**: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction. Tested locally with a ClickHouse table containing ~12m records. <img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM" src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b"> PLEASE NOTE: This PR is a continuation of #48817, which was closed without merging. --------- Signed-off-by: Connor Sanders <connor@elastiflow.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
## Why are these changes needed? Allow deployment to transition out of `DEPLOY_FAILED` into `UPDATING`. Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )