forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from ray-project:master #2625
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Why are these changes needed? Move `test_deployment_scheduler.py` from small to medium. Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
If args is not a list, an exception will be thrown in the above if statement.
…ate_failed_env_runners=False` to `True`). (#48286)
## Why are these changes needed? Implement full recursive cancellation at Serve layer. Currently, at the time that a parent request is cancelled, if: - The child request has not been assigned to a replica, _meaning_ the replica scheduler in the router is still processing that request - AND the parent request is not directly blocking and waiting on the result of the child request Then the child request will not be correctly cancelled. Note that if the second bullet point were actually flipped, i.e. the parent request _was_ waiting on the result at the time of cancellation, then an `asyncio.CancelledError` will be sent directly to the replica scheduler and the child request will get successfully cancelled. However obviously this cannot be guaranteed to be true. This PR adds a cache for asyncio tasks that are launched in the router to assign a request to a replica. When a parent request is cancelled, if there are any "pending assignment requests" in the cache, they are cancelled. Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Stacked on: #48223 Reimplements the `.bind()` codepath to avoid using the Ray DAG codepath which adds a lot of complexity for little benefit. The DAG traversal code is much simpler, around 100 lines of self-contained code in `build_app.py`. I've also added unit tests for it. There should be no behavior change here. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…joco 3.2.4, pettingzoo 1.24.3 supersuit 3.9.3)."" (#48308)
Previously the error message was injected directly as the HTTP status message Signed-off-by: Richo Healey <richo@anyscale.com>
## Why are these changes needed? fix router.py Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
…g compilation (#47757) Leaf nodes are nodes that are not output nodes and have no downstream nodes. If a leaf node raises an exception, it will not be propagated to the driver. Therefore, this PR raises an exception if a leaf node is found during compilation. Another solution: implicitly add leaf node to MultiOutputNode Currently, the function execute can return multiple CompiledDAGRefs. The UX we want to provide is to implicitly add leaf nodes to the MultiOutputNode but not return the references of the leaf nodes. For example, a MultiOutputNode is containing 3 DAG nodes (2 normal DAG nodes + 1 leaf node). x, y = compiled_dag.execute(input_vals) # We don't return the ref for the leaf node. However, the ref for leaf node will be GC(ed) in execute, and CompiledDAGRef’s del will call get if it was never called which makes execute to be a sync instead of an async operation which is not acceptable.
…#48358) Some tests are failing when we duplicate the original release tests. Marking them as unstable to be non-blocking. Will mark as stable once they are truly stable. Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
See #48228 (comment). Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
GcsServer manages several IO Contexts and their threads, and hard codes which class uses which IO Context. This is not flexible in that we can't easily declare another IO Context and adapt some classes without a pervasive code change. Introduces an IoContextProvider that manages IO Contexts based on a static, compile-time Policy, whose author can determine class-IO Context mapping. This way we can change the mappings in a central place without big chunk of changes everywhere. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…ls.cache decorators on methods can lead to memory leaks (#48186) While running the pre-commit hook of flake8, the following error occurs if Python version is 3.12. It's because the version of flake8 is too old. ![image](https://github.com/user-attachments/assets/bf56370f-04d7-4b9a-aea9-e1c578547ed4) version: - python: 3.12.7 - flake8: 7.1.1 - flake8-bugbear: 24.8.19 Signed-off-by: win5923 <ken89@kimo.com>
…3FileSystem` (#48216) ## Why are these changes needed? When PyArrow does not have support for S3, accessing `pyarrow.fs.S3FileSystem` raises an exception. Fix by checking if the `S3FileSystem` is importable before doing this typecheck. --------- Signed-off-by: Charles Coulombe <ccoulombe@users.noreply.github.com> Signed-off-by: Charles Coulombe <charles.coulombe@calculquebec.ca>
…oing_requests` (#47681) (#48274) ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> This PR modifies the actor_options used when deploying replicas. Deployment will use the configured `max_ongoing_requests` attribute of the deployment config as the replica's `max_concurrency` if the concurrency is not explicitly set. This is to prevent replica's `max_concurrency` from capping `max_ongoing_requests`. ## Related issue number <!-- For example: "Closes #1234" --> Closes #47681 Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> This PR moves `ProxyStatus` out of the `_private` directory, allowing it to be included in the API docs. This is the final attribute of `ServeStatus` that needs to be included in the documentation. ## Related issue number <!-- For example: "Closes #1234" --> Closes #43394 --------- Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Pyarrow 18 causes errors on the MacOS build. Since Pyarrow 18 is not officially supported on Ray Data yet, we can just pin it for now. Signed-off-by: Scott Lee <sjl@anyscale.com>
#48415) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? I was initially confused that I couldn't join another paused task while a debugger was in "continue" mode. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: bhuang <bhuang@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Adds a metric to monitor event loop lag for `instrumented_io_context`. This is by default applied to IO Contexts in GCS only. By default every 250ms we post an async task and use std::chrono::steady_clock to record the lag. The metric is a GAUGE with the thread name recorded. I tested this on my laptop and it works - it's a ~0 on an idle cluster, and 1000ms if I use `RAY_testing_asio_delay_us=event_loop_lag_probe=1000000:1000000`. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
CoreWorkerMemoryStore has GetAsync method which calls callbacks when the object is ready, immediately or on a Put() call. However despite the name, the call is not `Async`: it just calls on the GetAsync() or Put() call stack. If a mutex is held by any call frame in the callstack, it remains held in the callback, ripe for deadlocks. Adds another ctor argument io_context and posts any callbacks to the context to avoid those deadlocks. Note many cpp tests creates CoreWorkerMemoryStore without a io_context so I kept the option of creating an owned io_context with thread. Signed-off-by: Ruiyang Wang <rywang014@gmail.com> Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Gives clangd the ability to fully index ray's c++ code after you run ```bazel run :refresh_compile_commands```. Gives you lsp support + clang-tidy linting in ide to catch bad cpp practices in files based on rules already setup in .clang-tidy. https://github.com/ray-project/ray/blob/master/.clang-tidy A lot of these don't seem to be followed though and seem contrary to some of the style of existing code, but some are critical for not giving up performance gains off of small things. Trimming down to more important lints could be helpful. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
…8447) Adds dependency injection to create a custom `DeploymentHandle` type in `build_app`. This will be used to create a special type of handle for local testing mode. This also allowed me to remove the mocking/patching happening in the unit tests. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
https://github.com/ray-project/ray/pull/48693/files#diff-81d7373cd5567e997c002b244c491e6b3498d206b12c093f4dc4d30e9b5848af added a test that uses tensorflow. We currently need to skip all tensorflow-related tests for python 3.12 since we don't support tensorflow for python 3.12. Also this is a test for the deprecation of tensorflow ;) Test: - CI Signed-off-by: can <can@anyscale.com>
…tensors > 2Gb) (#48629) Enabling V2 Arrow Tensor extension type by default (allowing tensors > 2Gb) --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…Error (#48636) In a recent investigation, we found that when we call the `ray._private.internal_api.free()` from a task the same time as a Raylet is gracefully shutting down, the task might fail with application level Broken pipe IOError. This resulted in job failure without any task retries. However, as the Broken pipe happens because the unhealthiness of the local Raylet, the error should be a system level error and should be retried automatically. Updated changes in commit [01f5f11](01f5f11): This PR add the logic for the above bahvior: * When IOError is received in the `CoreWorker::Delete`, throw a system error exception so that the task can retry Why not add the exception check in the `free_objects` function? * It is better to add the logic in the `CoreWorker::Delete` because it can cover the case for other languages as well. * The `CoreWorker::Delete` function is intended to be open to all languages to call and is not called in other ray internal code paths. Why not crash the worker when IOError is encountered in the `WriteMessage` function? * `QuickExit()` function will directly exit the process without executing any shutdown logic for the worker. Directly calling the function in the task execution might potentially causing resource leak * At the same time, the write message function is called also on the graceful shutdown scenario and it is possible during the graceful shutdown process that the local Raylet is unreachable. Therefore, in the graceful shutdown scenario, we shouldn't exit early but let the shutdown logic finish. * At the same time, it is not clear in the code regarding the behavior of the graceful vs force shutdown. We might need some effort to make them clear. The todo is added in the PR. Updated changes in commit [2029d36](2029d36): > This PR add the logic for the above behavior: > * When IOError is received in the `free_objects()` function, throw a system error exception so that the task can retry Changes in commit ([9d57b29](9d57b29)) : > This PR add the logic for the above behavior: > * Today, the internal `free` API deletes the objects from the local Raylet object store by writing a message through a socket > * When the write failed because the local Raylet is terminated, there is already logic to quick exit the task > * However, the current termination check didn't cover the case where the local Raylet process is a Zombie process and IOError happens during write messages. > * This fix update the check criteria and fail the task when the Raylet process is terminated or the write message function returns an IOError~ Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
## Why are these changes needed? While we calling `xxx.map_groups(..., batch_format="...")`, we may invoke sort function and creating empty blocks which still uses pyarrow by default. And, when we invoke another sort call on top of it, we will hit `AttributeError: 'DataFrame' object has no attribute 'num_rows'` since we uses first block type. (However, we may have different blocks). See more details in #46748 ## Related issue number Close #46748 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Xingyu Long <xingyulong97@gmail.com> Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
Created by release automation bot. Update with commit e393a71 Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
…es (#48478) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When writing blocks to parquet, there might be blocks with fields that differ ONLY in nullability - by default, this would be rejected since some blocks might have a different schema than the ParquetWriter. However, we could potentially allow it to happen by tweaking the schema. This PR goes through all blocks before writing them to parquet, and merge schemas that differ only in nullability of the fields. It also casts the table to the newly merged schema so that the write could happen. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes #48102 --------- Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
ray.util.state.get_actor currently have a type annotation specifying that the function should return an Optional[Dict] but it actually returns [ActorState](https://github.com/ray-project/ray/blob/3141dfe4031cc715515b365278cd1d6b8955154e/python/ray/util/state/common.py#L416) (as the Docstring speficies). This pull request simply changes this type annotation. Signed-off-by: Miguel Teixeira <miguel.teixeira@poli.ufrj.br>
…48746) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? We currently report `iter_total_blocked_seconds` and `iter_user_seconds` as **Gauge** while we tracking them as counters, i.e.: - For each iteration, we had a timer that sums locally for each iteration into an aggregated value (which is the sum of total blocked seconds) - When the iteration ends or the iterator GCed, the gauge metric value is currently set to 0. - This creates confusion for users as a counter value (total time blocked on a dataset) should not be going back to 0, generating charts like below. --------- Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Some C++ improvements to our codebase. --------- Signed-off-by: dentiny <dentinyhao@gmail.com>
…coloring (#48473) Signed-off-by: dayshah <dhyey2019@gmail.com>
## Why are these changes needed? Adds a Sentinel value for making it possible to sort. Fixes #42142 ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
… batches first, instead of newest, BUT drop oldest batches if queue full). (#48702)
Signed-off-by: Aziz Belaweid <40893766+azayz@users.noreply.github.com>
Signed-off-by: ltbringer <amresh.venugopal@wise.com>
…ard (#48745) ## Why are these changes needed? Currently, there are some cases where the `Rows Outputted` value on the Ray Job page's `Ray Data Overview` section says "0", even after the dataset execution completes. The root cause of the bug is that we clear iteration/execution metrics after the dataset completes. This was previously used to "reset" the metrics to 0 after dataset completion, so that the last emitted value would not persist on the dashboard, even after the job finishes. Now that we display rates on the dashboard, this hack is no longer needed, and we can skip the metrics clearing. Fixed result: <img width="1860" alt="Screenshot at Nov 14 12-11-24" src="https://github.com/user-attachments/assets/35061b3f-9359-412b-8ab2-f4bcce412994"> ## Related issue number Closes #44635 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <sjl@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Fixed typo <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: mohitjain2504 <87856435+mohitjain2504@users.noreply.github.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Gene Der Su <gdsu@ucdavis.edu>
Signed-off-by: dayshah <dhyey2019@gmail.com>
too ancient; and there are no tests. this should remove the outdated paramiko package from the ray image Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
I randomly find lancedb issue: lancedb/lancedb#1480 which discloses a high-severity CVE Considering as lancedb, ray only has one use case for `retry` package, I took the same approach as lancedb/lancedb#1749, which names all variables better with unit and default value. --------- Signed-off-by: dentiny <dentinyhao@gmail.com>
nightly is inherently unstable, and latest is just last stable release. neither of them should block releases Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
fix NCCL_BFLOAT16 typo in TORCH_NCCL_DTYPE_MAP
also uses 1.3 syntax and heredoc for multiline commands Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Support read from Hudi table into Ray dataset. --------- Signed-off-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com>
…s download (#48756) Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )