forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream changes #6
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: pdmurray <peynmurray@gmail.com> Signed-off-by: pdmurray <peynmurray@gmail.com>
This PR adds log rotation for Ray Serve, letting it inherit rotation parameters (max_bytes, backup_count) from Ray Core, bringing a more consistent logging experience to Ray (as opposed to having the serve/ folder grow forever while the other logs rotate.
This PR adds additional information to the driver task event, namely, driver task type, and it's running/finished timestamps. This allows users (i.e. the dashboard) to inspect driver task more easily. This PR also exposes the exclude_driver flag to state API, allowing requests through https and ListAPiOptions to get driver tasks, while the default behaviour from state API will still be excluding it. This PR also filters out any tasks w/o task_info to prevent missing data issue.
If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate. Related issue number Closes #31121
Co-authored-by: Cade Daniel <edacih@gmail.com> Closes #31880
Adds back the metrics page Adds button to visit new dashboard and to go back Adds buttons for leaving feedback and viewing docs Add color to status badges of tasks and placement groups table Add alert when grafana is not running Fix copy button icon Separate metrics page into sections (both new IA and old IA)
…es to specify it (#31959) This PR clarifies where RunConfig can be specified. Also, when multiple configs are specified in different locations (in the Tuner and Trainer), this PR also logs information about which RunConfig is actually used. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
581cd4e moved some test files, breaking a link from the documentation. cc @iycheng 3343c76 changed the MapBatches string representation, breaking a docstring test. cc @peytondmurray Signed-off-by: Kai Fricke <kai@anyscale.com>
…cution state, and task submitters. (#31986)
…d management (#31979) Before this PR, stalls in the consumer thread would fully block the control loop. This provides backpressure, but at the cost of performance. This PR fully decouples the consumer thread from the control loop thread, allowing execution to proceed so long as there is sufficient object_store_memory budget remaining. It also adds a progress bar for the output queue, showing the number of output bundles consumed and the number of queued bundles for output:
…1993) Remove legacy memory monitor from worker submission code path, as that was already disabled by default in Ray 2.2
The structure of the content looks good. My main request is (like with the scheduling refactor), that we make this discoverable with links from the main task/actor sections. Could we add 2-3 links each from the main tasks/actors/objects content to the appropriate fault tolerance sections? _Originally posted by @ericl in #27573 (review) Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes. Related issue number Addresses #31741
…tring (#31840) Signed-off-by: rickyyx <rickyx@anyscale.com> This PR introduces a flag RAY_task_events_send_batch_size that controls the number of task events sent to GCS in a batch. With default setting, each core worker will send 10K task events per second to GCS, where GCS could handle 10K task events in ~50 milliseconds. This PR also adjust the worker side buffer limit to 1M with the new batching setting. The PR adds some debug informations as well.
…e_node` release test (#31904) The release test read_parquet_benchmark_single_node fails, due to using Python 3.7 and not having the pickle5 package installed. A similar issue is discussed in #26225. We found that the test failure is contained to the portion which tests a Dataset with a filter expression (the error is related to pickling with this filter expression). Therefore, we will temporarily disable this portion of the test, while keeping the rest of the release test (which I verified passes on the same cluster). We can come back to this in the future and fix the case with filter. Example of release test successfully running with the filter case removed. Signed-off-by: Scott Lee <sjl@anyscale.com>
…imizer (#31985) Signed-off-by: amogkam <amogkamsetty@yahoo.com> The following operations call map_batches directly: add_column, drop_columns, select_columns, random_sample. In this PR we add e2e tests for these examples with the new optimizer. In a future PR, we should refactor so that these operations do not call into map_batches and instead have their own logical operator.
…`MapOperator` actor pool. (#31987) This PR adds support for autoscaling to the actor pool implementation of `MapOperator` (this PR is stacked on top of #31986). The same autoscaling policy as the legacy `ActorPoolStrategy` is maintained, as well as providing more aggressive and sensible downscaling via: * If there are more idle actors than running/pending actors, scale down. * Once we're done submitting tasks, cancel pending actors and kill idle actors. In addition to autoscaling, `max_tasks_in_flight` capping is also implemented.
<img width="1731" alt="Screen Shot 2023-01-24 at 1 01 25 AM" src="https://user-images.githubusercontent.com/18510752/214250430-9bac7b06-56fb-44b3-a044-3eaf726d1469.png"> This PR adds the cluster utilization page in the landing view Co-authored-by: Alan Guo <aguo@anyscale.com>
This PR adds logical operator for randomize_block_order(). The change includes: Introduce AbstractAllToAll for all logical operators converted to AllToAllOperator RandomizeBlocks logical operator for randomize_block_order(). _internal/planner to move logic for Planner here and have generated function for randomize_blocks. This can be used later to create MapOperator/AllToAllOperator.
Add code owner to GCS module.
Signed-off-by: Cheng Su <scnju13@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Signed-off-by: SangBin Cho <rkooo567@gmail.com> This PR implements the timeline to the ray dashboard using new task backend. Implement the task events -> chrome tracing logic. Most of code is copied from existing code. TODO add unit tests (although we already have one, it is a pretty weak test). Create a timeline endpoint that can 1. download the json file (to download & upload manually) 2. return the json array buffer (to load onto perfetto directly) Create a subsection that has 3 features. 1. Download button. 2. Open perfetto button. 3. Instruction accordion.
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
… (where it's simple to add). (#32475)
In #28149 RayActorError is called with a str as cause, but this is not an accepted type. This leads to hitting the assertion error in the else case: assert isinstance(cause, ActorDiedErrorContext) on L283.
Signed-off-by: Pratik <pratikrajput1199@gmail.com>
…ainers (#32471) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
…rate file (#32457) Experiment state management is currently convoluted. We keep track of many duplicate variables, e.g. local/remote checkpoint dirs and syncers. The resume/syncing logic also takes up a lot of space in the trial runner. Saving and restoring experiment state is orthogonal to the actual trial lifecycle logic, thus it makes sense to separate this out. In the same go, I've removed a lot of duplicated state and simplified some APIs that will also make it easier to test the experiment state component separately. Signed-off-by: Kai Fricke <kai@anyscale.com>
An identical error message is returned in multiple cases if something goes wrong when pinging the api/version endpoint. This PR adds more information to the error message in case where the endpoint returns 404 in order to help with debugging.
…added to an operator (#32482) This PR ensures that the object store utilization for a bundle is still tracked when it's queued internally by an operator, e.g. MapOperator queueing bundles for the sake of bundling up to a minimum bundle size, or due to workers not yet being ready for dispatch.
* [tune/train] remove duplicated keys in tune/train results. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * timestamp Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * result_timestamp defaults to None Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix progress_reporter test. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * .get(, None) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test_gpu Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * WORKER_ Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
…y_tasks (#32438) Signed-off-by: rickyyx <rickyx@anyscale.com> We are calculating actor creation task submission time, which is less useful for this test.
Following our tune package restructure (https://github.com/ray-project/ray/pulls?q=is%3Apr+in%3Atitle+%5Btune%2Fstructure%5D), we now had 3 releases where we logged a warning (2.0-2.3). For 2.4, we should raise an error instead. For 2.5, we can remove the old files/packages. Signed-off-by: Kai Fricke <kai@anyscale.com>
…torch code example (#32058) The example under Ray AI Runtime/Example section directly used native PyTorch datasets for data loading. It's good to clarify that the current approach is for simplicity, the more recommended approach is to use the Ray dataset. Signed-off-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MacBook-Pro.local>
This PR always preserves order for the bulk executor. We may revisit this in the future, at which point we'd update all of the tests that rely on order preservation. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(
This PR fixes the `Stopper` doctests that are erroring. Previously, it used a `tune.Trainable` as its trainable, which would error on fit since its methods are not implemented. Also, it was missing some imports. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
#32457 refactored the experiment checkpoint management but introduced a bug where state is not correctly restored anymore. This was caught by a unit test error. This PR resolves the bug and makes sure the test passes. Signed-off-by: Kai Fricke <kai@anyscale.com>
Similar to #31204, refactor the core api reference for better layout and view. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
One of the flakiness of test_dataset.py is due to the timeout. This splits out the torch tests from this big test file. #32067
This PR is to add logical operator for group-by aggregate. The change includes: * `Aggregate`: the logical operator for aggregate * `generate_aggregate_fn`: the generated function for aggregate operator * `SortAggregateTaskSpec`: the task spec for doing sort-based aggregate, mostly refactored from [_GroupbyOp](https://github.com/ray-project/ray/blob/master/python/ray/data/grouped_dataset.py#L35).
#32486 introduced two test failures after hard-depracting a structure refactor. This PR fixes these two stale imports. Signed-off-by: Kai Fricke <coding@kaifricke.com>
By default, autosummary only shows one line for each class member instead of the entire docstring. Ideally the fix should be autosummarying class members as well but that generates too many doc pages and causes doc build timeout. For now, default to show docstring of class members in the class pages and an explicit opt-in to autosummary class members. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
… output to match docs (#32409) Un-revert #31166. This PR cleans up a few usability issues around Ray clusters: - Makes some cleanups to the ray start log output to match the new documentation on Ray clusters. Mainly, de-emphasize Ray Client and recommend jobs instead. - Add an opt-in flag for enabling multi-node clusters for OSX and Windows. Previously, it was possible to start a multi-node cluster, but then any Ray programs would fail mysteriously after connecting to the cluster. Now, it will warn the user with an error message if the opt-in flag is not set. - Document multi-node support for OSX and Windows. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
…32531) Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.