upstream changes #6

jcoffi · 2023-02-14T22:30:05Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: pdmurray <peynmurray@gmail.com> Signed-off-by: pdmurray <peynmurray@gmail.com>

This PR adds log rotation for Ray Serve, letting it inherit rotation parameters (max_bytes, backup_count) from Ray Core, bringing a more consistent logging experience to Ray (as opposed to having the serve/ folder grow forever while the other logs rotate.

This PR adds additional information to the driver task event, namely, driver task type, and it's running/finished timestamps. This allows users (i.e. the dashboard) to inspect driver task more easily. This PR also exposes the exclude_driver flag to state API, allowing requests through https and ListAPiOptions to get driver tasks, while the default behaviour from state API will still be excluding it. This PR also filters out any tasks w/o task_info to prevent missing data issue.

If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate. Related issue number Closes #31121

Co-authored-by: Cade Daniel <edacih@gmail.com> Closes #31880

Adds back the metrics page Adds button to visit new dashboard and to go back Adds buttons for leaving feedback and viewing docs Add color to status badges of tasks and placement groups table Add alert when grafana is not running Fix copy button icon Separate metrics page into sections (both new IA and old IA)

…es to specify it (#31959) This PR clarifies where RunConfig can be specified. Also, when multiple configs are specified in different locations (in the Tuner and Trainer), this PR also logs information about which RunConfig is actually used. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

@peytondmurray

581cd4e moved some test files, breaking a link from the documentation. cc @iycheng 3343c76 changed the MapBatches string representation, breaking a docstring test. cc @peytondmurray Signed-off-by: Kai Fricke <kai@anyscale.com>

…cution state, and task submitters. (#31986)

…d management (#31979) Before this PR, stalls in the consumer thread would fully block the control loop. This provides backpressure, but at the cost of performance. This PR fully decouples the consumer thread from the control loop thread, allowing execution to proceed so long as there is sufficient object_store_memory budget remaining. It also adds a progress bar for the output queue, showing the number of output bundles consumed and the number of queued bundles for output:

…2010) #31669 changed the `Trial.__dict__` by moving `local_dir` to `_local_dir`, which resulted in an error in our tune cloud tests. This PR updates the signature of the `TrialStub` class to resolve the issue. Signed-off-by: Kai Fricke <kai@anyscale.com>

…1993) Remove legacy memory monitor from worker submission code path, as that was already disabled by default in Ray 2.2

@ericl

The structure of the content looks good. My main request is (like with the scheduling refactor), that we make this discoverable with links from the main task/actor sections. Could we add 2-3 links each from the main tasks/actors/objects content to the appropriate fault tolerance sections? _Originally posted by @ericl in #27573 (review) Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>

The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes. Related issue number Addresses #31741

…tring (#31840) Signed-off-by: rickyyx <rickyx@anyscale.com> This PR introduces a flag RAY_task_events_send_batch_size that controls the number of task events sent to GCS in a batch. With default setting, each core worker will send 10K task events per second to GCS, where GCS could handle 10K task events in ~50 milliseconds. This PR also adjust the worker side buffer limit to 1M with the new batching setting. The PR adds some debug informations as well.

…e_node` release test (#31904) The release test read_parquet_benchmark_single_node fails, due to using Python 3.7 and not having the pickle5 package installed. A similar issue is discussed in #26225. We found that the test failure is contained to the portion which tests a Dataset with a filter expression (the error is related to pickling with this filter expression). Therefore, we will temporarily disable this portion of the test, while keeping the rest of the release test (which I verified passes on the same cluster). We can come back to this in the future and fix the case with filter. Example of release test successfully running with the filter case removed. Signed-off-by: Scott Lee <sjl@anyscale.com>

…imizer (#31985) Signed-off-by: amogkam <amogkamsetty@yahoo.com> The following operations call map_batches directly: add_column, drop_columns, select_columns, random_sample. In this PR we add e2e tests for these examples with the new optimizer. In a future PR, we should refactor so that these operations do not call into map_batches and instead have their own logical operator.

…info on failure (#32014) It appears the root cause of flaky failures described in #31981 is suppressed because we're not logging exceptions in `exponential_backoff_retry`. Signed-off-by: Cade Daniel <cade@anyscale.com>

… node manager. (#31917)" (#31995) This reverts commit a32b9b1.

…`MapOperator` actor pool. (#31987) This PR adds support for autoscaling to the actor pool implementation of `MapOperator` (this PR is stacked on top of #31986). The same autoscaling policy as the legacy `ActorPoolStrategy` is maintained, as well as providing more aggressive and sensible downscaling via: * If there are more idle actors than running/pending actors, scale down. * Once we're done submitting tasks, cancel pending actors and kill idle actors. In addition to autoscaling, `max_tasks_in_flight` capping is also implemented.

<img width="1731" alt="Screen Shot 2023-01-24 at 1 01 25 AM" src="https://user-images.githubusercontent.com/18510752/214250430-9bac7b06-56fb-44b3-a044-3eaf726d1469.png"> This PR adds the cluster utilization page in the landing view Co-authored-by: Alan Guo <aguo@anyscale.com>

This PR adds logical operator for randomize_block_order(). The change includes: Introduce AbstractAllToAll for all logical operators converted to AllToAllOperator RandomizeBlocks logical operator for randomize_block_order(). _internal/planner to move logic for Planner here and have generated function for randomize_blocks. This can be used later to create MapOperator/AllToAllOperator.

Add code owner to GCS module.

Signed-off-by: Cheng Su <scnju13@gmail.com>

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>

Signed-off-by: SangBin Cho <rkooo567@gmail.com> This PR implements the timeline to the ray dashboard using new task backend. Implement the task events -> chrome tracing logic. Most of code is copied from existing code. TODO add unit tests (although we already have one, it is a pretty weak test). Create a timeline endpoint that can 1. download the json file (to download & upload manually) 2. return the json array buffer (to load onto perfetto directly) Create a subsection that has 3 features. 1. Download button. 2. Open perfetto button. 3. Instruction accordion.

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…d debugstring (#31840)" (#32024) This reverts commit 5d1f2e4.

… (where it's simple to add). (#32475)

In #28149 RayActorError is called with a str as cause, but this is not an accepted type. This leads to hitting the assertion error in the else case: assert isinstance(cause, ActorDiedErrorContext) on L283.

Signed-off-by: Pratik <pratikrajput1199@gmail.com>

…ainers (#32471) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…rate file (#32457) Experiment state management is currently convoluted. We keep track of many duplicate variables, e.g. local/remote checkpoint dirs and syncers. The resume/syncing logic also takes up a lot of space in the trial runner. Saving and restoring experiment state is orthogonal to the actual trial lifecycle logic, thus it makes sense to separate this out. In the same go, I've removed a lot of duplicated state and simplified some APIs that will also make it easier to test the experiment state component separately. Signed-off-by: Kai Fricke <kai@anyscale.com>

An identical error message is returned in multiple cases if something goes wrong when pinging the api/version endpoint. This PR adds more information to the error message in case where the endpoint returns 404 in order to help with debugging.

…added to an operator (#32482) This PR ensures that the object store utilization for a bundle is still tracked when it's queued internally by an operator, e.g. MapOperator queueing bundles for the sake of bundling up to a minimum bundle size, or due to workers not yet being ready for dispatch.

* [tune/train] remove duplicated keys in tune/train results. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * timestamp Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * result_timestamp defaults to None Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix progress_reporter test. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * .get(, None) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test_gpu Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * WORKER_ Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

…y_tasks (#32438) Signed-off-by: rickyyx <rickyx@anyscale.com> We are calculating actor creation task submission time, which is less useful for this test.

Following our tune package restructure (https://github.com/ray-project/ray/pulls?q=is%3Apr+in%3Atitle+%5Btune%2Fstructure%5D), we now had 3 releases where we logged a warning (2.0-2.3). For 2.4, we should raise an error instead. For 2.5, we can remove the old files/packages. Signed-off-by: Kai Fricke <kai@anyscale.com>

…torch code example (#32058) The example under Ray AI Runtime/Example section directly used native PyTorch datasets for data loading. It's good to clarify that the current approach is for simplicity, the more recommended approach is to use the Ray dataset. Signed-off-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MacBook-Pro.local>

This PR always preserves order for the bulk executor. We may revisit this in the future, at which point we'd update all of the tests that rely on order preservation. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

This PR fixes the `Stopper` doctests that are erroring. Previously, it used a `tune.Trainable` as its trainable, which would error on fit since its methods are not implemented. Also, it was missing some imports. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…port (#32447)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

#32457 refactored the experiment checkpoint management but introduced a bug where state is not correctly restored anymore. This was caught by a unit test error. This PR resolves the bug and makes sure the test passes. Signed-off-by: Kai Fricke <kai@anyscale.com>

Similar to #31204, refactor the core api reference for better layout and view. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

One of the flakiness of test_dataset.py is due to the timeout. This splits out the torch tests from this big test file. #32067

Follow up to #32015.

This PR is to add logical operator for group-by aggregate. The change includes: * `Aggregate`: the logical operator for aggregate * `generate_aggregate_fn`: the generated function for aggregate operator * `SortAggregateTaskSpec`: the task spec for doing sort-based aggregate, mostly refactored from [_GroupbyOp](https://github.com/ray-project/ray/blob/master/python/ray/data/grouped_dataset.py#L35).

#32486 introduced two test failures after hard-depracting a structure refactor. This PR fixes these two stale imports. Signed-off-by: Kai Fricke <coding@kaifricke.com>

This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class. This PR is a followup to #31204 and #32311, which made the same changes for Ray Data/Tune docs. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

By default, autosummary only shows one line for each class member instead of the entire docstring. Ideally the fix should be autosummarying class members as well but that generates too many doc pages and causes doc build timeout. For now, default to show docstring of class members in the class pages and an explicit opt-in to autosummary class members. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

… output to match docs (#32409) Un-revert #31166. This PR cleans up a few usability issues around Ray clusters: - Makes some cleanups to the ray start log output to match the new documentation on Ray clusters. Mainly, de-emphasize Ray Client and recommend jobs instead. - Add an opt-in flag for enabling multi-node clusters for OSX and Windows. Previously, it was possible to start a multi-node cluster, but then any Ray programs would fail mysteriously after connecting to the cluster. Now, it will warn the user with an error message if the opt-in flag is not set. - Document multi-node support for OSX and Windows. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

…32531) Signed-off-by: amogkam <amogkamsetty@yahoo.com>

… data import (#32447)" (#32533) This reverts commit bc01288.

) This PR fixes trainable actor reuse to update the remote trial directory that it's writing checkpoints to. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

peytondmurray and others added 30 commits January 26, 2023 23:18

Add informative progress bar names to map_batches (#31526)

3343c76

Signed-off-by: pdmurray <peynmurray@gmail.com> Signed-off-by: pdmurray <peynmurray@gmail.com>

[serve] Add exponential backoff when retrying replicas (#31436)

3f1a880

If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate. Related issue number Closes #31121

[RLlib] Fixed the autorom dependency issue (#31933)

76d7467

Co-authored-by: Cade Daniel <edacih@gmail.com> Closes #31880

[Datasets] [Autoscaling Actor Pool - 1/2] Refactor MapOperator, exe…

02ca4c9

…cution state, and task submitters. (#31986)

[core] remove legacy memory monitor from task submission codepath (#3…

e64b44b

…1993) Remove legacy memory monitor from worker submission code path, as that was already disabled by default in Ray 2.2

Revert "[core] Fix gcs healthch manager crash when node is removed by…

51c5eda

… node manager. (#31917)" (#31995) This reverts commit a32b9b1.

[docs] collapse navbar (#31994)

b58bb93

[core] Add code owner to GCS module. (#32018)

09f45ad

Add code owner to GCS module.

Refactor block_fn out of map-like logical operators (#32021)

8e188db

Signed-off-by: Cheng Su <scnju13@gmail.com>

[train][docs] fix doc search issues, examples gallery & filter (#31635)

cc6d30a

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>

[RLlib] Separate PPO torch regression test, and make it longer (#31892)

20bfcdd

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Revert "[core][state] Adjust worker side reporting with batches && ad…

c889349

…d debugstring (#31840)" (#32024) This reverts commit 5d1f2e4.

[docs] simple web crawler example (#31900)

80d13d1

ArturNiederfahrenhorst and others added 29 commits February 13, 2023 17:20

[RLlib] Add sample timer to all algorithms' training_step() methods…

cacc982

… (where it's simple to add). (#32475)

[ActorInit] Fix Bug in Actor creation (#32277)

2e9b834

In #28149 RayActorError is called with a str as cause, but this is not an accepted type. This leads to hitting the assertion error in the else case: assert isinstance(cause, ActorDiedErrorContext) on L283.

Fix typo in README.md (#32466)

997e95e

Signed-off-by: Pratik <pratikrajput1199@gmail.com>

[RLlib] Added test version of BC algorithm based on RLModules an RLTr…

4ffa7fd

…ainers (#32471) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

[ci][core] Calculate actor creation time properly for stress_test_man…

e56665e

…y_tasks (#32438) Signed-off-by: rickyyx <rickyx@anyscale.com> We are calculating actor creation task submission time, which is less useful for this test.

[data] Fix pandas import failures by moving it to a top-level data im…

bc01288

…port (#32447)

[RLlib] Allow MARLModule customization from algorithm config (#32473)

a447cbb

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

[Doc] Restructure core API docs (#32236)

99d00ad

Similar to #31204, refactor the core api reference for better layout and view. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Deflake test_dataset.py: split torch tests (#32487)

b89457a

One of the flakiness of test_dataset.py is due to the timeout. This splits out the torch tests from this big test file. #32067

Clean up RAY_DATASET_FORCE_LOCAL_METADATA flag (#32483)

f0d96c5

Follow up to #32015.

Add write operator in new logical plan (#32440)

3414797

[tune] Fix two tests after structure refactor deprecation (#32517)

d092b12

#32486 introduced two test failures after hard-depracting a structure refactor. This PR fixes these two stale imports. Signed-off-by: Kai Fricke <coding@kaifricke.com>

[Data] Update DatasetPipeline.to_tf API to match with Dataset.to_tf (#…

9dcb369

…32531) Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Revert "[data] Fix pandas import failures by moving it to a top-level…

b12c0d1

… data import (#32447)" (#32533) This reverts commit bc01288.

[Tune] Update trainable remote_checkpoint_dir upon actor reuse (#32420

e8f1cf6

) This PR fixes trainable actor reuse to update the remote trial directory that it's writing checkpoints to. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

[docs] setting up grafana and prometheus (#31129)

b9f7e19

jcoffi merged commit 70098e1 into jcoffi:master Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upstream changes #6

upstream changes #6

jcoffi commented Feb 14, 2023

upstream changes #6

upstream changes #6

Conversation

jcoffi commented Feb 14, 2023

Why are these changes needed?

Related issue number

Checks