upstream changes #6

Signed-off-by: pdmurray <peynmurray@gmail.com> Signed-off-by: pdmurray <peynmurray@gmail.com>

This PR adds log rotation for Ray Serve, letting it inherit rotation parameters (max_bytes, backup_count) from Ray Core, bringing a more consistent logging experience to Ray (as opposed to having the serve/ folder grow forever while the other logs rotate.

This PR adds additional information to the driver task event, namely, driver task type, and it's running/finished timestamps. This allows users (i.e. the dashboard) to inspect driver task more easily. This PR also exposes the exclude_driver flag to state API, allowing requests through https and ListAPiOptions to get driver tasks, while the default behaviour from state API will still be excluding it. This PR also filters out any tasks w/o task_info to prevent missing data issue.

If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate. Related issue number Closes #31121

Co-authored-by: Cade Daniel <edacih@gmail.com> Closes #31880

Adds back the metrics page Adds button to visit new dashboard and to go back Adds buttons for leaving feedback and viewing docs Add color to status badges of tasks and placement groups table Add alert when grafana is not running Fix copy button icon Separate metrics page into sections (both new IA and old IA)

…es to specify it (#31959) This PR clarifies where RunConfig can be specified. Also, when multiple configs are specified in different locations (in the Tuner and Trainer), this PR also logs information about which RunConfig is actually used. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

@peytondmurray

581cd4e moved some test files, breaking a link from the documentation. cc @iycheng 3343c76 changed the MapBatches string representation, breaking a docstring test. cc @peytondmurray Signed-off-by: Kai Fricke <kai@anyscale.com>

…cution state, and task submitters. (#31986)

…d management (#31979) Before this PR, stalls in the consumer thread would fully block the control loop. This provides backpressure, but at the cost of performance. This PR fully decouples the consumer thread from the control loop thread, allowing execution to proceed so long as there is sufficient object_store_memory budget remaining. It also adds a progress bar for the output queue, showing the number of output bundles consumed and the number of queued bundles for output:

…2010) #31669 changed the `Trial.__dict__` by moving `local_dir` to `_local_dir`, which resulted in an error in our tune cloud tests. This PR updates the signature of the `TrialStub` class to resolve the issue. Signed-off-by: Kai Fricke <kai@anyscale.com>

…1993) Remove legacy memory monitor from worker submission code path, as that was already disabled by default in Ray 2.2

@ericl

The structure of the content looks good. My main request is (like with the scheduling refactor), that we make this discoverable with links from the main task/actor sections. Could we add 2-3 links each from the main tasks/actors/objects content to the appropriate fault tolerance sections? _Originally posted by @ericl in #27573 (review) Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>

The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes. Related issue number Addresses #31741

…tring (#31840) Signed-off-by: rickyyx <rickyx@anyscale.com> This PR introduces a flag RAY_task_events_send_batch_size that controls the number of task events sent to GCS in a batch. With default setting, each core worker will send 10K task events per second to GCS, where GCS could handle 10K task events in ~50 milliseconds. This PR also adjust the worker side buffer limit to 1M with the new batching setting. The PR adds some debug informations as well.

…e_node` release test (#31904) The release test read_parquet_benchmark_single_node fails, due to using Python 3.7 and not having the pickle5 package installed. A similar issue is discussed in #26225. We found that the test failure is contained to the portion which tests a Dataset with a filter expression (the error is related to pickling with this filter expression). Therefore, we will temporarily disable this portion of the test, while keeping the rest of the release test (which I verified passes on the same cluster). We can come back to this in the future and fix the case with filter. Example of release test successfully running with the filter case removed. Signed-off-by: Scott Lee <sjl@anyscale.com>

…imizer (#31985) Signed-off-by: amogkam <amogkamsetty@yahoo.com> The following operations call map_batches directly: add_column, drop_columns, select_columns, random_sample. In this PR we add e2e tests for these examples with the new optimizer. In a future PR, we should refactor so that these operations do not call into map_batches and instead have their own logical operator.

…info on failure (#32014) It appears the root cause of flaky failures described in #31981 is suppressed because we're not logging exceptions in `exponential_backoff_retry`. Signed-off-by: Cade Daniel <cade@anyscale.com>

… node manager. (#31917)" (#31995) This reverts commit a32b9b1.

…`MapOperator` actor pool. (#31987) This PR adds support for autoscaling to the actor pool implementation of `MapOperator` (this PR is stacked on top of #31986). The same autoscaling policy as the legacy `ActorPoolStrategy` is maintained, as well as providing more aggressive and sensible downscaling via: * If there are more idle actors than running/pending actors, scale down. * Once we're done submitting tasks, cancel pending actors and kill idle actors. In addition to autoscaling, `max_tasks_in_flight` capping is also implemented.

<img width="1731" alt="Screen Shot 2023-01-24 at 1 01 25 AM" src="https://user-images.githubusercontent.com/18510752/214250430-9bac7b06-56fb-44b3-a044-3eaf726d1469.png"> This PR adds the cluster utilization page in the landing view Co-authored-by: Alan Guo <aguo@anyscale.com>

This PR adds logical operator for randomize_block_order(). The change includes: Introduce AbstractAllToAll for all logical operators converted to AllToAllOperator RandomizeBlocks logical operator for randomize_block_order(). _internal/planner to move logic for Planner here and have generated function for randomize_blocks. This can be used later to create MapOperator/AllToAllOperator.

Add code owner to GCS module.

Signed-off-by: Cheng Su <scnju13@gmail.com>

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>

Signed-off-by: SangBin Cho <rkooo567@gmail.com> This PR implements the timeline to the ray dashboard using new task backend. Implement the task events -> chrome tracing logic. Most of code is copied from existing code. TODO add unit tests (although we already have one, it is a pretty weak test). Create a timeline endpoint that can 1. download the json file (to download & upload manually) 2. return the json array buffer (to load onto perfetto directly) Create a subsection that has 3 features. 1. Download button. 2. Open perfetto button. 3. Instruction accordion.

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…d debugstring (#31840)" (#32024) This reverts commit 5d1f2e4.

This PR adds seealso notes to help users distinguish between map, flat_map, and map_batches. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Signed-off-by: Eric Liang <ekhliang@gmail.com>

Without this patch, several of the help text are missing whitespace. For example, `--dashboard-host` appears as follows: --dashboard-host TEXT the host to bind the dashboard server to, either localhost (127.0.0.1) or 0.0.0.0 (available from all interfaces). By default, thisis localhost. This patch adds the correct trailing whitespace so there are spaces. Signed-off-by: Luke Hsiao <luke.hsiao@numbersstation.ai>

…igurable. (#31960)

…rs (#31991) * trying out a new configuration pattern for trainer runner and rl trainers Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

#32036)

… a MultiAgent env. (#31480)

…2002) This PR adds a DrainAndKillNode endpoint to the monitor service. It has the exact same semantics as the GcsNodeManager::HandleDrainNode. --------- Co-authored-by: Alex <alex@anyscale.com>

Checkpointable actor is already removed in #10333

…moved by node manager."" (#32019) This reverts commit 51c5eda. Reverts #31995 Skip the windows test. Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

Mixins don't work well with reuse_actors because the init is only called on construction. In the case of mlflow, this means that reused actors will try to overwrite state from the trials that previously ran on them. This is incorrect behavior and errors on the mlflow server side. Thus, we should default to not reuse actors for mixins. Signed-off-by: Kai Fricke <kai@anyscale.com>

Signed-off-by: Eric Liang <ekhliang@gmail.com>

Fixes: - Properly wire max tasks per actor to pool - Account for internal queue size in scheduling algorithm - Small improvements to progress bar UX

#31337 has become flaky again due to a low timeout. This PR follows #31338 and increases the timeout.

We currently have no canonical way to await actors. Users can define their own _is-ready_ methods, schedule a future, and await these, but this has to be done for every actor class separately. This does not match other patterns - e.g. we have `actor.__ray_terminate__.remote()` for actor termination and `placement_group.ready()` for placement group ready futures. This PR adds a new `__ray_ready__` magic actor method that just returns `True`. It can be used to await actors becoming ready (newly scheduled actors), and actors having processed all of their other enqueued tasks. Signed-off-by: Kai Fricke <kai@anyscale.com>

The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, #31945 and #32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable.

Reduce the timeout for many nodes actor test given that a test should finish within 1h. It can save some cost for problematic runs.

This PR is a quick fix to remove the non-useful comment introduced in #31526, probably during debugging. Replace the comment with a meaningful one.

Signed-off-by: Eric Liang <ekhliang@gmail.com> Combine tasks and actors sections Move object store memory back up to the logical section (it's one of the most useful metrics, it shouldn't be buried) Improve titles

Currently, the dropdown menu "Resources" in the Ray documentation contains a link called "Training." This link points to the [same site](https://www.anyscale.com/events) as "Events." However, we want this to direct to the repository of [technical training content](https://github.com/ray-project/ray-educational-materials). Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

…ult tolerance (#31949) This PR adds the documentation and sample config files for deploying Ray to K8S without using KubeRay. As KubeRay CRDs need cluster-scoped permissions, this PR helps those users who do not have cluster-scoped permissions to install Ray Cluster in their K8S.

Signed-off-by: Alan Guo <aguo@anyscale.com>

This progress bar automatically shows progress by groupings. Things that belong to the same parent are all put in a group. If a group has multiple children with the same name, those are merged together into a virtual group. These virtual groups have different visual treatment because a virtual group should not add an additional level of nesting.

… execute commands on databricks notebook for a long time (#31962) Databricks Runtime provides an API: dbutils.entry_point.getIdleTimeMillisSinceLastNotebookExecution() that returns elapsed milliseconds since last databricks notebook code execution. This PR code calls this interface to monitor notebook activity and shut down Ray cluster on timeout. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…` and `ArrowVariableShapedTensorArray` (#31817) Add support for creating ArrowTensorArrays and ArrowVariableShapedTensorArrays with string typed columns. Signed-off-by: Scott Lee <sjl@anyscale.com>

…apes` (but `reduce_retracing` instead). (#29214) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…e w/ pettingzoo. (#31820)

… foundation's gymnasium" (from "OpenAI gym"). (#32061)

…32101) This PR is to fix master with resolving the conflict between #32080 and #32081, i.e. - Pass TaskContext in random_shuffle.py:generate_random_shuffle_fn() - Add AllToAllTransformFn and rename TransformFn to MapTransformFn - Update the function return type in generate_map_xxx_fn(). Signed-off-by: Cheng Su <scnju13@gmail.com>

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* [air] Add test for remote_storage with real hdfs backend. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * typo Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * typo Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * try a different syntax. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * change `install-hdfs.sh` permission. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * -hdfs in air tests. update ssh-kengen command. fix a few typos. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * test_env= Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * cat hdfs_env Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * move `PATH` as well to a separate file. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * setting env vars in test only. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix import Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * address comments. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * nit Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix fixture Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * address comments Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * address comments Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

…xfail (#32072) * Marking RLLib release tests as unstable if xfail

This PR adds logical operator for `repartition()`. Only implement shuffle repartition (`repartition.py:generate_repartition_fn()`). Non-shuffle repartition is left as TODO, as the corresponding code in [fast_repartition.py](https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/fast_repartition.py) involves `BlockList`, `ExecutionPlan` and `Dataset.split()`, so it needs a deeper refactoring and code change.

This PR exposes the MultiGet operation to the InternalKVInterface. The MultiGet operation is already supported in the two backends (InMemory and Redis), so this PR is just plumbing. This change is needed to support getting multiple keys from the Internal KV in a single RPC.

…sorArray` and `ArrowVariableShapedTensorArray` (#31817)" (#32123) This reverts commit 1fdf24e.

This adds an option to the AIR DatasetConfig for a preprocessor that gets reapplied on each epoch. Currently the implementation uses DatasetPipeline to ensure that the extra preprocessing step is overlapped with training. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

…scaling (#32085) The previous way pending_nodes was calculated was prone to race conditions, instead, let's just always publish it in the main thread with other metrics. Closes #31982 --------- Co-authored-by: Alex <alex@anyscale.com>

…r lookup (#32087) In #30016 we migrated Ray Tune to use a new resource management interface. In the same PR, we simplified the resource consolidation logic. This lead to a performance regression first identified in #31337. After manual profiling, the regression seems to come from `RayTrialExecutor._count_staged_resources`. We have 1000 staged trials, and this function is called on every step, executing a linear scan through all trials. This PR fixes this performance bottleneck by keeping state of the resource counter instead of dynamically recreating it every time. This is simple as we can just add/subtract the resources whenever we add/remove from the `RayTrialExecutor._staged_trials` set. Manual testing confirmed this improves the runtime of `tune_scalability_result_throughput_cluster` from ~132 seconds to ~122 seconds, bringing it back to the same level as before the refactor. Signed-off-by: Kai Fricke <kai@anyscale.com>

…RLTrainers (#31991)" (#32130) Reverts #31991 This PR seems to have broken CI. Screenshot 2023-01-31 at 1 39 09 PM The error is https://buildkite.com/ray-project/oss-ci-build-branch/builds/2099#01860972-e02e-47c4-8f86-8be28ea18d92/3786-3992 AttributeError: '_TFStub' object has no attribute 'Tensor'

. So instead of averaging out, we should do sum(gpu_utillization) / (sum(num_gpus)) to cap the max percentage to 100%.

With the recent updating of the nightly tests, update the data here. In the nightly tests, we use 2k nodes (2cpus per node) and 20k actors, but if better node is used, we can run more than 40k actors. https://buildkite.com/ray-project/release-tests-branch/builds/1321#018604d7-86a3-4fad-ac6c-803db73821d3

Signed-off-by: Alan Guo <aguo@anyscale.com> fix lint #31750

…on planner. (#32095) This PR adds operation fusion to the new execution planner.

* Remove empty parser.add_argument() in test file * remove --framework=torch * fix BUILD * use training_iteration as stopping cirterion Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [release] minor fix to pytorch_pbt_failure test when using gpu. (#32070) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Datasets depends on ray.air for several key features (tensor extensions, Arrow transformations, data batch conversions), and not running the Datasets test suite in PR builds on ray.air changes has caused breaks to go undetected. This PR changes this so when files under python/ray/air change, we trigger the Datasets test suite in CI. Signed-off-by: Clark Zinzow <clarkzinzow@gmail.com>

…n` for remote URI (#32110) At least two users reported encountering ImportError( "You must `pip install smart_open` and " "`pip install boto3` to fetch URIs in s3 " "bucket. " and trying to fix it by specifying them in the pip field of runtime_env, which won't work because the runtime_env setup code doesn't run inside the runtime_env. This PR clarifies the error message to say that they must be preinstalled on the cluster, and adds a note to the docs.

…32126)

…d. (#31664) Signed-off-by: SangBin Cho <rkooo567@gmail.com>

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Why are these changes needed? Fail the task if it is the last task of the group, per the new (group by parent) worker killing policy Related issue number #32149 32078 Co-authored-by: Clarence Ng <clarence@anyscale.com>

…me/ray to Ray Dockerfile." #32026 Signed-off-by: kaihsun <kaihsun@anyscale.com>

#31956 Upgrade to a version of gRPC that GHSA-cfmr-vrgj-vqwv in Zlib 1.46.6 has this patch: grpc/grpc#31845

Why are these changes needed? Add a new protobuf for JobInfo from the Ray Job API Augment the existing GCS GetAllJobInfo endpoint to return this information, if available (not all GCS jobs were submitted via the Ray Job API; these jobs won't have this extra JobInfo.) Related issue number Closes #29621

…tring (remerging #31840) (#32057) Remerging #31840

This is the initial prototype of integrating ray status to the frontend. I think we could've returned structured data from the backend, but I decided to parse ray status output from the frontend for quick implementation (so that we can support if from ray 2.3).

) Signed-off-by: SangBin Cho <rkooo567@gmail.com>   ## Why are these changes needed? This PR unpins the version of open telemetry as it is too strict for an experimental tracing feature. ## Related issue number Closes #32051 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

* Revert "Revert "[RLlib] Reparameterize the construction of TrainerRunner and RLTrainers (#31991)" (#32130)" This reverts commit d15ccfc. * added bool evaluation to tf stub so that if tf returns false Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

This PR takes over #28179 Why are these changes needed? Today with the default scheduling policy, Ray will try to pack tasks on nodes until the resource utilization is beyond a certain threshold and spread tasks afterwards. This has caused slow down the scheduling speed for embarrassingly parallel jobs: we will only move on to another node before the current node's resource if sufficiently utilized, for each node, the overhead of accepting new job and starting of a new workers is not negligible. the overall scheduling speed doesn't scale with the number of nodes; This PR is one proposal to address the problem: instead of stick to one node, we randomly choose one node from top-k nodes for the default scheduling, where the node is sorted by it's resource utilization in reverse order. Intuitively, this allows us to kick off the workers startup on multiple node in parallel of the scheduling. benchmark result: baseline: 10 parallelism, top 1, 25 tasks/second 10 parallelism, top 6, 30 tasks/second 64 parallelism, top 6, 126 100 parallelism, top 6, 150 1000 parallelism, top 6, 374.8676886257549 10 concurrent, top 12, 176 64 concurrent, top 12, 182.59477988042443 tasks/s 128 concurrent, top 12, 245.9862948998163 256 concurrent, top 12, 298…

This PR adds actor detail page. Other than the detail page, it also Add pg id to task/actor Add profiling links to job detail & job row & actor detail

This PR is to add logical operator for `sort()`, the change includes: * `Sort` logical operator * `SortTaskSpec` to copy from `sort.py` * `generate_sort_fn` is generated function for sort

Signed-off-by: Simran Mhatre <simran@anyscale.com>

The test failed asan because some data is not cleaned when it exits. Increase the threshold to mitigate it. Tested locally and for 500 runs, only 3 failed.

Right now we show Actor error if the actor is killed due to OOM. This PR changes it so it surfaces a OOM error It does not support actor / actor task oom retry, as the goal of this PR is to improve observability by setting the death cause of the actor to OOM Related issue number #29736 Signed-off-by: Aviv Haber <aviv@anyscale.com> Signed-off-by: Clarence Ng <clarence@anyscale.com>

…kpoint (#31957) Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…inable` (#32059) * Add trainable and deprecate overwrite_trainable Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…2145)" (#32165) This reverts commit 12d7d7d.

…ver dies (#32127) Signed-off-by: Clarence Ng <clarence.wyng@gmail.com> infeasible requests are not cleaned up when the driver exits. This cleans up infeasible request created by driver when it exits. does not apply to worker exit (follow up) also does not apply to infeasible task submitted to a different raylet (follow up)

Signed-off-by: SangBin Cho <rkooo567@gmail.com> Add job id to the task state API call. This will help us not including tasks from other jobs (so improve the experience when we have 10K+ tasks from the cluster). Add resource requirement to the pg table.

… task id for parent's task id in state API (#32157) Right now, if a new thread (or async actor's event loop executing thread) runs some ray code (e.g. submitting a task, calling runtime context), the thread will have a WorkerThreadContext that has a random task id. This causes issues in state API since the task tree will have wrong structures, i.e. some tasks might have parent_task_id that doesn't match any existing tasks: For normal single threaded task/actor, we will use the main thread's task id (correct hehavior). For unusual cases (threaded/async actors), we will use the actor creation task's task id. This means from the advanced visualization, all the remote tasks created from actor tasks will be under the constructor of threaded/async actors

…lingPolicy (#32016) This PR changes usages of the `node:<ip>` custom resource as determined by querying [file:(air|tune|train).*\.py node:](https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/ray-project/ray%24+file:%28air%7Ctune%7Ctrain%29.*%5C.py+node:). This is being used for: - Collocating tasks (`_force_on_current_node`). - Syncing files to specific IP addresses. - Syncing files to _all_ other nodes. Signed-off-by: Matthew Deng <matt@anyscale.com>

In #31933 we fix an Atari ROM dependency that by default uses a torrent to download ROMs. The tests in this PR also break occasionally due to the same reason. I moved the ROM dependency to S3 to increase reliability. I actually think we can remove the ROM dependency from these app configs since I don't see any RL test using them. But I think that is too much risk for this PR, since it will likely end up as a cherry pick to 2.3.

…of nodes in the cluster (#31934) Why are these changes needed? This PR takes over #26373 Currently, the initial scheduling delay for a simple f.remote() loop is approximately worker startup time (~1s) * number of nodes. There are three reasons for this: 1 . Drivers do not share physical worker processes, so each raylet must start new worker processes when a new driver starts. Each raylet starts the workers when the driver first sends a lease (resource) request to that raylet. 2. The #14790 prefers to pack tasks on fewer nodes up to 50% CPU utilization before spreading tasks for load-balancing. 3. The maximum number of concurrent lease requests is 10, meaning that the driver must wait for workers to start on the first 10 nodes that it contacts before sending lease requests to the next set of nodes. Because of (2), the first 10 nodes contacted is usually not unique, especially when each node has many cores. This PR change (3), which allows us to dynamic adjust the max_pending_lease_requests based on the number of nodes in the cluster. Without this PR, the top k scheduling algorithm is bottlenecked by the speed of sending lease request across the cluster.

This PR is to fix filter logic that it should always `yield`, instead of `return`. Otherwise it will just read first block, and exit. Add a unit test, and verify unit test is failed before this PR. Also change all map-like functions to reuse same output buffer.

) make https://github.com/orgs/ray-project/teams/ray-core/members become the code-owner on most of core code paths

…ath (#32082)" (#32176) This reverts commit 223a9a6.

Signed-off-by: rickyyx rickyx@anyscale.com Why are these changes needed? We have the wrong unit translation right now when recording tasks' failed status if the owning job finishes. This results in negative duration of such tasks. Signed-off-by: rickyyx <rickyx@anyscale.com>

We currently resolve futures one-by-one in Ray Tune, and query Ray core for the ready status of future multiple times. Instead, we can also cache ready events and yield them if cached elements exist. This can improve performance: In tune_scalability_result_cluster_throughput this improved performance by ~2-3%. We will always re-query Ray if we expect a resource to be ready. Signed-off-by: Kai Fricke <kai@anyscale.com>

Fixes a bad import causing an AIR benchmark release test to fail. Release test run: https://buildkite.com/ray-project/release-tests-pr/builds/27298 Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

We currently run into syncing bottleneck when running many short running trials in a multi node cluster, see #32121. After some investigation, there are three major bottlenecks: 1. All of the 100 trials trigger 2 sync processes each. This is because we trigger a sync for both the result (`SyncerCallback.on_trial_result`) and for the trial completion (`SyncerCallback.on_trial_complete`). 2. We wait synchronously for the sync processes to finish on trial completion 3. The packing and unpacking interferes with the actual training processes on the local node, drastically increasing trial runtime for those trials colocated with the driver script This PR mitigates 1) and 2) to unblock the coming release. For 3), we may have to re-architecture the current packing logic that uses multiple pack actors and unpack tasks that can impact training performance. For 1), we introduce a **minimum training time + iteration threshold** for the syncing process. Per default, we only trigger the first sync after at least 2 results were received _or_ 10 training seconds passed. The logic here is that this will only affect experiments where we have short running trials that report one result. In that case, we only need the `on_trial_complete` trigger at the end of training. Other experiments are unaffected and there's not much lost if we don't sync results from the first iteration that took less than 10 seconds to run. For 2), we cache sync process removal on trial completion. This means we do not wait until the sync process finished, but we keep the process around so we can await syncing at the end of the experiment. Periodically we clean up sync processes that were flagged for removal. Signed-off-by: Kai Fricke <kai@anyscale.com>

…iment exists at a path/uri (#32003) This PR adds a utility to check if a given path (either local or remote) exists and can be restored from. It includes some simple validation that this is the root of the experiment directory (can't restore from the trial level directory). Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

This has caused flaky test failures which are false positives.

…` and `ArrowVariableShapedTensorArray` (#32143) Add support for creating ArrowTensorArrays and ArrowVariableShapedTensorArray with string typed columns. The previous PR #31817 had CI test failures which were not run at PR-review time. This PR replicates the functionality of the previous PR, and additionally addresses the test failures (which only occur for Arrow 8.0+). Signed-off-by: Scott Lee <sjl@anyscale.com>

Add links between progress bar and task table and actor table Add links from task table to logs and to view stack trace fix horizontal scroll of table view Fix logs link going to old IA instead of new IA. fix horizontal scroll of table view Add beta label

…ng messages (#32162) See follow-up comments in #31962 Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…e code-path (#32082)" (#32176)" (#32190) This reverts commit 4d526c5.

Signed-off-by: Ram Rachum <ram@rachum.com>

Signed-off-by: Avnish <avnishnarayan@gmail.com>

…n a trainer runner and releasing resources (#32109) Signed-off-by: avnish <avnish@anyscale.com>

* RLlib's example test suite should run on no-gpu instances, so we should exclude the gpu tag Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…#30611) pyarrow.fs.FileSystem.from_uri(uri) will work if uri is the form of hdfs://name_server/user_folder/... But it will fail if uri is in the form of hdfs:///user_folder. But certain raytune module make it not possible to supply uri always in hdfs://name_server/user_folder/... format. If fssepc is available, we don't have such issue. So we place fsspec at a higher priority Signed-off-by: yud <yud@uber.com>

…test"" (#32177) * Revert "Revert "[core] Increase the threshold for pubsub integration test (#32145)" (#32165)" This reverts commit 83e1a2a. Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

[core] release test for nested air (tune) oom #31768 Signed-off-by: Clarence Ng <clarence@anyscale.com>

Signed-off-by: David Xia <dxia@spotify.com>

…outside Ray (#31373) Answer a common user question by emphasizing in the docs that runtime envs are only active for Ray processes, so you shouldn't expect to be able to install a runtime env and then log into the cluster and start importing the packages outside Ray.

…32181) #32063 fixed some issues with the long_running_serve_failure release test and then marked it stable. The test ran successfully afterwards (see test run), but the CI failed to access logs from the cluster and reported the test as errored. The logs were inaccessible on the cluster due to an issue with the cluster setup. Since this test can run without persisting logs, this change drops the logging requirement for this test. Related issue number Closes #32169

Signed-off-by: jianoaix <iamjianxiao@gmail.com>

… ray (#32255) Signed-off-by: Alan Guo <aguo@anyscale.com> This lets users with their own grafana setups to have multiple dashboards, one per ray instance. Without this change, each dashboard would have the same uid and replace each other in the grafana DB.

Signed-off-by: Alan Guo <aguo@anyscale.com> This is no longer necessary after #31577

Signed-off-by: Clarence Ng <clarence@anyscale.com> Remove redundant mock classes. We just need one mock class for the interface that covers all the sub interface. The mock for the sub interface is unused

…g proper error surfacing (#32269) There is a small typo in the tensorflow_benchmark.py script that does not properly catch when a vanilla TF run failed three times. Because of this, we would previously record a training time of 0.0 for vanilla TF, which skews the calculated average and suggests that vanilla TF outperformed Ray Train. Instead, we should have raised an error message to surface the problem. Signed-off-by: Kai Fricke <kai@anyscale.com>

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

The "Getting Started" page is long. It contains large code snippets and potentially irrelevant information. This PR revises the page for readability and brevity. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Fixes a pyarrow issue where the syncing deadlocks when there are more files in a directory than available CPU cores. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com>

In #32255 , i added a new env var to customize grafana dashboard uid. I forgot to use this var in the overview page. I also made the "View in Grafana" button take the user directly to the dashboard instead of the homepage of Grafana. Signed-off-by: Alan Guo aguo@anyscale.com

In order to keep up CUDA versions need for PyTorch 2.0, this PR adds a CUDA 11.8 image. Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com>

Add __repr__() for ResultGrid class and prettify __repr__() of Result class. Signed-off-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter> Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter>

…it (#32281) If kicking off release tests from Buildkite, it's easy to make the mistake to insert a commit in both the Buildkite dialog and our own dialog. In the first case, it will checkout the repository from the specific commit, so if a test is not contained in that commit, it can't be run for that commit. This PR will provide a better error message in that case. Signed-off-by: Kai Fricke <kai@anyscale.com>

…s a child. (#32259) Signed-off-by: SangBin Cho <rkooo567@gmail.com> ray.cancel is only supported for tasks, not actor tasks (https://docs.ray.io/en/master/ray-core/package-ref.html#ray-cancel). Note that it is an intended design because canceling actor tasks could corrupt the actor states easily. When ray.cancel is called, we set recursive=True, which means all children's tasks will also be canceled. However, when this happens, if the task has a child "actor task", it crashes the worker with WorkerCrashedError: task_spec.cc:200: Check failed: sched_cls_id_ > 0 because we don't handle this case properly. To fix the issue, we check if the child tasks are actor task. This PR also improves the error message when recursive cancellation is failed. Note that because ray.cancel is not blocking, we couldn't include the error message into ray.get(canceled_task).

… available (#32286)

Signed-off-by: Balaji Veeramani <balaji@anyscale.com> #31989 broke the 📖 Documentation job. This PR fixes the doctest failure.

The dtype parameter of DLPredictor._predict_pandas and DLPredictor._predict_numpy is None but default, but the type hint suggests dtype is non-None. This PR fixes the type hint by labeling the parameter as Optional. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

) Closes #31779

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…31927) Signed-off-by: Jun Gong <gongjunoliver@hotmail.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>

…#32299) This PR removes references to Ray Client in Tune and Train examples. It also removes outdated references of needing `ray.init("auto")` being used to connect to an existing cluster vs. `ray.init()` creating a new local cluster. The latest `ray.init()` docstring explains that: > This method handles two cases; either a Ray cluster already exists and we just attach this driver to it or we start all of the processes associated with a Ray cluster and attach to the newly started cluster. New version of this PR: #31712 Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

) Improve handle_result (result alert logic) for release tests in case when the fetched result is empty due to infra issues. For example if job server on the cluster is down (which we rely on to get files back to buildkite runners). Without this, the error code indicates application error, which is misleading. See an example here: https://buildkite.com/ray-project/release-tests-branch/builds/1318#0185fc29-1d4c-483a-999b-ede500781c7a Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

…2262) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* add test cases and make nesteddict also support empty elements Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Tracks the total number of tasks created by leveraging the gcs_task_manager.

It's not because of leak. The root cause is because we allocate more requests when start. This PR fixed it by making the number of call constant.

removing some debugging message i accidentally merged in #32106

…32290)

TorchPredictor doesn't work with TorchVision detection models because they return List[Dict[str, torch.Tensor]] instead of torch.Tensor. This PR adds a TorchDetectionPredictor so users don't have to extend TorchPredictor themselves. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

* make only one hidden layer possible * move setting out output dims to setup() Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Signed-off-by: Eric Liang ekhliang@gmail.com Why are these changes needed? Preserve order decreases performance; set it off by default.

This is a corner case where buffer could be 0 and a comments needs to be fixed in the previous PR.

- Add an index page to list all the APIs. (https://ray--32307.org.readthedocs.build/en/32307/serve/api/index.html) - With this change, when you search specific python API e.g`ray.serve.run`. The search result will show core api link page. (Previously, the user can't get the correct search result, because we put all APIs on one page.) <img width="604" alt="image" src="https://user-images.githubusercontent.com/6515354/217628692-720b9344-061d-44de-bc77-ee0c0ef27276.png">

* Modifications to gpu resource logic in rl_trainer - Add support for gpu with local mode for tf trainers in local mode - remove `_make_distributed_module` - add support for `local_gpu_id` which is the id of the gpu to use during local mode training with gpu - refactor tf function tracing logic to include the call to strategy.run - change tf function logic to prevent unnecessary retracing - add warning to not do gpu or distributed training in tf without turning on eager tracing. Signed-off-by: avnish <avnish@anyscale.com>

This diagram is currently only placed on the key concepts page. However, when I search for ray jobs, I usually only end up on the job overview page and couldn't find this diagram. This diagram will be very helpful to people who need an overview of ray jobs which this page is intended for.

…new node (#32303) We have a release test named long_running_node_failures which intermittently fails because a node failed to start up. I couldn't debug it despite having all of the Ray logs. I created this PR to add a bit more information (the node socket that should have started up) in the hopes that this enables us to identify the issue next time it happens. Failure in long_running_node_failures: #32180

* [release] update if xgboost test suite require result or not. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * format Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Revert "format" This reverts commit 3140401. * Revert "[release] update if xgboost test suite require result or not." This reverts commit 03ca1c0. * change to default alert. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * remove tests from xgboost_tests alerts. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

…ised (#32329)

This PR updates the memory formatting to show usage and total in independent, friendly units. This is should make it easier to tell when there's a small amount of memory being used that could otherwise be rounded to 0, which is often confusing for downscaling. ``` ======== Autoscaler status: 2020-12-28 01:02:03 ======== Node status -------------------------------------------------------- Healthy: 2 p3.2xlarge 20 m4.4xlarge Pending: m4.4xlarge, 2 launching 1.2.3.4: m4.4xlarge, waiting-for-ssh 1.2.3.5: m4.4xlarge, waiting-for-ssh Recent failures: p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6) Resources -------------------------------------------------------- Usage: 0/2 AcceleratorType:V100 530.0/544.0 CPU 2/2 GPU 2.00GiB/8.00GiB memory 0B/16.00GiB object_store_memory Demands: {'CPU': 1}: 150+ pending tasks/actors {'CPU': 4} * 5 (PACK): 420+ pending placement groups {'CPU': 16}: 100+ from request_resources() ``` and ``` ======== Autoscaler status: 2020-12-28 01:02:03 ======== Node status -------------------------------------------------------- Healthy: 2 p3.2xlarge 20 m4.4xlarge Pending: m4.4xlarge, 2 launching 1.2.3.4: m4.4xlarge, waiting-for-ssh 1.2.3.5: m4.4xlarge, waiting-for-ssh Recent failures: p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6) Resources -------------------------------------------------------- Usage: 0/2 AcceleratorType:V100 530.0/544.0 CPU 2/2 GPU 2.00GiB/8.00GiB memory 3.14GiB/16.00GiB object_store_memory Demands: {'CPU': 1}: 150+ pending tasks/actors {'CPU': 4} * 5 (PACK): 420+ pending placement groups {'CPU': 16}: 100+ from request_resources() ``` are some examples of what the updated output may look like. Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Alex <alex@anyscale.com>

…t` output to match docs (#31166) This PR cleans up a few usability issues around Ray clusters: Makes some cleanups to the ray start log output to match the new documentation on Ray clusters. Mainly, de-emphasize Ray Client and recommend jobs instead. Add an opt-in flag for enabling multi-node clusters for OSX and Windows. Previously, it was possible to start a multi-node cluster, but then any Ray programs would fail mysteriously after connecting to the cluster. Now, it will warn the user with an error message if the opt-in flag is not set. Document multi-node support for OSX and Windows. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

…2278) This implements a very simple version of locality-aware task assignment. The locality assignment problem is complex, but here we will start by just preferentially assigning tasks to actors if the first block of the bundle is local. We will record perf metrics on the locality hit/miss rate. This feature is flag protected (on by default). Actor locality on: ``` MapBatches(Model): 0 active, 0 queued, 0 actors [987 locality hits, 13 misses]: 100%|█████████| 1000/1000 [01:01<00:00, 16.28it/s] Average throughput 16.072036005250155 GiB/s ``` Actor locality off: ``` MapBatches(Model): 0 active, 0 queued, 0 actors [locality off]: 100%|███████████████████████████| 1000/1000 [03:01<00:00, 5.50it/s] Average throughput 5.471759229068149 GiB/s ```

* Temporary fix to the leela chess example * Remove leela chess from the release test framework, move it to tuned examples Signed-off-by: avnish <avnish@anyscale.com>

Signed-off-by: rickyyx <rickyx@anyscale.com> This PR aims to improve performance of the task backend with 3 changes: Delay conversion of protobuf. We found the protobuf conversion, especially from TaskSpecification to TaskInfoEntry that's needed for the task metadata has been slow, and was in the critical path of task execution and submission. This PR delays the generation of rpc::TaskEvnets before sending in the flush thread. During task execution, it will simply generate a TaskEvent entry that's in-memory with a lower overhead. Fixed the circular buffer that's used as the underlying data structures for the buffered events. This prevents constant resizing when the buffer gets filled up or flushed, which is costly. Adjust the niceness of the flushing thread, so it has a lower priority than the worker thread.

Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Cade Daniel <cade@anyscale.com>

For gRPC callback API, in the server and client side, the lifecycle is different. For server, it has to call Finish to make the call be considered as dead by gRPC and this can only be called once. For client, it will destruct itself if it receive the signal from the server or the connection is broken due to some reasons. There are two issues here in ray syncer: server might call Finish twice because server has OnWriteDone/OnReadDone. The fix is that when error happened, we'll call Finish and we'll guarantee that it's only called once. client might destruct itself, because client didn't have anything added to control that. The fix is to add AddHole/RemoveHole in the code to explicit control that just like server side. Testing is tricky, but it can be caught by nightly tests.

Actor fault tolerance page is a better place for actor checkpointing. Also make the code example testable. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Signed-off-by: SangBin Cho <rkooo567@gmail.com> There are 2 issues. The duration should be recorded in microseconds. I made a mistake to record it as 10*microseconds which make the duration incorrect. The metadata event should be recorded only once. I made a mistake it is recorded for every task, which blows up the timeline file size. This PR fixes both issues + add relevant tests. I also created a dataclass for chrome tracing events for a better schema tracking.

…metric recording on counters. (#32355) Signed-off-by: rickyyx <rickyx@anyscale.com> When GcsTaskManager is busy processing task events, it is not supposed to slow down the GCS. However, we previously have mutexes protecting some of the counter states. So the main io service/thread will get blocked when trying to acquire locks to print debug states + record metrics + add telemetry data. Global stats: 196276 total (5 active) Queueing time: mean = 5.255 ms, max = 4.545 s, min = -0.000 s, total = 1031.389 s Execution time: mean = 295.864 us, total = 58.071 s Event stats: .... GCSServer.deadline_timer.debug_state_dump - 85 total (1 active), CPU time: mean = 521.750 ms, total = 44.349 s GCSServer.deadline_timer.debug_state_event_stats_print - 15 total (1 active, 1 running), CPU time: mean = 404.255 ms, total = 6.064 s .... This PR introduced a thread-safe wrapper on CounterMap, such that modifying and reading various debug counters will have minimal lock contentions. Also merged the count by task type for telemetry into the counter map. This way, we will not need to acquire locks at various places. With access to counters thread-safe now, we could also remove the mutex locks on the GcsTaskManagerStorage since it's now thread-safe (only accessed from its dedicated io thread)

Ports over previous rule to move RandomizeBlockOrder to the end of a DAG into the new execution backend as an optimizer rule. Closes #31894 Signed-off-by: amogkam <amogkamsetty@yahoo.com>

…evice (#31753) When DatasetIterator is used with Ray Train, automatically move the torch tensors returned by iter_torch_batches to the correct device. Signed-off-by: amogkam <amogkamsetty@yahoo.com>

This implements the abstractions introduced in #31236. Changes: - We move to a static callback definition to better match other existing APIs - We split the RayEventManager into an RayActorManager (for actors) and a RayEventManager (for futures) - Instead of awaiting an arbitrary number of results, we have a `next()` method to await exactly one event, as this is the only thing needed for Train/Tune - We simplified the APIs and reduced the number of concepts. This PR comes with two end-to-end example flows for Ray Train- and Ray Tune-like flows. Signed-off-by: Kai Fricke <kai@anyscale.com>

Implement asynchronous update function along with a small test to see that it converges to the same results as the synchronous update Signed-off-by: avnish <avnish@anyscale.com>

…ray start` output to match docs (#31166)" (#32403) This reverts commit 90f8511.

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com> 3x nightly dask test is failing, due to enabling of group-by-owner oom killer policy This switches the test to use the previous policy

HuggingFacePredictor's use_gpu was set in the wrong method, causing it to not really work correctly. This PR fixes that. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* Modifications to gpu resource logic in rl_trainer - Add support for gpu with local mode for tf trainers in local mode - remove `_make_distributed_module` - add support for `local_gpu_id` which is the id of the gpu to use during local mode training with gpu - refactor tf function tracing logic to include the call to strategy.run - change tf function logic to prevent unnecessary retracing - add warning to not do gpu or distributed training in tf without turning on eager tracing. Signed-off-by: avnish <avnish@anyscale.com>

…tain timings (#31464) Restore will fail if the object is still in the creation, so in certain timings, the pull will hang.

…e retry test. (#32242) * [Tune] Improve logging, unify requeue logic, improve trial restore retry test. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix unit test. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * lint Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test_tuner_restore Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

…GetAllJobInfo endpoint (#32388) The changes to the GetAllJobInfo endpoint in #31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen. This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR. Related issue number Closes #32213

…32411) This is to fix the Dataset.__repr__ issue in #32410, after we introduce function name in #31526. We should only make operator/stage name to be camel case. Signed-off-by: Cheng Su <scnju13@gmail.com>

…it tests (#32342) Every X seconds, when we record metrics, we check all pending updates from counter_map. If there's pending updates, we invoke the registered callback for the relevant updates, which record metrics. Currently, we have 3 counter_map. Regular (containing all data) & get & wait counter_map. For get and wait counter_map, although there are updates, we don't register callbacks (they are used to calculate correct RUNNING / GET / WAIT counts). So normally, this is what will happen. Task gets into RUNNING state. counter_map is updated and add a callback. Get is called, and get counter_map is updated. Callback is not updated (by design). If metrics are recorded after 2, the callback from regular counter_map is invoked and we record correct metrics. If metrics are recorded after 1, RUNNING state is recorded. But since we don't register callbacks for get counter map, when the next metrics are recorded, the relevant updates are not recorded. Flakiness comes from the latter case. This fixes the issue by having "no-op update" to the regular counter_map (e.g., Increment(0)). This will trigger counter_map to invoke a callback again which will correctly update get & wait status. I could also refactor the code to not use get&wait counter map, but this approach is much easier, so I decide to go with this approach. This PR also fixes the slow stats report issue.

We are dropping data at 10K as default, changing the buffer size larger right now before we figure out a way to store bursty task submissions.

… `TorchVisionTransform` (#32383) Transforms like RandomHorizontalFlip expect Torch tensors as input, but if you're applying the transform per-epoch, then you can't use ToTensor. To fix the problem, this PR updates TorchVisionPreprocessor to convert ndarray inputs to Torch tensors. You can't use ToTensor to convert the ndarrays to Torch tensors because then you'd be applying ToTensor twice, and your images would get scaled incorrectly. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

- Add triage label to enhancement and doc issues as well - Don't auto close issues pending triage Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Why are these changes needed? Deprecating ray client related docs.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…t (1st part) (#32394) This is to update Ray Data documentation and code example to reflect lazy execution by default. This covers the rest of documentation other than #32387 . Signed-off-by: Cheng Su <scnju13@gmail.com>

We believe this has minimal impact on the performance. So reverting for non-necessary code. Signed-off-by: rickyyx <rickyx@anyscale.com>

…del. (#32387) This PR updates the docs for a portion of the feature guides, the FAQ, the examples, and the docstrings for the Dataset, GroupedDataset, and read APIs, to reflect the new lazy-by-default execution semantics.

Fix release blocker issue: #32203 Ran 6 times and all of them passed. Signed-off-by: jianoaix <iamjianxiao@gmail.com>

At the moment, autoscaler commands fail (and head node set up fails) if the user doesn't have a .bashrc. This seems like an unnecessary requirement for startup. There's also a completely pointless true &&, which looks like an artifact from someone's refactor.

## Why are these changes needed? The worker leaks currently when the task references some global import like tensorflow. There are couple issues that led to this bug: when the worker finishes executing it does not clean up all its borrowed references the reference counting code treats borrowed reference as something it owns if the worker thinks it owns references it will not exit the worker pool will not force exit an idle worker, even if the job is dead, if the worker refuses to due to the aforementioned object ownership This PR implements the logic in worker pool to force kill an idle worker whose job has exited

This PR makes the ray.data.from_*() APIs lazy.

Signed-off-by: Cheng Su <scnju13@gmail.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…EL_DEFAULTS (#31821)

… (where it's simple to add). (#32475)

In #28149 RayActorError is called with a str as cause, but this is not an accepted type. This leads to hitting the assertion error in the else case: assert isinstance(cause, ActorDiedErrorContext) on L283.

Signed-off-by: Pratik <pratikrajput1199@gmail.com>

…ainers (#32471) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…rate file (#32457) Experiment state management is currently convoluted. We keep track of many duplicate variables, e.g. local/remote checkpoint dirs and syncers. The resume/syncing logic also takes up a lot of space in the trial runner. Saving and restoring experiment state is orthogonal to the actual trial lifecycle logic, thus it makes sense to separate this out. In the same go, I've removed a lot of duplicated state and simplified some APIs that will also make it easier to test the experiment state component separately. Signed-off-by: Kai Fricke <kai@anyscale.com>

An identical error message is returned in multiple cases if something goes wrong when pinging the api/version endpoint. This PR adds more information to the error message in case where the endpoint returns 404 in order to help with debugging.

…added to an operator (#32482) This PR ensures that the object store utilization for a bundle is still tracked when it's queued internally by an operator, e.g. MapOperator queueing bundles for the sake of bundling up to a minimum bundle size, or due to workers not yet being ready for dispatch.

* [tune/train] remove duplicated keys in tune/train results. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * timestamp Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * result_timestamp defaults to None Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix progress_reporter test. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * .get(, None) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test_gpu Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * WORKER_ Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

…y_tasks (#32438) Signed-off-by: rickyyx <rickyx@anyscale.com> We are calculating actor creation task submission time, which is less useful for this test.

Following our tune package restructure (https://github.com/ray-project/ray/pulls?q=is%3Apr+in%3Atitle+%5Btune%2Fstructure%5D), we now had 3 releases where we logged a warning (2.0-2.3). For 2.4, we should raise an error instead. For 2.5, we can remove the old files/packages. Signed-off-by: Kai Fricke <kai@anyscale.com>

…torch code example (#32058) The example under Ray AI Runtime/Example section directly used native PyTorch datasets for data loading. It's good to clarify that the current approach is for simplicity, the more recommended approach is to use the Ray dataset. Signed-off-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MacBook-Pro.local>

This PR always preserves order for the bulk executor. We may revisit this in the future, at which point we'd update all of the tests that rely on order preservation. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

This PR fixes the `Stopper` doctests that are erroring. Previously, it used a `tune.Trainable` as its trainable, which would error on fit since its methods are not implemented. Also, it was missing some imports. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…port (#32447)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

#32457 refactored the experiment checkpoint management but introduced a bug where state is not correctly restored anymore. This was caught by a unit test error. This PR resolves the bug and makes sure the test passes. Signed-off-by: Kai Fricke <kai@anyscale.com>

Similar to #31204, refactor the core api reference for better layout and view. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

One of the flakiness of test_dataset.py is due to the timeout. This splits out the torch tests from this big test file. #32067

Follow up to #32015.

This PR is to add logical operator for group-by aggregate. The change includes: * `Aggregate`: the logical operator for aggregate * `generate_aggregate_fn`: the generated function for aggregate operator * `SortAggregateTaskSpec`: the task spec for doing sort-based aggregate, mostly refactored from [_GroupbyOp](https://github.com/ray-project/ray/blob/master/python/ray/data/grouped_dataset.py#L35).

#32486 introduced two test failures after hard-depracting a structure refactor. This PR fixes these two stale imports. Signed-off-by: Kai Fricke <coding@kaifricke.com>

This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class. This PR is a followup to #31204 and #32311, which made the same changes for Ray Data/Tune docs. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

By default, autosummary only shows one line for each class member instead of the entire docstring. Ideally the fix should be autosummarying class members as well but that generates too many doc pages and causes doc build timeout. For now, default to show docstring of class members in the class pages and an explicit opt-in to autosummary class members. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

… output to match docs (#32409) Un-revert #31166. This PR cleans up a few usability issues around Ray clusters: - Makes some cleanups to the ray start log output to match the new documentation on Ray clusters. Mainly, de-emphasize Ray Client and recommend jobs instead. - Add an opt-in flag for enabling multi-node clusters for OSX and Windows. Previously, it was possible to start a multi-node cluster, but then any Ray programs would fail mysteriously after connecting to the cluster. Now, it will warn the user with an error message if the opt-in flag is not set. - Document multi-node support for OSX and Windows. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

…32531) Signed-off-by: amogkam <amogkamsetty@yahoo.com>

… data import (#32447)" (#32533) This reverts commit bc01288.

) This PR fixes trainable actor reuse to update the remote trial directory that it's writing checkpoints to. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upstream changes #6

upstream changes #6

Commits on Jan 27, 2023

Commits on Jan 28, 2023

Commits on Jan 29, 2023

Commits on Jan 30, 2023

Commits on Jan 31, 2023

Commits on Feb 1, 2023

Commits on Feb 2, 2023

Commits on Feb 3, 2023

Commits on Feb 4, 2023

Commits on Feb 6, 2023

Commits on Feb 7, 2023

Commits on Feb 8, 2023

Commits on Feb 9, 2023

Commits on Feb 10, 2023

Commits on Feb 11, 2023

Commits on Feb 13, 2023

Commits on Feb 14, 2023