-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream changes #6
Commits on Jan 27, 2023
-
Add informative progress bar names to map_batches (#31526)
Signed-off-by: pdmurray <peynmurray@gmail.com> Signed-off-by: pdmurray <peynmurray@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 3343c76 - Browse repository at this point
Copy the full SHA 3343c76View commit details -
Enable Log Rotation on Serve (#31844)
This PR adds log rotation for Ray Serve, letting it inherit rotation parameters (max_bytes, backup_count) from Ray Core, bringing a more consistent logging experience to Ray (as opposed to having the serve/ folder grow forever while the other logs rotate.
Configuration menu - View commit details
-
Copy full SHA for 7b2299b - Browse repository at this point
Copy the full SHA 7b2299bView commit details -
[core][state] Handle driver tasks (#31832)
This PR adds additional information to the driver task event, namely, driver task type, and it's running/finished timestamps. This allows users (i.e. the dashboard) to inspect driver task more easily. This PR also exposes the exclude_driver flag to state API, allowing requests through https and ListAPiOptions to get driver tasks, while the default behaviour from state API will still be excluding it. This PR also filters out any tasks w/o task_info to prevent missing data issue.
Configuration menu - View commit details
-
Copy full SHA for ed72ca8 - Browse repository at this point
Copy the full SHA ed72ca8View commit details -
[serve] Add exponential backoff when retrying replicas (#31436)
If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate. Related issue number Closes #31121
Configuration menu - View commit details
-
Copy full SHA for 3f1a880 - Browse repository at this point
Copy the full SHA 3f1a880View commit details -
[RLlib] Fixed the autorom dependency issue (#31933)
Co-authored-by: Cade Daniel <edacih@gmail.com> Closes #31880
Configuration menu - View commit details
-
Copy full SHA for 76d7467 - Browse repository at this point
Copy the full SHA 76d7467View commit details -
Polish the Dashboard new IA part 2 (#31946)
Adds back the metrics page Adds button to visit new dashboard and to go back Adds buttons for leaving feedback and viewing docs Add color to status badges of tasks and placement groups table Add alert when grafana is not running Fix copy button icon Separate metrics page into sections (both new IA and old IA)
Configuration menu - View commit details
-
Copy full SHA for 15af485 - Browse repository at this point
Copy the full SHA 15af485View commit details -
[Tune] Clarify which
RunConfig
is used when there are multiple plac……es to specify it (#31959) This PR clarifies where RunConfig can be specified. Also, when multiple configs are specified in different locations (in the Tuner and Trainer), this PR also logs information about which RunConfig is actually used. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Configuration menu - View commit details
-
Copy full SHA for eab29ca - Browse repository at this point
Copy the full SHA eab29caView commit details -
[docs] Fix linkcheck error and map batches docstring test (#31996)
581cd4e moved some test files, breaking a link from the documentation. cc @iycheng 3343c76 changed the MapBatches string representation, breaking a docstring test. cc @peytondmurray Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 1b20ae9 - Browse repository at this point
Copy the full SHA 1b20ae9View commit details -
[Datasets] [Autoscaling Actor Pool - 1/2] Refactor
MapOperator
, exe……cution state, and task submitters. (#31986)
Configuration menu - View commit details
-
Copy full SHA for 02ca4c9 - Browse repository at this point
Copy the full SHA 02ca4c9View commit details -
[data] [streaming] [12/n]--- Improve output backpressure reporting an…
…d management (#31979) Before this PR, stalls in the consumer thread would fully block the control loop. This provides backpressure, but at the cost of performance. This PR fully decouples the consumer thread from the control loop thread, allowing execution to proceed so long as there is sufficient object_store_memory budget remaining. It also adds a progress bar for the output queue, showing the number of output bundles consumed and the number of queued bundles for output:
Configuration menu - View commit details
-
Copy full SHA for ffbd87a - Browse repository at this point
Copy the full SHA ffbd87aView commit details -
[tune] Fix tune_cloud_* tests fow new Trial constructor arguments (#3…
Configuration menu - View commit details
-
Copy full SHA for 25a7df6 - Browse repository at this point
Copy the full SHA 25a7df6View commit details -
[core] remove legacy memory monitor from task submission codepath (#3…
…1993) Remove legacy memory monitor from worker submission code path, as that was already disabled by default in Ray 2.2
Configuration menu - View commit details
-
Copy full SHA for e64b44b - Browse repository at this point
Copy the full SHA e64b44bView commit details -
[docs] Revamp Ray core fault tolerance guide (#27573)
The structure of the content looks good. My main request is (like with the scheduling refactor), that we make this discoverable with links from the main task/actor sections. Could we add 2-3 links each from the main tasks/actors/objects content to the appropriate fault tolerance sections? _Originally posted by @ericl in #27573 (review) Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 2a7dd31 - Browse repository at this point
Copy the full SHA 2a7dd31View commit details -
[Serve] [release test] Add max_retries and max_restarts (#32011)
The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes. Related issue number Addresses #31741
Configuration menu - View commit details
-
Copy full SHA for dd36360 - Browse repository at this point
Copy the full SHA dd36360View commit details
Commits on Jan 28, 2023
-
[core][state] Adjust worker side reporting with batches && add debugs…
…tring (#31840) Signed-off-by: rickyyx <rickyx@anyscale.com> This PR introduces a flag RAY_task_events_send_batch_size that controls the number of task events sent to GCS in a batch. With default setting, each core worker will send 10K task events per second to GCS, where GCS could handle 10K task events in ~50 milliseconds. This PR also adjust the worker side buffer limit to 1M with the new batching setting. The PR adds some debug informations as well.
Configuration menu - View commit details
-
Copy full SHA for 5d1f2e4 - Browse repository at this point
Copy the full SHA 5d1f2e4View commit details -
[Dataset] Exclude breaking test case in `read_parquet_benchmark_singl…
…e_node` release test (#31904) The release test read_parquet_benchmark_single_node fails, due to using Python 3.7 and not having the pickle5 package installed. A similar issue is discussed in #26225. We found that the test failure is contained to the portion which tests a Dataset with a filter expression (the error is related to pickling with this filter expression). Therefore, we will temporarily disable this portion of the test, while keeping the rest of the release test (which I verified passes on the same cluster). We can come back to this in the future and fix the case with filter. Example of release test successfully running with the filter case removed. Signed-off-by: Scott Lee <sjl@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 675c6a0 - Browse repository at this point
Copy the full SHA 675c6a0View commit details -
[Data] Add tests for remainder of map_batches operations with new opt…
…imizer (#31985) Signed-off-by: amogkam <amogkamsetty@yahoo.com> The following operations call map_batches directly: add_column, drop_columns, select_columns, random_sample. In this PR we add e2e tests for these examples with the new optimizer. In a future PR, we should refactor so that these operations do not call into map_batches and instead have their own logical operator.
Configuration menu - View commit details
-
Copy full SHA for 00416d2 - Browse repository at this point
Copy the full SHA 00416d2View commit details -
[ci/release] Change exponential_backoff_retry to use warn instead of …
Configuration menu - View commit details
-
Copy full SHA for b5899d4 - Browse repository at this point
Copy the full SHA b5899d4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 51c5eda - Browse repository at this point
Copy the full SHA 51c5edaView commit details -
[Datasets] [Autoscaling Actor Pool - 2/2] Add autoscaling support to …
…`MapOperator` actor pool. (#31987) This PR adds support for autoscaling to the actor pool implementation of `MapOperator` (this PR is stacked on top of #31986). The same autoscaling policy as the legacy `ActorPoolStrategy` is maintained, as well as providing more aggressive and sensible downscaling via: * If there are more idle actors than running/pending actors, scale down. * Once we're done submitting tasks, cancel pending actors and kill idle actors. In addition to autoscaling, `max_tasks_in_flight` capping is also implemented.
Configuration menu - View commit details
-
Copy full SHA for 22177cb - Browse repository at this point
Copy the full SHA 22177cbView commit details -
[Dashboard] Add cluster utilization graph (#31896)
<img width="1731" alt="Screen Shot 2023-01-24 at 1 01 25 AM" src="https://user-images.githubusercontent.com/18510752/214250430-9bac7b06-56fb-44b3-a044-3eaf726d1469.png"> This PR adds the cluster utilization page in the landing view Co-authored-by: Alan Guo <aguo@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for ef28b5a - Browse repository at this point
Copy the full SHA ef28b5aView commit details -
[Datasets] Add logical operator for randomize_block_order() (#31977)
This PR adds logical operator for randomize_block_order(). The change includes: Introduce AbstractAllToAll for all logical operators converted to AllToAllOperator RandomizeBlocks logical operator for randomize_block_order(). _internal/planner to move logic for Planner here and have generated function for randomize_blocks. This can be used later to create MapOperator/AllToAllOperator.
Configuration menu - View commit details
-
Copy full SHA for e44a7d0 - Browse repository at this point
Copy the full SHA e44a7d0View commit details -
Configuration menu - View commit details
-
Copy full SHA for b58bb93 - Browse repository at this point
Copy the full SHA b58bb93View commit details -
[core] Add code owner to GCS module. (#32018)
Add code owner to GCS module.
Configuration menu - View commit details
-
Copy full SHA for 09f45ad - Browse repository at this point
Copy the full SHA 09f45adView commit details -
Refactor block_fn out of map-like logical operators (#32021)
Signed-off-by: Cheng Su <scnju13@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 8e188db - Browse repository at this point
Copy the full SHA 8e188dbView commit details -
[train][docs] fix doc search issues, examples gallery & filter (#31635)
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for cc6d30a - Browse repository at this point
Copy the full SHA cc6d30aView commit details -
[Dashboard] Timeline implemented by a new task backend (#31856)
Signed-off-by: SangBin Cho <rkooo567@gmail.com> This PR implements the timeline to the ray dashboard using new task backend. Implement the task events -> chrome tracing logic. Most of code is copied from existing code. TODO add unit tests (although we already have one, it is a pretty weak test). Create a timeline endpoint that can 1. download the json file (to download & upload manually) 2. return the json array buffer (to load onto perfetto directly) Create a subsection that has 3 features. 1. Download button. 2. Open perfetto button. 3. Instruction accordion.
Configuration menu - View commit details
-
Copy full SHA for f9fa0b2 - Browse repository at this point
Copy the full SHA f9fa0b2View commit details -
[RLlib] Separate PPO torch regression test, and make it longer (#31892)
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 20bfcdd - Browse repository at this point
Copy the full SHA 20bfcddView commit details -
Configuration menu - View commit details
-
Copy full SHA for c889349 - Browse repository at this point
Copy the full SHA c889349View commit details -
Configuration menu - View commit details
-
Copy full SHA for 80d13d1 - Browse repository at this point
Copy the full SHA 80d13d1View commit details
Commits on Jan 29, 2023
-
[Datasets] [Docs] Add
seealso
to map-related methods (#30579)This PR adds seealso notes to help users distinguish between map, flat_map, and map_batches. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 112a265 - Browse repository at this point
Copy the full SHA 112a265View commit details -
[RLlib] Give more time to impala release tests (#31910)
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 1929bb1 - Browse repository at this point
Copy the full SHA 1929bb1View commit details -
[docs] remove archive link (#32030)
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 6708b31 - Browse repository at this point
Copy the full SHA 6708b31View commit details
Commits on Jan 30, 2023
-
Fix whitespace in help message for ray cli (#31905)
Without this patch, several of the help text are missing whitespace. For example, `--dashboard-host` appears as follows: --dashboard-host TEXT the host to bind the dashboard server to, either localhost (127.0.0.1) or 0.0.0.0 (available from all interfaces). By default, thisis localhost. This patch adds the correct trailing whitespace so there are spaces. Signed-off-by: Luke Hsiao <luke.hsiao@numbersstation.ai>
Configuration menu - View commit details
-
Copy full SHA for cce092b - Browse repository at this point
Copy the full SHA cce092bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3fc2aac - Browse repository at this point
Copy the full SHA 3fc2aacView commit details -
[RLlib] Reparameterize the construction of TrainerRunner and RLTraine…
…rs (#31991) * trying out a new configuration pattern for trainer runner and rl trainers Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for d390df8 - Browse repository at this point
Copy the full SHA d390df8View commit details -
Configuration menu - View commit details
-
Copy full SHA for d26b55b - Browse repository at this point
Copy the full SHA d26b55bView commit details -
[RLlib] Contribution of LeelaChessZero algorithm for playing chess in…
… a MultiAgent env. (#31480)
Configuration menu - View commit details
-
Copy full SHA for 56b7911 - Browse repository at this point
Copy the full SHA 56b7911View commit details -
[2/n] Stabilize GCS/Autoscaler interface: Drain and Kill Node API (#3…
…2002) This PR adds a DrainAndKillNode endpoint to the monitor service. It has the exact same semantics as the GcsNodeManager::HandleDrainNode. --------- Co-authored-by: Alex <alex@anyscale.com>
Alex Wu and Alex authoredJan 30, 2023 Configuration menu - View commit details
-
Copy full SHA for e331f6e - Browse repository at this point
Copy the full SHA e331f6eView commit details -
[Core] Remove dead actor checkpoint code (#32045)
Checkpointable actor is already removed in #10333
Configuration menu - View commit details
-
Copy full SHA for 907e968 - Browse repository at this point
Copy the full SHA 907e968View commit details -
Revert "Revert "[core] Fix gcs healthch manager crash when node is re…
Configuration menu - View commit details
-
Copy full SHA for 664c844 - Browse repository at this point
Copy the full SHA 664c844View commit details -
[tune] Do not default to reuse_actors=True when mixins are used (#31999)
Mixins don't work well with reuse_actors because the init is only called on construction. In the case of mlflow, this means that reused actors will try to overwrite state from the trials that previously ran on them. This is incorrect behavior and errors on the mlflow server side. Thus, we should default to not reuse actors for mixins. Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for cc5baaa - Browse repository at this point
Copy the full SHA cc5baaaView commit details -
[metrics] Switch metric view to 5 min by default #32065
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 43a0d8f - Browse repository at this point
Copy the full SHA 43a0d8fView commit details -
[data] [streaming] Fixes to autoscaling actor pool streaming op (#32023)
Fixes: - Properly wire max tasks per actor to pool - Account for internal queue size in scheduling algorithm - Small improvements to progress bar UX
Configuration menu - View commit details
-
Copy full SHA for 96440cf - Browse repository at this point
Copy the full SHA 96440cfView commit details -
[CI] Increase target time for
test_result_throughput_cluster
(#32062)Configuration menu - View commit details
-
Copy full SHA for baac0a6 - Browse repository at this point
Copy the full SHA baac0a6View commit details -
[core] Add generic
__ray_ready__
method to Actor classes (#31997)We currently have no canonical way to await actors. Users can define their own _is-ready_ methods, schedule a future, and await these, but this has to be done for every actor class separately. This does not match other patterns - e.g. we have `actor.__ray_terminate__.remote()` for actor termination and `placement_group.ready()` for placement group ready futures. This PR adds a new `__ray_ready__` magic actor method that just returns `True`. It can be used to await actors becoming ready (newly scheduled actors), and actors having processed all of their other enqueued tasks. Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for fe729aa - Browse repository at this point
Copy the full SHA fe729aaView commit details -
[Serve] Mark
long_running_serve_failure
test asstable
(#32063)The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, #31945 and #32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable.
Configuration menu - View commit details
-
Copy full SHA for b350f8d - Browse repository at this point
Copy the full SHA b350f8dView commit details -
[core] Reduce the timeout for many nodes actor tests. (#32066)
Reduce the timeout for many nodes actor test given that a test should finish within 1h. It can save some cost for problematic runs.
Configuration menu - View commit details
-
Copy full SHA for fb96935 - Browse repository at this point
Copy the full SHA fb96935View commit details -
Configuration menu - View commit details
-
Copy full SHA for fefd5e3 - Browse repository at this point
Copy the full SHA fefd5e3View commit details
Commits on Jan 31, 2023
-
[Datasets] Remove the non-useful comment in
map_batches()
(#32020)This PR is a quick fix to remove the non-useful comment introduced in #31526, probably during debugging. Replace the comment with a meaningful one.
Configuration menu - View commit details
-
Copy full SHA for 34e2cd5 - Browse repository at this point
Copy the full SHA 34e2cd5View commit details -
simplify metrics pgae (#32089)
Signed-off-by: Eric Liang <ekhliang@gmail.com> Combine tasks and actors sections Move object store memory back up to the logical section (it's one of the most useful metrics, it shouldn't be buried) Improve titles
Configuration menu - View commit details
-
Copy full SHA for 755b56f - Browse repository at this point
Copy the full SHA 755b56fView commit details -
[docs] Update top-navigation.js (#32075)
Currently, the dropdown menu "Resources" in the Ray documentation contains a link called "Training." This link points to the [same site](https://www.anyscale.com/events) as "Events." However, we want this to direct to the repository of [technical training content](https://github.com/ray-project/ray-educational-materials). Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for f325ced - Browse repository at this point
Copy the full SHA f325cedView commit details -
[docs] deploying static ray cluster to K8S with external Redis for fa…
…ult tolerance (#31949) This PR adds the documentation and sample config files for deploying Ray to K8S without using KubeRay. As KubeRay CRDs need cluster-scoped permissions, this PR helps those users who do not have cluster-scoped permissions to install Ray Cluster in their K8S.
Configuration menu - View commit details
-
Copy full SHA for dc974cb - Browse repository at this point
Copy the full SHA dc974cbView commit details -
fix frontend tests after #32089 (#32097)
Signed-off-by: Alan Guo <aguo@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for b477f4b - Browse repository at this point
Copy the full SHA b477f4bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 8a0e453 - Browse repository at this point
Copy the full SHA 8a0e453View commit details -
Configuration menu - View commit details
-
Copy full SHA for 06197a5 - Browse repository at this point
Copy the full SHA 06197a5View commit details -
Advanced Progress Bar (#31750)
This progress bar automatically shows progress by groupings. Things that belong to the same parent are all put in a group. If a group has multiple children with the same name, those are merged together into a virtual group. These virtual groups have different visual treatment because a virtual group should not add an additional level of nesting.
Configuration menu - View commit details
-
Copy full SHA for d91d2d6 - Browse repository at this point
Copy the full SHA d91d2d6View commit details -
[spark] Automatically shut down ray on spark cluster if user does not…
… execute commands on databricks notebook for a long time (#31962) Databricks Runtime provides an API: dbutils.entry_point.getIdleTimeMillisSinceLastNotebookExecution() that returns elapsed milliseconds since last databricks notebook code execution. This PR code calls this interface to monitor notebook activity and shut down Ray cluster on timeout. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 3a1709f - Browse repository at this point
Copy the full SHA 3a1709fView commit details -
[Datasets] Add support for string tensor columns in `ArrowTensorArray…
…` and `ArrowVariableShapedTensorArray` (#31817) Add support for creating ArrowTensorArrays and ArrowVariableShapedTensorArrays with string typed columns. Signed-off-by: Scott Lee <sjl@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 1fdf24e - Browse repository at this point
Copy the full SHA 1fdf24eView commit details -
[RLlib] Upgrade tf eager code to no longer use `experimental_relax_sh…
…apes` (but `reduce_retracing` instead). (#29214) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 78b8c24 - Browse repository at this point
Copy the full SHA 78b8c24View commit details -
[RLlib] Change Waterworld v3 to v4 and reinstate indep. MARL test cas…
…e w/ pettingzoo. (#31820)
Configuration menu - View commit details
-
Copy full SHA for 61c411f - Browse repository at this point
Copy the full SHA 61c411fView commit details -
[RLlib; docs] Change links and references in code and docs to "Farama…
… foundation's gymnasium" (from "OpenAI gym"). (#32061)
Configuration menu - View commit details
-
Copy full SHA for f2b6a6b - Browse repository at this point
Copy the full SHA f2b6a6bView commit details -
[Datasets] Fix to pass TaskContext in generate_random_shuffle_fn() (#…
…32101) This PR is to fix master with resolving the conflict between #32080 and #32081, i.e. - Pass TaskContext in random_shuffle.py:generate_random_shuffle_fn() - Add AllToAllTransformFn and rename TransformFn to MapTransformFn - Update the function return type in generate_map_xxx_fn(). Signed-off-by: Cheng Su <scnju13@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for b7746b2 - Browse repository at this point
Copy the full SHA b7746b2View commit details -
[release] minor fix to pytorch_pbt_failure test when using gpu. (#32070)
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 293fe2c - Browse repository at this point
Copy the full SHA 293fe2cView commit details -
[air] Add test for remote_storage with real hdfs backend. (#31940)
* [air] Add test for remote_storage with real hdfs backend. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * typo Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * typo Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * try a different syntax. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * change `install-hdfs.sh` permission. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * -hdfs in air tests. update ssh-kengen command. fix a few typos. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * test_env= Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * cat hdfs_env Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * move `PATH` as well to a separate file. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * setting env vars in test only. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix import Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * address comments. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * nit Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix fixture Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * address comments Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * address comments Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 5cf61f0 - Browse repository at this point
Copy the full SHA 5cf61f0View commit details -
[RLlib] [Ray 2.3 release] Marking RLLib release tests as unstable if …
…xfail (#32072) * Marking RLLib release tests as unstable if xfail
Configuration menu - View commit details
-
Copy full SHA for 65d904f - Browse repository at this point
Copy the full SHA 65d904fView commit details -
[Datasets] Add logical operator for repartition() (#32102)
This PR adds logical operator for `repartition()`. Only implement shuffle repartition (`repartition.py:generate_repartition_fn()`). Non-shuffle repartition is left as TODO, as the corresponding code in [fast_repartition.py](https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/fast_repartition.py) involves `BlockList`, `ExecutionPlan` and `Dataset.split()`, so it needs a deeper refactoring and code change.
Configuration menu - View commit details
-
Copy full SHA for 44a1398 - Browse repository at this point
Copy the full SHA 44a1398View commit details -
[Core] Expose Internal KV MultiGet operation (#32096)
This PR exposes the MultiGet operation to the InternalKVInterface. The MultiGet operation is already supported in the two backends (InMemory and Redis), so this PR is just plumbing. This change is needed to support getting multiple keys from the Internal KV in a single RPC.
Configuration menu - View commit details
-
Copy full SHA for dae13bf - Browse repository at this point
Copy the full SHA dae13bfView commit details -
Configuration menu - View commit details
-
Copy full SHA for e3001e9 - Browse repository at this point
Copy the full SHA e3001e9View commit details -
[AIR] Add option for per-epoch preprocessor (#31739)
This adds an option to the AIR DatasetConfig for a preprocessor that gets reapplied on each epoch. Currently the implementation uses DatasetPipeline to ensure that the extra preprocessing step is overlapped with training. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Configuration menu - View commit details
-
Copy full SHA for ae167f0 - Browse repository at this point
Copy the full SHA ae167f0View commit details -
[observability][autoscaler] Ensure pending nodes is reset to 0 after …
…scaling (#32085) The previous way pending_nodes was calculated was prone to race conditions, instead, let's just always publish it in the main thread with other metrics. Closes #31982 --------- Co-authored-by: Alex <alex@anyscale.com>
Alex Wu and Alex authoredJan 31, 2023 Configuration menu - View commit details
-
Copy full SHA for 7573d49 - Browse repository at this point
Copy the full SHA 7573d49View commit details -
[tune/execution] Update staged resources in a fixed counter for faste…
…r lookup (#32087) In #30016 we migrated Ray Tune to use a new resource management interface. In the same PR, we simplified the resource consolidation logic. This lead to a performance regression first identified in #31337. After manual profiling, the regression seems to come from `RayTrialExecutor._count_staged_resources`. We have 1000 staged trials, and this function is called on every step, executing a linear scan through all trials. This PR fixes this performance bottleneck by keeping state of the resource counter instead of dynamically recreating it every time. This is simple as we can just add/subtract the resources whenever we add/remove from the `RayTrialExecutor._staged_trials` set. Manual testing confirmed this improves the runtime of `tune_scalability_result_throughput_cluster` from ~132 seconds to ~122 seconds, bringing it back to the same level as before the refactor. Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 10d52f7 - Browse repository at this point
Copy the full SHA 10d52f7View commit details -
Revert "[RLlib] Reparameterize the construction of TrainerRunner and …
…RLTrainers (#31991)" (#32130) Reverts #31991 This PR seems to have broken CI. Screenshot 2023-01-31 at 1 39 09 PM The error is https://buildkite.com/ray-project/oss-ci-build-branch/builds/2099#01860972-e02e-47c4-8f86-8be28ea18d92/3786-3992 AttributeError: '_TFStub' object has no attribute 'Tensor'
Configuration menu - View commit details
-
Copy full SHA for d15ccfc - Browse repository at this point
Copy the full SHA d15ccfcView commit details -
[Dashboard] Better gpu utilization (#32125)
. So instead of averaging out, we should do sum(gpu_utillization) / (sum(num_gpus)) to cap the max percentage to 100%.
Configuration menu - View commit details
-
Copy full SHA for a0b8499 - Browse repository at this point
Copy the full SHA a0b8499View commit details -
[core] Update the scalability envelop (#32131)
With the recent updating of the nightly tests, update the data here. In the nightly tests, we use 2k nodes (2cpus per node) and 20k actors, but if better node is used, we can run more than 40k actors. https://buildkite.com/ray-project/release-tests-branch/builds/1321#018604d7-86a3-4fad-ac6c-803db73821d3
Configuration menu - View commit details
-
Copy full SHA for f28428e - Browse repository at this point
Copy the full SHA f28428eView commit details -
Fix docs lint for advanced progress bar (#32124)
Signed-off-by: Alan Guo <aguo@anyscale.com> fix lint #31750
Configuration menu - View commit details
-
Copy full SHA for b4221c9 - Browse repository at this point
Copy the full SHA b4221c9View commit details -
[Datasets] [Operator Fusion - 1/2] Add operator fusion to new executi…
…on planner. (#32095) This PR adds operation fusion to the new execution planner.
Configuration menu - View commit details
-
Copy full SHA for 2137945 - Browse repository at this point
Copy the full SHA 2137945View commit details -
[RLlib] Fix waterworld example and test (#32117)
* Remove empty parser.add_argument() in test file * remove --framework=torch * fix BUILD * use training_iteration as stopping cirterion Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 12ff13d - Browse repository at this point
Copy the full SHA 12ff13dView commit details -
[RLlib] Error out if action_dict is empty in MultiAgentEnv. (#32129)
* [release] minor fix to pytorch_pbt_failure test when using gpu. (#32070) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 3b1e21f - Browse repository at this point
Copy the full SHA 3b1e21fView commit details
Commits on Feb 1, 2023
-
[CI] [Datasets] Run Datasets test suites on AIR changes (#32118)
Datasets depends on ray.air for several key features (tensor extensions, Arrow transformations, data batch conversions), and not running the Datasets test suite in PR builds on ray.air changes has caused breaks to go undetected. This PR changes this so when files under python/ray/air change, we trigger the Datasets test suite in CI. Signed-off-by: Clark Zinzow <clarkzinzow@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 1454e63 - Browse repository at this point
Copy the full SHA 1454e63View commit details -
[runtime env] Clarify error message about where to install `smart_ope…
…n` for remote URI (#32110) At least two users reported encountering ImportError( "You must `pip install smart_open` and " "`pip install boto3` to fetch URIs in s3 " "bucket. " and trying to fix it by specifying them in the pip field of runtime_env, which won't work because the runtime_env setup code doesn't run inside the runtime_env. This PR clarifies the error message to say that they must be preinstalled on the cluster, and adds a note to the docs.
Configuration menu - View commit details
-
Copy full SHA for 909c220 - Browse repository at this point
Copy the full SHA 909c220View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6ec71d7 - Browse repository at this point
Copy the full SHA 6ec71d7View commit details -
Configuration menu - View commit details
-
Copy full SHA for be6b598 - Browse repository at this point
Copy the full SHA be6b598View commit details -
[Doc] Update the doc to mention dynamic resource update is not allowe…
…d. (#31664) Signed-off-by: SangBin Cho <rkooo567@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 5c11090 - Browse repository at this point
Copy the full SHA 5c11090View commit details -
[ci] disable hdfs test for compat tests. (#32148)
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 13d0982 - Browse repository at this point
Copy the full SHA 13d0982View commit details -
[core][oom] enable group by parent policy by default (#31976)
Why are these changes needed? Fail the task if it is the last task of the group, per the new (group by parent) worker killing policy Related issue number #32149 32078 Co-authored-by: Clarence Ng <clarence@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for dff4f0a - Browse repository at this point
Copy the full SHA dff4f0aView commit details -
Revert "[Docker] (Kubeflow integration) Add chmod --recursive 777 /ho…
…me/ray to Ray Dockerfile." #32026 Signed-off-by: kaihsun <kaihsun@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for df05cd9 - Browse repository at this point
Copy the full SHA df05cd9View commit details -
[Core] update grpc to 1.46.6 (#32054)
#31956 Upgrade to a version of gRPC that GHSA-cfmr-vrgj-vqwv in Zlib 1.46.6 has this patch: grpc/grpc#31845
Configuration menu - View commit details
-
Copy full SHA for 47bb652 - Browse repository at this point
Copy the full SHA 47bb652View commit details -
[Core] Join Ray Jobs API
JobInfo
with GCSJobTableData
(#31046)Why are these changes needed? Add a new protobuf for JobInfo from the Ray Job API Augment the existing GCS GetAllJobInfo endpoint to return this information, if available (not all GCS jobs were submitted via the Ray Job API; these jobs won't have this extra JobInfo.) Related issue number Closes #29621
Configuration menu - View commit details
-
Copy full SHA for b2c5e63 - Browse repository at this point
Copy the full SHA b2c5e63View commit details -
Configuration menu - View commit details
-
Copy full SHA for d74e4c4 - Browse repository at this point
Copy the full SHA d74e4c4View commit details -
[Dashboard] Support ray status output to the dashboard job page (#32040)
This is the initial prototype of integrating ray status to the frontend. I think we could've returned structured data from the backend, but I decided to parse ray status output from the frontend for quick implementation (so that we can support if from ray 2.3).
Configuration menu - View commit details
-
Copy full SHA for 77ac9c2 - Browse repository at this point
Copy the full SHA 77ac9c2View commit details -
[Observability] Unpin open telemetry version for tracing feature (#32120
) Signed-off-by: SangBin Cho <rkooo567@gmail.com> <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This PR unpins the version of open telemetry as it is too strict for an experimental tracing feature. ## Related issue number Closes #32051 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(
Configuration menu - View commit details
-
Copy full SHA for 5dd1406 - Browse repository at this point
Copy the full SHA 5dd1406View commit details -
Configuration menu - View commit details
-
Copy full SHA for fb1e0b0 - Browse repository at this point
Copy the full SHA fb1e0b0View commit details -
[Core] Pick node from top k by default. (#31868)
This PR takes over #28179 Why are these changes needed? Today with the default scheduling policy, Ray will try to pack tasks on nodes until the resource utilization is beyond a certain threshold and spread tasks afterwards. This has caused slow down the scheduling speed for embarrassingly parallel jobs: we will only move on to another node before the current node's resource if sufficiently utilized, for each node, the overhead of accepting new job and starting of a new workers is not negligible. the overall scheduling speed doesn't scale with the number of nodes; This PR is one proposal to address the problem: instead of stick to one node, we randomly choose one node from top-k nodes for the default scheduling, where the node is sorted by it's resource utilization in reverse order. Intuitively, this allows us to kick off the workers startup on multiple node in parallel of the scheduling. benchmark result: baseline: 10 parallelism, top 1, 25 tasks/second 10 parallelism, top 6, 30 tasks/second 64 parallelism, top 6, 126 100 parallelism, top 6, 150 1000 parallelism, top 6, 374.8676886257549 10 concurrent, top 12, 176 64 concurrent, top 12, 182.59477988042443 tasks/s 128 concurrent, top 12, 245.9862948998163 256 concurrent, top 12, 298…
Configuration menu - View commit details
-
Copy full SHA for cf7bc27 - Browse repository at this point
Copy the full SHA cf7bc27View commit details -
[Dashboard] Support actor detail (#32103)
This PR adds actor detail page. Other than the detail page, it also Add pg id to task/actor Add profiling links to job detail & job row & actor detail
Configuration menu - View commit details
-
Copy full SHA for d4b0a20 - Browse repository at this point
Copy the full SHA d4b0a20View commit details -
[Datasets] Add logical operator for sort() (#32133)
This PR is to add logical operator for `sort()`, the change includes: * `Sort` logical operator * `SortTaskSpec` to copy from `sort.py` * `generate_sort_fn` is generated function for sort
Configuration menu - View commit details
-
Copy full SHA for 75419d3 - Browse repository at this point
Copy the full SHA 75419d3View commit details -
Signed-off-by: Simran Mhatre <simran@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for b8221bb - Browse repository at this point
Copy the full SHA b8221bbView commit details -
[core] Increase the threshold for pubsub integration test (#32145)
The test failed asan because some data is not cleaned when it exits. Increase the threshold to mitigate it. Tested locally and for 500 runs, only 3 failed.
Configuration menu - View commit details
-
Copy full SHA for 12d7d7d - Browse repository at this point
Copy the full SHA 12d7d7dView commit details -
[core] surface OOM error when actor is killed due to OOM (#32107)
Right now we show Actor error if the actor is killed due to OOM. This PR changes it so it surfaces a OOM error It does not support actor / actor task oom retry, as the goal of this PR is to improve observability by setting the death cause of the actor to OOM Related issue number #29736 Signed-off-by: Aviv Haber <aviv@anyscale.com> Signed-off-by: Clarence Ng <clarence@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 174f157 - Browse repository at this point
Copy the full SHA 174f157View commit details -
[Tune] Save and restore stateful callbacks as part of experiment chec…
…kpoint (#31957) Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Configuration menu - View commit details
-
Copy full SHA for 890e034 - Browse repository at this point
Copy the full SHA 890e034View commit details -
[Tune] Rename
overwrite_trainable
argument in Tuner restore to `tra……inable` (#32059) * Add trainable and deprecate overwrite_trainable Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Configuration menu - View commit details
-
Copy full SHA for 59f72cf - Browse repository at this point
Copy the full SHA 59f72cfView commit details -
Configuration menu - View commit details
-
Copy full SHA for 83e1a2a - Browse repository at this point
Copy the full SHA 83e1a2aView commit details -
[core] clean up infeasible tasks submitted by the driver when the dri…
…ver dies (#32127) Signed-off-by: Clarence Ng <clarence.wyng@gmail.com> infeasible requests are not cleaned up when the driver exits. This cleans up infeasible request created by driver when it exits. does not apply to worker exit (follow up) also does not apply to infeasible task submitted to a different raylet (follow up)
Configuration menu - View commit details
-
Copy full SHA for aad24bd - Browse repository at this point
Copy the full SHA aad24bdView commit details -
Signed-off-by: SangBin Cho <rkooo567@gmail.com> Add job id to the task state API call. This will help us not including tasks from other jobs (so improve the experience when we have 10K+ tasks from the cluster). Add resource requirement to the pg table.
Configuration menu - View commit details
-
Copy full SHA for eb660ce - Browse repository at this point
Copy the full SHA eb660ceView commit details -
[core][state][dashboard] Use main threads's task id or actor creation…
… task id for parent's task id in state API (#32157) Right now, if a new thread (or async actor's event loop executing thread) runs some ray code (e.g. submitting a task, calling runtime context), the thread will have a WorkerThreadContext that has a random task id. This causes issues in state API since the task tree will have wrong structures, i.e. some tasks might have parent_task_id that doesn't match any existing tasks: For normal single threaded task/actor, we will use the main thread's task id (correct hehavior). For unusual cases (threaded/async actors), we will use the actor creation task's task id. This means from the advanced visualization, all the remote tasks created from actor tasks will be under the constructor of threaded/async actors
Configuration menu - View commit details
-
Copy full SHA for 10c46dc - Browse repository at this point
Copy the full SHA 10c46dcView commit details -
[air][tune] replace node:<ip> custom resource with NodeAffinitySchedu…
…lingPolicy (#32016) This PR changes usages of the `node:<ip>` custom resource as determined by querying [file:(air|tune|train).*\.py node:](https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/ray-project/ray%24+file:%28air%7Ctune%7Ctrain%29.*%5C.py+node:). This is being used for: - Collocating tasks (`_force_on_current_node`). - Syncing files to specific IP addresses. - Syncing files to _all_ other nodes. Signed-off-by: Matthew Deng <matt@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 666e2d9 - Browse repository at this point
Copy the full SHA 666e2d9View commit details -
[Ray release] Moving Atari ROM dependencies to S3 (#32150)
In #31933 we fix an Atari ROM dependency that by default uses a torrent to download ROMs. The tests in this PR also break occasionally due to the same reason. I moved the ROM dependency to S3 to increase reliability. I actually think we can remove the ROM dependency from these app configs since I don't see any RL test using them. But I think that is too much risk for this PR, since it will likely end up as a cherry pick to 2.3.
Configuration menu - View commit details
-
Copy full SHA for 24d0376 - Browse repository at this point
Copy the full SHA 24d0376View commit details -
[Core] automatically pick max_pending_lease_requests based on number …
…of nodes in the cluster (#31934) Why are these changes needed? This PR takes over #26373 Currently, the initial scheduling delay for a simple f.remote() loop is approximately worker startup time (~1s) * number of nodes. There are three reasons for this: 1 . Drivers do not share physical worker processes, so each raylet must start new worker processes when a new driver starts. Each raylet starts the workers when the driver first sends a lease (resource) request to that raylet. 2. The #14790 prefers to pack tasks on fewer nodes up to 50% CPU utilization before spreading tasks for load-balancing. 3. The maximum number of concurrent lease requests is 10, meaning that the driver must wait for workers to start on the first 10 nodes that it contacts before sending lease requests to the next set of nodes. Because of (2), the first 10 nodes contacted is usually not unique, especially when each node has many cores. This PR change (3), which allows us to dynamic adjust the max_pending_lease_requests based on the number of nodes in the cluster. Without this PR, the top k scheduling algorithm is bottlenecked by the speed of sending lease request across the cluster.
Configuration menu - View commit details
-
Copy full SHA for ff16730 - Browse repository at this point
Copy the full SHA ff16730View commit details -
[Datasets] Fix filter logic and reuse output buffer (#32160)
This PR is to fix filter logic that it should always `yield`, instead of `return`. Otherwise it will just read first block, and exit. Add a unit test, and verify unit test is failed before this PR. Also change all map-like functions to reuse same output buffer.
Configuration menu - View commit details
-
Copy full SHA for e9269ab - Browse repository at this point
Copy the full SHA e9269abView commit details -
[Core] add ray-core as code-owner for most of the core code-path (#32082
) make https://github.com/orgs/ray-project/teams/ray-core/members become the code-owner on most of core code paths
Configuration menu - View commit details
-
Copy full SHA for 223a9a6 - Browse repository at this point
Copy the full SHA 223a9a6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4d526c5 - Browse repository at this point
Copy the full SHA 4d526c5View commit details -
[core][state] Fix task failed time when job finishes (#32161)
Signed-off-by: rickyyx rickyx@anyscale.com Why are these changes needed? We have the wrong unit translation right now when recording tasks' failed status if the owning job finishes. This results in negative duration of such tasks. Signed-off-by: rickyyx <rickyx@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for f49b1b2 - Browse repository at this point
Copy the full SHA f49b1b2View commit details -
[tune/execution][rfc] Cache ready futures in RayTrialExecutor (#32093)
We currently resolve futures one-by-one in Ray Tune, and query Ray core for the ready status of future multiple times. Instead, we can also cache ready events and yield them if cached elements exist. This can improve performance: In tune_scalability_result_cluster_throughput this improved performance by ~2-3%. We will always re-query Ray if we expect a resource to be ready. Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 6e39b2e - Browse repository at this point
Copy the full SHA 6e39b2eView commit details -
[Release] Fix bad import in AIR benchmark (#32175)
Fixes a bad import causing an AIR benchmark release test to fail. Release test run: https://buildkite.com/ray-project/release-tests-pr/builds/27298 Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Configuration menu - View commit details
-
Copy full SHA for 6d39879 - Browse repository at this point
Copy the full SHA 6d39879View commit details -
[tune] Sync less often and only wait at end of experiment (#32155)
We currently run into syncing bottleneck when running many short running trials in a multi node cluster, see #32121. After some investigation, there are three major bottlenecks: 1. All of the 100 trials trigger 2 sync processes each. This is because we trigger a sync for both the result (`SyncerCallback.on_trial_result`) and for the trial completion (`SyncerCallback.on_trial_complete`). 2. We wait synchronously for the sync processes to finish on trial completion 3. The packing and unpacking interferes with the actual training processes on the local node, drastically increasing trial runtime for those trials colocated with the driver script This PR mitigates 1) and 2) to unblock the coming release. For 3), we may have to re-architecture the current packing logic that uses multiple pack actors and unpack tasks that can impact training performance. For 1), we introduce a **minimum training time + iteration threshold** for the syncing process. Per default, we only trigger the first sync after at least 2 results were received _or_ 10 training seconds passed. The logic here is that this will only affect experiments where we have short running trials that report one result. In that case, we only need the `on_trial_complete` trigger at the end of training. Other experiments are unaffected and there's not much lost if we don't sync results from the first iteration that took less than 10 seconds to run. For 2), we cache sync process removal on trial completion. This means we do not wait until the sync process finished, but we keep the process around so we can await syncing at the end of the experiment. Periodically we clean up sync processes that were flagged for removal. Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 1f53e60 - Browse repository at this point
Copy the full SHA 1f53e60View commit details -
[Tune] Add
Tuner.can_restore(path)
utility for checking if an exper……iment exists at a path/uri (#32003) This PR adds a utility to check if a given path (either local or remote) exists and can be restored from. It includes some simple validation that this is the root of the experiment directory (can't restore from the trial level directory). Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Configuration menu - View commit details
-
Copy full SHA for d6de1ce - Browse repository at this point
Copy the full SHA d6de1ceView commit details
Commits on Feb 2, 2023
-
[ci][job] Move test_cli_integration to large test (#32171)
This has caused flaky test failures which are false positives.
Configuration menu - View commit details
-
Copy full SHA for a954ab7 - Browse repository at this point
Copy the full SHA a954ab7View commit details -
[Datasets] Add support for string tensor columns in `ArrowTensorArray…
…` and `ArrowVariableShapedTensorArray` (#32143) Add support for creating ArrowTensorArrays and ArrowVariableShapedTensorArray with string typed columns. The previous PR #31817 had CI test failures which were not run at PR-review time. This PR replicates the functionality of the previous PR, and additionally addresses the test failures (which only occur for Arrow 8.0+). Signed-off-by: Scott Lee <sjl@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 74266a2 - Browse repository at this point
Copy the full SHA 74266a2View commit details -
Add links between progress bar and task table and actor table Add links from task table to logs and to view stack trace fix horizontal scroll of table view Fix logs link going to old IA instead of new IA. fix horizontal scroll of table view Add beta label
Configuration menu - View commit details
-
Copy full SHA for 5091217 - Browse repository at this point
Copy the full SHA 5091217View commit details -
[spark] Refine some text in Ray on Spark exception messages and warni…
Configuration menu - View commit details
-
Copy full SHA for ed83715 - Browse repository at this point
Copy the full SHA ed83715View commit details -
Configuration menu - View commit details
-
Copy full SHA for ada5db7 - Browse repository at this point
Copy the full SHA ada5db7View commit details -
[RLlib] Fix typehint for
explore
argument. (#30734)Signed-off-by: Ram Rachum <ram@rachum.com>
Configuration menu - View commit details
-
Copy full SHA for 29cd2fa - Browse repository at this point
Copy the full SHA 29cd2faView commit details -
[RLlib] Add tags option to actor manager (#31803)
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for a53907c - Browse repository at this point
Copy the full SHA a53907cView commit details -
[RLlib] Optimize the trainer runner test, add method for shutting dow…
…n a trainer runner and releasing resources (#32109) Signed-off-by: avnish <avnish@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for fdfef1f - Browse repository at this point
Copy the full SHA fdfef1fView commit details -
[RLlib] Exclude gpu tag from Examples test suite in RLlib (#32141)
* RLlib's example test suite should run on no-gpu instances, so we should exclude the gpu tag Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for b81f0cd - Browse repository at this point
Copy the full SHA b81f0cdView commit details
Commits on Feb 3, 2023
-
[air] avoid inconsistency of create filesystem from uri for hdfs case (…
…#30611) pyarrow.fs.FileSystem.from_uri(uri) will work if uri is the form of hdfs://name_server/user_folder/... But it will fail if uri is in the form of hdfs:///user_folder. But certain raytune module make it not possible to supply uri always in hdfs://name_server/user_folder/... format. If fssepc is available, we don't have such issue. So we place fsspec at a higher priority Signed-off-by: yud <yud@uber.com>
Configuration menu - View commit details
-
Copy full SHA for b31343a - Browse repository at this point
Copy the full SHA b31343aView commit details -
Revert "Revert "[core] Increase the threshold for pubsub integration …
Configuration menu - View commit details
-
Copy full SHA for 6f97a83 - Browse repository at this point
Copy the full SHA 6f97a83View commit details -
[core] release test for nested air (tune) oom (#31768)
[core] release test for nested air (tune) oom #31768 Signed-off-by: Clarence Ng <clarence@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 370a574 - Browse repository at this point
Copy the full SHA 370a574View commit details
Commits on Feb 4, 2023
-
[Docs] Fix typo in Huggingface example notebook (#32218)
Signed-off-by: David Xia <dxia@spotify.com>
Configuration menu - View commit details
-
Copy full SHA for 8b55e2d - Browse repository at this point
Copy the full SHA 8b55e2dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 37c0f76 - Browse repository at this point
Copy the full SHA 37c0f76View commit details -
Configuration menu - View commit details
-
Copy full SHA for 715e1b2 - Browse repository at this point
Copy the full SHA 715e1b2View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5503bcd - Browse repository at this point
Copy the full SHA 5503bcdView commit details
Commits on Feb 6, 2023
-
[Doc] [runtime env] Address common question about importing packages …
…outside Ray (#31373) Answer a common user question by emphasizing in the docs that runtime envs are only active for Ray processes, so you shouldn't expect to be able to install a runtime env and then log into the cluster and start importing the packages outside Ray.
Configuration menu - View commit details
-
Copy full SHA for 276559e - Browse repository at this point
Copy the full SHA 276559eView commit details -
[Serve] Remove logging requirement for
long_running_serve_failure
(#……32181) #32063 fixed some issues with the long_running_serve_failure release test and then marked it stable. The test ran successfully afterwards (see test run), but the CI failed to access logs from the cluster and reported the test as errored. The logs were inaccessible on the cluster due to an issue with the cluster setup. Since this test can run without persisting logs, this change drops the logging requirement for this test. Related issue number Closes #32169
Configuration menu - View commit details
-
Copy full SHA for 2314775 - Browse repository at this point
Copy the full SHA 2314775View commit details -
[Datasets] Deflake the test_dataset.py (#32200)
Signed-off-by: jianoaix <iamjianxiao@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 095960c - Browse repository at this point
Copy the full SHA 095960cView commit details
Commits on Feb 7, 2023
-
Configuration menu - View commit details
-
Copy full SHA for e71e3a7 - Browse repository at this point
Copy the full SHA e71e3a7View commit details -
Allow overriding the UID of the default grafana dashboard exported by…
… ray (#32255) Signed-off-by: Alan Guo <aguo@anyscale.com> This lets users with their own grafana setups to have multiple dashboards, one per ray instance. Without this change, each dashboard would have the same uid and replace each other in the grafana DB.
Configuration menu - View commit details
-
Copy full SHA for f3ae74e - Browse repository at this point
Copy the full SHA f3ae74eView commit details -
Remove metrics-based progress-bar endpoints (#31702)
Signed-off-by: Alan Guo <aguo@anyscale.com> This is no longer necessary after #31577
Configuration menu - View commit details
-
Copy full SHA for 8030e51 - Browse repository at this point
Copy the full SHA 8030e51View commit details -
clean up raylet client mocks (#32216)
Signed-off-by: Clarence Ng <clarence@anyscale.com> Remove redundant mock classes. We just need one mock class for the interface that covers all the sub interface. The mock for the sub interface is unused
Configuration menu - View commit details
-
Copy full SHA for eec9791 - Browse repository at this point
Copy the full SHA eec9791View commit details -
Configuration menu - View commit details
-
Copy full SHA for 7432367 - Browse repository at this point
Copy the full SHA 7432367View commit details -
[air/benchmarks] Fix typo in tensorflow_benchmark.py script preventin…
…g proper error surfacing (#32269) There is a small typo in the tensorflow_benchmark.py script that does not properly catch when a vanilla TF run failed three times. Because of this, we would previously record a training time of 0.0 for vanilla TF, which skews the calculated average and suggests that vanilla TF outperformed Ray Train. Instead, we should have raised an error message to surface the problem. Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for c83111a - Browse repository at this point
Copy the full SHA c83111aView commit details -
[RLlib] Chaining Models in RLModules (#31469)
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 027965b - Browse repository at this point
Copy the full SHA 027965bView commit details -
[Data] Revise "Getting Started" page (#31989)
The "Getting Started" page is long. It contains large code snippets and potentially irrelevant information. This PR revises the page for readability and brevity. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 2efee15 - Browse repository at this point
Copy the full SHA 2efee15View commit details -
[Tune] Add
use_threads=False
in pyarrow syncing (#32256)Fixes a pyarrow issue where the syncing deadlocks when there are more files in a directory than available CPU cores. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 773f7bf - Browse repository at this point
Copy the full SHA 773f7bfView commit details -
Fix overview page to work with the new DASHBOARD_UID env var (#32279)
In #32255 , i added a new env var to customize grafana dashboard uid. I forgot to use this var in the overview page. I also made the "View in Grafana" button take the user directly to the dashboard instead of the homepage of Grafana. Signed-off-by: Alan Guo aguo@anyscale.com
Configuration menu - View commit details
-
Copy full SHA for ce5a21a - Browse repository at this point
Copy the full SHA ce5a21aView commit details -
[build_base] [Docker] Add cuda 11.8 images (#32247)
In order to keep up CUDA versions need for PyTorch 2.0, this PR adds a CUDA 11.8 image. Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 9995599 - Browse repository at this point
Copy the full SHA 9995599View commit details -
[Tune] Add repr for ResultGrid class (#31941)
Add __repr__() for ResultGrid class and prettify __repr__() of Result class. Signed-off-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter> Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter>
Configuration menu - View commit details
-
Copy full SHA for cf95514 - Browse repository at this point
Copy the full SHA cf95514View commit details -
[ci/release] Improve error message when kicking off tests from a comm…
…it (#32281) If kicking off release tests from Buildkite, it's easy to make the mistake to insert a commit in both the Buildkite dialog and our own dialog. In the first case, it will checkout the repository from the specific commit, so if a test is not contained in that commit, it can't be run for that commit. This PR will provide a better error message in that case. Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 37580d7 - Browse repository at this point
Copy the full SHA 37580d7View commit details -
[Core] Fix recursive cancelation crashes the worker when actor task i…
…s a child. (#32259) Signed-off-by: SangBin Cho <rkooo567@gmail.com> ray.cancel is only supported for tasks, not actor tasks (https://docs.ray.io/en/master/ray-core/package-ref.html#ray-cancel). Note that it is an intended design because canceling actor tasks could corrupt the actor states easily. When ray.cancel is called, we set recursive=True, which means all children's tasks will also be canceled. However, when this happens, if the task has a child "actor task", it crashes the worker with WorkerCrashedError: task_spec.cc:200: Check failed: sched_cls_id_ > 0 because we don't handle this case properly. To fix the issue, we check if the child tasks are actor task. This PR also improves the error message when recursive cancellation is failed. Note that because ray.cancel is not blocking, we couldn't include the error message into ray.get(canceled_task).
Configuration menu - View commit details
-
Copy full SHA for 00db336 - Browse repository at this point
Copy the full SHA 00db336View commit details
Commits on Feb 8, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 51efd2f - Browse repository at this point
Copy the full SHA 51efd2fView commit details -
[Datasets] Fix book-documentation (#32293)
Signed-off-by: Balaji Veeramani <balaji@anyscale.com> #31989 broke the 📖 Documentation job. This PR fixes the doctest failure.
Configuration menu - View commit details
-
Copy full SHA for 3fa36d9 - Browse repository at this point
Copy the full SHA 3fa36d9View commit details -
[AIR] Fix
dtype
type hint inDLPredictor
methods (#32198)The dtype parameter of DLPredictor._predict_pandas and DLPredictor._predict_numpy is None but default, but the type hint suggests dtype is non-None. This PR fixes the type hint by labeling the parameter as Optional. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 5e1def0 - Browse repository at this point
Copy the full SHA 5e1def0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3f43969 - Browse repository at this point
Copy the full SHA 3f43969View commit details -
[RLlib] PPO torch RLTrainer (#31801)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 1f77e04 - Browse repository at this point
Copy the full SHA 1f77e04View commit details -
[Tune] Replace reference values in a config dict with placeholders (#…
…31927) Signed-off-by: Jun Gong <gongjunoliver@hotmail.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for befad81 - Browse repository at this point
Copy the full SHA befad81View commit details -
Configuration menu - View commit details
-
Copy full SHA for aa504ae - Browse repository at this point
Copy the full SHA aa504aeView commit details -
Configuration menu - View commit details
-
Copy full SHA for cefd3c4 - Browse repository at this point
Copy the full SHA cefd3c4View commit details -
[Tune] Remove Ray Client references from Tune and Train docs/examples (…
…#32299) This PR removes references to Ray Client in Tune and Train examples. It also removes outdated references of needing `ray.init("auto")` being used to connect to an existing cluster vs. `ray.init()` creating a new local cluster. The latest `ray.init()` docstring explains that: > This method handles two cases; either a Ray cluster already exists and we just attach this driver to it or we start all of the processes associated with a Ray cluster and attach to the newly started cluster. New version of this PR: #31712 Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Configuration menu - View commit details
-
Copy full SHA for e84fcb1 - Browse repository at this point
Copy the full SHA e84fcb1View commit details -
[release] Improve handle_result in case of empty fetched result. (#32055
) Improve handle_result (result alert logic) for release tests in case when the fetched result is empty due to infra issues. For example if job server on the cluster is down (which we rely on to get files back to buildkite runners). Without this, the error code indicates application error, which is misleading. See an example here: https://buildkite.com/ray-project/release-tests-branch/builds/1318#0185fc29-1d4c-483a-999b-ede500781c7a Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for bae61d9 - Browse repository at this point
Copy the full SHA bae61d9View commit details -
[RLlib] Move minibatching into RLTrainer instead of TrainerRunner (#3…
…2262) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 585f8aa - Browse repository at this point
Copy the full SHA 585f8aaView commit details -
[RLlib] Support empty leafs with NestedDict (#32136)
* add test cases and make nesteddict also support empty elements Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 59c62e4 - Browse repository at this point
Copy the full SHA 59c62e4View commit details -
[RLlib] Forward fix for failing PPO Torch RLTrainer test (#32308)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for b85eb52 - Browse repository at this point
Copy the full SHA b85eb52View commit details -
[Doc] Add tips of writing fault tolerant Ray applications (#32191)
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for d256508 - Browse repository at this point
Copy the full SHA d256508View commit details -
[Telemetry] track num tasks created (#32106)
Tracks the total number of tasks created by leveraging the gcs_task_manager.
Configuration menu - View commit details
-
Copy full SHA for 56606ae - Browse repository at this point
Copy the full SHA 56606aeView commit details -
[core] Fix the GCS memory usage high issue
It's not because of leak. The root cause is because we allocate more requests when start. This PR fixed it by making the number of call constant.
Configuration menu - View commit details
-
Copy full SHA for cf1bc83 - Browse repository at this point
Copy the full SHA cf1bc83View commit details -
[telemetry] remove extra print #32322
removing some debugging message i accidentally merged in #32106
Configuration menu - View commit details
-
Copy full SHA for cb5129c - Browse repository at this point
Copy the full SHA cb5129cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 468e606 - Browse repository at this point
Copy the full SHA 468e606View commit details -
[AIR] Add
TorchDetectionPredictor
(#32199)TorchPredictor doesn't work with TorchVision detection models because they return List[Dict[str, torch.Tensor]] instead of torch.Tensor. This PR adds a TorchDetectionPredictor so users don't have to extend TorchPredictor themselves. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 53260af - Browse repository at this point
Copy the full SHA 53260afView commit details -
[RLlib] Make one hidden layer config possible for TorchMLP (#32310)
* make only one hidden layer possible * move setting out output dims to setup() Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 0466bd3 - Browse repository at this point
Copy the full SHA 0466bd3View commit details -
[data] [streaming] No preserve order by default (#32300)
Signed-off-by: Eric Liang ekhliang@gmail.com Why are these changes needed? Preserve order decreases performance; set it off by default.
Configuration menu - View commit details
-
Copy full SHA for f05eeb4 - Browse repository at this point
Copy the full SHA f05eeb4View commit details -
[core] Fix comments and a corner case in #32302 (#32323)
This is a corner case where buffer could be 0 and a comments needs to be fixed in the previous PR.
Configuration menu - View commit details
-
Copy full SHA for 3bb73d3 - Browse repository at this point
Copy the full SHA 3bb73d3View commit details -
[Serve][Doc] Refactor the Ray Serve API doc (#32307)
- Add an index page to list all the APIs. (https://ray--32307.org.readthedocs.build/en/32307/serve/api/index.html) - With this change, when you search specific python API e.g`ray.serve.run`. The search result will show core api link page. (Previously, the user can't get the correct search result, because we put all APIs on one page.) <img width="604" alt="image" src="https://user-images.githubusercontent.com/6515354/217628692-720b9344-061d-44de-bc77-ee0c0ef27276.png">
Configuration menu - View commit details
-
Copy full SHA for 22bc1e9 - Browse repository at this point
Copy the full SHA 22bc1e9View commit details
Commits on Feb 9, 2023
-
[RLlib] Modifications to gpu resource logic in rl_trainer (#32149)
* Modifications to gpu resource logic in rl_trainer - Add support for gpu with local mode for tf trainers in local mode - remove `_make_distributed_module` - add support for `local_gpu_id` which is the id of the gpu to use during local mode training with gpu - refactor tf function tracing logic to include the call to strategy.run - change tf function logic to prevent unnecessary retracing - add warning to not do gpu or distributed training in tf without turning on eager tracing. Signed-off-by: avnish <avnish@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for b73f3eb - Browse repository at this point
Copy the full SHA b73f3ebView commit details -
[Doc] add job overview diagram (#32050)
This diagram is currently only placed on the key concepts page. However, when I search for ray jobs, I usually only end up on the job overview page and couldn't find this diagram. This diagram will be very helpful to people who need an overview of ray jobs which this page is intended for.
Configuration menu - View commit details
-
Copy full SHA for 6cfb541 - Browse repository at this point
Copy the full SHA 6cfb541View commit details -
Configuration menu - View commit details
-
Copy full SHA for b011d56 - Browse repository at this point
Copy the full SHA b011d56View commit details -
[core] Improving failure message when ray processes fail to start on …
…new node (#32303) We have a release test named long_running_node_failures which intermittently fails because a node failed to start up. I couldn't debug it despite having all of the Ray logs. I created this PR to add a bit more information (the node socket that should have started up) in the hopes that this enables us to identify the issue next time it happens. Failure in long_running_node_failures: #32180
Configuration menu - View commit details
-
Copy full SHA for 63d922b - Browse repository at this point
Copy the full SHA 63d922bView commit details -
[release] update if xgboost test suite requires result or not. (#32340)
* [release] update if xgboost test suite require result or not. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * format Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Revert "format" This reverts commit 3140401. * Revert "[release] update if xgboost test suite require result or not." This reverts commit 03ca1c0. * change to default alert. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * remove tests from xgboost_tests alerts. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 5c1c888 - Browse repository at this point
Copy the full SHA 5c1c888View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5f0f95a - Browse repository at this point
Copy the full SHA 5f0f95aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 67d1515 - Browse repository at this point
Copy the full SHA 67d1515View commit details -
Configuration menu - View commit details
-
Copy full SHA for b2e7699 - Browse repository at this point
Copy the full SHA b2e7699View commit details -
[autoscaler][observability] Better memory formatting (#32337)
This PR updates the memory formatting to show usage and total in independent, friendly units. This is should make it easier to tell when there's a small amount of memory being used that could otherwise be rounded to 0, which is often confusing for downscaling. ``` ======== Autoscaler status: 2020-12-28 01:02:03 ======== Node status -------------------------------------------------------- Healthy: 2 p3.2xlarge 20 m4.4xlarge Pending: m4.4xlarge, 2 launching 1.2.3.4: m4.4xlarge, waiting-for-ssh 1.2.3.5: m4.4xlarge, waiting-for-ssh Recent failures: p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6) Resources -------------------------------------------------------- Usage: 0/2 AcceleratorType:V100 530.0/544.0 CPU 2/2 GPU 2.00GiB/8.00GiB memory 0B/16.00GiB object_store_memory Demands: {'CPU': 1}: 150+ pending tasks/actors {'CPU': 4} * 5 (PACK): 420+ pending placement groups {'CPU': 16}: 100+ from request_resources() ``` and ``` ======== Autoscaler status: 2020-12-28 01:02:03 ======== Node status -------------------------------------------------------- Healthy: 2 p3.2xlarge 20 m4.4xlarge Pending: m4.4xlarge, 2 launching 1.2.3.4: m4.4xlarge, waiting-for-ssh 1.2.3.5: m4.4xlarge, waiting-for-ssh Recent failures: p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6) Resources -------------------------------------------------------- Usage: 0/2 AcceleratorType:V100 530.0/544.0 CPU 2/2 GPU 2.00GiB/8.00GiB memory 3.14GiB/16.00GiB object_store_memory Demands: {'CPU': 1}: 150+ pending tasks/actors {'CPU': 4} * 5 (PACK): 420+ pending placement groups {'CPU': 16}: 100+ from request_resources() ``` are some examples of what the updated output may look like. Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Alex <alex@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for d653f73 - Browse repository at this point
Copy the full SHA d653f73View commit details -
[core] Add opt-in flag for Windows and OSX clusters, update `ray star…
…t` output to match docs (#31166) This PR cleans up a few usability issues around Ray clusters: Makes some cleanups to the ray start log output to match the new documentation on Ray clusters. Mainly, de-emphasize Ray Client and recommend jobs instead. Add an opt-in flag for enabling multi-node clusters for OSX and Windows. Previously, it was possible to start a multi-node cluster, but then any Ray programs would fail mysteriously after connecting to the cluster. Now, it will warn the user with an error message if the opt-in flag is not set. Document multi-node support for OSX and Windows. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for 90f8511 - Browse repository at this point
Copy the full SHA 90f8511View commit details -
[data] [streaming] Implement locality-aware actor task assignment (#3…
…2278) This implements a very simple version of locality-aware task assignment. The locality assignment problem is complex, but here we will start by just preferentially assigning tasks to actors if the first block of the bundle is local. We will record perf metrics on the locality hit/miss rate. This feature is flag protected (on by default). Actor locality on: ``` MapBatches(Model): 0 active, 0 queued, 0 actors [987 locality hits, 13 misses]: 100%|█████████| 1000/1000 [01:01<00:00, 16.28it/s] Average throughput 16.072036005250155 GiB/s ``` Actor locality off: ``` MapBatches(Model): 0 active, 0 queued, 0 actors [locality off]: 100%|███████████████████████████| 1000/1000 [03:01<00:00, 5.50it/s] Average throughput 5.471759229068149 GiB/s ```
Configuration menu - View commit details
-
Copy full SHA for 0e56dff - Browse repository at this point
Copy the full SHA 0e56dffView commit details -
[RLlib] Remove leela chess from release tests (#32325)
* Temporary fix to the leela chess example * Remove leela chess from the release test framework, move it to tuned examples Signed-off-by: avnish <avnish@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for f80badc - Browse repository at this point
Copy the full SHA f80badcView commit details -
[core][state] Task backend improve performance (#32251)
Signed-off-by: rickyyx <rickyx@anyscale.com> This PR aims to improve performance of the task backend with 3 changes: Delay conversion of protobuf. We found the protobuf conversion, especially from TaskSpecification to TaskInfoEntry that's needed for the task metadata has been slow, and was in the critical path of task execution and submission. This PR delays the generation of rpc::TaskEvnets before sending in the flush thread. During task execution, it will simply generate a TaskEvent entry that's in-memory with a lower overhead. Fixed the circular buffer that's used as the underlying data structures for the buffered events. This prevents constant resizing when the buffer gets filled up or flushed, which is costly. Adjust the niceness of the flushing thread, so it has a lower priority than the worker thread.
Configuration menu - View commit details
-
Copy full SHA for 69a14e7 - Browse repository at this point
Copy the full SHA 69a14e7View commit details -
[docs]Fix wording of Many model training guidance (#32319)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Cade Daniel <cade@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 8bf1d03 - Browse repository at this point
Copy the full SHA 8bf1d03View commit details -
[core] Fix gRPC callback API destruction issues. (#32151)
For gRPC callback API, in the server and client side, the lifecycle is different. For server, it has to call Finish to make the call be considered as dead by gRPC and this can only be called once. For client, it will destruct itself if it receive the signal from the server or the connection is broken due to some reasons. There are two issues here in ray syncer: server might call Finish twice because server has OnWriteDone/OnReadDone. The fix is that when error happened, we'll call Finish and we'll guarantee that it's only called once. client might destruct itself, because client didn't have anything added to control that. The fix is to add AddHole/RemoveHole in the code to explicit control that just like server side. Testing is tricky, but it can be caught by nightly tests.
Configuration menu - View commit details
-
Copy full SHA for fc81af1 - Browse repository at this point
Copy the full SHA fc81af1View commit details -
[Doc] Move actor checkpointing to actor fault tolerance page (#32153)
Actor fault tolerance page is a better place for actor checkpointing. Also make the code example testable. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 741b7a0 - Browse repository at this point
Copy the full SHA 741b7a0View commit details -
[Core/Observability] Fix the timeline bugs (#32287)
Signed-off-by: SangBin Cho <rkooo567@gmail.com> There are 2 issues. The duration should be recorded in microseconds. I made a mistake to record it as 10*microseconds which make the duration incorrect. The metadata event should be recorded only once. I made a mistake it is recorded for every task, which blows up the timeline file size. This PR fixes both issues + add relevant tests. I also created a dataclass for chrome tracing events for a better schema tracking.
Configuration menu - View commit details
-
Copy full SHA for 188c411 - Browse repository at this point
Copy the full SHA 188c411View commit details -
[core][state] Task Backend - reduce lock contention on debug stats / …
…metric recording on counters. (#32355) Signed-off-by: rickyyx <rickyx@anyscale.com> When GcsTaskManager is busy processing task events, it is not supposed to slow down the GCS. However, we previously have mutexes protecting some of the counter states. So the main io service/thread will get blocked when trying to acquire locks to print debug states + record metrics + add telemetry data. Global stats: 196276 total (5 active) Queueing time: mean = 5.255 ms, max = 4.545 s, min = -0.000 s, total = 1031.389 s Execution time: mean = 295.864 us, total = 58.071 s Event stats: .... GCSServer.deadline_timer.debug_state_dump - 85 total (1 active), CPU time: mean = 521.750 ms, total = 44.349 s GCSServer.deadline_timer.debug_state_event_stats_print - 15 total (1 active, 1 running), CPU time: mean = 404.255 ms, total = 6.064 s .... This PR introduced a thread-safe wrapper on CounterMap, such that modifying and reading various debug counters will have minimal lock contentions. Also merged the count by task type for telemetry into the counter map. This way, we will not need to acquire locks at various places. With access to counters thread-safe now, we could also remove the mutex locks on the GcsTaskManagerStorage since it's now thread-safe (only accessed from its dedicated io thread)
Configuration menu - View commit details
-
Copy full SHA for 2bbe8c1 - Browse repository at this point
Copy the full SHA 2bbe8c1View commit details
Commits on Feb 10, 2023
-
[Data] Add rule for
ReorderRandomizeBlockOrder
(#32254)Ports over previous rule to move RandomizeBlockOrder to the end of a DAG into the new execution backend as an optimizer rule. Closes #31894 Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Configuration menu - View commit details
-
Copy full SHA for b4ad23a - Browse repository at this point
Copy the full SHA b4ad23aView commit details -
[AIR] Automatically move
DatasetIterator
torch tensors to correct d……evice (#31753) When DatasetIterator is used with Ray Train, automatically move the torch tensors returned by iter_torch_batches to the correct device. Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Configuration menu - View commit details
-
Copy full SHA for 4420120 - Browse repository at this point
Copy the full SHA 4420120View commit details -
[air/execution] Event manager part 2: Implementation (#31811)
This implements the abstractions introduced in #31236. Changes: - We move to a static callback definition to better match other existing APIs - We split the RayEventManager into an RayActorManager (for actors) and a RayEventManager (for futures) - Instead of awaiting an arbitrary number of results, we have a `next()` method to await exactly one event, as this is the only thing needed for Train/Tune - We simplified the APIs and reduced the number of concepts. This PR comes with two end-to-end example flows for Ray Train- and Ray Tune-like flows. Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 492ff7e - Browse repository at this point
Copy the full SHA 492ff7eView commit details -
[RLlib] Async trainer manager (#32282)
Implement asynchronous update function along with a small test to see that it converges to the same results as the synchronous update Signed-off-by: avnish <avnish@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for c9cf2ef - Browse repository at this point
Copy the full SHA c9cf2efView commit details -
Configuration menu - View commit details
-
Copy full SHA for d807ce0 - Browse repository at this point
Copy the full SHA d807ce0View commit details -
[core][oom] Use retriable lifo policy for dask 3x nightly test (#32361)
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com> 3x nightly dask test is failing, due to enabling of group-by-owner oom killer policy This switches the test to use the previous policy
Configuration menu - View commit details
-
Copy full SHA for 73b52e0 - Browse repository at this point
Copy the full SHA 73b52e0View commit details -
[Train] Fix
use_gpu
withHuggingFacePredictor
(#32333)HuggingFacePredictor's use_gpu was set in the wrong method, causing it to not really work correctly. This PR fixes that. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Configuration menu - View commit details
-
Copy full SHA for a1938c3 - Browse repository at this point
Copy the full SHA a1938c3View commit details -
[RLlib] Clean up RLModule (#32328)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 841a4fb - Browse repository at this point
Copy the full SHA 841a4fbView commit details -
[RLlib] Cleanup RLTrainer (#32345)
* Modifications to gpu resource logic in rl_trainer - Add support for gpu with local mode for tf trainers in local mode - remove `_make_distributed_module` - add support for `local_gpu_id` which is the id of the gpu to use during local mode training with gpu - refactor tf function tracing logic to include the call to strategy.run - change tf function logic to prevent unnecessary retracing - add warning to not do gpu or distributed training in tf without turning on eager tracing. Signed-off-by: avnish <avnish@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 60fa8fe - Browse repository at this point
Copy the full SHA 60fa8feView commit details -
[Bug Fix][Object Store] race condition: Pull Manager will hang in cer…
…tain timings (#31464) Restore will fail if the object is still in the creation, so in certain timings, the pull will hang.
Configuration menu - View commit details
-
Copy full SHA for 9cbf406 - Browse repository at this point
Copy the full SHA 9cbf406View commit details -
[Tune] Improve logging, unify trial retry logic, improve trial restor…
…e retry test. (#32242) * [Tune] Improve logging, unify requeue logic, improve trial restore retry test. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix unit test. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * lint Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test_tuner_restore Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for d9a17f2 - Browse repository at this point
Copy the full SHA d9a17f2View commit details -
[Job API] Handle multiple drivers with same job submission id in GCS …
…GetAllJobInfo endpoint (#32388) The changes to the GetAllJobInfo endpoint in #31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen. This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR. Related issue number Closes #32213
Configuration menu - View commit details
-
Copy full SHA for 35e106a - Browse repository at this point
Copy the full SHA 35e106aView commit details -
[Datasets] Not change
map_batches()
UDF name inDataset.__repr__
(#…Configuration menu - View commit details
-
Copy full SHA for d8639ab - Browse repository at this point
Copy the full SHA d8639abView commit details -
[Metrics] Fix flaky test_task_metrics + fix slow report issue from un…
…it tests (#32342) Every X seconds, when we record metrics, we check all pending updates from counter_map. If there's pending updates, we invoke the registered callback for the relevant updates, which record metrics. Currently, we have 3 counter_map. Regular (containing all data) & get & wait counter_map. For get and wait counter_map, although there are updates, we don't register callbacks (they are used to calculate correct RUNNING / GET / WAIT counts). So normally, this is what will happen. Task gets into RUNNING state. counter_map is updated and add a callback. Get is called, and get counter_map is updated. Callback is not updated (by design). If metrics are recorded after 2, the callback from regular counter_map is invoked and we record correct metrics. If metrics are recorded after 1, RUNNING state is recorded. But since we don't register callbacks for get counter map, when the next metrics are recorded, the relevant updates are not recorded. Flakiness comes from the latter case. This fixes the issue by having "no-op update" to the regular counter_map (e.g., Increment(0)). This will trigger counter_map to invoke a callback again which will correctly update get & wait status. I could also refactor the code to not use get&wait counter map, but this approach is much easier, so I decide to go with this approach. This PR also fixes the slow stats report issue.
Configuration menu - View commit details
-
Copy full SHA for b7e671d - Browse repository at this point
Copy the full SHA b7e671dView commit details -
[core][state] State API scale losing data (#32408)
We are dropping data at 10K as default, changing the buffer size larger right now before we figure out a way to store bursty task submissions.
Configuration menu - View commit details
-
Copy full SHA for db9cfa6 - Browse repository at this point
Copy the full SHA db9cfa6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 613f4b0 - Browse repository at this point
Copy the full SHA 613f4b0View commit details -
[AIR] Allow users to pass
Callable[[torch.Tensor], torch.Tensor]
to…… `TorchVisionTransform` (#32383) Transforms like RandomHorizontalFlip expect Torch tensors as input, but if you're applying the transform per-epoch, then you can't use ToTensor. To fix the problem, this PR updates TorchVisionPreprocessor to convert ndarray inputs to Torch tensors. You can't use ToTensor to convert the ndarrays to Torch tensors because then you'd be applying ToTensor twice, and your images would get scaled incorrectly. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for faeb2cc - Browse repository at this point
Copy the full SHA faeb2ccView commit details -
Add triage label to enhancement and doc issues as well (#32352)
- Add triage label to enhancement and doc issues as well - Don't auto close issues pending triage Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 299d8f0 - Browse repository at this point
Copy the full SHA 299d8f0View commit details -
[docs] removing docs referring ray client. (#32209)
Why are these changes needed? Deprecating ray client related docs.
Configuration menu - View commit details
-
Copy full SHA for 6879184 - Browse repository at this point
Copy the full SHA 6879184View commit details -
[Doc] Document the top-k default scheduling strategy (#32331)
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 16a7683 - Browse repository at this point
Copy the full SHA 16a7683View commit details -
[Datasets] Update Ray Data documentation for lazy execution by defaul…
Configuration menu - View commit details
-
Copy full SHA for 08a8c65 - Browse repository at this point
Copy the full SHA 08a8c65View commit details -
[ci][core] Do not set flushing thread niceness for task backend #32439
We believe this has minimal impact on the performance. So reverting for non-necessary code. Signed-off-by: rickyyx <rickyx@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for bc2de90 - Browse repository at this point
Copy the full SHA bc2de90View commit details -
[Datasets] [Docs] Update docs to reflect lazy-by-default execution mo…
…del. (#32387) This PR updates the docs for a portion of the feature guides, the FAQ, the examples, and the docstrings for the Dataset, GroupedDataset, and read APIs, to reflect the new lazy-by-default execution semantics.
Configuration menu - View commit details
-
Copy full SHA for ed640b6 - Browse repository at this point
Copy the full SHA ed640b6View commit details -
Use retriable_lifo policy for shuffle 1tb nightly test (#32417)
Fix release blocker issue: #32203 Ran 6 times and all of them passed. Signed-off-by: jianoaix <iamjianxiao@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for dade595 - Browse repository at this point
Copy the full SHA dade595View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2874e47 - Browse repository at this point
Copy the full SHA 2874e47View commit details -
[Autoscaler] Make ~/.bashrc optional in autoscaler commands (#32393)
At the moment, autoscaler commands fail (and head node set up fails) if the user doesn't have a .bashrc. This seems like an unnecessary requirement for startup. There's also a completely pointless true &&, which looks like an artifact from someone's refactor.
Configuration menu - View commit details
-
Copy full SHA for 37086a5 - Browse repository at this point
Copy the full SHA 37086a5View commit details -
[core] Force kill worker whose job has exited (#32217)
## Why are these changes needed? The worker leaks currently when the task references some global import like tensorflow. There are couple issues that led to this bug: when the worker finishes executing it does not clean up all its borrowed references the reference counting code treats borrowed reference as something it owns if the worker thinks it owns references it will not exit the worker pool will not force exit an idle worker, even if the job is dead, if the worker refuses to due to the aforementioned object ownership This PR implements the logic in worker pool to force kill an idle worker whose job has exited
Configuration menu - View commit details
-
Copy full SHA for 704fd4a - Browse repository at this point
Copy the full SHA 704fd4aView commit details -
[Datasets] Make ray.data.from_* APIs lazy. (#32390)
This PR makes the ray.data.from_*() APIs lazy.
Configuration menu - View commit details
-
Copy full SHA for 9a04119 - Browse repository at this point
Copy the full SHA 9a04119View commit details
Commits on Feb 11, 2023
-
Fix doc test for dataset.py (#32458)
Signed-off-by: Cheng Su <scnju13@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for b3b0336 - Browse repository at this point
Copy the full SHA b3b0336View commit details -
[RLlib] Shared encoder MARL unittest and example (#32460)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 80e982b - Browse repository at this point
Copy the full SHA 80e982bView commit details
Commits on Feb 13, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 4c52789 - Browse repository at this point
Copy the full SHA 4c52789View commit details -
[RLlib] Add sample timer to all algorithms'
training_step()
methods…… (where it's simple to add). (#32475)
Configuration menu - View commit details
-
Copy full SHA for cacc982 - Browse repository at this point
Copy the full SHA cacc982View commit details -
[ActorInit] Fix Bug in Actor creation (#32277)
In #28149 RayActorError is called with a str as cause, but this is not an accepted type. This leads to hitting the assertion error in the else case: assert isinstance(cause, ActorDiedErrorContext) on L283.
Configuration menu - View commit details
-
Copy full SHA for 2e9b834 - Browse repository at this point
Copy the full SHA 2e9b834View commit details -
Fix typo in README.md (#32466)
Signed-off-by: Pratik <pratikrajput1199@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 997e95e - Browse repository at this point
Copy the full SHA 997e95eView commit details -
[RLlib] Added test version of BC algorithm based on RLModules an RLTr…
…ainers (#32471) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 4ffa7fd - Browse repository at this point
Copy the full SHA 4ffa7fdView commit details -
[tune] Move experiment state/checkpoint/resume management into a sepa…
…rate file (#32457) Experiment state management is currently convoluted. We keep track of many duplicate variables, e.g. local/remote checkpoint dirs and syncers. The resume/syncing logic also takes up a lot of space in the trial runner. Saving and restoring experiment state is orthogonal to the actual trial lifecycle logic, thus it makes sense to separate this out. In the same go, I've removed a lot of duplicated state and simplified some APIs that will also make it easier to test the experiment state component separately. Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 7e662dd - Browse repository at this point
Copy the full SHA 7e662ddView commit details -
[Jobs] Improve error message in case of
404
(#31120)An identical error message is returned in multiple cases if something goes wrong when pinging the api/version endpoint. This PR adds more information to the error message in case where the endpoint returns 404 in order to help with debugging.
Configuration menu - View commit details
-
Copy full SHA for 6de3cbe - Browse repository at this point
Copy the full SHA 6de3cbeView commit details -
[Datasets] Track bundles object store utilization as soon as they're …
…added to an operator (#32482) This PR ensures that the object store utilization for a bundle is still tracked when it's queued internally by an operator, e.g. MapOperator queueing bundles for the sake of bundling up to a minimum bundle size, or due to workers not yet being ready for dispatch.
Configuration menu - View commit details
-
Copy full SHA for 80f2161 - Browse repository at this point
Copy the full SHA 80f2161View commit details -
[tune/train] clean up tune/train result output (#32234)
* [tune/train] remove duplicated keys in tune/train results. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * timestamp Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * result_timestamp defaults to None Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix progress_reporter test. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * .get(, None) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix test_gpu Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * WORKER_ Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for e71c63f - Browse repository at this point
Copy the full SHA e71c63fView commit details -
[ci][core] Calculate actor creation time properly for stress_test_man…
…y_tasks (#32438) Signed-off-by: rickyyx <rickyx@anyscale.com> We are calculating actor creation task submission time, which is less useful for this test.
Configuration menu - View commit details
-
Copy full SHA for e56665e - Browse repository at this point
Copy the full SHA e56665eView commit details -
[tune] Structure refactor: Raise on import of old modules (#32486)
Following our tune package restructure (https://github.com/ray-project/ray/pulls?q=is%3Apr+in%3Atitle+%5Btune%2Fstructure%5D), we now had 3 releases where we logged a warning (2.0-2.3). For 2.4, we should raise an error instead. For 2.5, we can remove the old files/packages. Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for 2cee078 - Browse repository at this point
Copy the full SHA 2cee078View commit details -
[Doc] Add data ingestion clarification for AIR converting existing py…
…torch code example (#32058) The example under Ray AI Runtime/Example section directly used native PyTorch datasets for data loading. It's good to clarify that the current approach is for simplicity, the more recommended approach is to use the Ray dataset. Signed-off-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MacBook-Pro.local>
Configuration menu - View commit details
-
Copy full SHA for 91940e3 - Browse repository at this point
Copy the full SHA 91940e3View commit details
Commits on Feb 14, 2023
-
[Datasets] Always preserve order for the BulkExecutor. (#32437)
This PR always preserves order for the bulk executor. We may revisit this in the future, at which point we'd update all of the tests that rely on order preservation. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(
Configuration menu - View commit details
-
Copy full SHA for 71dfd20 - Browse repository at this point
Copy the full SHA 71dfd20View commit details -
[Tune] Fix docstring failures (#32484)
This PR fixes the `Stopper` doctests that are erroring. Previously, it used a `tune.Trainable` as its trainable, which would error on fit since its methods are not implemented. Also, it was missing some imports. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Configuration menu - View commit details
-
Copy full SHA for 421b527 - Browse repository at this point
Copy the full SHA 421b527View commit details -
Configuration menu - View commit details
-
Copy full SHA for bc01288 - Browse repository at this point
Copy the full SHA bc01288View commit details -
[RLlib] Allow MARLModule customization from algorithm config (#32473)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for a447cbb - Browse repository at this point
Copy the full SHA a447cbbView commit details -
[tune] Fix resuming from cloud storage (+ test) (#32504)
#32457 refactored the experiment checkpoint management but introduced a bug where state is not correctly restored anymore. This was caught by a unit test error. This PR resolves the bug and makes sure the test passes. Signed-off-by: Kai Fricke <kai@anyscale.com>
Configuration menu - View commit details
-
Copy full SHA for efc432b - Browse repository at this point
Copy the full SHA efc432bView commit details -
[Doc] Restructure core API docs (#32236)
Similar to #31204, refactor the core api reference for better layout and view. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 99d00ad - Browse repository at this point
Copy the full SHA 99d00adView commit details -
Deflake test_dataset.py: split torch tests (#32487)
One of the flakiness of test_dataset.py is due to the timeout. This splits out the torch tests from this big test file. #32067
Configuration menu - View commit details
-
Copy full SHA for b89457a - Browse repository at this point
Copy the full SHA b89457aView commit details -
Configuration menu - View commit details
-
Copy full SHA for f0d96c5 - Browse repository at this point
Copy the full SHA f0d96c5View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3414797 - Browse repository at this point
Copy the full SHA 3414797View commit details -
[Datasets] Add logical operator for aggregate (#32462)
This PR is to add logical operator for group-by aggregate. The change includes: * `Aggregate`: the logical operator for aggregate * `generate_aggregate_fn`: the generated function for aggregate operator * `SortAggregateTaskSpec`: the task spec for doing sort-based aggregate, mostly refactored from [_GroupbyOp](https://github.com/ray-project/ray/blob/master/python/ray/data/grouped_dataset.py#L35).
Configuration menu - View commit details
-
Copy full SHA for 66c0533 - Browse repository at this point
Copy the full SHA 66c0533View commit details -
[tune] Fix two tests after structure refactor deprecation (#32517)
#32486 introduced two test failures after hard-depracting a structure refactor. This PR fixes these two stale imports. Signed-off-by: Kai Fricke <coding@kaifricke.com>
Configuration menu - View commit details
-
Copy full SHA for d092b12 - Browse repository at this point
Copy the full SHA d092b12View commit details -
Configuration menu - View commit details
-
Copy full SHA for d87d86f - Browse repository at this point
Copy the full SHA d87d86fView commit details -
Fix autosummary to show docstring of class members (#32520)
By default, autosummary only shows one line for each class member instead of the entire docstring. Ideally the fix should be autosummarying class members as well but that generates too many doc pages and causes doc build timeout. For now, default to show docstring of class members in the class pages and an explicit opt-in to autosummary class members. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 19ca00b - Browse repository at this point
Copy the full SHA 19ca00bView commit details -
[core] Add opt-in flag for Windows and OSX clusters, update ray start…
… output to match docs (#32409) Un-revert #31166. This PR cleans up a few usability issues around Ray clusters: - Makes some cleanups to the ray start log output to match the new documentation on Ray clusters. Mainly, de-emphasize Ray Client and recommend jobs instead. - Add an opt-in flag for enabling multi-node clusters for OSX and Windows. Previously, it was possible to start a multi-node cluster, but then any Ray programs would fail mysteriously after connecting to the cluster. Now, it will warn the user with an error message if the opt-in flag is not set. - Document multi-node support for OSX and Windows. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for bf5e721 - Browse repository at this point
Copy the full SHA bf5e721View commit details -
[Data] Update DatasetPipeline.to_tf API to match with Dataset.to_tf (#…
…32531) Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Configuration menu - View commit details
-
Copy full SHA for 9dcb369 - Browse repository at this point
Copy the full SHA 9dcb369View commit details -
Configuration menu - View commit details
-
Copy full SHA for b12c0d1 - Browse repository at this point
Copy the full SHA b12c0d1View commit details -
[Tune] Update trainable
remote_checkpoint_dir
upon actor reuse (#32420Configuration menu - View commit details
-
Copy full SHA for e8f1cf6 - Browse repository at this point
Copy the full SHA e8f1cf6View commit details -
Configuration menu - View commit details
-
Copy full SHA for b9f7e19 - Browse repository at this point
Copy the full SHA b9f7e19View commit details