Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream changes #6

Merged
merged 267 commits into from
Feb 14, 2023
Merged

upstream changes #6

merged 267 commits into from
Feb 14, 2023
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Jan 27, 2023

  1. Add informative progress bar names to map_batches (#31526)

    Signed-off-by: pdmurray <peynmurray@gmail.com>
    
    Signed-off-by: pdmurray <peynmurray@gmail.com>
    peytondmurray authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    3343c76 View commit details
    Browse the repository at this point in the history
  2. Enable Log Rotation on Serve (#31844)

    This PR adds log rotation for Ray Serve, letting it inherit rotation parameters (max_bytes, backup_count) from Ray Core, bringing a more consistent logging experience to Ray (as opposed to having the serve/ folder grow forever while the other logs rotate.
    andreapiso authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    7b2299b View commit details
    Browse the repository at this point in the history
  3. [core][state] Handle driver tasks (#31832)

    This PR adds additional information to the driver task event, namely, driver task type, and it's running/finished timestamps. This allows users (i.e. the dashboard) to inspect driver task more easily.
    This PR also exposes the exclude_driver flag to state API, allowing requests through https and ListAPiOptions to get driver tasks, while the default behaviour from state API will still be excluding it.
    This PR also filters out any tasks w/o task_info to prevent missing data issue.
    rickyyx authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    ed72ca8 View commit details
    Browse the repository at this point in the history
  4. [serve] Add exponential backoff when retrying replicas (#31436)

    If deployment is repeatedly failing, perform exponential backoff so as to not repeatedly try to restart the replica at a very fast rate.
    
    Related issue number
    Closes #31121
    zcin authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    3f1a880 View commit details
    Browse the repository at this point in the history
  5. [RLlib] Fixed the autorom dependency issue (#31933)

    Co-authored-by: Cade Daniel <edacih@gmail.com>
    Closes #31880
    kouroshHakha authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    76d7467 View commit details
    Browse the repository at this point in the history
  6. Polish the Dashboard new IA part 2 (#31946)

    Adds back the metrics page
    Adds button to visit new dashboard and to go back
    Adds buttons for leaving feedback and viewing docs
    Add color to status badges of tasks and placement groups table
    Add alert when grafana is not running
    Fix copy button icon
    Separate metrics page into sections (both new IA and old IA)
    alanwguo authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    15af485 View commit details
    Browse the repository at this point in the history
  7. [Tune] Clarify which RunConfig is used when there are multiple plac…

    …es to specify it (#31959)
    
    This PR clarifies where RunConfig can be specified. Also, when multiple configs are specified in different locations (in the Tuner and Trainer), this PR also logs information about which RunConfig is actually used.
    
    Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
    justinvyu authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    eab29ca View commit details
    Browse the repository at this point in the history
  8. [docs] Fix linkcheck error and map batches docstring test (#31996)

    581cd4e moved some test files, breaking a link from the documentation. cc @iycheng
    
    3343c76 changed the MapBatches string representation, breaking a docstring test. cc @peytondmurray
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    1b20ae9 View commit details
    Browse the repository at this point in the history
  9. [Datasets] [Autoscaling Actor Pool - 1/2] Refactor MapOperator, exe…

    …cution state, and task submitters. (#31986)
    clarkzinzow authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    02ca4c9 View commit details
    Browse the repository at this point in the history
  10. [data] [streaming] [12/n]--- Improve output backpressure reporting an…

    …d management (#31979)
    
    Before this PR, stalls in the consumer thread would fully block the control loop. This provides backpressure, but at the cost of performance.
    
    This PR fully decouples the consumer thread from the control loop thread, allowing execution to proceed so long as there is sufficient object_store_memory budget remaining. It also adds a progress bar for the output queue, showing the number of output bundles consumed and the number of queued bundles for output:
    ericl authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    ffbd87a View commit details
    Browse the repository at this point in the history
  11. [tune] Fix tune_cloud_* tests fow new Trial constructor arguments (#3…

    …2010)
    
    #31669 changed the `Trial.__dict__` by moving `local_dir` to `_local_dir`, which resulted in an error in our tune cloud tests. This PR updates the signature of the `TrialStub` class to resolve the issue.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    25a7df6 View commit details
    Browse the repository at this point in the history
  12. [core] remove legacy memory monitor from task submission codepath (#3…

    …1993)
    
    Remove legacy memory monitor from worker submission code path, as that was already disabled by default in Ray 2.2
    clarng authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    e64b44b View commit details
    Browse the repository at this point in the history
  13. [docs] Revamp Ray core fault tolerance guide (#27573)

    The structure of the content looks good. My main request is (like with the scheduling refactor), that we make this discoverable with links from the main task/actor sections. Could we add 2-3 links each from the main tasks/actors/objects content to the appropriate fault tolerance sections?
    
    _Originally posted by @ericl in #27573 (review)
    
    Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
    Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
    3 people authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    2a7dd31 View commit details
    Browse the repository at this point in the history
  14. [Serve] [release test] Add max_retries and max_restarts (#32011)

    The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes.
    
    Related issue number
    Addresses #31741
    shrekris-anyscale authored Jan 27, 2023
    Configuration menu
    Copy the full SHA
    dd36360 View commit details
    Browse the repository at this point in the history

Commits on Jan 28, 2023

  1. [core][state] Adjust worker side reporting with batches && add debugs…

    …tring (#31840)
    
    Signed-off-by: rickyyx <rickyx@anyscale.com>
    
    This PR introduces a flag RAY_task_events_send_batch_size that controls the number of task events sent to GCS in a batch. With default setting, each core worker will send 10K task events per second to GCS, where GCS could handle 10K task events in ~50 milliseconds.
    
    This PR also adjust the worker side buffer limit to 1M with the new batching setting.
    
    The PR adds some debug informations as well.
    rickyyx authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    5d1f2e4 View commit details
    Browse the repository at this point in the history
  2. [Dataset] Exclude breaking test case in `read_parquet_benchmark_singl…

    …e_node` release test (#31904)
    
    The release test read_parquet_benchmark_single_node fails, due to using Python 3.7 and not having the pickle5 package installed. A similar issue is discussed in #26225. We found that the test failure is contained to the portion which tests a Dataset with a filter expression (the error is related to pickling with this filter expression).
    
    Therefore, we will temporarily disable this portion of the test, while keeping the rest of the release test (which I verified passes on the same cluster). We can come back to this in the future and fix the case with filter. Example of release test successfully running with the filter case removed.
    
    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    675c6a0 View commit details
    Browse the repository at this point in the history
  3. [Data] Add tests for remainder of map_batches operations with new opt…

    …imizer (#31985)
    
    Signed-off-by: amogkam <amogkamsetty@yahoo.com>
    
    The following operations call map_batches directly: add_column, drop_columns, select_columns, random_sample.
    
    In this PR we add e2e tests for these examples with the new optimizer. In a future PR, we should refactor so that these operations do not call into map_batches and instead have their own logical operator.
    amogkam authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    00416d2 View commit details
    Browse the repository at this point in the history
  4. [ci/release] Change exponential_backoff_retry to use warn instead of …

    …info on failure (#32014)
    
    It appears the root cause of flaky failures described in #31981 is suppressed because we're not logging exceptions in `exponential_backoff_retry`.
    
    Signed-off-by: Cade Daniel <cade@anyscale.com>
    cadedaniel authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    b5899d4 View commit details
    Browse the repository at this point in the history
  5. Revert "[core] Fix gcs healthch manager crash when node is removed by…

    … node manager. (#31917)" (#31995)
    
    This reverts commit a32b9b1.
    krfricke authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    51c5eda View commit details
    Browse the repository at this point in the history
  6. [Datasets] [Autoscaling Actor Pool - 2/2] Add autoscaling support to …

    …`MapOperator` actor pool. (#31987)
    
    This PR adds support for autoscaling to the actor pool implementation of `MapOperator` (this PR is stacked on top of #31986).
    
    The same autoscaling policy as the legacy `ActorPoolStrategy` is maintained, as well as providing more aggressive and sensible downscaling via:
    * If there are more idle actors than running/pending actors, scale down.
    * Once we're done submitting tasks, cancel pending actors and kill idle actors.
    
    In addition to autoscaling, `max_tasks_in_flight` capping is also implemented.
    clarkzinzow authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    22177cb View commit details
    Browse the repository at this point in the history
  7. [Dashboard] Add cluster utilization graph (#31896)

    <img width="1731" alt="Screen Shot 2023-01-24 at 1 01 25 AM" src="https://user-images.githubusercontent.com/18510752/214250430-9bac7b06-56fb-44b3-a044-3eaf726d1469.png">
    
    This PR adds the cluster utilization page in the landing view
    
    Co-authored-by: Alan Guo <aguo@anyscale.com>
    rkooo567 and alanwguo authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    ef28b5a View commit details
    Browse the repository at this point in the history
  8. [Datasets] Add logical operator for randomize_block_order() (#31977)

    This PR adds logical operator for randomize_block_order(). The change includes:
    
    Introduce AbstractAllToAll for all logical operators converted to AllToAllOperator
    RandomizeBlocks logical operator for randomize_block_order().
    _internal/planner to move logic for Planner here and have generated function for randomize_blocks. This can be used later to create MapOperator/AllToAllOperator.
    c21 authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    e44a7d0 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    b58bb93 View commit details
    Browse the repository at this point in the history
  10. [core] Add code owner to GCS module. (#32018)

    Add code owner to GCS module.
    fishbone authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    09f45ad View commit details
    Browse the repository at this point in the history
  11. Refactor block_fn out of map-like logical operators (#32021)

    Signed-off-by: Cheng Su <scnju13@gmail.com>
    c21 authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    8e188db View commit details
    Browse the repository at this point in the history
  12. [train][docs] fix doc search issues, examples gallery & filter (#31635)

    Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
    Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
    3 people authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    cc6d30a View commit details
    Browse the repository at this point in the history
  13. [Dashboard] Timeline implemented by a new task backend (#31856)

    Signed-off-by: SangBin Cho <rkooo567@gmail.com>
    
    This PR implements the timeline to the ray dashboard using new task backend.
    
    Implement the task events -> chrome tracing logic. Most of code is copied from existing code. TODO add unit tests (although we already have one, it is a pretty weak test).
    Create a timeline endpoint that can 1. download the json file (to download & upload manually) 2. return the json array buffer (to load onto perfetto directly)
    Create a subsection that has 3 features. 1. Download button. 2. Open perfetto button. 3. Instruction accordion.
    rkooo567 authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    f9fa0b2 View commit details
    Browse the repository at this point in the history
  14. [RLlib] Separate PPO torch regression test, and make it longer (#31892)

    Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
    ArturNiederfahrenhorst authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    20bfcdd View commit details
    Browse the repository at this point in the history
  15. Revert "[core][state] Adjust worker side reporting with batches && ad…

    …d debugstring (#31840)" (#32024)
    
    This reverts commit 5d1f2e4.
    krfricke authored Jan 28, 2023
    Configuration menu
    Copy the full SHA
    c889349 View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    80d13d1 View commit details
    Browse the repository at this point in the history

Commits on Jan 29, 2023

  1. [Datasets] [Docs] Add seealso to map-related methods (#30579)

    This PR adds seealso notes to help users distinguish between map, flat_map, and map_batches.
    
    Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
    bveeramani authored Jan 29, 2023
    Configuration menu
    Copy the full SHA
    112a265 View commit details
    Browse the repository at this point in the history
  2. [RLlib] Give more time to impala release tests (#31910)

    Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
    ArturNiederfahrenhorst authored Jan 29, 2023
    Configuration menu
    Copy the full SHA
    1929bb1 View commit details
    Browse the repository at this point in the history
  3. [docs] remove archive link (#32030)

    Signed-off-by: Eric Liang <ekhliang@gmail.com>
    ericl authored Jan 29, 2023
    Configuration menu
    Copy the full SHA
    6708b31 View commit details
    Browse the repository at this point in the history

Commits on Jan 30, 2023

  1. Fix whitespace in help message for ray cli (#31905)

    Without this patch, several of the help text are missing whitespace. For
    example, `--dashboard-host` appears as follows:
    
      --dashboard-host TEXT           the host to bind the dashboard server to,
                                      either localhost (127.0.0.1) or 0.0.0.0
                                      (available from all interfaces). By default,
                                      thisis localhost.
    
    This patch adds the correct trailing whitespace so there are spaces.
    
    Signed-off-by: Luke Hsiao <luke.hsiao@numbersstation.ai>
    lukehsiao authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    cce092b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    3fc2aac View commit details
    Browse the repository at this point in the history
  3. [RLlib] Reparameterize the construction of TrainerRunner and RLTraine…

    …rs (#31991)
    
    * trying out a new configuration pattern for trainer runner and rl trainers
    
    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
    kouroshHakha authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    d390df8 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    d26b55b View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    56b7911 View commit details
    Browse the repository at this point in the history
  6. [2/n] Stabilize GCS/Autoscaler interface: Drain and Kill Node API (#3…

    …2002)
    
    This PR adds a DrainAndKillNode endpoint to the monitor service. It has the exact same semantics as the GcsNodeManager::HandleDrainNode.
    
    
    ---------
    
    Co-authored-by: Alex <alex@anyscale.com>
    Alex Wu and Alex authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    e331f6e View commit details
    Browse the repository at this point in the history
  7. [Core] Remove dead actor checkpoint code (#32045)

    Checkpointable actor is already removed in #10333
    jjyao authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    907e968 View commit details
    Browse the repository at this point in the history
  8. Revert "Revert "[core] Fix gcs healthch manager crash when node is re…

    …moved by node manager."" (#32019)
    
    This reverts commit 51c5eda.
    
    Reverts #31995
    
    Skip the windows test.
    
    Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
    fishbone authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    664c844 View commit details
    Browse the repository at this point in the history
  9. [tune] Do not default to reuse_actors=True when mixins are used (#31999)

    Mixins don't work well with reuse_actors because the init is only called on construction. In the case of mlflow, this means that reused actors will try to overwrite state from the trials that previously ran on them. This is incorrect behavior and errors on the mlflow server side.
    
    Thus, we should default to not reuse actors for mixins.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    cc5baaa View commit details
    Browse the repository at this point in the history
  10. [metrics] Switch metric view to 5 min by default #32065

    Signed-off-by: Eric Liang <ekhliang@gmail.com>
    ericl authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    43a0d8f View commit details
    Browse the repository at this point in the history
  11. [data] [streaming] Fixes to autoscaling actor pool streaming op (#32023)

    Fixes:
    - Properly wire max tasks per actor to pool
    - Account for internal queue size in scheduling algorithm
    - Small improvements to progress bar UX
    ericl authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    96440cf View commit details
    Browse the repository at this point in the history
  12. [CI] Increase target time for test_result_throughput_cluster (#32062)

    #31337 has become flaky again due to a low timeout. This PR follows #31338 and increases the timeout.
    cadedaniel authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    baac0a6 View commit details
    Browse the repository at this point in the history
  13. [core] Add generic __ray_ready__ method to Actor classes (#31997)

    We currently have no canonical way to await actors. Users can define their own _is-ready_ methods, schedule a future, and await these, but this has to be done for every actor class separately.
    
    This does not match other patterns - e.g. we have `actor.__ray_terminate__.remote()` for actor termination and `placement_group.ready()` for placement group ready futures.
    
    This PR adds a new `__ray_ready__` magic actor method that just returns `True`. It can be used to await actors becoming ready (newly scheduled actors), and actors having processed all of their other enqueued tasks.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    fe729aa View commit details
    Browse the repository at this point in the history
  14. [Serve] Mark long_running_serve_failure test as stable (#32063)

    The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, #31945 and #32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable.
    shrekris-anyscale authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    b350f8d View commit details
    Browse the repository at this point in the history
  15. [core] Reduce the timeout for many nodes actor tests. (#32066)

    Reduce the timeout for many nodes actor test given that a test should finish within 1h.
    It can save some cost for problematic runs.
    fishbone authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    fb96935 View commit details
    Browse the repository at this point in the history
  16. Fix unit test (#32084)

    alanwguo authored Jan 30, 2023
    Configuration menu
    Copy the full SHA
    fefd5e3 View commit details
    Browse the repository at this point in the history

Commits on Jan 31, 2023

  1. [Datasets] Remove the non-useful comment in map_batches() (#32020)

    This PR is a quick fix to remove the non-useful comment introduced in #31526, probably during debugging.
    
    Replace the comment with a meaningful one.
    c21 authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    34e2cd5 View commit details
    Browse the repository at this point in the history
  2. simplify metrics pgae (#32089)

    Signed-off-by: Eric Liang <ekhliang@gmail.com>
    
    Combine tasks and actors sections
    Move object store memory back up to the logical section (it's one of the most useful metrics, it shouldn't be buried)
    Improve titles
    ericl authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    755b56f View commit details
    Browse the repository at this point in the history
  3. [docs] Update top-navigation.js (#32075)

    Currently, the dropdown menu "Resources" in the Ray documentation contains a link called "Training." This link points to the [same site](https://www.anyscale.com/events) as "Events." However, we want this to direct to the repository of [technical training content](https://github.com/ray-project/ray-educational-materials).
    
    Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
    emmyscode and angelinalg authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    f325ced View commit details
    Browse the repository at this point in the history
  4. [docs] deploying static ray cluster to K8S with external Redis for fa…

    …ult tolerance (#31949)
    
    This PR adds the documentation and sample config files for deploying Ray to K8S without using KubeRay. As KubeRay CRDs need cluster-scoped permissions, this PR helps those users who do not have cluster-scoped permissions to install Ray Cluster in their K8S.
    YQ-Wang authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    dc974cb View commit details
    Browse the repository at this point in the history
  5. fix frontend tests after #32089 (#32097)

    Signed-off-by: Alan Guo <aguo@anyscale.com>
    alanwguo authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    b477f4b View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    8a0e453 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    06197a5 View commit details
    Browse the repository at this point in the history
  8. Advanced Progress Bar (#31750)

    This progress bar automatically shows progress by groupings.
    
    Things that belong to the same parent are all put in a group.
    If a group has multiple children with the same name, those are merged together into a virtual group.
    
    These virtual groups have different visual treatment because a virtual group should not add an additional level of nesting.
    alanwguo authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    d91d2d6 View commit details
    Browse the repository at this point in the history
  9. [spark] Automatically shut down ray on spark cluster if user does not…

    … execute commands on databricks notebook for a long time (#31962)
    
    Databricks Runtime provides an API:
    dbutils.entry_point.getIdleTimeMillisSinceLastNotebookExecution() that returns elapsed milliseconds since last databricks notebook code execution.
    This PR code calls this interface to monitor notebook activity and shut down Ray cluster on timeout.
    
    Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
    WeichenXu123 authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    3a1709f View commit details
    Browse the repository at this point in the history
  10. [Datasets] Add support for string tensor columns in `ArrowTensorArray…

    …` and `ArrowVariableShapedTensorArray` (#31817)
    
    Add support for creating ArrowTensorArrays and ArrowVariableShapedTensorArrays with string typed columns.
    
    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    1fdf24e View commit details
    Browse the repository at this point in the history
  11. [RLlib] Upgrade tf eager code to no longer use `experimental_relax_sh…

    …apes` (but `reduce_retracing` instead). (#29214)
    
    Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
    ArturNiederfahrenhorst authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    78b8c24 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    61c411f View commit details
    Browse the repository at this point in the history
  13. [RLlib; docs] Change links and references in code and docs to "Farama…

    … foundation's gymnasium" (from "OpenAI gym"). (#32061)
    avnishn authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    f2b6a6b View commit details
    Browse the repository at this point in the history
  14. [Datasets] Fix to pass TaskContext in generate_random_shuffle_fn() (#…

    …32101)
    
    This PR is to fix master with resolving the conflict between #32080 and #32081, i.e.
    
    - Pass TaskContext in random_shuffle.py:generate_random_shuffle_fn()
    - Add AllToAllTransformFn and rename TransformFn to MapTransformFn
    - Update the function return type in generate_map_xxx_fn().
    
    Signed-off-by: Cheng Su <scnju13@gmail.com>
    c21 authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    b7746b2 View commit details
    Browse the repository at this point in the history
  15. [release] minor fix to pytorch_pbt_failure test when using gpu. (#32070)

    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    xwjiang2010 authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    293fe2c View commit details
    Browse the repository at this point in the history
  16. [air] Add test for remote_storage with real hdfs backend. (#31940)

    * [air] Add test for remote_storage with real hdfs backend.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * typo
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * typo
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * try a different syntax.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * change `install-hdfs.sh` permission.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * -hdfs in air tests.
    
    update ssh-kengen command.
    
    fix a few typos.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * test_env=
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * cat hdfs_env
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * move `PATH` as well to a separate file.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * setting env vars in test only.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * fix import
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * fix
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * address comments.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * nit
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * fix fixture
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * address comments
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * address comments
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    ---------
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    xwjiang2010 authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    5cf61f0 View commit details
    Browse the repository at this point in the history
  17. [RLlib] [Ray 2.3 release] Marking RLLib release tests as unstable if …

    …xfail (#32072)
    
    * Marking RLLib release tests as unstable if xfail
    cadedaniel authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    65d904f View commit details
    Browse the repository at this point in the history
  18. [Datasets] Add logical operator for repartition() (#32102)

    This PR adds logical operator for `repartition()`. Only implement shuffle repartition (`repartition.py:generate_repartition_fn()`).
    
    Non-shuffle repartition is left as TODO, as the corresponding code in [fast_repartition.py](https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/fast_repartition.py) involves `BlockList`, `ExecutionPlan` and `Dataset.split()`, so it needs a deeper refactoring and code change.
    c21 authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    44a1398 View commit details
    Browse the repository at this point in the history
  19. [Core] Expose Internal KV MultiGet operation (#32096)

    This PR exposes the MultiGet operation to the InternalKVInterface. The MultiGet operation is already supported in the two backends (InMemory and Redis), so this PR is just plumbing.
    
    This change is needed to support getting multiple keys from the Internal KV in a single RPC.
    architkulkarni authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    dae13bf View commit details
    Browse the repository at this point in the history
  20. Revert "[Datasets] Add support for string tensor columns in `ArrowTen…

    …sorArray` and `ArrowVariableShapedTensorArray` (#31817)" (#32123)
    
    This reverts commit 1fdf24e.
    scottjlee authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    e3001e9 View commit details
    Browse the repository at this point in the history
  21. [AIR] Add option for per-epoch preprocessor (#31739)

    This adds an option to the AIR DatasetConfig for a preprocessor that gets reapplied on each epoch. Currently the implementation uses DatasetPipeline to ensure that the extra preprocessing step is overlapped with training.
    
    Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
    stephanie-wang authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    ae167f0 View commit details
    Browse the repository at this point in the history
  22. [observability][autoscaler] Ensure pending nodes is reset to 0 after …

    …scaling (#32085)
    
    The previous way pending_nodes was calculated was prone to race conditions, instead, let's just always publish it in the main thread with other metrics.
    
    Closes #31982
    
    ---------
    
    Co-authored-by: Alex <alex@anyscale.com>
    Alex Wu and Alex authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    7573d49 View commit details
    Browse the repository at this point in the history
  23. [tune/execution] Update staged resources in a fixed counter for faste…

    …r lookup (#32087)
    
    In #30016 we migrated Ray Tune to use a new resource management interface. In the same PR, we simplified the resource consolidation logic. This lead to a performance regression first identified in #31337.
    
    After manual profiling, the regression seems to come from `RayTrialExecutor._count_staged_resources`. We have 1000 staged trials, and this function is called on every step, executing a linear scan through all trials.
    
    This PR fixes this performance bottleneck by keeping state of the resource counter instead of dynamically recreating it every time. This is simple as we can just add/subtract the resources whenever we add/remove from the `RayTrialExecutor._staged_trials` set.
    
    Manual testing confirmed this improves the runtime of `tune_scalability_result_throughput_cluster` from ~132 seconds to ~122 seconds, bringing it back to the same level as before the refactor.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    10d52f7 View commit details
    Browse the repository at this point in the history
  24. Revert "[RLlib] Reparameterize the construction of TrainerRunner and …

    …RLTrainers (#31991)" (#32130)
    
    Reverts #31991
    
    This PR seems to have broken CI.
    
    Screenshot 2023-01-31 at 1 39 09 PM
    
    The error is https://buildkite.com/ray-project/oss-ci-build-branch/builds/2099#01860972-e02e-47c4-8f86-8be28ea18d92/3786-3992
    AttributeError: '_TFStub' object has no attribute 'Tensor'
    architkulkarni authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    d15ccfc View commit details
    Browse the repository at this point in the history
  25. [Dashboard] Better gpu utilization (#32125)

    . So instead of averaging out, we should do sum(gpu_utillization) / (sum(num_gpus)) to cap the max percentage to 100%.
    rkooo567 authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    a0b8499 View commit details
    Browse the repository at this point in the history
  26. [core] Update the scalability envelop (#32131)

    With the recent updating of the nightly tests, update the data here.
    
    In the nightly tests, we use 2k nodes (2cpus per node) and 20k actors, but if better node is used, we can run more than 40k actors.
    
    https://buildkite.com/ray-project/release-tests-branch/builds/1321#018604d7-86a3-4fad-ac6c-803db73821d3
    fishbone authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    f28428e View commit details
    Browse the repository at this point in the history
  27. Fix docs lint for advanced progress bar (#32124)

    Signed-off-by: Alan Guo <aguo@anyscale.com>
    
    fix lint #31750
    alanwguo authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    b4221c9 View commit details
    Browse the repository at this point in the history
  28. [Datasets] [Operator Fusion - 1/2] Add operator fusion to new executi…

    …on planner. (#32095)
    
    This PR adds operation fusion to the new execution planner.
    clarkzinzow authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    2137945 View commit details
    Browse the repository at this point in the history
  29. [RLlib] Fix waterworld example and test (#32117)

    * Remove empty parser.add_argument() in test file
    * remove --framework=torch
    * fix BUILD
    * use training_iteration as stopping cirterion
    
    Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
    ArturNiederfahrenhorst authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    12ff13d View commit details
    Browse the repository at this point in the history
  30. [RLlib] Error out if action_dict is empty in MultiAgentEnv. (#32129)

    * [release] minor fix to pytorch_pbt_failure test when using gpu. (#32070)
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    kouroshHakha authored Jan 31, 2023
    Configuration menu
    Copy the full SHA
    3b1e21f View commit details
    Browse the repository at this point in the history

Commits on Feb 1, 2023

  1. [CI] [Datasets] Run Datasets test suites on AIR changes (#32118)

    Datasets depends on ray.air for several key features (tensor extensions, Arrow transformations, data batch conversions), and not running the Datasets test suite in PR builds on ray.air changes has caused breaks to go undetected. This PR changes this so when files under python/ray/air change, we trigger the Datasets test suite in CI.
    
    Signed-off-by: Clark Zinzow <clarkzinzow@gmail.com>
    clarkzinzow authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    1454e63 View commit details
    Browse the repository at this point in the history
  2. [runtime env] Clarify error message about where to install `smart_ope…

    …n` for remote URI (#32110)
    
    At least two users reported encountering
    
    ImportError(
                                "You must `pip install smart_open` and "
                                "`pip install boto3` to fetch URIs in s3 "
                                "bucket. "
    and trying to fix it by specifying them in the pip field of runtime_env, which won't work because the runtime_env setup code doesn't run inside the runtime_env. This PR clarifies the error message to say that they must be preinstalled on the cluster, and adds a note to the docs.
    architkulkarni authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    909c220 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    6ec71d7 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    be6b598 View commit details
    Browse the repository at this point in the history
  5. [Doc] Update the doc to mention dynamic resource update is not allowe…

    …d. (#31664)
    
    Signed-off-by: SangBin Cho <rkooo567@gmail.com>
    rkooo567 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    5c11090 View commit details
    Browse the repository at this point in the history
  6. [ci] disable hdfs test for compat tests. (#32148)

    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    xwjiang2010 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    13d0982 View commit details
    Browse the repository at this point in the history
  7. [core][oom] enable group by parent policy by default (#31976)

    Why are these changes needed?
    Fail the task if it is the last task of the group, per the new (group by parent) worker killing policy
    
    Related issue number
    #32149 32078
    
    
    Co-authored-by: Clarence Ng <clarence@anyscale.com>
    clarng and clarence-wu authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    dff4f0a View commit details
    Browse the repository at this point in the history
  8. Revert "[Docker] (Kubeflow integration) Add chmod --recursive 777 /ho…

    …me/ray to Ray Dockerfile." #32026
    
    Signed-off-by: kaihsun <kaihsun@anyscale.com>
    kevin85421 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    df05cd9 View commit details
    Browse the repository at this point in the history
  9. [Core] update grpc to 1.46.6 (#32054)

    #31956
    
    Upgrade to a version of gRPC that GHSA-cfmr-vrgj-vqwv in Zlib
    1.46.6 has this patch: grpc/grpc#31845
    scv119 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    47bb652 View commit details
    Browse the repository at this point in the history
  10. [Core] Join Ray Jobs API JobInfo with GCS JobTableData (#31046)

    Why are these changes needed?
    Add a new protobuf for JobInfo from the Ray Job API
    Augment the existing GCS GetAllJobInfo endpoint to return this information, if available (not all GCS jobs were submitted via the Ray Job API; these jobs won't have this extra JobInfo.)
    Related issue number
    Closes #29621
    architkulkarni authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    b2c5e63 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    d74e4c4 View commit details
    Browse the repository at this point in the history
  12. [Dashboard] Support ray status output to the dashboard job page (#32040)

    This is the initial prototype of integrating ray status to the frontend.
    
    I think we could've returned structured data from the backend, but I decided to parse ray status output from the frontend for quick implementation (so that we can support if from ray 2.3).
    rkooo567 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    77ac9c2 View commit details
    Browse the repository at this point in the history
  13. [Observability] Unpin open telemetry version for tracing feature (#32120

    )
    
    Signed-off-by: SangBin Cho <rkooo567@gmail.com>
    
    <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->
    
    <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->
    
    ## Why are these changes needed?
    
    This PR unpins the version of open telemetry as it is too strict for an experimental tracing feature.  
    
    ## Related issue number
    
    Closes #32051
    
    ## Checks
    
    - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    rkooo567 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    5dd1406 View commit details
    Browse the repository at this point in the history
  14. [RLlib] Fix revert of trainer runner (#32146)

    * Revert "Revert "[RLlib] Reparameterize the construction of TrainerRunner and RLTrainers (#31991)" (#32130)"
    
    This reverts commit d15ccfc.
    
    * added bool evaluation to tf stub so that if tf returns false
    
    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
    kouroshHakha authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    fb1e0b0 View commit details
    Browse the repository at this point in the history
  15. [Core] Pick node from top k by default. (#31868)

    This PR takes over #28179
    
    Why are these changes needed?
    Today with the default scheduling policy, Ray will try to pack tasks on nodes until the resource utilization is beyond a certain threshold and spread tasks afterwards.
    This has caused slow down the scheduling speed for embarrassingly parallel jobs:
    
    we will only move on to another node before the current node's resource if sufficiently utilized,
    for each node, the overhead of accepting new job and starting of a new workers is not negligible.
    the overall scheduling speed doesn't scale with the number of nodes;
    This PR is one proposal to address the problem: instead of stick to one node, we randomly choose one node from top-k nodes for the default scheduling, where the node is sorted by it's resource utilization in reverse order.
    
    Intuitively, this allows us to kick off the workers startup on multiple node in parallel of the scheduling.
    
    benchmark result:
    
    baseline: 10 parallelism, top 1, 25 tasks/second
    10 parallelism, top 6, 30 tasks/second
    64 parallelism, top 6, 126
    100 parallelism, top 6, 150
    1000 parallelism, top 6, 374.8676886257549
    10 concurrent, top 12, 176
    64 concurrent, top 12, 182.59477988042443 tasks/s
    128 concurrent, top 12, 245.9862948998163
    256 concurrent, top 12, 298…
    scv119 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    cf7bc27 View commit details
    Browse the repository at this point in the history
  16. [Dashboard] Support actor detail (#32103)

    This PR adds actor detail page.
    
    Other than the detail page, it also
    
    Add pg id to task/actor
    Add profiling links to job detail & job row & actor detail
    rkooo567 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    d4b0a20 View commit details
    Browse the repository at this point in the history
  17. [Datasets] Add logical operator for sort() (#32133)

    This PR is to add logical operator for `sort()`, the change includes:
    * `Sort` logical operator
    * `SortTaskSpec` to copy from `sort.py`
    * `generate_sort_fn` is generated function for sort
    c21 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    75419d3 View commit details
    Browse the repository at this point in the history
  18. Update index.md (#32053)

    Signed-off-by: Simran Mhatre <simran@anyscale.com>
    simran-2797 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    b8221bb View commit details
    Browse the repository at this point in the history
  19. [core] Increase the threshold for pubsub integration test (#32145)

    The test failed asan because some data is not cleaned when it exits. Increase the threshold to mitigate it. Tested locally and for 500 runs, only 3 failed.
    fishbone authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    12d7d7d View commit details
    Browse the repository at this point in the history
  20. [core] surface OOM error when actor is killed due to OOM (#32107)

    Right now we show Actor error if the actor is killed due to OOM. This PR changes it so it surfaces a OOM error
    
    It does not support actor / actor task oom retry, as the goal of this PR is to improve observability by setting the death cause of the actor to OOM
    
    Related issue number
    #29736
    Signed-off-by: Aviv Haber <aviv@anyscale.com>
    Signed-off-by: Clarence Ng <clarence@anyscale.com>
    clarng authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    174f157 View commit details
    Browse the repository at this point in the history
  21. [Tune] Save and restore stateful callbacks as part of experiment chec…

    …kpoint (#31957)
    
    Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
    justinvyu authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    890e034 View commit details
    Browse the repository at this point in the history
  22. [Tune] Rename overwrite_trainable argument in Tuner restore to `tra…

    …inable` (#32059)
    
    * Add trainable and deprecate overwrite_trainable
    
    Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
    justinvyu authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    59f72cf View commit details
    Browse the repository at this point in the history
  23. Configuration menu
    Copy the full SHA
    83e1a2a View commit details
    Browse the repository at this point in the history
  24. [core] clean up infeasible tasks submitted by the driver when the dri…

    …ver dies (#32127)
    
    Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
    
    infeasible requests are not cleaned up when the driver exits. This cleans up infeasible request created by driver when it exits.
    
    does not apply to worker exit (follow up)
    also does not apply to infeasible task submitted to a different raylet (follow up)
    clarng authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    aad24bd View commit details
    Browse the repository at this point in the history
  25. Done (#32104)

    Signed-off-by: SangBin Cho <rkooo567@gmail.com>
    
    Add job id to the task state API call. This will help us not including tasks from other jobs (so improve the experience when we have 10K+ tasks from the cluster).
    Add resource requirement to the pg table.
    rkooo567 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    eb660ce View commit details
    Browse the repository at this point in the history
  26. [core][state][dashboard] Use main threads's task id or actor creation…

    … task id for parent's task id in state API (#32157)
    
    Right now, if a new thread (or async actor's event loop executing thread) runs some ray code (e.g. submitting a task, calling runtime context), the thread will have a WorkerThreadContext that has a random task id.
    
    This causes issues in state API since the task tree will have wrong structures, i.e. some tasks might have parent_task_id that doesn't match any existing tasks:
    
    For normal single threaded task/actor, we will use the main thread's task id (correct hehavior).
    For unusual cases (threaded/async actors), we will use the actor creation task's task id. This means from the advanced visualization, all the remote tasks created from actor tasks will be under the constructor of threaded/async actors
    rickyyx authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    10c46dc View commit details
    Browse the repository at this point in the history
  27. [air][tune] replace node:<ip> custom resource with NodeAffinitySchedu…

    …lingPolicy (#32016)
    
    This PR changes usages of the `node:<ip>` custom resource as determined by querying [file:(air|tune|train).*\.py node:](https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/ray-project/ray%24+file:%28air%7Ctune%7Ctrain%29.*%5C.py+node:). 
    
    This is being used for:
    - Collocating tasks (`_force_on_current_node`).
    - Syncing files to specific IP addresses.
    - Syncing files to _all_ other nodes.
    
    Signed-off-by: Matthew Deng <matt@anyscale.com>
    matthewdeng authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    666e2d9 View commit details
    Browse the repository at this point in the history
  28. [Ray release] Moving Atari ROM dependencies to S3 (#32150)

    In #31933 we fix an Atari ROM dependency that by default uses a torrent to download ROMs. The tests in this PR also break occasionally due to the same reason.
    
    I moved the ROM dependency to S3 to increase reliability. I actually think we can remove the ROM dependency from these app configs since I don't see any RL test using them. But I think that is too much risk for this PR, since it will likely end up as a cherry pick to 2.3.
    cadedaniel authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    24d0376 View commit details
    Browse the repository at this point in the history
  29. [Core] automatically pick max_pending_lease_requests based on number …

    …of nodes in the cluster (#31934)
    
    Why are these changes needed?
    This PR takes over #26373
    
    Currently, the initial scheduling delay for a simple f.remote() loop is approximately worker startup time (~1s) * number of nodes. There are three reasons for this:
    
    1 . Drivers do not share physical worker processes, so each raylet must start new worker processes when a new driver starts. Each raylet starts the workers when the driver first sends a lease (resource) request to that raylet.
    2. The #14790 prefers to pack tasks on fewer nodes up to 50% CPU utilization before spreading tasks for load-balancing.
    3. The maximum number of concurrent lease requests is 10, meaning that the driver must wait for workers to start on the first 10 nodes that it contacts before sending lease requests to the next set of nodes. Because of (2), the first 10 nodes contacted is usually not unique, especially when each node has many cores.
    
    This PR change (3), which allows us to dynamic adjust the max_pending_lease_requests based on the number of nodes in the cluster.
    Without this PR, the top k scheduling algorithm is bottlenecked by the speed of sending lease request across the cluster.
    scv119 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    ff16730 View commit details
    Browse the repository at this point in the history
  30. [Datasets] Fix filter logic and reuse output buffer (#32160)

    This PR is to fix filter logic that it should always `yield`, instead of `return`. Otherwise it will just read first block, and exit. Add a unit test, and verify unit test is failed before this PR.
    
    Also change all map-like functions to reuse same output buffer.
    c21 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    e9269ab View commit details
    Browse the repository at this point in the history
  31. Configuration menu
    Copy the full SHA
    223a9a6 View commit details
    Browse the repository at this point in the history
  32. Configuration menu
    Copy the full SHA
    4d526c5 View commit details
    Browse the repository at this point in the history
  33. [core][state] Fix task failed time when job finishes (#32161)

    Signed-off-by: rickyyx rickyx@anyscale.com
    
    Why are these changes needed?
    We have the wrong unit translation right now when recording tasks' failed status if the owning job finishes.
    This results in negative duration of such tasks.
    Signed-off-by: rickyyx <rickyx@anyscale.com>
    rickyyx authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    f49b1b2 View commit details
    Browse the repository at this point in the history
  34. [tune/execution][rfc] Cache ready futures in RayTrialExecutor (#32093)

    We currently resolve futures one-by-one in Ray Tune, and query Ray core for the ready status of future multiple times. Instead, we can also cache ready events and yield them if cached elements exist. This can improve performance: In tune_scalability_result_cluster_throughput this improved performance by ~2-3%.
    
    We will always re-query Ray if we expect a resource to be ready.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    6e39b2e View commit details
    Browse the repository at this point in the history
  35. [Release] Fix bad import in AIR benchmark (#32175)

    Fixes a bad import causing an AIR benchmark release test to fail.
    
    Release test run: https://buildkite.com/ray-project/release-tests-pr/builds/27298
    
    Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
    Yard1 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    6d39879 View commit details
    Browse the repository at this point in the history
  36. [tune] Sync less often and only wait at end of experiment (#32155)

    We currently run into syncing bottleneck when running many short running trials in a multi node cluster, see #32121.
    
    After some investigation, there are three major bottlenecks:
    
    1. All of the 100 trials trigger 2 sync processes each. This is because we trigger a sync for both the result (`SyncerCallback.on_trial_result`) and for the trial completion (`SyncerCallback.on_trial_complete`).
    2. We wait synchronously for the sync processes to finish on trial completion
    3. The packing and unpacking interferes with the actual training processes on the local node, drastically increasing trial runtime for those trials colocated with the driver script
    
    This PR mitigates 1) and 2) to unblock the coming release. For 3), we may have to re-architecture the current packing logic that uses multiple pack actors and unpack tasks that can impact training performance.
    
    For 1), we introduce a **minimum training time + iteration threshold** for the syncing process. Per default, we only trigger the first sync after at least 2 results were received _or_ 10 training seconds passed. The logic here is that this will only affect experiments where we have short running trials that report one result. In that case, we only need the `on_trial_complete` trigger at the end of training. Other experiments are unaffected and there's not much lost if we don't sync results from the first iteration that took less than 10 seconds to run.
    
    For 2), we cache sync process removal on trial completion. This means we do not wait until the sync process finished, but we keep the process around so we can await syncing at the end of the experiment. Periodically we clean up sync processes that were flagged for removal.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    1f53e60 View commit details
    Browse the repository at this point in the history
  37. [Tune] Add Tuner.can_restore(path) utility for checking if an exper…

    …iment exists at a path/uri (#32003)
    
    This PR adds a utility to check if a given path (either local or remote) exists and can be restored from. It includes some simple validation that this is the root of the experiment directory (can't restore from the trial level directory).
    
    Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
    Signed-off-by: Justin Yu <justinvyu@anyscale.com>
    Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
    justinvyu and Yard1 authored Feb 1, 2023
    Configuration menu
    Copy the full SHA
    d6de1ce View commit details
    Browse the repository at this point in the history

Commits on Feb 2, 2023

  1. [ci][job] Move test_cli_integration to large test (#32171)

    This has caused flaky test failures which are false positives.
    rickyyx authored Feb 2, 2023
    Configuration menu
    Copy the full SHA
    a954ab7 View commit details
    Browse the repository at this point in the history
  2. [Datasets] Add support for string tensor columns in `ArrowTensorArray…

    …` and `ArrowVariableShapedTensorArray` (#32143)
    
    Add support for creating ArrowTensorArrays and ArrowVariableShapedTensorArray with string typed columns. The previous PR #31817 had CI test failures which were not run at PR-review time. This PR replicates the functionality of the previous PR, and additionally addresses the test failures (which only occur for Arrow 8.0+).
    
    Signed-off-by: Scott Lee <sjl@anyscale.com>
    scottjlee authored Feb 2, 2023
    Configuration menu
    Copy the full SHA
    74266a2 View commit details
    Browse the repository at this point in the history
  3. IA polish for demo (#32158)

    Add links between progress bar and task table and actor table
    Add links from task table to logs and to view stack trace fix horizontal scroll of table view
    Fix logs link going to old IA instead of new IA.
    fix horizontal scroll of table view
    Add beta label
    alanwguo authored Feb 2, 2023
    Configuration menu
    Copy the full SHA
    5091217 View commit details
    Browse the repository at this point in the history
  4. [spark] Refine some text in Ray on Spark exception messages and warni…

    …ng messages (#32162)
    
    See follow-up comments in #31962
    
    Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
    WeichenXu123 authored Feb 2, 2023
    Configuration menu
    Copy the full SHA
    ed83715 View commit details
    Browse the repository at this point in the history
  5. Revert "Revert "[Core] add ray-core as code-owner for most of the cor…

    …e code-path (#32082)" (#32176)" (#32190)
    
    This reverts commit 4d526c5.
    scv119 authored Feb 2, 2023
    Configuration menu
    Copy the full SHA
    ada5db7 View commit details
    Browse the repository at this point in the history
  6. [RLlib] Fix typehint for explore argument. (#30734)

    Signed-off-by: Ram Rachum <ram@rachum.com>
    cool-RR authored Feb 2, 2023
    Configuration menu
    Copy the full SHA
    29cd2fa View commit details
    Browse the repository at this point in the history
  7. [RLlib] Add tags option to actor manager (#31803)

    Signed-off-by: Avnish <avnishnarayan@gmail.com>
    avnishn authored Feb 2, 2023
    Configuration menu
    Copy the full SHA
    a53907c View commit details
    Browse the repository at this point in the history
  8. [RLlib] Optimize the trainer runner test, add method for shutting dow…

    …n a trainer runner and releasing resources (#32109)
    
    Signed-off-by: avnish <avnish@anyscale.com>
    avnishn authored Feb 2, 2023
    Configuration menu
    Copy the full SHA
    fdfef1f View commit details
    Browse the repository at this point in the history
  9. [RLlib] Exclude gpu tag from Examples test suite in RLlib (#32141)

    * RLlib's example test suite should run on no-gpu instances, so we should exclude the gpu tag
    
    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
    kouroshHakha authored Feb 2, 2023
    Configuration menu
    Copy the full SHA
    b81f0cd View commit details
    Browse the repository at this point in the history

Commits on Feb 3, 2023

  1. [air] avoid inconsistency of create filesystem from uri for hdfs case (

    …#30611)
    
    pyarrow.fs.FileSystem.from_uri(uri) will work if uri is the form of hdfs://name_server/user_folder/... But it will fail if uri is in the form of hdfs:///user_folder. But certain raytune module make it not possible to supply uri always in hdfs://name_server/user_folder/... format. If fssepc is available, we don't have such issue. So we place fsspec at a higher priority
    
    Signed-off-by: yud <yud@uber.com>
    yuduber authored Feb 3, 2023
    Configuration menu
    Copy the full SHA
    b31343a View commit details
    Browse the repository at this point in the history
  2. Revert "Revert "[core] Increase the threshold for pubsub integration …

    …test"" (#32177)
    
    * Revert "Revert "[core] Increase the threshold for pubsub integration test (#32145)" (#32165)"
    
    This reverts commit 83e1a2a.
    
    Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
    fishbone authored Feb 3, 2023
    Configuration menu
    Copy the full SHA
    6f97a83 View commit details
    Browse the repository at this point in the history
  3. [core] release test for nested air (tune) oom (#31768)

    [core] release test for nested air (tune) oom #31768
    
    Signed-off-by: Clarence Ng <clarence@anyscale.com>
    clarng authored Feb 3, 2023
    Configuration menu
    Copy the full SHA
    370a574 View commit details
    Browse the repository at this point in the history

Commits on Feb 4, 2023

  1. [Docs] Fix typo in Huggingface example notebook (#32218)

    Signed-off-by: David Xia <dxia@spotify.com>
    davidxia authored Feb 4, 2023
    Configuration menu
    Copy the full SHA
    8b55e2d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    37c0f76 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    715e1b2 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    5503bcd View commit details
    Browse the repository at this point in the history

Commits on Feb 6, 2023

  1. [Doc] [runtime env] Address common question about importing packages …

    …outside Ray (#31373)
    
    Answer a common user question by emphasizing in the docs that runtime envs are only active for Ray processes, so you shouldn't expect to be able to install a runtime env and then log into the cluster and start importing the packages outside Ray.
    architkulkarni authored Feb 6, 2023
    Configuration menu
    Copy the full SHA
    276559e View commit details
    Browse the repository at this point in the history
  2. [Serve] Remove logging requirement for long_running_serve_failure (#…

    …32181)
    
    #32063 fixed some issues with the long_running_serve_failure release test and then marked it stable. The test ran successfully afterwards (see test run), but the CI failed to access logs from the cluster and reported the test as errored. The logs were inaccessible on the cluster due to an issue with the cluster setup.
    
    Since this test can run without persisting logs, this change drops the logging requirement for this test.
    
    Related issue number
    Closes #32169
    shrekris-anyscale authored Feb 6, 2023
    Configuration menu
    Copy the full SHA
    2314775 View commit details
    Browse the repository at this point in the history
  3. [Datasets] Deflake the test_dataset.py (#32200)

    Signed-off-by: jianoaix <iamjianxiao@gmail.com>
    jianoaix authored Feb 6, 2023
    Configuration menu
    Copy the full SHA
    095960c View commit details
    Browse the repository at this point in the history

Commits on Feb 7, 2023

  1. Configuration menu
    Copy the full SHA
    e71e3a7 View commit details
    Browse the repository at this point in the history
  2. Allow overriding the UID of the default grafana dashboard exported by…

    … ray (#32255)
    
    Signed-off-by: Alan Guo <aguo@anyscale.com>
    
    This lets users with their own grafana setups to have multiple dashboards, one per ray instance. Without this change, each dashboard would have the same uid and replace each other in the grafana DB.
    alanwguo authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    f3ae74e View commit details
    Browse the repository at this point in the history
  3. Remove metrics-based progress-bar endpoints (#31702)

    Signed-off-by: Alan Guo <aguo@anyscale.com>
    
    This is no longer necessary after #31577
    alanwguo authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    8030e51 View commit details
    Browse the repository at this point in the history
  4. clean up raylet client mocks (#32216)

    Signed-off-by: Clarence Ng <clarence@anyscale.com>
    
    Remove redundant mock classes. We just need one mock class for the interface that covers all the sub interface. The mock for the sub interface is unused
    clarng authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    eec9791 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    7432367 View commit details
    Browse the repository at this point in the history
  6. [air/benchmarks] Fix typo in tensorflow_benchmark.py script preventin…

    …g proper error surfacing (#32269)
    
    There is a small typo in the tensorflow_benchmark.py script that does not properly catch when a vanilla TF run failed three times. Because of this, we would previously record a training time of 0.0 for vanilla TF, which skews the calculated average and suggests that vanilla TF outperformed Ray Train. Instead, we should have raised an error message to surface the problem.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    c83111a View commit details
    Browse the repository at this point in the history
  7. [RLlib] Chaining Models in RLModules (#31469)

    Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
    ArturNiederfahrenhorst authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    027965b View commit details
    Browse the repository at this point in the history
  8. [Data] Revise "Getting Started" page (#31989)

    The "Getting Started" page is long. It contains large code snippets and potentially irrelevant information. This PR revises the page for readability and brevity.
    
    Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
    bveeramani authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    2efee15 View commit details
    Browse the repository at this point in the history
  9. [Tune] Add use_threads=False in pyarrow syncing (#32256)

    Fixes a pyarrow issue where the syncing deadlocks when there are more files in a directory than available CPU cores.
    
    Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    Co-authored-by: Kai Fricke <kai@anyscale.com>
    Yard1 and Kai Fricke authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    773f7bf View commit details
    Browse the repository at this point in the history
  10. Fix overview page to work with the new DASHBOARD_UID env var (#32279)

    In #32255 , i added a new env var to customize grafana dashboard uid. I forgot to use this var in the overview page.
    I also made the "View in Grafana" button take the user directly to the dashboard instead of the homepage of Grafana.
    
    Signed-off-by: Alan Guo aguo@anyscale.com
    alanwguo authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    ce5a21a View commit details
    Browse the repository at this point in the history
  11. [build_base] [Docker] Add cuda 11.8 images (#32247)

    In order to keep up CUDA versions need for PyTorch 2.0, this PR adds a CUDA 11.8 image.
    
    Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    Co-authored-by: Kai Fricke <kai@anyscale.com>
    ArturNiederfahrenhorst and Kai Fricke authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    9995599 View commit details
    Browse the repository at this point in the history
  12. [Tune] Add repr for ResultGrid class (#31941)

    Add __repr__() for ResultGrid class and prettify __repr__() of Result class.
    
    Signed-off-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter>
    Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter>
    woshiyyya and Yunxuan Xiao authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    cf95514 View commit details
    Browse the repository at this point in the history
  13. [ci/release] Improve error message when kicking off tests from a comm…

    …it (#32281)
    
    If kicking off release tests from Buildkite, it's easy to make the mistake to insert a commit in both the Buildkite dialog and our own dialog. In the first case, it will checkout the repository from the specific commit, so if a test is not contained in that commit, it can't be run for that commit.
    
    This PR will provide a better error message in that case.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    37580d7 View commit details
    Browse the repository at this point in the history
  14. [Core] Fix recursive cancelation crashes the worker when actor task i…

    …s a child. (#32259)
    
    Signed-off-by: SangBin Cho <rkooo567@gmail.com>
    
    ray.cancel is only supported for tasks, not actor tasks (https://docs.ray.io/en/master/ray-core/package-ref.html#ray-cancel). Note that it is an intended design because canceling actor tasks could corrupt the actor states easily.
    
    When ray.cancel is called, we set recursive=True, which means all children's tasks will also be canceled. However, when this happens, if the task has a child "actor task", it crashes the worker with WorkerCrashedError: task_spec.cc:200: Check failed: sched_cls_id_ > 0 because we don't handle this case properly.
    
    To fix the issue, we check if the child tasks are actor task. This PR also improves the error message when recursive cancellation is failed. Note that because ray.cancel is not blocking, we couldn't include the error message into ray.get(canceled_task).
    rkooo567 authored Feb 7, 2023
    Configuration menu
    Copy the full SHA
    00db336 View commit details
    Browse the repository at this point in the history

Commits on Feb 8, 2023

  1. Configuration menu
    Copy the full SHA
    51efd2f View commit details
    Browse the repository at this point in the history
  2. [Datasets] Fix book-documentation (#32293)

    Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
    
    #31989 broke the 📖 Documentation job. This PR fixes the doctest failure.
    bveeramani authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    3fa36d9 View commit details
    Browse the repository at this point in the history
  3. [AIR] Fix dtype type hint in DLPredictor methods (#32198)

    The dtype parameter of DLPredictor._predict_pandas and DLPredictor._predict_numpy is None but default, but the type hint suggests dtype is non-None. This PR fixes the type hint by labeling the parameter as Optional.
    
    Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
    bveeramani authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    5e1def0 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    3f43969 View commit details
    Browse the repository at this point in the history
  5. [RLlib] PPO torch RLTrainer (#31801)

    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
    kouroshHakha authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    1f77e04 View commit details
    Browse the repository at this point in the history
  6. [Tune] Replace reference values in a config dict with placeholders (#…

    …31927)
    
    Signed-off-by: Jun Gong <gongjunoliver@hotmail.com>
    Co-authored-by: Justin Yu <justinvyu@anyscale.com>
    Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
    3 people authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    befad81 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    aa504ae View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    cefd3c4 View commit details
    Browse the repository at this point in the history
  9. [Tune] Remove Ray Client references from Tune and Train docs/examples (

    …#32299)
    
    This PR removes references to Ray Client in Tune and Train examples. It also removes outdated references of needing `ray.init("auto")` being used to connect to an existing cluster vs. `ray.init()` creating a new local cluster.
    
    The latest `ray.init()` docstring explains that:
    
    > This method handles two cases; either a Ray cluster already exists and we just attach this driver to it or we start all of the processes associated with a Ray cluster and attach to the newly started cluster.
    
    New version of this PR: #31712
    
    Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
    justinvyu authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    e84fcb1 View commit details
    Browse the repository at this point in the history
  10. [release] Improve handle_result in case of empty fetched result. (#32055

    )
    
    Improve handle_result (result alert logic) for release tests in case when the fetched result is empty due to infra issues. For example if job server on the cluster is down (which we rely on to get files back to buildkite runners).
    
    Without this, the error code indicates application error, which is misleading.
    See an example here: https://buildkite.com/ray-project/release-tests-branch/builds/1318#0185fc29-1d4c-483a-999b-ede500781c7a
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    xwjiang2010 authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    bae61d9 View commit details
    Browse the repository at this point in the history
  11. [RLlib] Move minibatching into RLTrainer instead of TrainerRunner (#3…

    …2262)
    
    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
    kouroshHakha authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    585f8aa View commit details
    Browse the repository at this point in the history
  12. [RLlib] Support empty leafs with NestedDict (#32136)

    * add test cases and make nesteddict also support empty elements
    
    Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
    ArturNiederfahrenhorst authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    59c62e4 View commit details
    Browse the repository at this point in the history
  13. [RLlib] Forward fix for failing PPO Torch RLTrainer test (#32308)

    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
    kouroshHakha authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    b85eb52 View commit details
    Browse the repository at this point in the history
  14. [Doc] Add tips of writing fault tolerant Ray applications (#32191)

    Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
    jjyao authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    d256508 View commit details
    Browse the repository at this point in the history
  15. [Telemetry] track num tasks created (#32106)

    Tracks the total number of tasks created by leveraging the gcs_task_manager.
    scv119 authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    56606ae View commit details
    Browse the repository at this point in the history
  16. [core] Fix the GCS memory usage high issue

    It's not because of leak. The root cause is because we allocate more requests when start. This PR fixed it by making the number of call constant.
    fishbone authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    cf1bc83 View commit details
    Browse the repository at this point in the history
  17. [telemetry] remove extra print #32322

    removing some debugging message i accidentally merged in #32106
    scv119 authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    cb5129c View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    468e606 View commit details
    Browse the repository at this point in the history
  19. [AIR] Add TorchDetectionPredictor (#32199)

    TorchPredictor doesn't work with TorchVision detection models because they return List[Dict[str, torch.Tensor]] instead of torch.Tensor. This PR adds a TorchDetectionPredictor so users don't have to extend TorchPredictor themselves.
    
    Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
    bveeramani authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    53260af View commit details
    Browse the repository at this point in the history
  20. [RLlib] Make one hidden layer config possible for TorchMLP (#32310)

    * make only one hidden layer possible
    * move setting out output dims to setup()
    
    Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
    ArturNiederfahrenhorst authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    0466bd3 View commit details
    Browse the repository at this point in the history
  21. [data] [streaming] No preserve order by default (#32300)

    Signed-off-by: Eric Liang ekhliang@gmail.com
    
    Why are these changes needed?
    Preserve order decreases performance; set it off by default.
    ericl authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    f05eeb4 View commit details
    Browse the repository at this point in the history
  22. [core] Fix comments and a corner case in #32302 (#32323)

    This is a corner case where buffer could be 0 and a comments needs to be fixed in the previous PR.
    fishbone authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    3bb73d3 View commit details
    Browse the repository at this point in the history
  23. [Serve][Doc] Refactor the Ray Serve API doc (#32307)

    - Add an index page to list all the APIs. (https://ray--32307.org.readthedocs.build/en/32307/serve/api/index.html)
    - With this change, when you search specific python API e.g`ray.serve.run`. The search result will show core api link page. (Previously, the user can't get the correct search result, because we put all APIs on one page.)
    <img width="604" alt="image" src="https://user-images.githubusercontent.com/6515354/217628692-720b9344-061d-44de-bc77-ee0c0ef27276.png">
    sihanwang41 authored Feb 8, 2023
    Configuration menu
    Copy the full SHA
    22bc1e9 View commit details
    Browse the repository at this point in the history

Commits on Feb 9, 2023

  1. [RLlib] Modifications to gpu resource logic in rl_trainer (#32149)

    * Modifications to gpu resource logic in rl_trainer
    
    - Add support for gpu with local mode for tf trainers in local mode
    - remove `_make_distributed_module`
    - add support for `local_gpu_id` which is the id of the gpu to use during local mode training with gpu
    - refactor tf function tracing logic to include the call to strategy.run
    - change tf function logic to prevent unnecessary retracing
    - add warning to not do gpu or distributed training in tf without turning on eager tracing.
    
    Signed-off-by: avnish <avnish@anyscale.com>
    avnishn authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    b73f3eb View commit details
    Browse the repository at this point in the history
  2. [Doc] add job overview diagram (#32050)

    This diagram is currently only placed on the key concepts page. However, when I search for ray jobs, I usually only end up on the job overview page and couldn't find this diagram. This diagram will be very helpful to people who need an overview of ray jobs which this page is intended for.
    scottsun94 authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    6cfb541 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    b011d56 View commit details
    Browse the repository at this point in the history
  4. [core] Improving failure message when ray processes fail to start on …

    …new node (#32303)
    
    We have a release test named long_running_node_failures which intermittently fails because a node failed to start up. I couldn't debug it despite having all of the Ray logs. I created this PR to add a bit more information (the node socket that should have started up) in the hopes that this enables us to identify the issue next time it happens.
    
    Failure in long_running_node_failures: #32180
    cadedaniel authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    63d922b View commit details
    Browse the repository at this point in the history
  5. [release] update if xgboost test suite requires result or not. (#32340)

    * [release] update if xgboost test suite require result or not.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * format
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * Revert "format"
    
    This reverts commit 3140401.
    
    * Revert "[release] update if xgboost test suite require result or not."
    
    This reverts commit 03ca1c0.
    
    * change to default alert.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * remove tests from xgboost_tests alerts.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    ---------
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    xwjiang2010 authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    5c1c888 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    5f0f95a View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    67d1515 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    b2e7699 View commit details
    Browse the repository at this point in the history
  9. [autoscaler][observability] Better memory formatting (#32337)

    This PR updates the memory formatting to show usage and total in independent, friendly units. This is should make it easier to tell when there's a small amount of memory being used that could otherwise be rounded to 0, which is often confusing for downscaling.
    
    ```
    ======== Autoscaler status: 2020-12-28 01:02:03 ========
    Node status
    --------------------------------------------------------
    Healthy:
     2 p3.2xlarge
     20 m4.4xlarge
    Pending:
     m4.4xlarge, 2 launching
     1.2.3.4: m4.4xlarge, waiting-for-ssh
     1.2.3.5: m4.4xlarge, waiting-for-ssh
    Recent failures:
     p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6)
    
    Resources
    --------------------------------------------------------
    Usage:
     0/2 AcceleratorType:V100
     530.0/544.0 CPU
     2/2 GPU
     2.00GiB/8.00GiB memory
     0B/16.00GiB object_store_memory
    
    Demands:
     {'CPU': 1}: 150+ pending tasks/actors
     {'CPU': 4} * 5 (PACK): 420+ pending placement groups
     {'CPU': 16}: 100+ from request_resources()
    ```
    
    and 
    
    ```
    ======== Autoscaler status: 2020-12-28 01:02:03 ========
    Node status
    --------------------------------------------------------
    Healthy:
     2 p3.2xlarge
     20 m4.4xlarge
    Pending:
     m4.4xlarge, 2 launching
     1.2.3.4: m4.4xlarge, waiting-for-ssh
     1.2.3.5: m4.4xlarge, waiting-for-ssh
    Recent failures:
     p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6)
    
    Resources
    --------------------------------------------------------
    Usage:
     0/2 AcceleratorType:V100
     530.0/544.0 CPU
     2/2 GPU
     2.00GiB/8.00GiB memory
     3.14GiB/16.00GiB object_store_memory
    
    Demands:
     {'CPU': 1}: 150+ pending tasks/actors
     {'CPU': 4} * 5 (PACK): 420+ pending placement groups
     {'CPU': 16}: 100+ from request_resources()
    ```
    
    are some examples of what the updated output may look like.
    
    Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
    Co-authored-by: Alex <alex@anyscale.com>
    3 people authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    d653f73 View commit details
    Browse the repository at this point in the history
  10. [core] Add opt-in flag for Windows and OSX clusters, update `ray star…

    …t` output to match docs (#31166)
    
    This PR cleans up a few usability issues around Ray clusters:
    
        Makes some cleanups to the ray start log output to match the new documentation on Ray clusters. Mainly, de-emphasize Ray Client and recommend jobs instead.
        Add an opt-in flag for enabling multi-node clusters for OSX and Windows. Previously, it was possible to start a multi-node cluster, but then any Ray programs would fail mysteriously after connecting to the cluster. Now, it will warn the user with an error message if the opt-in flag is not set.
        Document multi-node support for OSX and Windows.
    
    Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
    Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
    stephanie-wang and architkulkarni authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    90f8511 View commit details
    Browse the repository at this point in the history
  11. [data] [streaming] Implement locality-aware actor task assignment (#3…

    …2278)
    
    This implements a very simple version of locality-aware task assignment. The locality assignment problem is complex, but here we will start by just preferentially assigning tasks to actors if the first block of the bundle is local. We will record perf metrics on the locality hit/miss rate.
    
    This feature is flag protected (on by default).
    
    Actor locality on: 
    ```
    MapBatches(Model): 0 active, 0 queued, 0 actors [987 locality hits, 13 misses]:
    100%|█████████| 1000/1000 [01:01<00:00, 16.28it/s]
    
    Average throughput 16.072036005250155 GiB/s
    ```
    
    Actor locality off:
    ```
    MapBatches(Model): 0 active, 0 queued, 0 actors [locality off]:
    100%|███████████████████████████| 1000/1000 [03:01<00:00,  5.50it/s]
    
    Average throughput 5.471759229068149 GiB/s
    ```
    ericl authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    0e56dff View commit details
    Browse the repository at this point in the history
  12. [RLlib] Remove leela chess from release tests (#32325)

    * Temporary fix to the leela chess example
    * Remove leela chess from the release test framework, move it to tuned examples
    
    Signed-off-by: avnish <avnish@anyscale.com>
    avnishn authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    f80badc View commit details
    Browse the repository at this point in the history
  13. [core][state] Task backend improve performance (#32251)

    Signed-off-by: rickyyx <rickyx@anyscale.com>
    
    This PR aims to improve performance of the task backend with 3 changes:
    
    Delay conversion of protobuf. We found the protobuf conversion, especially from TaskSpecification to TaskInfoEntry that's needed for the task metadata has been slow, and was in the critical path of task execution and submission. This PR delays the generation of rpc::TaskEvnets before sending in the flush thread. During task execution, it will simply generate a TaskEvent entry that's in-memory with a lower overhead.
    Fixed the circular buffer that's used as the underlying data structures for the buffered events. This prevents constant resizing when the buffer gets filled up or flushed, which is costly.
    Adjust the niceness of the flushing thread, so it has a lower priority than the worker thread.
    rickyyx authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    69a14e7 View commit details
    Browse the repository at this point in the history
  14. [docs]Fix wording of Many model training guidance (#32319)

    Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
    Co-authored-by: Cade Daniel <cade@anyscale.com>
    3 people authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    8bf1d03 View commit details
    Browse the repository at this point in the history
  15. [core] Fix gRPC callback API destruction issues. (#32151)

    For gRPC callback API, in the server and client side, the lifecycle is different.
    
    For server, it has to call Finish to make the call be considered as dead by gRPC and this can only be called once.
    For client, it will destruct itself if it receive the signal from the server or the connection is broken due to some reasons.
    
    There are two issues here in ray syncer:
    
    server might call Finish twice because server has OnWriteDone/OnReadDone.
    The fix is that when error happened, we'll call Finish and we'll guarantee that it's only called once.
    client might destruct itself, because client didn't have anything added to control that.
    The fix is to add AddHole/RemoveHole in the code to explicit control that just like server side.
    Testing is tricky, but it can be caught by nightly tests.
    fishbone authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    fc81af1 View commit details
    Browse the repository at this point in the history
  16. [Doc] Move actor checkpointing to actor fault tolerance page (#32153)

    Actor fault tolerance page is a better place for actor checkpointing. Also make the code example testable.
    
    Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
    jjyao authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    741b7a0 View commit details
    Browse the repository at this point in the history
  17. [Core/Observability] Fix the timeline bugs (#32287)

    Signed-off-by: SangBin Cho <rkooo567@gmail.com>
    
    There are 2 issues.
    
    The duration should be recorded in microseconds. I made a mistake to record it as 10*microseconds which make the duration incorrect.
    The metadata event should be recorded only once. I made a mistake it is recorded for every task, which blows up the timeline file size.
    This PR fixes both issues + add relevant tests.
    
    I also created a dataclass for chrome tracing events for a better schema tracking.
    rkooo567 authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    188c411 View commit details
    Browse the repository at this point in the history
  18. [core][state] Task Backend - reduce lock contention on debug stats / …

    …metric recording on counters. (#32355)
    
    Signed-off-by: rickyyx <rickyx@anyscale.com>
    
    When GcsTaskManager is busy processing task events, it is not supposed to slow down the GCS. However, we previously have mutexes protecting some of the counter states. So the main io service/thread will get blocked when trying to acquire locks to print debug states + record metrics + add telemetry data.
    
    Global stats: 196276 total (5 active)
    Queueing time: mean = 5.255 ms, max = 4.545 s, min = -0.000 s, total = 1031.389 s
    Execution time:  mean = 295.864 us, total = 58.071 s
    Event stats:
    ....
            GCSServer.deadline_timer.debug_state_dump - 85 total (1 active), CPU time: mean = 521.750 ms, total = 44.349 s
            GCSServer.deadline_timer.debug_state_event_stats_print - 15 total (1 active, 1 running), CPU time: mean = 404.255 ms, total = 6.064 s
    ....
    
    This PR
    
    introduced a thread-safe wrapper on CounterMap, such that modifying and reading various debug counters will have minimal lock contentions. Also merged the count by task type for telemetry into the counter map. This way, we will not need to acquire locks at various places.
    With access to counters thread-safe now, we could also remove the mutex locks on the GcsTaskManagerStorage since it's now thread-safe (only accessed from its dedicated io thread)
    rickyyx authored Feb 9, 2023
    Configuration menu
    Copy the full SHA
    2bbe8c1 View commit details
    Browse the repository at this point in the history

Commits on Feb 10, 2023

  1. [Data] Add rule for ReorderRandomizeBlockOrder (#32254)

    Ports over previous rule to move RandomizeBlockOrder to the end of a DAG into the new execution backend as an optimizer rule.
    
    Closes #31894
    
    Signed-off-by: amogkam <amogkamsetty@yahoo.com>
    amogkam authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    b4ad23a View commit details
    Browse the repository at this point in the history
  2. [AIR] Automatically move DatasetIterator torch tensors to correct d…

    …evice (#31753)
    
    When DatasetIterator is used with Ray Train, automatically move the torch tensors returned by iter_torch_batches to the correct device.
    
    Signed-off-by: amogkam <amogkamsetty@yahoo.com>
    amogkam authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    4420120 View commit details
    Browse the repository at this point in the history
  3. [air/execution] Event manager part 2: Implementation (#31811)

    This implements the abstractions introduced in #31236.
    
    Changes:
    - We move to a static callback definition to better match other existing APIs
    - We split the RayEventManager into an RayActorManager (for actors) and a RayEventManager (for futures)
    - Instead of awaiting an arbitrary number of results, we have a `next()` method to await exactly one event, as this is the only thing needed for Train/Tune
    - We simplified the APIs and reduced the number of concepts.
    
    This PR comes with two end-to-end example flows for Ray Train- and Ray Tune-like flows.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    492ff7e View commit details
    Browse the repository at this point in the history
  4. [RLlib] Async trainer manager (#32282)

    Implement asynchronous update function along with a small
    test to see that it converges to the same results as the synchronous
    update
    
    Signed-off-by: avnish <avnish@anyscale.com>
    avnishn authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    c9cf2ef View commit details
    Browse the repository at this point in the history
  5. Revert "[core] Add opt-in flag for Windows and OSX clusters, update `…

    …ray start` output to match docs (#31166)" (#32403)
    
    This reverts commit 90f8511.
    scv119 authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    d807ce0 View commit details
    Browse the repository at this point in the history
  6. [core][oom] Use retriable lifo policy for dask 3x nightly test (#32361)

    Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
    
    3x nightly dask test is failing, due to enabling of group-by-owner oom killer policy
    
    This switches the test to use the previous policy
    clarng authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    73b52e0 View commit details
    Browse the repository at this point in the history
  7. [Train] Fix use_gpu with HuggingFacePredictor (#32333)

    HuggingFacePredictor's use_gpu was set in the wrong method, causing it to not really work correctly. This PR fixes that.
    
    Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
    Yard1 authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    a1938c3 View commit details
    Browse the repository at this point in the history
  8. [RLlib] Clean up RLModule (#32328)

    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
    kouroshHakha authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    841a4fb View commit details
    Browse the repository at this point in the history
  9. [RLlib] Cleanup RLTrainer (#32345)

    * Modifications to gpu resource logic in rl_trainer
    
    - Add support for gpu with local mode for tf trainers in local mode
    - remove `_make_distributed_module`
    - add support for `local_gpu_id` which is the id of the gpu to use
      during local mode training with gpu
    - refactor tf function tracing logic to include the call to strategy.run
    - change tf function logic to prevent unnecessary retracing
    - add warning to not do gpu or distributed training in tf without
    turning on eager tracing.
    
    Signed-off-by: avnish <avnish@anyscale.com>
    kouroshHakha authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    60fa8fe View commit details
    Browse the repository at this point in the history
  10. [Bug Fix][Object Store] race condition: Pull Manager will hang in cer…

    …tain timings (#31464)
    
    Restore will fail if the object is still in the creation, so in certain timings, the pull will hang.
    Catch-Bull authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    9cbf406 View commit details
    Browse the repository at this point in the history
  11. [Tune] Improve logging, unify trial retry logic, improve trial restor…

    …e retry test. (#32242)
    
    * [Tune] Improve logging, unify requeue logic, improve trial restore retry test.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * fix unit test.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * lint
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * fix test_tuner_restore
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    ---------
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    xwjiang2010 authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    d9a17f2 View commit details
    Browse the repository at this point in the history
  12. [Job API] Handle multiple drivers with same job submission id in GCS …

    …GetAllJobInfo endpoint (#32388)
    
    The changes to the GetAllJobInfo endpoint in #31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen.
    
    This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR.
    
    Related issue number
    Closes #32213
    architkulkarni authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    35e106a View commit details
    Browse the repository at this point in the history
  13. [Datasets] Not change map_batches() UDF name in Dataset.__repr__ (#…

    …32411)
    
    This is to fix the Dataset.__repr__ issue in #32410, after we introduce function name in #31526. We should only make operator/stage name to be camel case.
    
    Signed-off-by: Cheng Su <scnju13@gmail.com>
    c21 authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    d8639ab View commit details
    Browse the repository at this point in the history
  14. [Metrics] Fix flaky test_task_metrics + fix slow report issue from un…

    …it tests (#32342)
    
    Every X seconds, when we record metrics, we check all pending updates from counter_map. If there's pending updates, we invoke the registered callback for the relevant updates, which record metrics.
    
    Currently, we have 3 counter_map. Regular (containing all data) & get & wait counter_map. For get and wait  counter_map, although there are updates, we don't register callbacks (they are used to calculate correct RUNNING / GET / WAIT counts).
    
    So normally, this is what will happen.
    
    Task gets into RUNNING state. counter_map is updated and add a callback.
    Get is called, and get counter_map is updated. Callback is not updated (by design).
    If metrics are recorded after 2, the callback from regular counter_map is invoked and we record correct metrics.
    
    If metrics are recorded after 1, RUNNING state is recorded. But since we don't register callbacks for get counter map, when the next metrics are recorded, the relevant updates are not recorded.
    
    Flakiness comes from the latter case.
    
    This fixes the issue by having "no-op update" to the regular counter_map (e.g., Increment(0)). This will trigger counter_map to invoke a callback again which will correctly update get & wait status.
    
    I could also refactor the code to not use get&wait counter map, but this approach is much easier, so I decide to go with this approach.
    
    This PR also fixes the slow stats report issue.
    rkooo567 authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    b7e671d View commit details
    Browse the repository at this point in the history
  15. [core][state] State API scale losing data (#32408)

    We are dropping data at 10K as default, changing the buffer size larger right now before we figure out a way to store bursty task submissions.
    rickyyx authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    db9cfa6 View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    613f4b0 View commit details
    Browse the repository at this point in the history
  17. [AIR] Allow users to pass Callable[[torch.Tensor], torch.Tensor] to…

    … `TorchVisionTransform` (#32383)
    
    Transforms like RandomHorizontalFlip expect Torch tensors as input, but if you're applying the transform per-epoch, then you can't use ToTensor. To fix the problem, this PR updates TorchVisionPreprocessor to convert ndarray inputs to Torch tensors.
    
    You can't use ToTensor to convert the ndarrays to Torch tensors because then you'd be applying ToTensor twice, and your images would get scaled incorrectly.
    
    Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
    bveeramani authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    faeb2cc View commit details
    Browse the repository at this point in the history
  18. Add triage label to enhancement and doc issues as well (#32352)

    - Add triage label to enhancement and doc issues as well
    - Don't auto close issues pending triage
    
    Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
    jjyao authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    299d8f0 View commit details
    Browse the repository at this point in the history
  19. [docs] removing docs referring ray client. (#32209)

    Why are these changes needed?
    Deprecating ray client related docs.
    scv119 authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    6879184 View commit details
    Browse the repository at this point in the history
  20. [Doc] Document the top-k default scheduling strategy (#32331)

    Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
    jjyao authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    16a7683 View commit details
    Browse the repository at this point in the history
  21. [Datasets] Update Ray Data documentation for lazy execution by defaul…

    …t (1st part) (#32394)
    
    This is to update Ray Data documentation and code example to reflect lazy execution by default. This covers the rest of documentation other than #32387 .
    
    Signed-off-by: Cheng Su <scnju13@gmail.com>
    c21 authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    08a8c65 View commit details
    Browse the repository at this point in the history
  22. [ci][core] Do not set flushing thread niceness for task backend #32439

    We believe this has minimal impact on the performance. So reverting for non-necessary code.
    Signed-off-by: rickyyx <rickyx@anyscale.com>
    rickyyx authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    bc2de90 View commit details
    Browse the repository at this point in the history
  23. [Datasets] [Docs] Update docs to reflect lazy-by-default execution mo…

    …del. (#32387)
    
    This PR updates the docs for a portion of the feature guides, the FAQ, the examples, and the docstrings for the Dataset, GroupedDataset, and read APIs, to reflect the new lazy-by-default execution semantics.
    clarkzinzow authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    ed640b6 View commit details
    Browse the repository at this point in the history
  24. Use retriable_lifo policy for shuffle 1tb nightly test (#32417)

    Fix release blocker issue: #32203
    
    Ran 6 times and all of them passed.
    
    Signed-off-by: jianoaix <iamjianxiao@gmail.com>
    jianoaix authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    dade595 View commit details
    Browse the repository at this point in the history
  25. Configuration menu
    Copy the full SHA
    2874e47 View commit details
    Browse the repository at this point in the history
  26. [Autoscaler] Make ~/.bashrc optional in autoscaler commands (#32393)

    At the moment, autoscaler commands fail (and head node set up fails) if the user doesn't have a .bashrc. This seems like an unnecessary requirement for startup.
    
    There's also a completely pointless true &&, which looks like an artifact from someone's refactor.
    ckw017 authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    37086a5 View commit details
    Browse the repository at this point in the history
  27. [core] Force kill worker whose job has exited (#32217)

    ## Why are these changes needed?
    
    The worker leaks currently when the task references some global import like tensorflow. There are couple issues that led to this bug:
    
    when the worker finishes executing it does not clean up all its borrowed references
    the reference counting code treats borrowed reference as something it owns
    if the worker thinks it owns references it will not exit
    the worker pool will not force exit an idle worker, even if the job is dead, if the worker refuses to due to the aforementioned object ownership
    This PR implements the logic in worker pool to force kill an idle worker whose job has exited
    clarng authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    704fd4a View commit details
    Browse the repository at this point in the history
  28. [Datasets] Make ray.data.from_* APIs lazy. (#32390)

    This PR makes the ray.data.from_*() APIs lazy.
    clarkzinzow authored Feb 10, 2023
    Configuration menu
    Copy the full SHA
    9a04119 View commit details
    Browse the repository at this point in the history

Commits on Feb 11, 2023

  1. Fix doc test for dataset.py (#32458)

    Signed-off-by: Cheng Su <scnju13@gmail.com>
    c21 authored Feb 11, 2023
    Configuration menu
    Copy the full SHA
    b3b0336 View commit details
    Browse the repository at this point in the history
  2. [RLlib] Shared encoder MARL unittest and example (#32460)

    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
    kouroshHakha authored Feb 11, 2023
    Configuration menu
    Copy the full SHA
    80e982b View commit details
    Browse the repository at this point in the history

Commits on Feb 13, 2023

  1. Configuration menu
    Copy the full SHA
    4c52789 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    cacc982 View commit details
    Browse the repository at this point in the history
  3. [ActorInit] Fix Bug in Actor creation (#32277)

    In #28149 RayActorError is called with a str as cause, but this is not an accepted type. This leads to hitting the assertion error in the else case: assert isinstance(cause, ActorDiedErrorContext) on L283.
    ijrsvt authored Feb 13, 2023
    Configuration menu
    Copy the full SHA
    2e9b834 View commit details
    Browse the repository at this point in the history
  4. Fix typo in README.md (#32466)

    Signed-off-by: Pratik <pratikrajput1199@gmail.com>
    prrajput1199 authored Feb 13, 2023
    Configuration menu
    Copy the full SHA
    997e95e View commit details
    Browse the repository at this point in the history
  5. [RLlib] Added test version of BC algorithm based on RLModules an RLTr…

    …ainers (#32471)
    
    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
    kouroshHakha authored Feb 13, 2023
    Configuration menu
    Copy the full SHA
    4ffa7fd View commit details
    Browse the repository at this point in the history
  6. [tune] Move experiment state/checkpoint/resume management into a sepa…

    …rate file (#32457)
    
    Experiment state management is currently convoluted.
    We keep track of many duplicate variables, e.g. local/remote checkpoint dirs and syncers.
    The resume/syncing logic also takes up a lot of space in the trial runner.
    
    Saving and restoring experiment state is orthogonal to the actual trial lifecycle logic, thus it makes sense to separate this out. In the same go, I've removed a lot of duplicated state and simplified some APIs that will also make it easier to test the experiment state component separately.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Feb 13, 2023
    Configuration menu
    Copy the full SHA
    7e662dd View commit details
    Browse the repository at this point in the history
  7. [Jobs] Improve error message in case of 404 (#31120)

    An identical error message is returned in multiple cases if something goes wrong when pinging the api/version endpoint. This PR adds more information to the error message in case where the endpoint returns 404 in order to help with debugging.
    architkulkarni authored Feb 13, 2023
    Configuration menu
    Copy the full SHA
    6de3cbe View commit details
    Browse the repository at this point in the history
  8. [Datasets] Track bundles object store utilization as soon as they're …

    …added to an operator (#32482)
    
    This PR ensures that the object store utilization for a bundle is still tracked when it's queued internally by an operator, e.g. MapOperator queueing bundles for the sake of bundling up to a minimum bundle size, or due to workers not yet being ready for dispatch.
    clarkzinzow authored Feb 13, 2023
    Configuration menu
    Copy the full SHA
    80f2161 View commit details
    Browse the repository at this point in the history
  9. [tune/train] clean up tune/train result output (#32234)

    * [tune/train] remove duplicated keys in tune/train results.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * timestamp
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * result_timestamp defaults to None
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * fix test
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * fix progress_reporter test.
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * .get(, None)
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * fix test
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * fix test_gpu
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    * WORKER_
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    
    ---------
    
    Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
    xwjiang2010 authored Feb 13, 2023
    Configuration menu
    Copy the full SHA
    e71c63f View commit details
    Browse the repository at this point in the history
  10. [ci][core] Calculate actor creation time properly for stress_test_man…

    …y_tasks (#32438)
    
    Signed-off-by: rickyyx <rickyx@anyscale.com>
    
    We are calculating actor creation task submission time, which is less useful for this test.
    rickyyx authored Feb 13, 2023
    Configuration menu
    Copy the full SHA
    e56665e View commit details
    Browse the repository at this point in the history
  11. [tune] Structure refactor: Raise on import of old modules (#32486)

    Following our tune package restructure (https://github.com/ray-project/ray/pulls?q=is%3Apr+in%3Atitle+%5Btune%2Fstructure%5D), we now had 3 releases where we logged a warning (2.0-2.3). For 2.4, we should raise an error instead. For 2.5, we can remove the old files/packages.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Feb 13, 2023
    Configuration menu
    Copy the full SHA
    2cee078 View commit details
    Browse the repository at this point in the history
  12. [Doc] Add data ingestion clarification for AIR converting existing py…

    …torch code example (#32058)
    
    The example under Ray AI Runtime/Example section directly used native PyTorch datasets for data loading. It's good to clarify that the current approach is for simplicity, the more recommended approach is to use the Ray dataset.
    
    Signed-off-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter>
    Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
    Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MBP.local.meter>
    Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
    Co-authored-by: Yunxuan Xiao <yunxuanx@Yunxuans-MacBook-Pro.local>
    4 people authored Feb 13, 2023
    Configuration menu
    Copy the full SHA
    91940e3 View commit details
    Browse the repository at this point in the history

Commits on Feb 14, 2023

  1. [Datasets] Always preserve order for the BulkExecutor. (#32437)

    This PR always preserves order for the bulk executor. We may revisit this in the future, at which point we'd update all of the tests that rely on order preservation.
    
    ## Checks
    
    - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    clarkzinzow authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    71dfd20 View commit details
    Browse the repository at this point in the history
  2. [Tune] Fix docstring failures (#32484)

    This PR fixes the `Stopper` doctests that are erroring. Previously, it used a `tune.Trainable` as its trainable, which would error on fit since its methods are not implemented. Also, it was missing some imports.
    
    Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
    justinvyu authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    421b527 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    bc01288 View commit details
    Browse the repository at this point in the history
  4. [RLlib] Allow MARLModule customization from algorithm config (#32473)

    Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
    kouroshHakha authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    a447cbb View commit details
    Browse the repository at this point in the history
  5. [tune] Fix resuming from cloud storage (+ test) (#32504)

    #32457 refactored the experiment checkpoint management but introduced a bug where state is not correctly restored anymore. This was caught by a unit test error. This PR resolves the bug and makes sure the test passes.
    
    Signed-off-by: Kai Fricke <kai@anyscale.com>
    krfricke authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    efc432b View commit details
    Browse the repository at this point in the history
  6. [Doc] Restructure core API docs (#32236)

    Similar to #31204, refactor the core api reference for better layout and view.
    
    Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
    jjyao authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    99d00ad View commit details
    Browse the repository at this point in the history
  7. Deflake test_dataset.py: split torch tests (#32487)

    One of the flakiness of test_dataset.py is due to the timeout. This splits out the torch tests from this big test file.
    
    #32067
    jianoaix authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    b89457a View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    f0d96c5 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    3414797 View commit details
    Browse the repository at this point in the history
  10. [Datasets] Add logical operator for aggregate (#32462)

    This PR is to add logical operator for group-by aggregate. The change includes:
    * `Aggregate`: the logical operator for aggregate
    * `generate_aggregate_fn`: the generated function for aggregate operator
    * `SortAggregateTaskSpec`: the task spec for doing sort-based aggregate, mostly refactored from [_GroupbyOp](https://github.com/ray-project/ray/blob/master/python/ray/data/grouped_dataset.py#L35).
    c21 authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    66c0533 View commit details
    Browse the repository at this point in the history
  11. [tune] Fix two tests after structure refactor deprecation (#32517)

    #32486 introduced two test failures after hard-depracting a structure refactor. This PR fixes these two stale imports.
    
    Signed-off-by: Kai Fricke <coding@kaifricke.com>
    krfricke authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    d092b12 View commit details
    Browse the repository at this point in the history
  12. [AIR][Train][Doc] Restructure API reference (#32360)

    This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class.
    
    This PR is a followup to #31204 and #32311, which made the same changes for Ray Data/Tune docs.
    
    Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
    justinvyu authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    d87d86f View commit details
    Browse the repository at this point in the history
  13. Fix autosummary to show docstring of class members (#32520)

    By default, autosummary only shows one line for each class member instead of the entire docstring. Ideally the fix should be autosummarying class members as well but that generates too many doc pages and causes doc build timeout. For now, default to show docstring of class members in the class pages and an explicit opt-in to autosummary class members.
    
    Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
    jjyao authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    19ca00b View commit details
    Browse the repository at this point in the history
  14. [core] Add opt-in flag for Windows and OSX clusters, update ray start…

    … output to match docs (#32409)
    
    Un-revert #31166.
    
    This PR cleans up a few usability issues around Ray clusters:
    
    - Makes some cleanups to the ray start log output to match the new documentation on Ray clusters. Mainly, de-emphasize Ray Client and recommend jobs instead.
    - Add an opt-in flag for enabling multi-node clusters for OSX and Windows. Previously, it was possible to start a multi-node cluster, but then any Ray programs would fail mysteriously after connecting to the cluster. Now, it will warn the user with an error message if the opt-in flag is not set.
    - Document multi-node support for OSX and Windows.
    
    Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
    Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
    stephanie-wang and architkulkarni authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    bf5e721 View commit details
    Browse the repository at this point in the history
  15. [Data] Update DatasetPipeline.to_tf API to match with Dataset.to_tf (#…

    …32531)
    
    Signed-off-by: amogkam <amogkamsetty@yahoo.com>
    amogkam authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    9dcb369 View commit details
    Browse the repository at this point in the history
  16. Revert "[data] Fix pandas import failures by moving it to a top-level…

    … data import (#32447)" (#32533)
    
    This reverts commit bc01288.
    cadedaniel authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    b12c0d1 View commit details
    Browse the repository at this point in the history
  17. [Tune] Update trainable remote_checkpoint_dir upon actor reuse (#32420

    )
    
    This PR fixes trainable actor reuse to update the remote trial directory that it's writing checkpoints to.
    
    Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
    justinvyu authored Feb 14, 2023
    Configuration menu
    Copy the full SHA
    e8f1cf6 View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    b9f7e19 View commit details
    Browse the repository at this point in the history