Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] SkyPilot Client-Server Architecture #4660

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 6, 2025

This PR rearchitects SkyPilot to be a client-server architecture, i.e. separating the backend from the frontend, so that the
backend can be separately deployed.

What is new?

  1. A way to deploy SkyPilot for an organization to have a single pane of glass view for the resources.
  2. Multi-tenant SkyPilot allowing multiple users to collaborate (see everyone's resources, share resources, etc)
  3. Asynchronous CLI/Python SDK
  4. A new set of RESTful APIs and API server-related interfaces

Architecture

image

Behaviors

Local API server (individual users)

Without deploying a remote API server, SkyPilot still acts normally, with a local API server automatically started.

image

Remote API server (multi-user organizations)

A user can now deploy a remote API server and have the local client connect to that API server, so that multiple clients can connect to the same API server, i.e. having a single pane of glass of the resources used by multiple users/clients.

image

Disruptive Changes

  1. The SkyPilot Python SDK are now asynchronous by default, i.e. it returns a future, request_id that you can wait or stream with sky.get(request_id) and sky.stream_and_get(request_id)
  2. Storage objects no longer eagerly upload files during creation, but waits until a cluster/job is launched.

For more user facing details, see API server docs hosted here: https://docs.skypilot.co/en/client-server/reference/api-server/api-server.html

For more developer facing details, see the readme:

Tests

Local API Server

Remote API Server

An API server deployed on GKE cluster, deployed with: https://docs.skypilot.co/en/client-server/reference/api-server/api-server-admin-deploy.html

  • pytest tests/test_smoke.py --aws
  • pytest tests/test_smoke.py --gcp
  • pytest tests/test_smoke.py --kubernetes
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

TODO

  • Add more tests above

Acknowledgement

We thank all our users for providing valuable feedback during the alpha/beta testing, and all contributors who made this PR possible: @romilbhardwaj @concretevitamin @cg505 @KeplerC @zpoint @yika-luo @weih1121.

Squashed commit of the following:

commit 7889757299c0260fa4fc6e8c2d3b06484783aae7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Feb 6 21:36:58 2025 +0000

    change to local cache for the k8s client for newly added credential

commit bf660981504544aeb15650938d754094d00faf22
Merge: 19f13d372 3e9bd9d0f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Feb 6 21:30:59 2025 +0000

    Merge branch 'master' of github.com:skypilot-org/skypilot into restapi

commit 19f13d372b3fa7d590adcb490cc03ff7d2fa6f30
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Feb 6 19:48:17 2025 +0000

    Fix the unit test

commit 72ea226c4b73c6e0eac0f4d55063a6507107c711
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Feb 6 17:42:51 2025 +0000

    fix variables

commit 0dcc06b6bde5f2d24df1c7a8e31b703d54705e77
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Feb 6 07:57:26 2025 +0000

    remove unecessary endpoint

commit d54c74c59cac76ee56acfd339328144eb60d5929
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Feb 6 07:49:56 2025 +0000

    refactor serve endpoint for less API calls

commit 912069e6f715d10f7d44c4acdd45d57c25de6689
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Feb 6 07:27:56 2025 +0000

    Fix output

commit f8630cdba402882112fdc6821c586d897ee766ea
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Feb 6 02:36:39 2025 +0000

    fix bump seconds

commit 38e66d87fc56d03e5194d9ff18f58ff2194d680a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Feb 5 23:11:11 2025 +0000

    Fix merging issue

commit 0e7ace80f84ee4845f006e63610f9f80c1b5928c
Merge: 99b81b6ff e7d94e956
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Feb 5 23:00:03 2025 +0000

    Merge branch 'master' of github.com:skypilot-org/skypilot into restapi

commit 99b81b6ff810c595f4e9c5fbd4c8936692521fef
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Feb 5 22:53:15 2025 +0000

    mypy

commit 0cadaf976bd3f4296250992d299128282e16f9f6
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Feb 5 22:35:31 2025 +0000

    Fix serve status

commit 8d7849f3579e4a500efcc55c8a7e8023faf776e7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Feb 5 22:28:22 2025 +0000

    Avoid clone-disk-from tests

commit f0026726c1f6cfe66658e6e4820816e3d7591af9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Feb 5 22:14:09 2025 +0000

    Avoid detach setup

commit 9b8c0cbe61ede6b846bbc054f68653f6d1cdff62
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Feb 5 01:09:04 2025 +0000

    Address comments

commit cfaa62e573f0345ad4ba70c1b33ba1163ba809df
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Feb 5 00:56:22 2025 +0000

    pylint

commit 52bbdb7619c0a0334024d5a7e0eff02124e456e0
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Feb 5 00:55:03 2025 +0000

    Address comments

commit 25a0c9255c1a166e0855f56ca0d53751b1528723
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Feb 4 22:41:21 2025 +0000

    docstr

commit 1bad3ebef6bafbd0eeba28b03f8bc6e310ff558a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Feb 4 22:13:42 2025 +0000

    Fix raise for resource parsing

commit 81d7d79070b1ebc711a3c0034b40778df4608d77
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Feb 4 22:11:35 2025 +0000

    Fix env var for CPU/memory limit in pod

commit 11cff6b7370f148e74da0c56729dbe094d3d9f77
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Feb 4 21:46:12 2025 +0000

    rename to LONG and SHORT

commit 75797766c9830d58d8c1c0123c2782dee8417aae
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Feb 4 21:03:02 2025 +0000

    Use 127.0.0.1 by default for API server endpoint

commit 41e53331ee5ae0b1119b8b24d950ff6bd13e5a37
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Mon Feb 3 23:44:01 2025 +0000

    update logging for status refresh

commit 574fb6f0b6bf843384ee6f0070b40e540e06b87a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Mon Feb 3 15:41:41 2025 -0800

    [Usage/SKY-1403] Fix the usage collection for API server (#180)

    * Fix usage for server side

    * minor update

    * Fix run id

    * fix client entrypoint cmd

    * rename function

    * directly use reset for usage messages and refactor a bit

    * minor comment update

    * disable usage for internal status refresh

commit 09801da1f23c125e539caa20b94689832b7b0c08
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Mon Feb 3 04:46:40 2025 +0000

    Fix merge conflict for jobs launch

commit 7b86c84f4543ac092c5fb56474c32459b33c8d34
Merge: 52d499e2f 269dfb192
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Mon Feb 3 04:27:46 2025 +0000

    Merge branch 'master' of github.com:skypilot-org/skypilot into restapi

commit 52d499e2f9c3bf3f75674bd6d160224b5f50729d
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sun Feb 2 08:09:49 2025 +0000

    format

commit af6a346342109c31b71ad7d793275de407c29ab8
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sun Feb 2 08:09:34 2025 +0000

    address comments

commit 983bf1f392783e81477ccac1f90d9c5b6b77e16e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Feb 1 18:33:52 2025 +0000

    fix argument

commit 100a5d43154f4cf2ada5f154f40fc361dd35a9ec
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Feb 1 18:33:15 2025 +0000

    Fix invocation

commit 66540e28ec8282fbaab263b440b4acd4ae4b36eb
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Feb 1 18:32:26 2025 +0000

    fix invocation

commit d62c79c64656723af85d9052282e0fc7e024729e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Feb 1 18:27:25 2025 +0000

    make `need_confirmation` internal

commit 3f573afd2253410c320b3c134946c89b3d58e7cc
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Feb 1 18:01:51 2025 +0000

    base exception

commit efdb5911fa0da3f4f5701f27bb26ac5b5395b252
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Feb 1 06:40:40 2025 +0000

    format

commit f6964d398b6f2298bab23e2c82707cef6295c659
Merge: d7dd77d61 ac9f159db
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Feb 1 06:39:33 2025 +0000

    Merge branch 'restapi' of github.com:assemble-org/skypilot into restapi

commit d7dd77d61b5579a698015e43b67718f7a923d916
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Feb 1 06:39:24 2025 +0000

    Address comment

commit ac9f159db4b9adb3a9822d30b0fb71e9119da3c7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 31 22:35:09 2025 -0800

    Update sky/data/storage.py

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

commit 935c8b94c14275ffa3646312a50e82589a8fd1f7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Feb 1 05:43:59 2025 +0000

    Fix SKY-1383

commit f4f7653a2232e159ee4cfe55f0e9cc60baec0524
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Feb 1 02:13:49 2025 +0000

    Fix unit tests

commit 4347b3d18acd013972132d878187952298a81ffb
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 31 23:57:14 2025 +0000

    use response 500 for failed requests

commit b6fea92010749e7593387ba9813e740363633751
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 31 23:05:46 2025 +0000

    Add TODO

commit 26938445c9436c72c50d35f8540e9ee54e06ca2c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 31 20:32:18 2025 +0000

    address comments

commit b6d5437665cb5dd84fc574b4452a3ea24297b034
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 31 17:46:27 2025 +0000

    fix tests

commit 432b586954c84be86ac36feea26eb5f892ad794e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 31 08:32:52 2025 +0000

    Address comments

commit 8ae605c993067d8b15fcd7eb91dbaeada750a8d6
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 31 04:10:31 2025 +0000

    address comments

commit 80b34b5a35f8cd4db081ddd528ee2a506835987a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Jan 30 19:46:26 2025 +0000

    Address comments

commit 4e80ff8390715cfdf55a784be2bc7b6e8266f67a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Jan 25 19:51:56 2025 +0000

    Address comments

    commit 307140e45ba602b88f0d44da6ff232bbd45f22bb
    Author: Zhanghao Wu <zhanghao.wu@outlook.com>
    Date:   Sat Jan 25 18:39:16 2025 +0000

        Address comments

        Fix cancel body

        Fix

        Fix

        remove redis types

commit 252786fe73d47a5df725b29b4a2a1fa7d58a4aaa
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 18:35:19 2025 -0800

    [SKY-1374] Fix HTTPS keys (#179)

    * upload https keys

    * alternstive output

commit cb45ae3ab21a3ec1c3fa2567d080723378ef7703
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 23:23:05 2025 +0000

    Fix autodown

commit a471c092c4918b035a51f88b728e844f262ce643
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 23:04:04 2025 +0000

    Fix folder creation for log download

commit 4c2a438b24b349359e38ef37e042b6fa21bd5444
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 21:48:52 2025 +0000

    format

commit 32d8ffa197cdcd669b3c888c0967808c62799712
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 21:43:01 2025 +0000

    Fix cluster job downloading

commit 895f277e7cb80a2a43772499e7233f15b7130fbf
Merge: 70e7ebf61 1146cab36
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 20:35:04 2025 +0000

    Merge branch 'restapi' of github.com:assemble-org/skypilot into restapi

commit 70e7ebf61cd36aecdd4d35f169b63df0d40ad9aa
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 20:23:03 2025 +0000

    Naming and comments

commit 1146cab3694a34bf9cb66a4e244c069250afb47d
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 11:48:19 2025 -0800

    [SKY-1368] Support empty folder in file mounts and symlinks (#178)

    * upload symlink in file upload

    * format

    * remnant

    * fix relative

    * check relative path

    * Fix relative path in symlink target

    * Add unittest for zip and unzip

    * Add back symlink

    * fix circle link

commit a8bba42de13ea88b9e4fc9bfb2f121abb2663cfb
Author: zpoint <zp0int@qq.com>
Date:   Sat Jan 25 02:51:29 2025 +0800

    get rid of autouse on pytest (#176)

    * get rid of autouse

    * manual fixture

    * prevent reload

    * remove the prevent_reload

    * Avoid reload cli

    * format

    ---------

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

commit 099b8be94f1e40ed84f734a0bd44f7ba9941587f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 09:52:54 2025 -0800

    [SKY-1366] Fix storage reconstruct (#177)

    * Fix storage reconstruct

    * Error out instead

    * Add action

    * Fix exception type

    * Fix logging

commit 62071d56bff05187cf2127429122ba6691ff05a1
Author: Hong <hong@assemblesys.com>
Date:   Fri Jan 24 20:31:28 2025 +0800

    [SKY-1200][UX] Valid sky server available while login (#134)

commit 4179a6c2f0c771cec2f5d758304f959937b04e04
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 07:57:18 2025 +0000

    move lock under logs folder

commit 5a66dae35ecc1efebbacb2a1d5fef9652cbcb8e1
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 07:22:14 2025 +0000

    Fix config override error handling

commit ed641a16b5f7bf126ec0e6c8cab3b0ae836aa404
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 02:08:52 2025 +0000

    Expose CLOUD_REGISTRY

commit 60e211f27c8b2b82f601eb75b99afac9795d6e62
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 02:02:03 2025 +0000

    format

commit 9fcce2c977da60f875c26cc4140e87fc293f98c1
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 02:00:37 2025 +0000

    [SKY-1367] Fix validate exception serialization and deserialization

commit a00748b33e3d062f4bc9afaee631f05a18d85b45
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 00:57:09 2025 +0000

    Fix deepspeed setup

commit c81a20a5844d752287f951b65566d5d1e4780b69
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 00:39:53 2025 +0000

    remove api server deployment

commit c2a4f49a2adf10099b87f33c5de2d9956ef69d00
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 24 00:34:34 2025 +0000

    Fix dimming

commit 5ff29f469d68e114fc2a7305b89bde4cb62d4ba2
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Jan 23 16:32:26 2025 -0800

    merge issue

commit 3994e17be460f15bbbb2834e78a29b566b5a3539
Merge: befbf5114 97b8e8f12
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Jan 23 22:35:21 2025 +0000

    Merge branch 'master' of github.com:skypilot-org/skypilot into restapi

commit befbf51143a9e1735298a585b0ac84250a79d23b
Author: zpoint <zp0int@qq.com>
Date:   Thu Jan 23 14:56:33 2025 +0800

    SKY-1034 [Tests] Fix tests with no need for credentials (#51)

    * add restapi for tests

    * fix storage

    * add TODO

    * uncomment CI test

    * pass api test and unit test

    * fix dryrun bug

    * fix test_config

    * fix test_jobs

    * fix for jobs and serve

    * bug fix

    * fix test jobs and serve

    * support enable_all_clouds on server side

    * comment out fail test cases

    * bug fix and reformat

    * temp comment

    * pass tests

    * test config uncomment

    * restructure fixture for faster test

    * restore change by cursor

    * test test_list_accelerators

    * bug fix

    * fix test cli

    * fix all

    * test CI

    * bug fix

    * rename file

    * Update sky/cli.py

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * test_config pass final test

    * test_config pass final test

    * use patch and import strategy

    * format

    * bug fix

    * local test behaviour

    * fix reload issue

    * yapf

    * remove test controller util

    * fix test cases

    * resolve PR comment

    * resolve merge conflict

    * fix k8s test after merging restapi

    * resolve comment

    * resolve PR comment

    * debug

    * debug

    * bug fix

    * api change

    * fix PR comment

    * change stream method

    * fix

    * cli fix

    * Update tests/conftest.py

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * Update tests/test_cli.py

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * Update tests/test_cli.py

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * fix

    * resolve PR comment

    * restore class

    * support stream mock

    ---------

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

commit 72d85e942b91905f41807765d2ad141eae7438ce
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Jan 22 22:41:28 2025 -0800

    [SKY-1357] Fix jobs dashboard on Kubernetes (#175)

    * Add port forward command support

    * fix port forward command

    * format

    * Add comment

commit 0c39a8cca617b736e8c675bef08c1a8082167de8
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Jan 22 18:53:58 2025 -0800

    [SKY-1363] `sky down sky-jobs...` request fail to cancel on-going request for `sky jobs launch` (#174)

    * Cancel jobs request when downing

    * Add column for api status

    * Add comment

    * minor

commit d582284dc4ddb2cad5282ca03829144acb275644
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Jan 22 18:51:05 2025 -0800

    [SKY-1323] Add timestamps for server logs (#173)

    * Add timestamps for server log

    * comments

    * format

commit fffc1535530f52c888e52c9176318918220e1758
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Wed Jan 22 15:58:22 2025 -0800

    [SKY-1223] Docs for multi-k8s support, exec based kubeconfig converter  (#164)

    * Add docs and kubeconfig converter

    * lint

    * update docs

    * update docs

commit 3ab843ffbb46bc87b4bcc75c113c618142a1ca77
Author: zpoint <zp0int@qq.com>
Date:   Thu Jan 23 06:04:30 2025 +0800

    [SKY-1009] [Robust] Make streaming requests (sky logs , sky jobs logs , etc) synchronous (#155)

    * tail logs

    * circular import

    * print end

    * bug fix

    * fix

    * core function

    * resolve PR comment

    * mypy

    * bug fix

    * type

    * resolve PR comment

    * restore change

    * resolve merge conflict

    * Architecture: Add changes with dicussed offline

    * minor movement

    * format

    * format

    * Fix import

    * fix imports

    ---------

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

commit 3e75a0f4d3756fb8c469e94088c360d5fdcbe4a4
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Wed Jan 22 13:07:39 2025 -0800

    [SKY-1329] Better logging when API server fails to start (#167)

    * Better logging when API server fails to start

    * lint

    * comments

    * lint

    * lint

commit 179ed3391ee8c5a7e554b9470965cca82a751432
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Jan 22 11:17:25 2025 -0800

    [SKY-1337] Refactor client common and fix jobs logs sync down (#165)

    * refactor and support log sync down for jobs

    * Refactor for file uploads

    * Fix log path

    * Fix jobs logs downloading

    * fix smoke test

    * fix smoke test

    * format

    * Add debugging

    * Add debug

    * Reuse request body

    * Remove debug lines

    * get the latest

    * fix jobs logs tests

    * fix grep JobStatus

commit a90928e11a872cbabb983fb766fa52485861ee66
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Jan 22 11:01:35 2025 -0800

    [SKY-1347] Cancel pending requests for `sky down` (#172)

    Cancel pending requests for `sky down` as well

commit 9167fed281428f8a64ecd39b834fa028d4864c94
Author: zpoint <zp0int@qq.com>
Date:   Wed Jan 22 18:01:08 2025 +0800

    add aiofiles support for pre-commit-config (#171)

    add aiofiles support

commit 8301271cb1504ad76abcf2ebf4b48d1979af71de
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Tue Jan 21 22:16:21 2025 -0800

    [SKY-1326] Minor logging fixes for requests (#169)

    Minor logging

commit 1865ecba8b01a03e436f4b5e65beb8031cac4258
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Tue Jan 21 18:57:55 2025 -0800

    [SKY-1269] Allow tasks to run in custom namespaces (#170)

    * incluster namespace selection

    * comment

    * lint

    * lint

commit c4edb603464a00a715b41df7bfbdfa7c1a30bd4a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Jan 21 16:31:47 2025 -0800

    [SKY-1115] Fix smoke tests with remote API server (no local credential)  and local API server (#161)

    * Run cloud storage checks in tests on a cluster

    * More fixes

    * add more fix

    * format

    * format

    * Fix tests that requires local credentials

    * refactor the cmd runner

    * use new cluster

    * Add handling for GCP cloud commands

    * Wait for the cloud cmd cluster to be up

    * fix storage deletion

    * fix api server endpoint

    * fix env

    * fix None for env

    * fix intermediate storage

    * increase timeout

    * skip bench

    * fix queue with docker

    * format

    * more robust termination

    * Fix error handling for controller

    * fix API call

    * format

    * format

    * fix gcp related tests

    * Additional fixes

    * fix clean up

    * change zone

    * avoid test if on k8s

    * longer wait for cloud cmd cluster

    * longer wait

    * longer wait

    * longer

    * avoid tailing ?

    * wait for running

    * Fix storage mounts k8s test

    * Fix msg

    * Fix managed jobs output

    * fix dependency installation

    * Add additional fix

    * longer wait time for running

    * longer wait

    * slgihtly longer for clean up the resources

    * fix output

    * Fix cluster name

    * fix cluster names

commit 020a20efa43be95e36a3222d7e7abaf2269d2d7f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sun Jan 19 14:24:31 2025 -0800

    [SKY-1348] Fix console mess up (#166)

    * Fix jobs logs

    * Add comments

commit 2bae2b5a6c3d52f589dca55b7faa0eb2a68cc5c5
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sun Jan 19 11:52:51 2025 -0800

    [SKY-1339] Upload files by chunks (#163)

    * fixes

    * wip fixes

    * wip

    * minor fix

    * Fix bug for bad zip

    * fix chunk upload

    * update return value

    * fix multi thread issue

    * avoid io

    * Add cleanup for stale chunks

    * format

    * lint

    * Add comments

    * ux

    * fix cleanup

commit 189c686bf2a12071a21407085ef9fd258a30d800
Merge: 61f44a053 2354b818b
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sun Jan 19 19:21:07 2025 +0000

    Merge branch 'master' of github.com:skypilot-org/skypilot into restapi

commit 61f44a053c6778ccf5c6b28abc92006cee997f03
Merge: e87afbc6c 6b23582d9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 17 23:46:28 2025 +0000

    Merge branch 'master' of github.com:skypilot-org/skypilot into restapi

commit e87afbc6c2687ae864c944d8137a294f32094410
Merge: b0ea95f03 9e1b4ddc5
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 17 07:15:09 2025 +0000

    Merge branch 'master' of github.com:assemble-org/skypilot into restapi

commit b0ea95f030f24c3a7f9936233d1c838eb953b7d1
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 17 06:21:19 2025 +0000

    Fix job controller merging issue

commit 37f6fce6ad9b1723c32fc9157701adaf8e34c330
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Thu Jan 16 20:29:22 2025 -0800

    [SKY-1321] Fix concurrent requests to launch local API server (#162)

    * Fix concurrent requests to launch local API server

    * lint

commit 2b537b9b349fb848ae9d4a6da84f330be8fcdd8c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 17 03:42:59 2025 +0000

    format and unittest fix

commit bd65f74dce7588ae62efd58bc5d5021576213326
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 17 03:27:58 2025 +0000

    Fix cannonicalize accelerator implementation from master merging

commit a33a894b3e2df6a238a0fab727dcce01c95e0f5c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 17 01:37:31 2025 +0000

    Merge branch 'master' of github.com:skypilot-org/skypilot into restapi

commit 50671940daf8eb9713879a88480f230531e2cc7e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Jan 16 15:21:41 2025 -0800

    [SKY-980] Refactor jobs queue and service status to avoid mp pool (#141)

    * Refactor jobs queue and service status to avoid mp pool

    * add comment

    * fix API cancel

commit 7c614835447302964ae767aad8b8d283e7198411
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Thu Jan 16 11:35:51 2025 -0800

    [Docs] Updates based on onboarding feedback (#120)

    * Update docs

    * Add python version

commit 73a20398855a0a63ccb9306af2ae40446e18f29d
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Wed Jan 15 22:58:28 2025 -0800

    [SKY-373] Fix `sky local` for client server (#61)

    * Make sky local up work

    * lint

    * lint

    * Update checks

    * lint

    * Merge with restapi

    * revert log changes

    * fix

    * Rebase done

    * Comments and refactor

    * fixes

    * Change to __enter__ and add todo

    * lint

    * lint

    * fix package path

    * remove debug

    * Fix GPU detection

    * Logging fixes

    * Lint

commit 006ce5c2aaa4f8bfbf9c061195cf9dc84d3e54ec
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Jan 15 19:55:59 2025 -0800

    [SKY-915/SKY-1292] Clean up ssh entries and fix api logs exception handling (#157)

    * Remove ssh config for clusters not exist

    * Fix log path non-exist issue

    * Update sky/cli.py

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

    * Elaborate comments

    ---------

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

commit 0a0cb59e43e84441812622ee50f2dd042660e787
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Jan 16 00:46:35 2025 +0000

    Add back changes from #78

commit e2d3eac431d6c443244e8b5eba28b1f8a2c07180
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Wed Jan 15 15:50:36 2025 -0800

    [SKY-1045] Support GCP service account auth (#160)

    Update GCP service account docs

commit efd893df50e9b9e91dc988a9f47d8c5c7fd7a7f7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Jan 15 22:32:47 2025 +0000

    Fix SKY-1303

commit 36e638f677de165dff5bfe6d278f29633f1db49a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Jan 14 19:28:44 2025 -0800

    [SKY-1288] Fix python SDK docs (#156)

    * Fix docs for SDK

    * Split into items

    * Add comments for API endpoint env var

    * format

    * fix docs

    * Add docs for serve

commit 5b686e34d0843d7f399584302e5758a621263dcd
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Jan 14 17:31:23 2025 -0800

    [SKY-1217] Fix log download test (#159)

    * Fix log download test

    * Add comment

commit 4757b78d389188ec89810e82e6186c5c648f4ab7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Jan 14 16:34:05 2025 -0800

    [SKY-1095] Fix serve down issue with storage deletion (#158)

    Fix serve down issue with storage deletion

commit c8bb92c7c5fca19939242fd10f64d9cdc9c7327f
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Tue Jan 14 14:14:09 2025 -0800

    [SKY-1233] Handle non-sky exceptions and airflow updates (#151)

    * Add exception wrapping and update airflow

    * Add git branch

    * Update airflow and docs

    * Update airflow and docs

    * lint

commit 0b4530f38e35cbfea513992ce9c189698abc4365
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Tue Jan 14 12:51:53 2025 -0800

    [SKY-1258] Fix sky api start error message when using remote API server (#152)

    * Fix sky api start logic

    * lint

    * lint

    * comments

    * simplify logging logic

    * Add docs and comments

    * Add docs and comments

commit 6dd380e7e316bc7c0051f955ba5f326c4e0d454c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Jan 14 11:14:08 2025 -0800

    [SKY-1291] Fix CRLF output (#153)

    * Fix CRLF output

    * fix last line

    * update comment

commit 79f49ab29e9f901cca143a4aec366fd946d2aa55
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Mon Jan 13 20:40:24 2025 -0800

    [SKY-1108] Better logging for skypilot config override (#150)

    Better logging for skypilot config override

commit c28d7b49f6670893098ff63bf552ccea7c978dd7
Author: zpoint <zp0int@qq.com>
Date:   Mon Jan 13 14:01:37 2025 +0800

    type hint fix for tail logs (#149)

    type hint

commit 24d270b64c6fa480e0eb50583b198401e000902f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sun Jan 12 22:00:46 2025 -0800

    [SKY-1129] Avoid lru_cache across requests (#137)

    * Add reload for caches

    * Add comments

    * Fix docstr

    * Simplify

    * Add TODO

commit 270edfb8156e32cb395cae55bd37c44e13f2db1f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sun Jan 12 21:53:26 2025 -0800

    [UX] Prompt when launching on other's cluster (#146)

    * Prompt when launching on other's cluster

    * add import

    * format

    * Better logging

    * Only show hash when user

commit 37f0c823a1b86ddb38ce377e80863baa4d74ad6c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sun Jan 12 21:52:34 2025 -0800

    [UX] Add terminal completion and fix progress bar with `\r` (#148)

    * complete console

    * Add terminal completion

    * format

    * format

    * Fix \r

    * format

commit 1e3e5281e159735533a7eac9f98c21f45f21eab6
Author: zpoint <zp0int@qq.com>
Date:   Sun Jan 12 11:23:38 2025 +0800

    [SKY-1287]sky exec test echo hi doesn't show task logs (hi) (#147)

    * fix merge conflict overwritten

    * mypy

    * resolve comment

    * mypy

    * return type

commit a3db3e2e61f25c0518b6c868a3f1afaacf1b885c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 10 18:53:27 2025 -0800

    [SKY-1256] Avoid buffer for streaming hints (#145)

    * Avoid buffer for streaming hints

    * format

    * Better log streamer

    * format

    * format

    * format

    * remove cache

    * fix error handling

commit 786ae1d778889f89ba170d6e3db25774dec211ca
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Jan 11 02:42:49 2025 +0000

    Avoid refresh cluster status when cluster yaml is None

commit 6d31f31c9b867c0515f6c9ea3080ae63101d66f0
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 10 13:50:11 2025 -0800

    [SKY-1278/SKY-1134] Add versioning for API server (#138)

    * Add versioning for API server

    * Add commit and version in the health call

    * refactor a bit

    * format

    * fix docstr

    * renaming

commit 212f72fe663e27a3f7f8e83da13e6e795d0831b9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 10 19:57:48 2025 +0000

    format

commit 5fbe71ea17551dfb22f1021db44dd81a98c6dfe2
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 10 19:56:22 2025 +0000

    format

commit e8a1f4736536020255430e9d1d61654881a0a7dc
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 10 19:55:18 2025 +0000

    Rename to API server

commit 55c648700419d9c2181206438fe2e30aaac483d8
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Fri Jan 10 10:32:56 2025 -0800

    Update health endpoint in docs (#144)

    Update health endpoint

commit 9ec1ed69f64019d23dc5760aec76e6c7d01bc66d
Author: zpoint <zp0int@qq.com>
Date:   Fri Jan 10 17:04:42 2025 +0800

    [SKY-988] [Robust] Refactor tail_logs to make it a uvicorn async call instead of an async executor call (#130)

    * sync log call

    * terminate worker

    * use sky abort

    * merge restapi

    * revert logging order

    * doc update

commit e1eed3bf0ddd987b26aa094a2d1407f127ab28d2
Author: Hong <hong@assemblesys.com>
Date:   Fri Jan 10 14:46:21 2025 +0800

    [UX] set `RUNNING` request color to green  (#142)

commit 961782a0d9b48939dea9ba2abcf5a53662d55052
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Jan 9 22:34:18 2025 -0800

    [SKY-1273] Rename API server related CLIs (#139)

    * Rename API server related CLIs

    * Consolidate server logs

    * format

    * Add api login python API

    * format

    * Add comments and TODOs

    * Add TODO for endpoint env var

commit cfaf37e1e5204b42c3bfcac6fcbb7f301d5cfa18
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Jan 9 18:16:47 2025 -0800

    Add user hash for existing cluster (#140)

    * Add user hash for existing cluster

    * format

    * format

commit c96710df70bf89fd20c373e04213bde54d9157be
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Fri Jan 10 01:05:37 2025 +0000

    Fix check multiple clouds

commit 466c4d6a767e1a82631b6f2794ca296e82904049
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Jan 9 10:48:17 2025 -0800

    Clean up client file mounts during API server restarts (#135)

    * Refactor client server

    * add docstr

    * Fix types

    * Refactor for jobs/server

    * use type alias

    * Add todo

    * rename `check_health`

    * update docstr

    * Add comments

    * Update docstr

    * Adding docstr

    * Move

    * Add doc

    * style

    * format

    * fix redis

    * renames

    * naming

    * format

    * format

    * clean up clients file mounts as well

    * fix merging naming issue

    * format

    * format

commit e6cf03f050c2086382e21f3fa2a68fb0d0ade32b
Author: Yika <yikaluo@assemblesys.com>
Date:   Thu Jan 9 10:20:33 2025 -0800

    [SKY-1210] Fix or triage TODOs in restapi branch (#128)

    * Fix or triage TODOs in restapi branch

    * merge with restapi

    * comments

commit f73a871431a1a85584761f14b13e2e95e4d1a0d9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Jan 9 00:41:27 2025 -0800

    Address comments for restapi branch (#133)

    * Refactor client server

    * add docstr

    * Fix types

    * Refactor for jobs/server

    * use type alias

    * Add todo

    * rename `check_health`

    * update docstr

    * Add comments

    * Update docstr

    * Adding docstr

    * Move

    * Add doc

    * style

    * format

    * fix redis

    * renames

    * naming

    * format

    * format

    * Add comment

    * Avoid dumping to files

    * format

    * Update sky/server/requests/executor.py

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

    * Add todo for legacy

    * use queue.Queeu

    ---------

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

commit 8fa28098edb6a85631c9a95107857d5f3701a832
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Jan 7 19:34:12 2025 +0000

    remove poetry

commit fe9d7385b459754cf493eee3345281f2f97fc4a7
Merge: 344151261 6cf98a3cf
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Jan 7 18:39:54 2025 +0000

    Merge branch 'master' of github.com:assemble-org/skypilot into restapi

commit 34415126138d67254c16f830c875a9f9077f76d4
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Mon Jan 6 23:45:19 2025 -0800

    Merge master (#131)

    * [perf] use uv for venv creation and pip install (#4414)

    * Revert "remove `uv` from runtime setup due to azure installation issue (#4401)"

    This reverts commit 0b20d568ee1af454bfec3e50ff62d239f976e52d.

    * on azure, use --prerelease=allow to install azure-cli

    * use uv venv --seed

    * fix backwards compatibility

    * really fix backwards compatibility

    * use uv to set up controller dependencies

    * fix python 3.8

    * lint

    * add missing file

    * update comment

    * split out azure-cli dep

    * fix lint for dependencies

    * use runpy.run_path rather than modifying sys.path

    * fix cloud dependency installation commands

    * lint

    * Update sky/utils/controller_utils.py

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    ---------

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * [Minor] README updates. (#4436)

    * [Minor] README touches.

    * update

    * update

    * make --fast robust against credential or wheel updates (#4289)

    * add config_dict['config_hash'] output to write_cluster_config

    * fix docstring for write_cluster_config

    This used to be true, but since #2943, 'ray' is the only provisioner.
    Add other keys that are now present instead.

    * when using --fast, check if config_hash matches, and if not, provision

    * mock hashing method in unit test

    This is needed since some files in the fake file mounts don't actually exist,
    like the wheel path.

    * check config hash within provision with lock held

    * address other PR review comments

    * rename to skip_if_no_cluster_updates

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * add assert details

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * address PR comments and update docstrings

    * fix test

    * update docstrings

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * address PR comments

    * fix lint and tests

    * Update sky/backends/cloud_vm_ray_backend.py

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * refactor skip_if_no_cluster_update var

    * clarify comment

    * format exception

    ---------

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * [k8s] Add resource limits only if they exist (#4440)

    Add limits only if they exist

    * [robustness] cover some potential resource leakage cases (#4443)

    * if a newly-created cluster is missing from the cloud, wait before deleting

    Addresses #4431.

    * confirm cluster actually terminates before deleting from the db

    * avoid deleting cluster data outside the primary provision loop

    * tweaks

    * Apply suggestions from code review

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * use usage_intervals for new cluster detection

    get_cluster_duration will include the total duration of the cluster since its
    initial launch, while launched_at may be reset by sky launch on an existing
    cluster. So this is a more accurate method to check.

    * fix terminating/stopping state for Lambda and Paperspace

    * Revert "use usage_intervals for new cluster detection"

    This reverts commit aa6d2e9f8462c4e68196e9a6420c6781c9ff116b.

    * check cloud.STATUS_VERSION before calling query_instances

    * avoid try/catch when querying instances

    * update comments

    ---------

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * smoke tests support storage mount only (#4446)

    * smoke tests support storage mount only

    * fix verify command

    * rename to only_mount

    * [Feature] support spot pod on RunPod (#4447)

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * resolve comments

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    ---------

    Co-authored-by: hwei <hwei@covariant.ai>

    * use lazy import for runpod (#4451)

    Fixes runpod import issues introduced in #4447.

    * [k8s] Fix show-gpus when running with incluster auth (#4452)

    * Add limits only if they exist

    * Fix incluster auth handling

    * Not mutate azure dep list at runtime (#4457)

    * add 1, 2, 4 size H100's to GCP (#4456)

    * add 1, 2, 4 size H100's to GCP

    * update

    * Support buildkite CICD and restructure smoke tests (#4396)

    * event based smoke test

    * more event based smoke test

    * more test cases

    * more test cases with managed jobs

    * bug fix

    * bump up seconds

    * merge master and resolve conflict

    * more test case

    * support test_managed_jobs_pipeline_failed_setup

    * support test_managed_jobs_recovery_aws

    * manged job status

    * bug fix

    * test managed job cancel

    * test_managed_jobs_storage

    * more test cases

    * resolve pr comment

    * private member function

    * bug fix

    * restructure

    * fix import

    * buildkite config

    * fix stdout problem

    * update pipeline test

    * test again

    * smoke test for buildkite

    * remove unsupport cloud for now

    * merge branch 'reliable_smoke_test_more'

    * bug fix

    * bug fix

    * bug fix

    * test pipeline pre merge

    * build test

    * test again

    * trigger test

    * bug fix

    * generate pipeline

    * robust generate pipeline

    * refactor pipeline

    * remove runpod

    * hot fix to pass smoke test

    * random order

    * allow parameter

    * bug fix

    * bug fix

    * exclude lambda cloud

    * dynamic generate pipeline

    * fix pre-commit

    * format

    * support SUPPRESS_SENSITIVE_LOG

    * support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log

    * support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log

    * add backward_compatibility_tests to pipeline

    * pip install uv for backward compatibility test

    * import style

    * generate all cloud

    * resolve PR comment

    * update comment

    * naming fix

    * grammar correction

    * resolve PR comment

    * fix import

    * fix import

    * support gcp on pre merge test

    * no gcp test case for pre merge

    * [k8s] Make node termination robust (#4469)

    * Add limits only if they exist

    * retry deletion

    * lint

    * lint

    * comments

    * lint

    * [Catalog] Bump catalog schema version (#4470)

    * Bump catalog schema version

    * trigger CI

    * [core] skip provider.availability_zone in the cluster config hash (#4463)

    skip provider.availability_zone in the cluster config hash

    * remove sky jobs launch --fast (#4467)

    * remove sky jobs launch --fast

    The --fast behavior is now always enabled. This was unsafe before but since
    \#4289 it should be safe.

    We will remove the flag before 0.8.0 so that it never touches a stable version.

    sky launch still has the --fast flag. This flag is unsafe because it could cause
    setup to be skipped even though it should be re-run. In the managed jobs case,
    this is not an issue because we fully control the setup and know it will not
    change.

    * fix lint

    * [docs] Change urls to docs.skypilot.co, add 404 page (#4413)

    * Add 404 page, change to docs.skypilot.co

    * lint

    * [UX] Fix unnecessary OCI logging (#4476)

    Sync PR: fix-oci-logging-master

    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

    * [Example] PyTorch distributed training with minGPT (#4464)

    * Add example for distributed pytorch

    * update

    * Update examples/distributed-pytorch/README.md

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

    * Update examples/distributed-pytorch/README.md

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

    * Update examples/distributed-pytorch/README.md

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

    * Update examples/distributed-pytorch/README.md

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

    * Update examples/distributed-pytorch/README.md

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

    * Update examples/distributed-pytorch/README.md

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

    * Update examples/distributed-pytorch/README.md

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

    * Fix

    ---------

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

    * Add tests for Azure spot instance (#4475)

    * verify azure spot instance

    * string style

    * echo

    * echo vm detail

    * bug fix

    * remove comment

    * rename pre-merge test to quicktest-core (#4486)

    * rename to test core

    * rename file

    * [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu (#4337)

    * [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu

    Signed-off-by: nkwangleiGIT <nkwanglei@126.com>

    * fix format issue

    Signed-off-by: nkwangleiGIT <nkwanglei@126.com>

    ---------

    Signed-off-by: nkwangleiGIT <nkwanglei@126.com>

    * [k8s] Fix IPv6 ssh support (#4497)

    * Add limits only if they exist

    * Fix ipv6 support

    * Fix ipv6 support

    * [Serve] Add and adopt least load policy as default poicy. (#4439)

    * [Serve] Add and adopt least load policy as default poicy.

    * Docs & smoke tests

    * error message for different lb policy

    * add minimal example

    * fix

    * [Docs] Update logo in docs (#4500)

    * WIP updating Elisa logo; issues with light/dark modes

    * Fix SVG in navbar rendering by hardcoding SVG + defining text color in css

    * Update readme images

    * newline

    ---------

    Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

    * Replace `len()` Zero Checks with Pythonic Empty Sequence Checks (#4298)

    * style: mainly replace len() comparisons with 0/1 with pythonic empty sequence checks

    * chore: more typings

    * use `df.empty` for dataframe

    * fix: more `df.empty`

    * format

    * revert partially

    * style: add back comments

    * style: format

    * refactor: `dict[str, str]`

    Co-authored-by: Tian Xia <cblmemo@gmail.com>

    ---------

    Co-authored-by: Tian Xia <cblmemo@gmail.com>

    * [Docs] Fix logo file path (#4504)

    * Add limits only if they exist

    * rename

    * [Storage] Show logs for storage mount (#4387)

    * commit for logging change

    * logger for storage

    * grammar

    * fix format

    * better comment

    * resolve copilot review

    * resolve PR comment

    * remove unuse var

    * Update sky/data/data_utils.py

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * resolve PR comment

    * update comment for get_run_timestamp

    * rename backend_util.get_run_timestamp to sky_logging.get_run_timestamp

    ---------

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * [Examples] Update Ollama setup commands (#4510)

    wip

    * [OCI] Support OCI Object Storage  (#4501)

    * OCI Object Storage Support

    * example yaml update

    * example update

    * add more example yaml

    * Support RClone-RPM pkg

    * Add smoke test

    * ver

    * smoke test

    * Resolve dependancy conflict between oci-cli and runpod

    * Use latest RClone version (v1.68.2)

    * minor optimize

    * Address review comments

    * typo

    * test

    * sync code with repo

    * Address review comments & more testing.

    * address one more comment

    * [Jobs] Allowing to specify intermediate bucket for file upload (#4257)

    * debug

    * support workdir_bucket_name config on yaml file

    * change the match statement to if else due to mypy limit

    * pass mypy

    * yapf format fix

    * reformat

    * remove debug line

    * all dir to same bucket

    * private member function

    * fix mypy

    * support sub dir config to separate to different directory

    * rename and add smoke test

    * bucketname

    * support sub dir mount

    * private member for _bucket_sub_path and smoke test fix

    * support copy mount for sub dir

    * support gcs, s3 delete folder

    * doc

    * r2 remove_objects_from_sub_path

    * support azure remove directory and cos remove

    * doc string for remove_objects_from_sub_path

    * fix sky jobs subdir issue

    * test case update

    * rename to _bucket_sub_path

    * change the config schema

    * setter

    * bug fix and test update

    * delete bucket depends on user config or sky generated

    * add test case

    * smoke test bug fix

    * robust smoke test

    * fix comment

    * bug fix

    * set the storage manually

    * better structure

    * fix mypy

    * Update docs/source/reference/config.rst

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * Update docs/source/reference/config.rst

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * limit creation for bucket and delete sub dir only

    * resolve comment

    * Update docs/source/reference/config.rst

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * Update sky/utils/controller_utils.py

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * resolve PR comment

    * bug fix

    * bug fix

    * fix test case

    * bug fix

    * fix

    * fix test case

    * bug fix

    * support is_sky_managed param in config

    * pass param intermediate_bucket_is_sky_managed

    * resolve PR comment

    * Update sky/utils/controller_utils.py

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * hide bucket creation log

    * reset green color

    * rename is_sky_managed to _is_sky_managed

    * bug fix

    * retrieve _is_sky_managed from stores

    * propogate the log

    ---------

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * [Core] Deprecate LocalDockerBackend (#4516)

    Deprecate local docker backend

    * [docs] Add newer examples for AI tutorial and distributed training (#4509)

    * Update tutorial and distributed training examples.

    * Add examples link

    * add rdvz

    * [k8s] Fix L40 detection for nvidia GFD labels (#4511)

    Fix L40 detection

    * [docs] Support OCI Object Storage (#4513)

    * Support OCI Object Storage

    * Add oci bucket for file_mount

    * [Docs] Disable Kapa AI (#4518)

    Disable kapa

    * [DigitalOcean] droplet integration (#3832)

    * init digital ocean droplet integration

    * abbreviate cloud name

    * switch to pydo

    * adjust polling logic and mount block storage to instance

    * filter by paginated

    * lint

    * sky launch, start, stop functional

    * fix credential file mounts, autodown works now

    * set gpu droplet image

    * cleanup

    * remove more tests

    * atomically destroy instance and block storage simulatenously

    * install docker

    * disable spot test

    * fix ip address bug for multinode

    * lint

    * patch ssh from job/serve controller

    * switch to EA slugs

    * do adaptor

    * lint

    * Update sky/clouds/do.py

    Co-authored-by: Tian Xia <cblmemo@gmail.com>

    * Update sky/clouds/do.py

    Co-authored-by: Tian Xia <cblmemo@gmail.com>

    * comment template

    * comment patch

    * add h100 test case

    * comment on instance name length

    * Update sky/clouds/do.py

    Co-authored-by: Tian Xia <cblmemo@gmail.com>

    * Update sky/clouds/service_catalog/do_catalog.py

    Co-authored-by: Tian Xia <cblmemo@gmail.com>

    * comment on max node char len

    * comment on weird azure import

    * comment acc price is included in instance price

    * fix return type

    * switch with do_utils

    * remove broad except

    * Update sky/provision/do/instance.py

    Co-authored-by: Tian Xia <cblmemo@gmail.com>

    * Update sky/provision/do/instance.py

    Co-authored-by: Tian Xia <cblmemo@gmail.com>

    * remove azure

    * comment on non_terminated_only

    * add open port debug message

    * wrap start instance api

    * use f-string

    * wrap stop

    * wrap instance down

    * assert credentials and check against all contexts

    * assert client is None

    * remove pending instances during instance restart

    * wrap rename

    * rename ssh key var

    * fix tags

    * add tags for block device

    * f strings for errors

    * support image ids

    * update do tests

    * only store head instance id

    * rename image slugs

    * add digital ocean alias

    * wait for docker to be available

    * update requirements and tests

    * increase docker timeout

    * lint

    * move tests

    * lint

    * patch test

    * lint

    * typo fix

    * fix typo

    * patch tests

    * fix tests

    * no_mark spot test

    * handle 2cpu serve tests

    * lint

    * lint

    * use logger.debug

    * fix none cred path

    * lint

    * handle get_cred path

    * pylint

    * patch for DO test_optimizer_dryruns.py

    * revert optimizer dryrun

    ---------

    Co-authored-by: Tian Xia <cblmemo@gmail.com>
    Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-12.ec2.internal>

    * [Docs] Refactor pod_config docs (#4427)

    * refactor pod_config docs

    * Update docs/source/reference/kubernetes/kubernetes-getting-started.rst

    Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

    * Update docs/source/reference/kubernetes/kubernetes-getting-started.rst

    Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

    ---------

    Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

    * [OCI] Set default image to ubuntu LTS 22.04 (#4517)

    * set default gpu image to skypilot:gpu-ubuntu-2204

    * add example

    * remove comment line

    * set cpu default image to 2204

    * update change history

    * [OCI] 1. Support specify OS with custom image id. 2. Corner case fix (#4524)

    * Support specify os type with custom image id.

    * trim space

    * nit

    * comment

    * Update intermediate bucket related doc (#4521)

    * doc

    * Update docs/source/examples/managed-jobs.rst

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * Update docs/source/examples/managed-jobs.rst

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * Update docs/source/examples/managed-jobs.rst

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * Update docs/source/examples/managed-jobs.rst

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * Update docs/source/examples/managed-jobs.rst

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * Update docs/source/examples/managed-jobs.rst

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * add tip

    * minor changes

    ---------

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * [aws] cache user identity by 'aws configure list' (#4507)

    * [aws] cache user identity by 'aws configure list'

    Signed-off-by: Aylei <rayingecho@gmail.com>

    * refine get_user_identities docstring

    Signed-off-by: Aylei <rayingecho@gmail.com>

    * address review comments

    Signed-off-by: Aylei <rayingecho@gmail.com>

    ---------

    Signed-off-by: Aylei <rayingecho@gmail.com>

    * [k8s] Add validation for pod_config #4206 (#4466)

    * [k8s] Add validation for pod_config #4206

    Check pod_config when run 'sky check k8s' by using k8s api

    * update: check pod_config when launch

    check merged pod_config during launch using k8s api

    * fix test

    * ignore check failed when test with dryrun

    if there is no kube config in env, ignore ValueError when launch
    with dryrun. For now, we don't support check schema offline.

    * use deserialize api to check pod_config schema

    * test

    * create another api_client with no kubeconfig

    * test

    * update error message

    * update test

    * test

    * test

    * Update sky/backends/backend_utils.py

    ---------

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * [core] fix wheel timestamp check (#4488)

    Previously, we were only taking the max timestamp of all the subdirectories of
    the given directory. So the timestamp could be incorrect if only a file changed,
    and no directory changed. This fixes the issue by looking at all directories and
    files given by os.walk().

    * [docs] Add image_id doc in task YAML for OCI (#4526)

    * Add image_id doc for OCI

    * nit

    * Update docs/source/reference/yaml-spec.rst

    Co-authored-by: Tian Xia <cblmemo@gmail.com>

    ---------

    Co-authored-by: Tian Xia <cblmemo@gmail.com>

    * [UX] warning before launching jobs/serve when using a reauth required credentials (#4479)

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * wip

    * Update sky/backends/cloud_vm_ray_backend.py

    Minor fix

    * Update sky/clouds/aws.py

    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * wip

    * minor changes

    * wip

    ---------

    Co-authored-by: hong <hong@hongdeMacBook-Pro.local>
    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

    * [GCP] Activate service account for storage and controller (#4529)

    * Activate service account for storage

    * disable logging if not using service account

    * Activate for controller as well.

    * revert controller activate

    * Add comments

    * format

    * fix smoke

    * [OCI] Support reuse existing VCN for SkyServe (#4530)

    * Support reuse existing VCN for SkyServe

    * fix

    * remove unused import

    * format

    * [docs] OCI: advanced configuration & add vcn_ocid (#4531)

    * Add vcn_ocid configuration

    * Update config.rst

    * fix merge issues WIP

    * fix merging issues

    * fix imports

    * fix stores

    ---------

    Signed-off-by: nkwangleiGIT <nkwanglei@126.com>
    Signed-off-by: Aylei <rayingecho@gmail.com>
    Co-authored-by: Christopher Cooper <cooperc@assemblesys.com>
    Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
    Co-authored-by: zpoint <zp0int@qq.com>
    Co-authored-by: Hong <weih1121@qq.com>
    Co-authored-by: hwei <hwei@covariant.ai>
    Co-authored-by: Yika <yikaluo@assemblesys.com>
    Co-authored-by: Seth Kimmel <seth.kimmel3@gmail.com>
    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
    Co-authored-by: Lei <nkwanglei@126.com>
    Co-authored-by: Tian Xia <cblmemo@gmail.com>
    Co-authored-by: Andy Lee <andylizf@outlook.com>
    Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
    Co-authored-by: Hysun He <hysunhe@foxmail.com>
    Co-authored-by: Andrew Aikawa <asai@berkeley.edu>
    Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-12.ec2.internal>
    Co-authored-by: Aylei <rayingecho@gmail.com>
    Co-authored-by: Chester Li <chaoleili2@gmail.com>
    Co-authored-by: hong <hong@hongdeMacBook-Pro.local>

commit 4d9c1d5f83a7ea2799800976f98bef8435552902
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Mon Jan 6 13:06:15 2025 -0800

    [Client] Create ssh config folder (#105)

    Create ssh config folder

commit 397a6d9c80c1d63127699be654e1c445de8bb4db
Author: Yika <yikaluo@assemblesys.com>
Date:   Fri Jan 3 14:51:31 2025 -0800

    [UX][SKY-1085] Show more readable process command name in `htop` (#119)

    * [UX] Process naming

    * comment

commit c289b36e11bf5fe39bcf5dfbaae22feec7cc027d
Author: Yika <yikaluo@assemblesys.com>
Date:   Fri Jan 3 14:40:47 2025 -0800

    [UX][SKY-1081] More consistent logs for file syncing (#118)

    [UX][SKY-1081] Better logs for file syncing

commit f29a6de440408172bf5c7e054b50ee6742d0557b
Author: Yika <yikaluo@assemblesys.com>
Date:   Fri Jan 3 14:04:42 2025 -0800

    [SKY-1071] Fix sky api server_logs for remote API server (#129)

    * [SKY-1071] Fix sky api server_logs for remote API server

    * comments

commit ae30752777f7bc7778feda5b51f760337d066615
Author: Yika <yikaluo@assemblesys.com>
Date:   Mon Dec 23 11:20:31 2024 -0800

    [UX][SKY-1092] Colored request status for sky api ls (#116)

    * [UX][SKY-1092] Colored request status for sky api ls

    * address comments

    * fix type

    * nit

commit baf3fa340f625acb18e56d28cda9bcc99fc83ff5
Author: Yika <yikaluo@assemblesys.com>
Date:   Mon Dec 23 11:18:27 2024 -0800

    [UX][SKY-1086] Polish worker logs in API server (#117)

    * [UX][SKY-1086] Polish worker logs in API server

    * address comments

    * revert accident commit

commit a9f5f38c92e23a61fcb365b4fb62dd3959219c0c
Author: Yika <yikaluo@assemblesys.com>
Date:   Wed Dec 18 10:17:16 2024 -0800

    [SKY-1128] Fix sky storage cloud type mismatch issue (#109)

    Fix sky storage cloud type mismatch issue

commit 6318b38159c6754e5e950c08c4058176d88aea5c
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Tue Dec 17 16:57:17 2024 -0800

    [Docs] Quick updates to docs (#106)

    Updates

commit 591c52e8757eb182392a51cf583807c6fbe7e41e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Dec 17 16:22:01 2024 -0800

    Reload kubernetes utils to avoid caching issue (#102)

    * Reload kubernetes utils to avoid caching issue

    * format

commit d75fcd849c2dfef833a5afb3dfae35010afc1e06
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Tue Dec 17 16:01:27 2024 -0800

    [SKY-1107] Docs for deployment (#92)

    * Update admin docs

    * Updates

    * Update docs

    * Update docs

    * Add storage docs

    * Updates

commit 65554949e0f7743e964abf9c228c0d501087a5f5
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Tue Dec 17 12:49:03 2024 -0800

    [Tests] Fix smoke test for `--kubernetes` and backward compatibility test on AWS (#80)

    * Remove stop test for k8s

    * Simplify the test k8s creation commands

    * use 5+

    * fix

    * fix

    * expanduser

    * add tpu mark

    * minore large cpu

    * correctly quote

    * format

    * get old jobs as well

    * fix launch on existing

    * fix

    * fix back

    * fix back compat

    * minor fix

    * fix

    * Fix server side validation

    * remove debug

    * rename back to cloudvmray

    * rename back to cloudvmray

    * fix aws cli on kubernetes pod

    * format

    * use uv

    * fix status -r in backward compat

    * Fix path

    * format

    * less \t

    * try less gap for waiting

    * Add a timeout for curl to avoid long waiting

    * Add max time as well

    * shorter sleep

commit 8af911e4ad36a8f9da77f5186b18e716503a05a4
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Tue Dec 17 00:03:37 2024 -0800

    [Sky-1123] Fix proxy command for SSH (#94)

    Fix proxy command

commit ed724c52a458fbac9d81fa0134f0ff1612b0c531
Author: Yika <yikaluo@assemblesys.com>
Date:   Mon Dec 16 18:11:31 2024 -0800

    [SKY-1110] Fix sky server controller name issue (#91)

    * Remove controller name reference on client side

    * format

    * add comments

    * rename

commit 670af89b9677b75a4c08f409ccc653c07a608654
Author: Yika <yikaluo@assemblesys.com>
Date:   Mon Dec 16 18:10:12 2024 -0800

    Fix validating relative path (./) (#90)

    * Not validate file_mounts for sky exec

    * fix sky launch on file_mount relative path

    * address comments

    * validate on both side

    * nit

    * smoke test

    * comments

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    * format

    ---------

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

commit 8f07b85738dc174a0656ef5115a9b1e7145783a3
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Mon Dec 16 01:22:43 2024 -0800

    Fix config reload (#79)

    * reload the config

    * rename

    * elaborate

    * Fix makedir

    * minor

    * longer retry for deploy

    * Refactor

    * fix unit test

    * fix

    * format

    * fix docstr

    * revert

    * Fix loading issue

    * Add unit test

    * format

    * fix condition

    * Fix controller hash

    * revert to remove

    * rename

    * format

    * rename

    * comment

    * Avoid skip env var

    * comment

    * Add more tests

    * try test_config

    * format

    * avoid test config

    * fix folder path

    * use uuid

    * use uuid for download as well

    * format

commit 389a32b3f0410dce4bd95b74c7e982f442f4d644
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Sat Dec 14 17:49:41 2024 -0800

    [SKY-1073] Fix cross-device linking error with `sky jobs launch` (#78)

    * hardlink in the same block device

    * lint

    * lint

    * Update sky/skylet/constants.py

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

    ---------

    Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

commit 1a45d90e5b7ea75d2c86c2bba377a7d563af5f85
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Sat Dec 14 16:09:56 2024 -0800

    [Docs] Client server docs (SKY-1101) (#77)

    * Add autobuild

    * better logging

    * Add docs for API server

    * update note for `sky down` and `sky top`

    * format

    * Updates

    * Update

    * updates

    * fix typos

    * Fix

    * comment

    ---------

    Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

commit 07a4ce71fe41a4b51ef4cade461c88648cb5cecf
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date:   Fri Dec 13 16:06:53 2024 -0800

    [SKY-1074] Fix storage related tests for remote API server (#74)

    * Make test_storage_mounting.yaml.j2 cloud-specific

    * Add local marker to skip tests on remote API server

    * Update todo

    * lint

commit 1556c3301e1d06614da8a0040c6bd55608679a4b
Author: Yika <yikaluo@assemblesys.com>
Date:   Fri Dec 13 15:53:13 2024 -0800

    [SKY-1080] Controller backward compatibility (#72)

    * Controller backward compatibility

    * format

    * test

    * test2

    * polish

    * format

    * revert hostname

    * nit

    * better naming

    * unit test

    * more unit test

    * format

    * address comments

    * constant

commit 3eb370d639713b8892bfb7fccecd8cf3e5fcff75
Author: Yika <yikaluo@assemblesys.com>
Date:   Thu Dec 12 21:16:16 2024 -0800

    [SKY-1084] k8 ssh proxy uses sky python (#75)

    k8 ssh proxy uses sky python

commit 9f56b0fc31c6492f930805acf89c8e4ab4cc2609
Author: Yika <yikaluo@assemblesys.com>
Date:   Thu Dec 12 21:15:50 2024 -0800

    [SKY-1094] Disable sky bench (#73)

    Disable sky bench

commit 352320e2ba3e5c086fe9fd8ea1aeac716e67e371
Author: Yika <yikaluo@assemblesys.com>
Date:   Thu Dec 12 20:28:43 2024 -0800

    [SKY-1104] fix sky down issue (#76)

    fix sky down issue

commit 0383c82457401875a428aed0f9dd0bb6d6764904
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Thu Dec 12 17:09:40 2024 -0800

    Fix websocket proxy path and fixes SKY-1076 (#71)

    * Fix websocket proxy path

    * Add cluster config

    * format

commit 9fbb7fba1e656bb403ddfa17056382ecca7819c1
Author: Yika <yikaluo@assemblesys.com>
Date:   Thu Dec 12 16:56:40 2024 -0800

    UX fixes from alpha test (#70)

    * UX fixes from alpha test

    * decoder encoder

    * address comments

commit f8273f983dd4d78f6cbedc0e8f066c741afe26de
Author: Yika <yikaluo@assemblesys.com>
Date:   Thu Dec 12 16:45:21 2024 -0800

    [UX] sky api abort -a and -u (#69)

    * sky api abort -a and -u

    * fix -u

    * comments

    * address comments

    * nit

commit 7b7abaa82a89d63f45a92d8e46af203bc42e991b
Author: Yika <yikaluo@assemblesys.com>
Date:   Thu Dec 12 13:40:05 2024 -0800

    [UX] sky api ls feature polish (#67)

    * sky api ls feature polish

    * address comments

    * naming

    * nit

    * user ID

    * --all-status

commit 92b137f838a636d4568a6040a73919da440576e5
Author: Yika <yikaluo@assemblesys.com>
Date:   Thu Dec 12 09:45:11 2024 -0800

    Fix sky storage leaking issue (#65)

    * Fix cloud storage leak

    * nit

    * format

    * comments

    * comments

commit 839db19606ae46e75e354f127faa9d5ceb96e7f0
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date:   Wed Dec 11 21:37:46 2024 -0800

    Merge branch 'master' of github.com:skypilot-org/skypilot into restapi (#62)

    * [perf…
@Michaelvll Michaelvll changed the title [Core] Rearchitect SkyPilot to be Client-Server [Core] SkyPilot Client-Server Architecture Feb 6, 2025
@Michaelvll
Copy link
Collaborator Author

/smoke-test --aws

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
.github/workflows/pytest.yml Outdated Show resolved Hide resolved
@@ -120,6 +125,7 @@ def launch(
'remote_user_config_path': remote_user_config_path,
'modified_catalogs':
service_catalog_common.get_modified_catalog_file_mounts(),
'dashboard_setup_cmd': managed_job_constants.DASHBOARD_SETUP_CMD,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use this??

Copy link
Collaborator Author

@Michaelvll Michaelvll Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! There are some merging conflicts. I updated it now.

sky/jobs/server/core.py Outdated Show resolved Hide resolved
@zpoint
Copy link
Collaborator

zpoint commented Feb 7, 2025

/smoke-test --aws -k test_env_check

@zpoint
Copy link
Collaborator

zpoint commented Feb 7, 2025

/smoke-test --aws -k test_env_check

@zpoint
Copy link
Collaborator

zpoint commented Feb 7, 2025

/smoke-test --managed-jobs

.github/workflows/pytest.yml Outdated Show resolved Hide resolved
@@ -120,6 +125,7 @@ def launch(
'remote_user_config_path': remote_user_config_path,
'modified_catalogs':
service_catalog_common.get_modified_catalog_file_mounts(),
'dashboard_setup_cmd': managed_job_constants.DASHBOARD_SETUP_CMD,
Copy link
Collaborator Author

@Michaelvll Michaelvll Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! There are some merging conflicts. I updated it now.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Comment on lines +268 to +269
# become a job on controller which messes up the job IDs (we assume the
# job ID in controller's job queue is consistent with managed job IDs).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cg505, is this still the case? Do we still rely on the same ray job ID and managed job ID?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. I think no but there might still be some place (e.g. in log streaming code) that assumes it. Probably best to leave this in.

Comment on lines 274 to 281
if controller_status in [
job_lib.JobStatus.FAILED_SETUP,
job_lib.JobStatus.FAILED_DRIVER, job_lib.JobStatus.FAILED
]:
# We should fail the case where the controller status is
# failed, as it is likely due to the job for submitting the
# managed job to scheduler failed.
logger.error('Failed to submit the managed job to scheduler.')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cg505, I added this so that jobs failed before scheduler is triggered for the job will be marked as FAILED_CONTROLLER. Wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit concerned here, since the job_lib status could be failed but the managed job was successfully submitted, then something else failed after. In that case we could race with the scheduler and get into some weird undefined behavior.
If we want to add this check, we should also verify that the schedule_state is INACTIVE (submit to scheduler failed), and we need to check schedule_state after we check the job_id status. Otherwise, if the job is WAITING or some other schedule_state, we need to make sure to stop the controller process.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I can add an additional condition for checking the schedule_state to be INACTIVE. I was assuming the job_lib status should always be SUCCEEDED if the managed job was successfully submited.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a question of what we want to do if the job_lib status is some FAILED state but the job was successfully submitted (schedule_state is not INACTIVE). This seems like a problem, so FAILED_CONTROLLER is reasonable. We just have to be careful about how to do it:

  • set job to DONE
  • get controller process PID
  • if not None, kill controller process and children
  • clean up cluster

or something like that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure in what case it could happen that the job_lib can fail when the job is current submitted to scheduler, since the job controller's run section seems only inlcudes the submission to scheduler?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. some ray issue out of our control

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even in the submission, set_waiting succeeds but maybe_schedule_next_jobs crashes for some reason

with filelock.FileLock(_get_lock_path()):
state.scheduler_set_waiting(job_id, dag_yaml_path)
maybe_schedule_next_jobs()

job_lib.JobStatus.FAILED_DRIVER,
job_lib.JobStatus.FAILED
]:
# We should fail the case where the controller status is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to double check the schedule_state after job_lib.get_status to avoid a race

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point! I tried to add the schedule state check again here, and the logic becomes quite complicated. I now revert it to only check the FAILED_SETUP case, which should be safe to only check the job_lib state. PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants