-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] SkyPilot Client-Server Architecture #4660
base: master
Are you sure you want to change the base?
Conversation
Squashed commit of the following: commit 7889757299c0260fa4fc6e8c2d3b06484783aae7 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Feb 6 21:36:58 2025 +0000 change to local cache for the k8s client for newly added credential commit bf660981504544aeb15650938d754094d00faf22 Merge: 19f13d372 3e9bd9d0f Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Feb 6 21:30:59 2025 +0000 Merge branch 'master' of github.com:skypilot-org/skypilot into restapi commit 19f13d372b3fa7d590adcb490cc03ff7d2fa6f30 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Feb 6 19:48:17 2025 +0000 Fix the unit test commit 72ea226c4b73c6e0eac0f4d55063a6507107c711 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Feb 6 17:42:51 2025 +0000 fix variables commit 0dcc06b6bde5f2d24df1c7a8e31b703d54705e77 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Feb 6 07:57:26 2025 +0000 remove unecessary endpoint commit d54c74c59cac76ee56acfd339328144eb60d5929 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Feb 6 07:49:56 2025 +0000 refactor serve endpoint for less API calls commit 912069e6f715d10f7d44c4acdd45d57c25de6689 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Feb 6 07:27:56 2025 +0000 Fix output commit f8630cdba402882112fdc6821c586d897ee766ea Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Feb 6 02:36:39 2025 +0000 fix bump seconds commit 38e66d87fc56d03e5194d9ff18f58ff2194d680a Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Feb 5 23:11:11 2025 +0000 Fix merging issue commit 0e7ace80f84ee4845f006e63610f9f80c1b5928c Merge: 99b81b6ff e7d94e956 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Feb 5 23:00:03 2025 +0000 Merge branch 'master' of github.com:skypilot-org/skypilot into restapi commit 99b81b6ff810c595f4e9c5fbd4c8936692521fef Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Feb 5 22:53:15 2025 +0000 mypy commit 0cadaf976bd3f4296250992d299128282e16f9f6 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Feb 5 22:35:31 2025 +0000 Fix serve status commit 8d7849f3579e4a500efcc55c8a7e8023faf776e7 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Feb 5 22:28:22 2025 +0000 Avoid clone-disk-from tests commit f0026726c1f6cfe66658e6e4820816e3d7591af9 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Feb 5 22:14:09 2025 +0000 Avoid detach setup commit 9b8c0cbe61ede6b846bbc054f68653f6d1cdff62 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Feb 5 01:09:04 2025 +0000 Address comments commit cfaa62e573f0345ad4ba70c1b33ba1163ba809df Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Feb 5 00:56:22 2025 +0000 pylint commit 52bbdb7619c0a0334024d5a7e0eff02124e456e0 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Feb 5 00:55:03 2025 +0000 Address comments commit 25a0c9255c1a166e0855f56ca0d53751b1528723 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Feb 4 22:41:21 2025 +0000 docstr commit 1bad3ebef6bafbd0eeba28b03f8bc6e310ff558a Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Feb 4 22:13:42 2025 +0000 Fix raise for resource parsing commit 81d7d79070b1ebc711a3c0034b40778df4608d77 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Feb 4 22:11:35 2025 +0000 Fix env var for CPU/memory limit in pod commit 11cff6b7370f148e74da0c56729dbe094d3d9f77 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Feb 4 21:46:12 2025 +0000 rename to LONG and SHORT commit 75797766c9830d58d8c1c0123c2782dee8417aae Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Feb 4 21:03:02 2025 +0000 Use 127.0.0.1 by default for API server endpoint commit 41e53331ee5ae0b1119b8b24d950ff6bd13e5a37 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Mon Feb 3 23:44:01 2025 +0000 update logging for status refresh commit 574fb6f0b6bf843384ee6f0070b40e540e06b87a Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Mon Feb 3 15:41:41 2025 -0800 [Usage/SKY-1403] Fix the usage collection for API server (#180) * Fix usage for server side * minor update * Fix run id * fix client entrypoint cmd * rename function * directly use reset for usage messages and refactor a bit * minor comment update * disable usage for internal status refresh commit 09801da1f23c125e539caa20b94689832b7b0c08 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Mon Feb 3 04:46:40 2025 +0000 Fix merge conflict for jobs launch commit 7b86c84f4543ac092c5fb56474c32459b33c8d34 Merge: 52d499e2f 269dfb192 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Mon Feb 3 04:27:46 2025 +0000 Merge branch 'master' of github.com:skypilot-org/skypilot into restapi commit 52d499e2f9c3bf3f75674bd6d160224b5f50729d Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sun Feb 2 08:09:49 2025 +0000 format commit af6a346342109c31b71ad7d793275de407c29ab8 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sun Feb 2 08:09:34 2025 +0000 address comments commit 983bf1f392783e81477ccac1f90d9c5b6b77e16e Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Feb 1 18:33:52 2025 +0000 fix argument commit 100a5d43154f4cf2ada5f154f40fc361dd35a9ec Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Feb 1 18:33:15 2025 +0000 Fix invocation commit 66540e28ec8282fbaab263b440b4acd4ae4b36eb Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Feb 1 18:32:26 2025 +0000 fix invocation commit d62c79c64656723af85d9052282e0fc7e024729e Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Feb 1 18:27:25 2025 +0000 make `need_confirmation` internal commit 3f573afd2253410c320b3c134946c89b3d58e7cc Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Feb 1 18:01:51 2025 +0000 base exception commit efdb5911fa0da3f4f5701f27bb26ac5b5395b252 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Feb 1 06:40:40 2025 +0000 format commit f6964d398b6f2298bab23e2c82707cef6295c659 Merge: d7dd77d61 ac9f159db Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Feb 1 06:39:33 2025 +0000 Merge branch 'restapi' of github.com:assemble-org/skypilot into restapi commit d7dd77d61b5579a698015e43b67718f7a923d916 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Feb 1 06:39:24 2025 +0000 Address comment commit ac9f159db4b9adb3a9822d30b0fb71e9119da3c7 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 31 22:35:09 2025 -0800 Update sky/data/storage.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> commit 935c8b94c14275ffa3646312a50e82589a8fd1f7 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Feb 1 05:43:59 2025 +0000 Fix SKY-1383 commit f4f7653a2232e159ee4cfe55f0e9cc60baec0524 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Feb 1 02:13:49 2025 +0000 Fix unit tests commit 4347b3d18acd013972132d878187952298a81ffb Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 31 23:57:14 2025 +0000 use response 500 for failed requests commit b6fea92010749e7593387ba9813e740363633751 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 31 23:05:46 2025 +0000 Add TODO commit 26938445c9436c72c50d35f8540e9ee54e06ca2c Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 31 20:32:18 2025 +0000 address comments commit b6d5437665cb5dd84fc574b4452a3ea24297b034 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 31 17:46:27 2025 +0000 fix tests commit 432b586954c84be86ac36feea26eb5f892ad794e Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 31 08:32:52 2025 +0000 Address comments commit 8ae605c993067d8b15fcd7eb91dbaeada750a8d6 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 31 04:10:31 2025 +0000 address comments commit 80b34b5a35f8cd4db081ddd528ee2a506835987a Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Jan 30 19:46:26 2025 +0000 Address comments commit 4e80ff8390715cfdf55a784be2bc7b6e8266f67a Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Jan 25 19:51:56 2025 +0000 Address comments commit 307140e45ba602b88f0d44da6ff232bbd45f22bb Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Jan 25 18:39:16 2025 +0000 Address comments Fix cancel body Fix Fix remove redis types commit 252786fe73d47a5df725b29b4a2a1fa7d58a4aaa Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 18:35:19 2025 -0800 [SKY-1374] Fix HTTPS keys (#179) * upload https keys * alternstive output commit cb45ae3ab21a3ec1c3fa2567d080723378ef7703 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 23:23:05 2025 +0000 Fix autodown commit a471c092c4918b035a51f88b728e844f262ce643 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 23:04:04 2025 +0000 Fix folder creation for log download commit 4c2a438b24b349359e38ef37e042b6fa21bd5444 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 21:48:52 2025 +0000 format commit 32d8ffa197cdcd669b3c888c0967808c62799712 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 21:43:01 2025 +0000 Fix cluster job downloading commit 895f277e7cb80a2a43772499e7233f15b7130fbf Merge: 70e7ebf61 1146cab36 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 20:35:04 2025 +0000 Merge branch 'restapi' of github.com:assemble-org/skypilot into restapi commit 70e7ebf61cd36aecdd4d35f169b63df0d40ad9aa Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 20:23:03 2025 +0000 Naming and comments commit 1146cab3694a34bf9cb66a4e244c069250afb47d Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 11:48:19 2025 -0800 [SKY-1368] Support empty folder in file mounts and symlinks (#178) * upload symlink in file upload * format * remnant * fix relative * check relative path * Fix relative path in symlink target * Add unittest for zip and unzip * Add back symlink * fix circle link commit a8bba42de13ea88b9e4fc9bfb2f121abb2663cfb Author: zpoint <zp0int@qq.com> Date: Sat Jan 25 02:51:29 2025 +0800 get rid of autouse on pytest (#176) * get rid of autouse * manual fixture * prevent reload * remove the prevent_reload * Avoid reload cli * format --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> commit 099b8be94f1e40ed84f734a0bd44f7ba9941587f Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 09:52:54 2025 -0800 [SKY-1366] Fix storage reconstruct (#177) * Fix storage reconstruct * Error out instead * Add action * Fix exception type * Fix logging commit 62071d56bff05187cf2127429122ba6691ff05a1 Author: Hong <hong@assemblesys.com> Date: Fri Jan 24 20:31:28 2025 +0800 [SKY-1200][UX] Valid sky server available while login (#134) commit 4179a6c2f0c771cec2f5d758304f959937b04e04 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 07:57:18 2025 +0000 move lock under logs folder commit 5a66dae35ecc1efebbacb2a1d5fef9652cbcb8e1 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 07:22:14 2025 +0000 Fix config override error handling commit ed641a16b5f7bf126ec0e6c8cab3b0ae836aa404 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 02:08:52 2025 +0000 Expose CLOUD_REGISTRY commit 60e211f27c8b2b82f601eb75b99afac9795d6e62 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 02:02:03 2025 +0000 format commit 9fcce2c977da60f875c26cc4140e87fc293f98c1 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 02:00:37 2025 +0000 [SKY-1367] Fix validate exception serialization and deserialization commit a00748b33e3d062f4bc9afaee631f05a18d85b45 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 00:57:09 2025 +0000 Fix deepspeed setup commit c81a20a5844d752287f951b65566d5d1e4780b69 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 00:39:53 2025 +0000 remove api server deployment commit c2a4f49a2adf10099b87f33c5de2d9956ef69d00 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 24 00:34:34 2025 +0000 Fix dimming commit 5ff29f469d68e114fc2a7305b89bde4cb62d4ba2 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Jan 23 16:32:26 2025 -0800 merge issue commit 3994e17be460f15bbbb2834e78a29b566b5a3539 Merge: befbf5114 97b8e8f12 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Jan 23 22:35:21 2025 +0000 Merge branch 'master' of github.com:skypilot-org/skypilot into restapi commit befbf51143a9e1735298a585b0ac84250a79d23b Author: zpoint <zp0int@qq.com> Date: Thu Jan 23 14:56:33 2025 +0800 SKY-1034 [Tests] Fix tests with no need for credentials (#51) * add restapi for tests * fix storage * add TODO * uncomment CI test * pass api test and unit test * fix dryrun bug * fix test_config * fix test_jobs * fix for jobs and serve * bug fix * fix test jobs and serve * support enable_all_clouds on server side * comment out fail test cases * bug fix and reformat * temp comment * pass tests * test config uncomment * restructure fixture for faster test * restore change by cursor * test test_list_accelerators * bug fix * fix test cli * fix all * test CI * bug fix * rename file * Update sky/cli.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * test_config pass final test * test_config pass final test * use patch and import strategy * format * bug fix * local test behaviour * fix reload issue * yapf * remove test controller util * fix test cases * resolve PR comment * resolve merge conflict * fix k8s test after merging restapi * resolve comment * resolve PR comment * debug * debug * bug fix * api change * fix PR comment * change stream method * fix * cli fix * Update tests/conftest.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * Update tests/test_cli.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * Update tests/test_cli.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * fix * resolve PR comment * restore class * support stream mock --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> commit 72d85e942b91905f41807765d2ad141eae7438ce Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Jan 22 22:41:28 2025 -0800 [SKY-1357] Fix jobs dashboard on Kubernetes (#175) * Add port forward command support * fix port forward command * format * Add comment commit 0c39a8cca617b736e8c675bef08c1a8082167de8 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Jan 22 18:53:58 2025 -0800 [SKY-1363] `sky down sky-jobs...` request fail to cancel on-going request for `sky jobs launch` (#174) * Cancel jobs request when downing * Add column for api status * Add comment * minor commit d582284dc4ddb2cad5282ca03829144acb275644 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Jan 22 18:51:05 2025 -0800 [SKY-1323] Add timestamps for server logs (#173) * Add timestamps for server log * comments * format commit fffc1535530f52c888e52c9176318918220e1758 Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Wed Jan 22 15:58:22 2025 -0800 [SKY-1223] Docs for multi-k8s support, exec based kubeconfig converter (#164) * Add docs and kubeconfig converter * lint * update docs * update docs commit 3ab843ffbb46bc87b4bcc75c113c618142a1ca77 Author: zpoint <zp0int@qq.com> Date: Thu Jan 23 06:04:30 2025 +0800 [SKY-1009] [Robust] Make streaming requests (sky logs , sky jobs logs , etc) synchronous (#155) * tail logs * circular import * print end * bug fix * fix * core function * resolve PR comment * mypy * bug fix * type * resolve PR comment * restore change * resolve merge conflict * Architecture: Add changes with dicussed offline * minor movement * format * format * Fix import * fix imports --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> commit 3e75a0f4d3756fb8c469e94088c360d5fdcbe4a4 Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Wed Jan 22 13:07:39 2025 -0800 [SKY-1329] Better logging when API server fails to start (#167) * Better logging when API server fails to start * lint * comments * lint * lint commit 179ed3391ee8c5a7e554b9470965cca82a751432 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Jan 22 11:17:25 2025 -0800 [SKY-1337] Refactor client common and fix jobs logs sync down (#165) * refactor and support log sync down for jobs * Refactor for file uploads * Fix log path * Fix jobs logs downloading * fix smoke test * fix smoke test * format * Add debugging * Add debug * Reuse request body * Remove debug lines * get the latest * fix jobs logs tests * fix grep JobStatus commit a90928e11a872cbabb983fb766fa52485861ee66 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Jan 22 11:01:35 2025 -0800 [SKY-1347] Cancel pending requests for `sky down` (#172) Cancel pending requests for `sky down` as well commit 9167fed281428f8a64ecd39b834fa028d4864c94 Author: zpoint <zp0int@qq.com> Date: Wed Jan 22 18:01:08 2025 +0800 add aiofiles support for pre-commit-config (#171) add aiofiles support commit 8301271cb1504ad76abcf2ebf4b48d1979af71de Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Tue Jan 21 22:16:21 2025 -0800 [SKY-1326] Minor logging fixes for requests (#169) Minor logging commit 1865ecba8b01a03e436f4b5e65beb8031cac4258 Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Tue Jan 21 18:57:55 2025 -0800 [SKY-1269] Allow tasks to run in custom namespaces (#170) * incluster namespace selection * comment * lint * lint commit c4edb603464a00a715b41df7bfbdfa7c1a30bd4a Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Jan 21 16:31:47 2025 -0800 [SKY-1115] Fix smoke tests with remote API server (no local credential) and local API server (#161) * Run cloud storage checks in tests on a cluster * More fixes * add more fix * format * format * Fix tests that requires local credentials * refactor the cmd runner * use new cluster * Add handling for GCP cloud commands * Wait for the cloud cmd cluster to be up * fix storage deletion * fix api server endpoint * fix env * fix None for env * fix intermediate storage * increase timeout * skip bench * fix queue with docker * format * more robust termination * Fix error handling for controller * fix API call * format * format * fix gcp related tests * Additional fixes * fix clean up * change zone * avoid test if on k8s * longer wait for cloud cmd cluster * longer wait * longer wait * longer * avoid tailing ? * wait for running * Fix storage mounts k8s test * Fix msg * Fix managed jobs output * fix dependency installation * Add additional fix * longer wait time for running * longer wait * slgihtly longer for clean up the resources * fix output * Fix cluster name * fix cluster names commit 020a20efa43be95e36a3222d7e7abaf2269d2d7f Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sun Jan 19 14:24:31 2025 -0800 [SKY-1348] Fix console mess up (#166) * Fix jobs logs * Add comments commit 2bae2b5a6c3d52f589dca55b7faa0eb2a68cc5c5 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sun Jan 19 11:52:51 2025 -0800 [SKY-1339] Upload files by chunks (#163) * fixes * wip fixes * wip * minor fix * Fix bug for bad zip * fix chunk upload * update return value * fix multi thread issue * avoid io * Add cleanup for stale chunks * format * lint * Add comments * ux * fix cleanup commit 189c686bf2a12071a21407085ef9fd258a30d800 Merge: 61f44a053 2354b818b Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sun Jan 19 19:21:07 2025 +0000 Merge branch 'master' of github.com:skypilot-org/skypilot into restapi commit 61f44a053c6778ccf5c6b28abc92006cee997f03 Merge: e87afbc6c 6b23582d9 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 17 23:46:28 2025 +0000 Merge branch 'master' of github.com:skypilot-org/skypilot into restapi commit e87afbc6c2687ae864c944d8137a294f32094410 Merge: b0ea95f03 9e1b4ddc5 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 17 07:15:09 2025 +0000 Merge branch 'master' of github.com:assemble-org/skypilot into restapi commit b0ea95f030f24c3a7f9936233d1c838eb953b7d1 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 17 06:21:19 2025 +0000 Fix job controller merging issue commit 37f6fce6ad9b1723c32fc9157701adaf8e34c330 Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Thu Jan 16 20:29:22 2025 -0800 [SKY-1321] Fix concurrent requests to launch local API server (#162) * Fix concurrent requests to launch local API server * lint commit 2b537b9b349fb848ae9d4a6da84f330be8fcdd8c Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 17 03:42:59 2025 +0000 format and unittest fix commit bd65f74dce7588ae62efd58bc5d5021576213326 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 17 03:27:58 2025 +0000 Fix cannonicalize accelerator implementation from master merging commit a33a894b3e2df6a238a0fab727dcce01c95e0f5c Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 17 01:37:31 2025 +0000 Merge branch 'master' of github.com:skypilot-org/skypilot into restapi commit 50671940daf8eb9713879a88480f230531e2cc7e Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Jan 16 15:21:41 2025 -0800 [SKY-980] Refactor jobs queue and service status to avoid mp pool (#141) * Refactor jobs queue and service status to avoid mp pool * add comment * fix API cancel commit 7c614835447302964ae767aad8b8d283e7198411 Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Thu Jan 16 11:35:51 2025 -0800 [Docs] Updates based on onboarding feedback (#120) * Update docs * Add python version commit 73a20398855a0a63ccb9306af2ae40446e18f29d Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Wed Jan 15 22:58:28 2025 -0800 [SKY-373] Fix `sky local` for client server (#61) * Make sky local up work * lint * lint * Update checks * lint * Merge with restapi * revert log changes * fix * Rebase done * Comments and refactor * fixes * Change to __enter__ and add todo * lint * lint * fix package path * remove debug * Fix GPU detection * Logging fixes * Lint commit 006ce5c2aaa4f8bfbf9c061195cf9dc84d3e54ec Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Jan 15 19:55:59 2025 -0800 [SKY-915/SKY-1292] Clean up ssh entries and fix api logs exception handling (#157) * Remove ssh config for clusters not exist * Fix log path non-exist issue * Update sky/cli.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Elaborate comments --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> commit 0a0cb59e43e84441812622ee50f2dd042660e787 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Jan 16 00:46:35 2025 +0000 Add back changes from #78 commit e2d3eac431d6c443244e8b5eba28b1f8a2c07180 Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Wed Jan 15 15:50:36 2025 -0800 [SKY-1045] Support GCP service account auth (#160) Update GCP service account docs commit efd893df50e9b9e91dc988a9f47d8c5c7fd7a7f7 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Jan 15 22:32:47 2025 +0000 Fix SKY-1303 commit 36e638f677de165dff5bfe6d278f29633f1db49a Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Jan 14 19:28:44 2025 -0800 [SKY-1288] Fix python SDK docs (#156) * Fix docs for SDK * Split into items * Add comments for API endpoint env var * format * fix docs * Add docs for serve commit 5b686e34d0843d7f399584302e5758a621263dcd Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Jan 14 17:31:23 2025 -0800 [SKY-1217] Fix log download test (#159) * Fix log download test * Add comment commit 4757b78d389188ec89810e82e6186c5c648f4ab7 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Jan 14 16:34:05 2025 -0800 [SKY-1095] Fix serve down issue with storage deletion (#158) Fix serve down issue with storage deletion commit c8bb92c7c5fca19939242fd10f64d9cdc9c7327f Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Tue Jan 14 14:14:09 2025 -0800 [SKY-1233] Handle non-sky exceptions and airflow updates (#151) * Add exception wrapping and update airflow * Add git branch * Update airflow and docs * Update airflow and docs * lint commit 0b4530f38e35cbfea513992ce9c189698abc4365 Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Tue Jan 14 12:51:53 2025 -0800 [SKY-1258] Fix sky api start error message when using remote API server (#152) * Fix sky api start logic * lint * lint * comments * simplify logging logic * Add docs and comments * Add docs and comments commit 6dd380e7e316bc7c0051f955ba5f326c4e0d454c Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Jan 14 11:14:08 2025 -0800 [SKY-1291] Fix CRLF output (#153) * Fix CRLF output * fix last line * update comment commit 79f49ab29e9f901cca143a4aec366fd946d2aa55 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Mon Jan 13 20:40:24 2025 -0800 [SKY-1108] Better logging for skypilot config override (#150) Better logging for skypilot config override commit c28d7b49f6670893098ff63bf552ccea7c978dd7 Author: zpoint <zp0int@qq.com> Date: Mon Jan 13 14:01:37 2025 +0800 type hint fix for tail logs (#149) type hint commit 24d270b64c6fa480e0eb50583b198401e000902f Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sun Jan 12 22:00:46 2025 -0800 [SKY-1129] Avoid lru_cache across requests (#137) * Add reload for caches * Add comments * Fix docstr * Simplify * Add TODO commit 270edfb8156e32cb395cae55bd37c44e13f2db1f Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sun Jan 12 21:53:26 2025 -0800 [UX] Prompt when launching on other's cluster (#146) * Prompt when launching on other's cluster * add import * format * Better logging * Only show hash when user commit 37f0c823a1b86ddb38ce377e80863baa4d74ad6c Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sun Jan 12 21:52:34 2025 -0800 [UX] Add terminal completion and fix progress bar with `\r` (#148) * complete console * Add terminal completion * format * format * Fix \r * format commit 1e3e5281e159735533a7eac9f98c21f45f21eab6 Author: zpoint <zp0int@qq.com> Date: Sun Jan 12 11:23:38 2025 +0800 [SKY-1287]sky exec test echo hi doesn't show task logs (hi) (#147) * fix merge conflict overwritten * mypy * resolve comment * mypy * return type commit a3db3e2e61f25c0518b6c868a3f1afaacf1b885c Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 10 18:53:27 2025 -0800 [SKY-1256] Avoid buffer for streaming hints (#145) * Avoid buffer for streaming hints * format * Better log streamer * format * format * format * remove cache * fix error handling commit 786ae1d778889f89ba170d6e3db25774dec211ca Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Jan 11 02:42:49 2025 +0000 Avoid refresh cluster status when cluster yaml is None commit 6d31f31c9b867c0515f6c9ea3080ae63101d66f0 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 10 13:50:11 2025 -0800 [SKY-1278/SKY-1134] Add versioning for API server (#138) * Add versioning for API server * Add commit and version in the health call * refactor a bit * format * fix docstr * renaming commit 212f72fe663e27a3f7f8e83da13e6e795d0831b9 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 10 19:57:48 2025 +0000 format commit 5fbe71ea17551dfb22f1021db44dd81a98c6dfe2 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 10 19:56:22 2025 +0000 format commit e8a1f4736536020255430e9d1d61654881a0a7dc Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 10 19:55:18 2025 +0000 Rename to API server commit 55c648700419d9c2181206438fe2e30aaac483d8 Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Fri Jan 10 10:32:56 2025 -0800 Update health endpoint in docs (#144) Update health endpoint commit 9ec1ed69f64019d23dc5760aec76e6c7d01bc66d Author: zpoint <zp0int@qq.com> Date: Fri Jan 10 17:04:42 2025 +0800 [SKY-988] [Robust] Refactor tail_logs to make it a uvicorn async call instead of an async executor call (#130) * sync log call * terminate worker * use sky abort * merge restapi * revert logging order * doc update commit e1eed3bf0ddd987b26aa094a2d1407f127ab28d2 Author: Hong <hong@assemblesys.com> Date: Fri Jan 10 14:46:21 2025 +0800 [UX] set `RUNNING` request color to green (#142) commit 961782a0d9b48939dea9ba2abcf5a53662d55052 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Jan 9 22:34:18 2025 -0800 [SKY-1273] Rename API server related CLIs (#139) * Rename API server related CLIs * Consolidate server logs * format * Add api login python API * format * Add comments and TODOs * Add TODO for endpoint env var commit cfaf37e1e5204b42c3bfcac6fcbb7f301d5cfa18 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Jan 9 18:16:47 2025 -0800 Add user hash for existing cluster (#140) * Add user hash for existing cluster * format * format commit c96710df70bf89fd20c373e04213bde54d9157be Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Fri Jan 10 01:05:37 2025 +0000 Fix check multiple clouds commit 466c4d6a767e1a82631b6f2794ca296e82904049 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Jan 9 10:48:17 2025 -0800 Clean up client file mounts during API server restarts (#135) * Refactor client server * add docstr * Fix types * Refactor for jobs/server * use type alias * Add todo * rename `check_health` * update docstr * Add comments * Update docstr * Adding docstr * Move * Add doc * style * format * fix redis * renames * naming * format * format * clean up clients file mounts as well * fix merging naming issue * format * format commit e6cf03f050c2086382e21f3fa2a68fb0d0ade32b Author: Yika <yikaluo@assemblesys.com> Date: Thu Jan 9 10:20:33 2025 -0800 [SKY-1210] Fix or triage TODOs in restapi branch (#128) * Fix or triage TODOs in restapi branch * merge with restapi * comments commit f73a871431a1a85584761f14b13e2e95e4d1a0d9 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Jan 9 00:41:27 2025 -0800 Address comments for restapi branch (#133) * Refactor client server * add docstr * Fix types * Refactor for jobs/server * use type alias * Add todo * rename `check_health` * update docstr * Add comments * Update docstr * Adding docstr * Move * Add doc * style * format * fix redis * renames * naming * format * format * Add comment * Avoid dumping to files * format * Update sky/server/requests/executor.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Add todo for legacy * use queue.Queeu --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> commit 8fa28098edb6a85631c9a95107857d5f3701a832 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Jan 7 19:34:12 2025 +0000 remove poetry commit fe9d7385b459754cf493eee3345281f2f97fc4a7 Merge: 344151261 6cf98a3cf Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Jan 7 18:39:54 2025 +0000 Merge branch 'master' of github.com:assemble-org/skypilot into restapi commit 34415126138d67254c16f830c875a9f9077f76d4 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Mon Jan 6 23:45:19 2025 -0800 Merge master (#131) * [perf] use uv for venv creation and pip install (#4414) * Revert "remove `uv` from runtime setup due to azure installation issue (#4401)" This reverts commit 0b20d568ee1af454bfec3e50ff62d239f976e52d. * on azure, use --prerelease=allow to install azure-cli * use uv venv --seed * fix backwards compatibility * really fix backwards compatibility * use uv to set up controller dependencies * fix python 3.8 * lint * add missing file * update comment * split out azure-cli dep * fix lint for dependencies * use runpy.run_path rather than modifying sys.path * fix cloud dependency installation commands * lint * Update sky/utils/controller_utils.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * [Minor] README updates. (#4436) * [Minor] README touches. * update * update * make --fast robust against credential or wheel updates (#4289) * add config_dict['config_hash'] output to write_cluster_config * fix docstring for write_cluster_config This used to be true, but since #2943, 'ray' is the only provisioner. Add other keys that are now present instead. * when using --fast, check if config_hash matches, and if not, provision * mock hashing method in unit test This is needed since some files in the fake file mounts don't actually exist, like the wheel path. * check config hash within provision with lock held * address other PR review comments * rename to skip_if_no_cluster_updates Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * add assert details Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * address PR comments and update docstrings * fix test * update docstrings Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * address PR comments * fix lint and tests * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * refactor skip_if_no_cluster_update var * clarify comment * format exception --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * [k8s] Add resource limits only if they exist (#4440) Add limits only if they exist * [robustness] cover some potential resource leakage cases (#4443) * if a newly-created cluster is missing from the cloud, wait before deleting Addresses #4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9f8462c4e68196e9a6420c6781c9ff116b. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * smoke tests support storage mount only (#4446) * smoke tests support storage mount only * fix verify command * rename to only_mount * [Feature] support spot pod on RunPod (#4447) * wip * wip * wip * wip * wip * wip * resolve comments * wip * wip * wip * wip * wip * wip --------- Co-authored-by: hwei <hwei@covariant.ai> * use lazy import for runpod (#4451) Fixes runpod import issues introduced in #4447. * [k8s] Fix show-gpus when running with incluster auth (#4452) * Add limits only if they exist * Fix incluster auth handling * Not mutate azure dep list at runtime (#4457) * add 1, 2, 4 size H100's to GCP (#4456) * add 1, 2, 4 size H100's to GCP * update * Support buildkite CICD and restructure smoke tests (#4396) * event based smoke test * more event based smoke test * more test cases * more test cases with managed jobs * bug fix * bump up seconds * merge master and resolve conflict * more test case * support test_managed_jobs_pipeline_failed_setup * support test_managed_jobs_recovery_aws * manged job status * bug fix * test managed job cancel * test_managed_jobs_storage * more test cases * resolve pr comment * private member function * bug fix * restructure * fix import * buildkite config * fix stdout problem * update pipeline test * test again * smoke test for buildkite * remove unsupport cloud for now * merge branch 'reliable_smoke_test_more' * bug fix * bug fix * bug fix * test pipeline pre merge * build test * test again * trigger test * bug fix * generate pipeline * robust generate pipeline * refactor pipeline * remove runpod * hot fix to pass smoke test * random order * allow parameter * bug fix * bug fix * exclude lambda cloud * dynamic generate pipeline * fix pre-commit * format * support SUPPRESS_SENSITIVE_LOG * support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log * support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log * add backward_compatibility_tests to pipeline * pip install uv for backward compatibility test * import style * generate all cloud * resolve PR comment * update comment * naming fix * grammar correction * resolve PR comment * fix import * fix import * support gcp on pre merge test * no gcp test case for pre merge * [k8s] Make node termination robust (#4469) * Add limits only if they exist * retry deletion * lint * lint * comments * lint * [Catalog] Bump catalog schema version (#4470) * Bump catalog schema version * trigger CI * [core] skip provider.availability_zone in the cluster config hash (#4463) skip provider.availability_zone in the cluster config hash * remove sky jobs launch --fast (#4467) * remove sky jobs launch --fast The --fast behavior is now always enabled. This was unsafe before but since \#4289 it should be safe. We will remove the flag before 0.8.0 so that it never touches a stable version. sky launch still has the --fast flag. This flag is unsafe because it could cause setup to be skipped even though it should be re-run. In the managed jobs case, this is not an issue because we fully control the setup and know it will not change. * fix lint * [docs] Change urls to docs.skypilot.co, add 404 page (#4413) * Add 404 page, change to docs.skypilot.co * lint * [UX] Fix unnecessary OCI logging (#4476) Sync PR: fix-oci-logging-master Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * [Example] PyTorch distributed training with minGPT (#4464) * Add example for distributed pytorch * update * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Fix --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Add tests for Azure spot instance (#4475) * verify azure spot instance * string style * echo * echo vm detail * bug fix * remove comment * rename pre-merge test to quicktest-core (#4486) * rename to test core * rename file * [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu (#4337) * [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu Signed-off-by: nkwangleiGIT <nkwanglei@126.com> * fix format issue Signed-off-by: nkwangleiGIT <nkwanglei@126.com> --------- Signed-off-by: nkwangleiGIT <nkwanglei@126.com> * [k8s] Fix IPv6 ssh support (#4497) * Add limits only if they exist * Fix ipv6 support * Fix ipv6 support * [Serve] Add and adopt least load policy as default poicy. (#4439) * [Serve] Add and adopt least load policy as default poicy. * Docs & smoke tests * error message for different lb policy * add minimal example * fix * [Docs] Update logo in docs (#4500) * WIP updating Elisa logo; issues with light/dark modes * Fix SVG in navbar rendering by hardcoding SVG + defining text color in css * Update readme images * newline --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Replace `len()` Zero Checks with Pythonic Empty Sequence Checks (#4298) * style: mainly replace len() comparisons with 0/1 with pythonic empty sequence checks * chore: more typings * use `df.empty` for dataframe * fix: more `df.empty` * format * revert partially * style: add back comments * style: format * refactor: `dict[str, str]` Co-authored-by: Tian Xia <cblmemo@gmail.com> --------- Co-authored-by: Tian Xia <cblmemo@gmail.com> * [Docs] Fix logo file path (#4504) * Add limits only if they exist * rename * [Storage] Show logs for storage mount (#4387) * commit for logging change * logger for storage * grammar * fix format * better comment * resolve copilot review * resolve PR comment * remove unuse var * Update sky/data/data_utils.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * resolve PR comment * update comment for get_run_timestamp * rename backend_util.get_run_timestamp to sky_logging.get_run_timestamp --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [Examples] Update Ollama setup commands (#4510) wip * [OCI] Support OCI Object Storage (#4501) * OCI Object Storage Support * example yaml update * example update * add more example yaml * Support RClone-RPM pkg * Add smoke test * ver * smoke test * Resolve dependancy conflict between oci-cli and runpod * Use latest RClone version (v1.68.2) * minor optimize * Address review comments * typo * test * sync code with repo * Address review comments & more testing. * address one more comment * [Jobs] Allowing to specify intermediate bucket for file upload (#4257) * debug * support workdir_bucket_name config on yaml file * change the match statement to if else due to mypy limit * pass mypy * yapf format fix * reformat * remove debug line * all dir to same bucket * private member function * fix mypy * support sub dir config to separate to different directory * rename and add smoke test * bucketname * support sub dir mount * private member for _bucket_sub_path and smoke test fix * support copy mount for sub dir * support gcs, s3 delete folder * doc * r2 remove_objects_from_sub_path * support azure remove directory and cos remove * doc string for remove_objects_from_sub_path * fix sky jobs subdir issue * test case update * rename to _bucket_sub_path * change the config schema * setter * bug fix and test update * delete bucket depends on user config or sky generated * add test case * smoke test bug fix * robust smoke test * fix comment * bug fix * set the storage manually * better structure * fix mypy * Update docs/source/reference/config.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/reference/config.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * limit creation for bucket and delete sub dir only * resolve comment * Update docs/source/reference/config.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update sky/utils/controller_utils.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * resolve PR comment * bug fix * bug fix * fix test case * bug fix * fix * fix test case * bug fix * support is_sky_managed param in config * pass param intermediate_bucket_is_sky_managed * resolve PR comment * Update sky/utils/controller_utils.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * hide bucket creation log * reset green color * rename is_sky_managed to _is_sky_managed * bug fix * retrieve _is_sky_managed from stores * propogate the log --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [Core] Deprecate LocalDockerBackend (#4516) Deprecate local docker backend * [docs] Add newer examples for AI tutorial and distributed training (#4509) * Update tutorial and distributed training examples. * Add examples link * add rdvz * [k8s] Fix L40 detection for nvidia GFD labels (#4511) Fix L40 detection * [docs] Support OCI Object Storage (#4513) * Support OCI Object Storage * Add oci bucket for file_mount * [Docs] Disable Kapa AI (#4518) Disable kapa * [DigitalOcean] droplet integration (#3832) * init digital ocean droplet integration * abbreviate cloud name * switch to pydo * adjust polling logic and mount block storage to instance * filter by paginated * lint * sky launch, start, stop functional * fix credential file mounts, autodown works now * set gpu droplet image * cleanup * remove more tests * atomically destroy instance and block storage simulatenously * install docker * disable spot test * fix ip address bug for multinode * lint * patch ssh from job/serve controller * switch to EA slugs * do adaptor * lint * Update sky/clouds/do.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * Update sky/clouds/do.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * comment template * comment patch * add h100 test case * comment on instance name length * Update sky/clouds/do.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * Update sky/clouds/service_catalog/do_catalog.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * comment on max node char len * comment on weird azure import * comment acc price is included in instance price * fix return type * switch with do_utils * remove broad except * Update sky/provision/do/instance.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * Update sky/provision/do/instance.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * remove azure * comment on non_terminated_only * add open port debug message * wrap start instance api * use f-string * wrap stop * wrap instance down * assert credentials and check against all contexts * assert client is None * remove pending instances during instance restart * wrap rename * rename ssh key var * fix tags * add tags for block device * f strings for errors * support image ids * update do tests * only store head instance id * rename image slugs * add digital ocean alias * wait for docker to be available * update requirements and tests * increase docker timeout * lint * move tests * lint * patch test * lint * typo fix * fix typo * patch tests * fix tests * no_mark spot test * handle 2cpu serve tests * lint * lint * use logger.debug * fix none cred path * lint * handle get_cred path * pylint * patch for DO test_optimizer_dryruns.py * revert optimizer dryrun --------- Co-authored-by: Tian Xia <cblmemo@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-12.ec2.internal> * [Docs] Refactor pod_config docs (#4427) * refactor pod_config docs * Update docs/source/reference/kubernetes/kubernetes-getting-started.rst Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Update docs/source/reference/kubernetes/kubernetes-getting-started.rst Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * [OCI] Set default image to ubuntu LTS 22.04 (#4517) * set default gpu image to skypilot:gpu-ubuntu-2204 * add example * remove comment line * set cpu default image to 2204 * update change history * [OCI] 1. Support specify OS with custom image id. 2. Corner case fix (#4524) * Support specify os type with custom image id. * trim space * nit * comment * Update intermediate bucket related doc (#4521) * doc * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * add tip * minor changes --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [aws] cache user identity by 'aws configure list' (#4507) * [aws] cache user identity by 'aws configure list' Signed-off-by: Aylei <rayingecho@gmail.com> * refine get_user_identities docstring Signed-off-by: Aylei <rayingecho@gmail.com> * address review comments Signed-off-by: Aylei <rayingecho@gmail.com> --------- Signed-off-by: Aylei <rayingecho@gmail.com> * [k8s] Add validation for pod_config #4206 (#4466) * [k8s] Add validation for pod_config #4206 Check pod_config when run 'sky check k8s' by using k8s api * update: check pod_config when launch check merged pod_config during launch using k8s api * fix test * ignore check failed when test with dryrun if there is no kube config in env, ignore ValueError when launch with dryrun. For now, we don't support check schema offline. * use deserialize api to check pod_config schema * test * create another api_client with no kubeconfig * test * update error message * update test * test * test * Update sky/backends/backend_utils.py --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [core] fix wheel timestamp check (#4488) Previously, we were only taking the max timestamp of all the subdirectories of the given directory. So the timestamp could be incorrect if only a file changed, and no directory changed. This fixes the issue by looking at all directories and files given by os.walk(). * [docs] Add image_id doc in task YAML for OCI (#4526) * Add image_id doc for OCI * nit * Update docs/source/reference/yaml-spec.rst Co-authored-by: Tian Xia <cblmemo@gmail.com> --------- Co-authored-by: Tian Xia <cblmemo@gmail.com> * [UX] warning before launching jobs/serve when using a reauth required credentials (#4479) * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * Update sky/backends/cloud_vm_ray_backend.py Minor fix * Update sky/clouds/aws.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * wip * minor changes * wip --------- Co-authored-by: hong <hong@hongdeMacBook-Pro.local> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [GCP] Activate service account for storage and controller (#4529) * Activate service account for storage * disable logging if not using service account * Activate for controller as well. * revert controller activate * Add comments * format * fix smoke * [OCI] Support reuse existing VCN for SkyServe (#4530) * Support reuse existing VCN for SkyServe * fix * remove unused import * format * [docs] OCI: advanced configuration & add vcn_ocid (#4531) * Add vcn_ocid configuration * Update config.rst * fix merge issues WIP * fix merging issues * fix imports * fix stores --------- Signed-off-by: nkwangleiGIT <nkwanglei@126.com> Signed-off-by: Aylei <rayingecho@gmail.com> Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Co-authored-by: zpoint <zp0int@qq.com> Co-authored-by: Hong <weih1121@qq.com> Co-authored-by: hwei <hwei@covariant.ai> Co-authored-by: Yika <yikaluo@assemblesys.com> Co-authored-by: Seth Kimmel <seth.kimmel3@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Lei <nkwanglei@126.com> Co-authored-by: Tian Xia <cblmemo@gmail.com> Co-authored-by: Andy Lee <andylizf@outlook.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> Co-authored-by: Hysun He <hysunhe@foxmail.com> Co-authored-by: Andrew Aikawa <asai@berkeley.edu> Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-12.ec2.internal> Co-authored-by: Aylei <rayingecho@gmail.com> Co-authored-by: Chester Li <chaoleili2@gmail.com> Co-authored-by: hong <hong@hongdeMacBook-Pro.local> commit 4d9c1d5f83a7ea2799800976f98bef8435552902 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Mon Jan 6 13:06:15 2025 -0800 [Client] Create ssh config folder (#105) Create ssh config folder commit 397a6d9c80c1d63127699be654e1c445de8bb4db Author: Yika <yikaluo@assemblesys.com> Date: Fri Jan 3 14:51:31 2025 -0800 [UX][SKY-1085] Show more readable process command name in `htop` (#119) * [UX] Process naming * comment commit c289b36e11bf5fe39bcf5dfbaae22feec7cc027d Author: Yika <yikaluo@assemblesys.com> Date: Fri Jan 3 14:40:47 2025 -0800 [UX][SKY-1081] More consistent logs for file syncing (#118) [UX][SKY-1081] Better logs for file syncing commit f29a6de440408172bf5c7e054b50ee6742d0557b Author: Yika <yikaluo@assemblesys.com> Date: Fri Jan 3 14:04:42 2025 -0800 [SKY-1071] Fix sky api server_logs for remote API server (#129) * [SKY-1071] Fix sky api server_logs for remote API server * comments commit ae30752777f7bc7778feda5b51f760337d066615 Author: Yika <yikaluo@assemblesys.com> Date: Mon Dec 23 11:20:31 2024 -0800 [UX][SKY-1092] Colored request status for sky api ls (#116) * [UX][SKY-1092] Colored request status for sky api ls * address comments * fix type * nit commit baf3fa340f625acb18e56d28cda9bcc99fc83ff5 Author: Yika <yikaluo@assemblesys.com> Date: Mon Dec 23 11:18:27 2024 -0800 [UX][SKY-1086] Polish worker logs in API server (#117) * [UX][SKY-1086] Polish worker logs in API server * address comments * revert accident commit commit a9f5f38c92e23a61fcb365b4fb62dd3959219c0c Author: Yika <yikaluo@assemblesys.com> Date: Wed Dec 18 10:17:16 2024 -0800 [SKY-1128] Fix sky storage cloud type mismatch issue (#109) Fix sky storage cloud type mismatch issue commit 6318b38159c6754e5e950c08c4058176d88aea5c Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Tue Dec 17 16:57:17 2024 -0800 [Docs] Quick updates to docs (#106) Updates commit 591c52e8757eb182392a51cf583807c6fbe7e41e Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Dec 17 16:22:01 2024 -0800 Reload kubernetes utils to avoid caching issue (#102) * Reload kubernetes utils to avoid caching issue * format commit d75fcd849c2dfef833a5afb3dfae35010afc1e06 Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Tue Dec 17 16:01:27 2024 -0800 [SKY-1107] Docs for deployment (#92) * Update admin docs * Updates * Update docs * Update docs * Add storage docs * Updates commit 65554949e0f7743e964abf9c228c0d501087a5f5 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Tue Dec 17 12:49:03 2024 -0800 [Tests] Fix smoke test for `--kubernetes` and backward compatibility test on AWS (#80) * Remove stop test for k8s * Simplify the test k8s creation commands * use 5+ * fix * fix * expanduser * add tpu mark * minore large cpu * correctly quote * format * get old jobs as well * fix launch on existing * fix * fix back * fix back compat * minor fix * fix * Fix server side validation * remove debug * rename back to cloudvmray * rename back to cloudvmray * fix aws cli on kubernetes pod * format * use uv * fix status -r in backward compat * Fix path * format * less \t * try less gap for waiting * Add a timeout for curl to avoid long waiting * Add max time as well * shorter sleep commit 8af911e4ad36a8f9da77f5186b18e716503a05a4 Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Tue Dec 17 00:03:37 2024 -0800 [Sky-1123] Fix proxy command for SSH (#94) Fix proxy command commit ed724c52a458fbac9d81fa0134f0ff1612b0c531 Author: Yika <yikaluo@assemblesys.com> Date: Mon Dec 16 18:11:31 2024 -0800 [SKY-1110] Fix sky server controller name issue (#91) * Remove controller name reference on client side * format * add comments * rename commit 670af89b9677b75a4c08f409ccc653c07a608654 Author: Yika <yikaluo@assemblesys.com> Date: Mon Dec 16 18:10:12 2024 -0800 Fix validating relative path (./) (#90) * Not validate file_mounts for sky exec * fix sky launch on file_mount relative path * address comments * validate on both side * nit * smoke test * comments Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * format --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> commit 8f07b85738dc174a0656ef5115a9b1e7145783a3 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Mon Dec 16 01:22:43 2024 -0800 Fix config reload (#79) * reload the config * rename * elaborate * Fix makedir * minor * longer retry for deploy * Refactor * fix unit test * fix * format * fix docstr * revert * Fix loading issue * Add unit test * format * fix condition * Fix controller hash * revert to remove * rename * format * rename * comment * Avoid skip env var * comment * Add more tests * try test_config * format * avoid test config * fix folder path * use uuid * use uuid for download as well * format commit 389a32b3f0410dce4bd95b74c7e982f442f4d644 Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Sat Dec 14 17:49:41 2024 -0800 [SKY-1073] Fix cross-device linking error with `sky jobs launch` (#78) * hardlink in the same block device * lint * lint * Update sky/skylet/constants.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> commit 1a45d90e5b7ea75d2c86c2bba377a7d563af5f85 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Sat Dec 14 16:09:56 2024 -0800 [Docs] Client server docs (SKY-1101) (#77) * Add autobuild * better logging * Add docs for API server * update note for `sky down` and `sky top` * format * Updates * Update * updates * fix typos * Fix * comment --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> commit 07a4ce71fe41a4b51ef4cade461c88648cb5cecf Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Date: Fri Dec 13 16:06:53 2024 -0800 [SKY-1074] Fix storage related tests for remote API server (#74) * Make test_storage_mounting.yaml.j2 cloud-specific * Add local marker to skip tests on remote API server * Update todo * lint commit 1556c3301e1d06614da8a0040c6bd55608679a4b Author: Yika <yikaluo@assemblesys.com> Date: Fri Dec 13 15:53:13 2024 -0800 [SKY-1080] Controller backward compatibility (#72) * Controller backward compatibility * format * test * test2 * polish * format * revert hostname * nit * better naming * unit test * more unit test * format * address comments * constant commit 3eb370d639713b8892bfb7fccecd8cf3e5fcff75 Author: Yika <yikaluo@assemblesys.com> Date: Thu Dec 12 21:16:16 2024 -0800 [SKY-1084] k8 ssh proxy uses sky python (#75) k8 ssh proxy uses sky python commit 9f56b0fc31c6492f930805acf89c8e4ab4cc2609 Author: Yika <yikaluo@assemblesys.com> Date: Thu Dec 12 21:15:50 2024 -0800 [SKY-1094] Disable sky bench (#73) Disable sky bench commit 352320e2ba3e5c086fe9fd8ea1aeac716e67e371 Author: Yika <yikaluo@assemblesys.com> Date: Thu Dec 12 20:28:43 2024 -0800 [SKY-1104] fix sky down issue (#76) fix sky down issue commit 0383c82457401875a428aed0f9dd0bb6d6764904 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Thu Dec 12 17:09:40 2024 -0800 Fix websocket proxy path and fixes SKY-1076 (#71) * Fix websocket proxy path * Add cluster config * format commit 9fbb7fba1e656bb403ddfa17056382ecca7819c1 Author: Yika <yikaluo@assemblesys.com> Date: Thu Dec 12 16:56:40 2024 -0800 UX fixes from alpha test (#70) * UX fixes from alpha test * decoder encoder * address comments commit f8273f983dd4d78f6cbedc0e8f066c741afe26de Author: Yika <yikaluo@assemblesys.com> Date: Thu Dec 12 16:45:21 2024 -0800 [UX] sky api abort -a and -u (#69) * sky api abort -a and -u * fix -u * comments * address comments * nit commit 7b7abaa82a89d63f45a92d8e46af203bc42e991b Author: Yika <yikaluo@assemblesys.com> Date: Thu Dec 12 13:40:05 2024 -0800 [UX] sky api ls feature polish (#67) * sky api ls feature polish * address comments * naming * nit * user ID * --all-status commit 92b137f838a636d4568a6040a73919da440576e5 Author: Yika <yikaluo@assemblesys.com> Date: Thu Dec 12 09:45:11 2024 -0800 Fix sky storage leaking issue (#65) * Fix cloud storage leak * nit * format * comments * comments commit 839db19606ae46e75e354f127faa9d5ceb96e7f0 Author: Zhanghao Wu <zhanghao.wu@outlook.com> Date: Wed Dec 11 21:37:46 2024 -0800 Merge branch 'master' of github.com:skypilot-org/skypilot into restapi (#62) * [perf…
/smoke-test --aws |
@@ -120,6 +125,7 @@ def launch( | |||
'remote_user_config_path': remote_user_config_path, | |||
'modified_catalogs': | |||
service_catalog_common.get_modified_catalog_file_mounts(), | |||
'dashboard_setup_cmd': managed_job_constants.DASHBOARD_SETUP_CMD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we use this??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! There are some merging conflicts. I updated it now.
/smoke-test --aws -k test_env_check |
Clean up endpoint docs
/smoke-test --aws -k test_env_check |
/smoke-test --managed-jobs |
@@ -120,6 +125,7 @@ def launch( | |||
'remote_user_config_path': remote_user_config_path, | |||
'modified_catalogs': | |||
service_catalog_common.get_modified_catalog_file_mounts(), | |||
'dashboard_setup_cmd': managed_job_constants.DASHBOARD_SETUP_CMD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! There are some merging conflicts. I updated it now.
# become a job on controller which messes up the job IDs (we assume the | ||
# job ID in controller's job queue is consistent with managed job IDs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cg505, is this still the case? Do we still rely on the same ray job ID and managed job ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. I think no but there might still be some place (e.g. in log streaming code) that assumes it. Probably best to leave this in.
sky/jobs/utils.py
Outdated
if controller_status in [ | ||
job_lib.JobStatus.FAILED_SETUP, | ||
job_lib.JobStatus.FAILED_DRIVER, job_lib.JobStatus.FAILED | ||
]: | ||
# We should fail the case where the controller status is | ||
# failed, as it is likely due to the job for submitting the | ||
# managed job to scheduler failed. | ||
logger.error('Failed to submit the managed job to scheduler.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cg505, I added this so that jobs failed before scheduler is triggered for the job will be marked as FAILED_CONTROLLER. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit concerned here, since the job_lib status could be failed but the managed job was successfully submitted, then something else failed after. In that case we could race with the scheduler and get into some weird undefined behavior.
If we want to add this check, we should also verify that the schedule_state is INACTIVE (submit to scheduler failed), and we need to check schedule_state after we check the job_id status. Otherwise, if the job is WAITING or some other schedule_state, we need to make sure to stop the controller process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I can add an additional condition for checking the schedule_state
to be INACTIVE
. I was assuming the job_lib status should always be SUCCEEDED
if the managed job was successfully submited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a question of what we want to do if the job_lib status is some FAILED state but the job was successfully submitted (schedule_state is not INACTIVE). This seems like a problem, so FAILED_CONTROLLER is reasonable. We just have to be careful about how to do it:
- set job to DONE
- get controller process PID
- if not None, kill controller process and children
- clean up cluster
or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure in what case it could happen that the job_lib
can fail when the job is current submitted to scheduler, since the job controller's run section seems only inlcudes the submission to scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. some ray issue out of our control
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even in the submission, set_waiting
succeeds but maybe_schedule_next_jobs
crashes for some reason
skypilot/sky/jobs/scheduler.py
Lines 186 to 188 in 54fe787
with filelock.FileLock(_get_lock_path()): | |
state.scheduler_set_waiting(job_id, dag_yaml_path) | |
maybe_schedule_next_jobs() |
sky/jobs/utils.py
Outdated
job_lib.JobStatus.FAILED_DRIVER, | ||
job_lib.JobStatus.FAILED | ||
]: | ||
# We should fail the case where the controller status is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to double check the schedule_state after job_lib.get_status to avoid a race
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point! I tried to add the schedule state check again here, and the logic becomes quite complicated. I now revert it to only check the FAILED_SETUP
case, which should be safe to only check the job_lib state. PTAL.
This PR rearchitects SkyPilot to be a client-server architecture, i.e. separating the backend from the frontend, so that the
backend can be separately deployed.
What is new?
Architecture
Behaviors
Local API server (individual users)
Without deploying a remote API server, SkyPilot still acts normally, with a local API server automatically started.
Remote API server (multi-user organizations)
A user can now deploy a remote API server and have the local client connect to that API server, so that multiple clients can connect to the same API server, i.e. having a single pane of glass of the resources used by multiple users/clients.
Disruptive Changes
request_id
that you can wait or stream withsky.get(request_id)
andsky.stream_and_get(request_id)
For more user facing details, see API server docs hosted here: https://docs.skypilot.co/en/client-server/reference/api-server/api-server.html
For more developer facing details, see the readme:
Tests
Local API Server
pytest tests/test_smoke.py --aws
. Fail on [Test] Azure storage test fails probably due to wrong handling of the storage account in the test #4672, which should be fine for now.pytest tests/test_smoke.py --gcp
pytest tests/test_smoke.py --kubernetes
conda deactivate; bash -i tests/backward_compatibility_tests.sh
Remote API Server
An API server deployed on GKE cluster, deployed with: https://docs.skypilot.co/en/client-server/reference/api-server/api-server-admin-deploy.html
pytest tests/test_smoke.py --aws
pytest tests/test_smoke.py --gcp
pytest tests/test_smoke.py --kubernetes
conda deactivate; bash -i tests/backward_compatibility_tests.sh
TODO
Acknowledgement
We thank all our users for providing valuable feedback during the alpha/beta testing, and all contributors who made this PR possible: @romilbhardwaj @concretevitamin @cg505 @KeplerC @zpoint @yika-luo @weih1121.