[Azure] Support fractional A10 instance types #3877

cblmemo · 2024-08-26T22:30:20Z

This PR support fractional A10 instance types from instance_type=xxx and accelerators=A10:{0.25,0.5,0.75}.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)

# Launch cluster with
$ sky launch --instance-type Standard_NV6ads_A10_v5 -c sky-59bd-memory
# or
$ sky launch --gpus A10:0.25 -c sky-59bd-memory
# then
$ sky launch --gpus A10:0.24 -c sky-59bd-memory
sky.exceptions.ResourcesMismatchError: Task requested resources with fractional accelerator counts. For fractional counts, the required count must match the existing cluster. Got required accelerator A10:0.24 but the existing cluster has A10:0.25.
$ sky launch --gpus A10:1 -c sky-59bd-memory
sky.exceptions.ResourcesMismatchError: Requested resources do not match the existing cluster.
  Requested:    {1x <Cloud>({'A10': 1})}
  Existing:     1x Azure(Standard_NV6ads_A10_v5, {'A10': 0.25})
To fix: specify a new cluster name, or down the existing cluster first: sky down sky-59bd-memory
$ sky launch --gpus A10:0.25 -c sky-59bd-memory nvidia-smi
Task from command: nvidia-smi
Running task on cluster sky-59bd-memory...
W 08-28 11:16:57 cloud_vm_ray_backend.py:1937] Trying to launch an A10 cluster on Azure. This may take ~20 minutes due to driver installation.
I 08-28 11:16:57 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /home/memory/sky_logs/sky-2024-08-28-11-16-56-005018/provision.log
I 08-28 11:16:58 provisioner.py:65] Launching on Azure eastus (all zones)
I 08-28 11:17:07 provisioner.py:450] Successfully provisioned or found existing instance.
I 08-28 11:17:19 provisioner.py:552] Successfully provisioned cluster: sky-59bd-memory
I 08-28 11:17:21 cloud_vm_ray_backend.py:3294] Job submitted with Job ID: 3
I 08-28 18:17:22 log_lib.py:412] Start streaming logs for job 3.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.8.0.4']
(sky-cmd, pid=15884) Wed Aug 28 18:17:23 2024       
(sky-cmd, pid=15884) +---------------------------------------------------------------------------------------+
(sky-cmd, pid=15884) | NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
(sky-cmd, pid=15884) |-----------------------------------------+----------------------+----------------------+
(sky-cmd, pid=15884) | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
(sky-cmd, pid=15884) | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
(sky-cmd, pid=15884) |                                         |                      |               MIG M. |
(sky-cmd, pid=15884) |=========================================+======================+======================|
(sky-cmd, pid=15884) |   0  NVIDIA A10-4Q                  On  | 00000002:00:00.0 Off |                    0 |
(sky-cmd, pid=15884) | N/A   N/A    P0              N/A /  N/A |      0MiB /  4096MiB |      0%      Default |
(sky-cmd, pid=15884) |                                         |                      |             Disabled |
(sky-cmd, pid=15884) +-----------------------------------------+----------------------+----------------------+
(sky-cmd, pid=15884)                                                                                          
(sky-cmd, pid=15884) +---------------------------------------------------------------------------------------+
(sky-cmd, pid=15884) | Processes:                                                                            |
(sky-cmd, pid=15884) |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
(sky-cmd, pid=15884) |        ID   ID                                                             Usage      |
(sky-cmd, pid=15884) |=======================================================================================|
(sky-cmd, pid=15884) |  No running processes found                                                           |
(sky-cmd, pid=15884) +---------------------------------------------------------------------------------------+
INFO: Job finished (status: SUCCEEDED).
Shared connection to 23.101.130.81 closed.
I 08-28 11:17:24 cloud_vm_ray_backend.py:3329] Job ID: 3
I 08-28 11:17:24 cloud_vm_ray_backend.py:3329] To cancel the job:       sky cancel sky-59bd-memory 3
I 08-28 11:17:24 cloud_vm_ray_backend.py:3329] To stream job logs:      sky logs sky-59bd-memory 3
I 08-28 11:17:24 cloud_vm_ray_backend.py:3329] To view the job queue:   sky queue sky-59bd-memory
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] 
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] Cluster name: sky-59bd-memory
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] To log into the head VM: ssh sky-59bd-memory
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] To submit a job:         sky exec sky-59bd-memory yaml_file
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] To stop the cluster:     sky stop sky-59bd-memory
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] To teardown the cluster: sky down sky-59bd-memory
Clusters
NAME             LAUNCHED        RESOURCES                                        STATUS  AUTOSTOP  COMMAND                       
sky-59bd-memory  a few secs ago  1x Azure(Standard_NV6ads_A10_v5, {'A10': 0.25})  UP      -         sky launch -c sky-59bd-me...

All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Michaelvll · 2024-08-27T06:14:28Z

sky/clouds/service_catalog/azure_catalog.py

+    # Filter out instance types that only contain a fractional of GPU.
+    df_filtered = _df.loc[~_df['InstanceType'].isin(_FILTERED_A10_INSTANCE_TYPES
+                                                   )]


Instead of excluding the instances directly, can we print out some hints like the one when we specify sky launch --gpus L4:

Multiple AWS instances satisfy L4:1. The cheapest AWS(g6.xlarge, {'L4': 1}) is considered among: I 08-27 06:09:54 optimizer.py:922] ['g6.xlarge', 'g6.2xlarge', 'g6.4xlarge', 'gr6.4xlarge', 'g6.8xlarge', 'gr6.8xlarge', 'g6.16xlarge'].

This hint is used to print instances w/ same accelerator number. I'm thinking if we should do this to fractional GPUs..

…ployment group works for fractional one

cblmemo · 2024-08-28T18:21:33Z

I support launching from --gpus A10:0.25 and only allow strict equal on fractional GPU requirements. Also updated the test I've done in the PR description. PTAL!

Michaelvll

Thanks for the update @cblmemo! Mostly looks good to me with some slight issue.

Michaelvll · 2024-09-02T22:16:58Z

sky/clouds/service_catalog/data_fetchers/fetch_azure.py

+    # Manually update the GPU count for fractional A10 instance types.
+    df_ret['AcceleratorCount'] = df_ret.apply(_upd_a10_gpu_count,
+                                              axis='columns')


Could we say more in the comment for why we need to do it manually?

Good point! Added. PTAL

sky/clouds/service_catalog/scp_catalog.py

sky/clouds/azure.py

Michaelvll · 2024-09-02T22:23:42Z

sky/resources.py

+                if isinstance(self.accelerators[acc], float) or isinstance(
+                        other_accelerators[acc], float):
+                    # If the requested accelerator count is a float, we only
+                    # allow strictly equal counts since all of the float point
+                    # accelerator counts are less than 1 (e.g., 0.1, 0.5), and
+                    # we want to avoid semantic ambiguity (e.g. launching
+                    # with --gpus A10:0.25 on a A10:0.75 cluster).
+                    if not math.isclose(self.accelerators[acc],
+                                        other_accelerators[acc]):
+                        return False


We should allow requested resources to be a float, while the existing accelerators to be a int, as long as requested resources is <= existing resources.

That said, the

Suggested change

if isinstance(self.accelerators[acc], float) or isinstance(

other_accelerators[acc], float):

# If the requested accelerator count is a float, we only

# allow strictly equal counts since all of the float point

# accelerator counts are less than 1 (e.g., 0.1, 0.5), and

# we want to avoid semantic ambiguity (e.g. launching

# with --gpus A10:0.25 on a A10:0.75 cluster).

if not math.isclose(self.accelerators[acc],

other_accelerators[acc]):

return False

if isinstance(other_accelerators[acc], float) and not other_accelerators[acc].is_integer():

# If the requested accelerator count is a float, we only

# allow strictly equal counts since all of the float point

# accelerator counts are less than 1 (e.g., 0.1, 0.5), and

# we want to avoid semantic ambiguity (e.g. launching

# with --gpus A10:0.25 on a A10:0.75 cluster).

if not math.isclose(self.accelerators[acc],

other_accelerators[acc]):

return False

Good point! Updated. Thanks!

Actually, after a second thought, I think we should still keep the original isinstance(self.accelerators[acc], float) or isinstance(other_accelerators[acc], float) condition. Considering the following case: user submit the jobs with --gpus A10:0.5 and the cluster has A10:1. In fact the requirements 0.5 will be translated to 1 and thus the user can only have one A10:0.5 job instead of 2, which is confusing. The original condition capture such case but the updated one (isinstance(other_accelerators[acc], float) and not other_accelerators[acc].is_integer()) does not.

We did allow having two A10:0.5 running on a single cluster with A10:1. Do you know when did we change the behavior of this? or did we ever change this behavior before this PR?

We need to fix this before we merge the PR

Michaelvll

Thanks for adding the support @cblmemo! It seems good to me. Please do some tests to make sure the changes do not cause issues with other clouds and other ACC types (considering we have changed significant amount of places)

Michaelvll · 2024-09-06T22:30:46Z

sky/backends/cloud_vm_ray_backend.py

+                                    'Task requested resources with fractional '
+                                    'accelerator counts. For fractional '
+                                    'counts, the required count must match the '
+                                    'existing cluster. Got required accelerator'
+                                    f' {acc}:{self_count} but the existing '
+                                    f'cluster has {acc}:{existing_count}.')


This error message is not accurate? Our check is for ACC count of existing cluster instead of the task requested resources?

Please see the above comments 🤔

Michaelvll · 2024-09-09T23:39:11Z

sky/resources.py

+                if isinstance(self.accelerators[acc], float) or isinstance(
+                        other_accelerators[acc], float):
+                    # If the requested accelerator count is a float, we only
+                    # allow strictly equal counts since all of the float point
+                    # accelerator counts are less than 1 (e.g., 0.1, 0.5), and
+                    # we want to avoid semantic ambiguity (e.g. launching
+                    # with --gpus A10:0.25 on a A10:0.75 cluster).
+                    if not math.isclose(self.accelerators[acc],
+                                        other_accelerators[acc]):
+                        return False


We need to fix this before we merge the PR

sky/resources.py

sky/backends/cloud_vm_ray_backend.py

cblmemo · 2024-09-11T07:04:20Z

Just identified another bug: for a A10:0.5 cluster, previous implementation would force using --gpus A10:0.5 when sky exec, which could actually have 2 jobs simultaneous running as the ray cluster has resources A10:1. Just fixed by if we found a fractional cluster, then we set the gpu demand to its ceiling value (which is essentially 1).

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

cblmemo · 2024-09-11T07:16:01Z

Actually, after a second thought, I think we could even allow requiring --gpus A10:0.25 for a A10:0.5 cluster - we just need to convert it. The actual required num of gpus can be calculated by {required_count} / {cluster_acc_count} * 1, as we set the remote ray cluster's custom resources to A10:1. For this example it should require A10:0.5, so that two --gpus A10:0.25 can running simultaneously for a A10:0.5 cluster.

cblmemo · 2024-09-11T07:27:19Z

Another TODO: I just found that there are 1/6 and 1/3 A10 instance types. We need to figure out a precision to display such decimals.

cblmemo · 2024-09-12T06:13:50Z

All todos is done. After smoke test it should be able to merge ;)

Michaelvll · 2024-09-13T01:18:54Z

sky/clouds/service_catalog/common.py

+            return int(value)
+        return float(value)
+
+    return {acc_name: _convert(acc_count)}


 def get_instance_type_for_accelerator_impl(


Should we update the acc_count type here? Also, when we are comparing the acc_count should we make sure every number within math.abs(df['AcceleratorCount'] - acc_count) <= 0.01 should work. Otherwise, a user running sky launch --gpus A10:0.16 or sky launch --gpus A10:0.1666 would fail?

Should we update the acc_count type here?

Sry could you elaborate on this...? Are you saying there are a better place to update the type or what..?

Also, when we are comparing the acc_count should we make sure every number within math.abs(df['AcceleratorCount'] - acc_count) <= 0.01 should work.

For this, I'm slightly concerned about the case when user:

sky launch -c a10-frac --gpus A10:0.16 # detected as 0.167 in catalog, then launch cluster with 0.167 gpu sky exec a10-frac --gpus A10:0.16 sleep 100000 # user would think the cluster is full sky exec a10-frac --gpus A10:0.007 sleep 100000 # however, this still running as the cluster is actually launched with 0.167 gpu.

To deal with the failing, we currently shows all valid instance type as fuzzy candidate and user could then modify their acc count:

sky launch --gpus A10:0.16 I 09-12 22:34:50 optimizer.py:1301] No resource satisfying <Cloud>({'A10': 0.16}) on [AWS, GCP, Azure, RunPod]. I 09-12 22:34:50 optimizer.py:1305] Did you mean: ['A100-80GB-SXM:1', 'A100-80GB-SXM:2', 'A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100-80GB:8', 'A100:1', 'A100:16', 'A100:2', 'A100:4', 'A100:8', 'A10:0.167', 'A10:0.333', 'A10:0.5', 'A10:1', 'A10:2', 'A10G:1', 'A10G:4', 'A10G:8'] sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request: Task<name=sky-cmd>(run=<empty>) resources: <Cloud>({'A10': 0.16}). To fix: relax or change the resource requirements. Try one of these offered accelerators: ['A100-80GB-SXM:1', 'A100-80GB-SXM:2', 'A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100-80GB:8', 'A100:1', 'A100:16', 'A100:2', 'A100:4', 'A100:8', 'A10:0.167', 'A10:0.333', 'A10:0.5', 'A10:1', 'A10:2', 'A10G:1', 'A10G:4', 'A10G:8'] Hint: sky show-gpus to list available accelerators. sky check to check the enabled clouds.

Does that sounds good to you?

Sry could you elaborate on this...? Are you saying there are a better place to update the type or what..?

acc_count is currently int for the type annotation below. Should we update that?

For this, I'm slightly concerned about the case when user:
sky launch -c a10-frac --gpus A10:0.16 # detected as 0.167 in catalog, then launch cluster with 0.167 gpu
sky exec a10-frac --gpus A10:0.16 sleep 100000 # user would think the cluster is full
sky exec a10-frac --gpus A10:0.007 sleep 100000 # however, this still running as the cluster is actually launched with 0.167 gpu.
To deal with the failing, we currently shows all valid instance type as fuzzy candidate and user could then modify their acc count:

This will only apply for the case when a user is actually creating an instance with A10:0.16 for this function right? When we are launching an instance, once we returned the instance type, we can round up the request to the actual acc_count in the catalog?

acc_count is currently int for the type annotation below. Should we update that?

Done!

This will only apply for the case when a user is actually creating an instance with A10:0.16 for this function right? When we are launching an instance, once we returned the instance type, we can round up the request to the actual acc_count in the catalog?

Good point! Done. PTAL!

cblmemo · 2024-10-01T23:01:59Z

@Michaelvll bump for this - will fix the conflict soon

cblmemo · 2024-10-10T00:24:19Z

bump for review @Michaelvll

sky/backends/cloud_vm_ray_backend.py

Michaelvll · 2024-10-11T07:26:19Z

sky/clouds/service_catalog/common.py

+            return int(value)
+        return float(value)
+
+    return {acc_name: _convert(acc_count)}


 def get_instance_type_for_accelerator_impl(


Sry could you elaborate on this...? Are you saying there are a better place to update the type or what..?

acc_count is currently int for the type annotation below. Should we update that?

For this, I'm slightly concerned about the case when user:
sky launch -c a10-frac --gpus A10:0.16 # detected as 0.167 in catalog, then launch cluster with 0.167 gpu
sky exec a10-frac --gpus A10:0.16 sleep 100000 # user would think the cluster is full
sky exec a10-frac --gpus A10:0.007 sleep 100000 # however, this still running as the cluster is actually launched with 0.167 gpu.
To deal with the failing, we currently shows all valid instance type as fuzzy candidate and user could then modify their acc count:

This will only apply for the case when a user is actually creating an instance with A10:0.16 for this function right? When we are launching an instance, once we returned the instance type, we can round up the request to the actual acc_count in the catalog?

Michaelvll

Thanks @cblmemo! LGTM. This should be good to go once the tests passed.

sky/clouds/service_catalog/common.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

cblmemo · 2024-10-25T20:59:01Z

Manual test passed. Running smoke test now!

cblmemo · 2024-10-26T20:33:57Z

Most of the smoke tests passed. Currently still failed one: #4192, some AWS bucker permission issue, and TPU tests which is due to quota constraints. It should not be relevant to this PR. Merging now

* fix * change catalog to float gpu num * support print float point gpu in sky launch. TODO: test if the ray deployment group works for fractional one * fix unittest * format * patch ray resources to ceil value * support launch from --gpus A10 * only allow strictly match fractional gpu counts * address comment * change back condition * fix * apply suggestions from code review * fix * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * format * fix display of fuzzy candidates * fix precision issue * fix num gpu required * refactor in check_resources_fit_cluster * change type annotation of acc_count * enable fuzzy fp acc count * fix k8s * Update sky/clouds/service_catalog/common.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * fix integer gpus * format --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

…f options (#4061) * user can select load balancing policies * some fixes * linting * Fixes according to comments * Linting * Linting * Fixed according to comments * fix * removed line from examples * Reverted changes * Reverted changes * Fixed according to comments * Linting * Update sky/serve/load_balancer.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * [Catalog] Silently ignore TPU price not found. (#4134) * [Catalog] Silently ignore TPU price not found. * assert for non tpu v6e * format * [docs] Update GPUs used in docs (#4138) * Change V100 to H100 * updates * update * [k8s] Fix GPU labeling for EKS (#4146) Fix GPU labelling * [k8s] Handle @ in context name (#4147) Handle @ in context name * [Docs] Typo in distributed jobs docs (#4149) minor typo * [Performance] Refactor Azure SDK usage (#4139) * [Performance] Refactor Azure SDK usage * lazy import and address comments * address comments * fixes * fixes * nits * fixes * Fix OCI import issue (#4178) * Fix OCI import issue * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * edit comments --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * [k8s] Add retry for apparmor failures (#4176) * Add retry for apparmor failures * add comment * [Docs] Update Managed Jobs page. (#4177) * [Docs] Update Managed Jobs page. * Lint * Updates * Minor: Jobs docs fix. (#4183) * [Docs] Update Managed Jobs page. * Lint * Updates * reword * [UX] remove all uses of deprecated `sky jobs` (#4173) * [UX] remove all uses of deprecated `sky jobs` * Apply suggestions from code review Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * fix other mentions of "spot jobs" --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [Azure] Support fractional A10 instance types (#3877) * fix * change catalog to float gpu num * support print float point gpu in sky launch. TODO: test if the ray deployment group works for fractional one * fix unittest * format * patch ray resources to ceil value * support launch from --gpus A10 * only allow strictly match fractional gpu counts * address comment * change back condition * fix * apply suggestions from code review * fix * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * format * fix display of fuzzy candidates * fix precision issue * fix num gpu required * refactor in check_resources_fit_cluster * change type annotation of acc_count * enable fuzzy fp acc count * fix k8s * Update sky/clouds/service_catalog/common.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * fix integer gpus * format --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * [Jobs] Refactor: Extract task failure state update helper (#4185) refactor: a unified exception handling utility * [Core] Remove backward compatibility code for 0.6.0 & 0.7.0 (#4175) * [Core] Remove backward compatibility code for 0.6.0 * remove backwards compatibility for 0.7.0 release * Update sky/serve/serve_state.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * remove more * Revert "remove more" This reverts commit 34c28e9. * remove more but not instance tags --------- Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Remove outdated pylint disabling comments (#4196) Update cloud_vm_ray_backend.py * [test] update default clouds for smoke tests (#4182) * [k8s] Show all kubernetes clusters in optimizer table (#4013) * Show all kubernetes clusters in optimizer table * format * Add comment * [Azure] Allow resource group specifiation for Azure instance provisioning (#3764) * Allow resource group specifiation for Azure instance provisioning * Add 'use_external_resource_group' under provider config * nit * attached resources deletion * support deployment removal when terminating * nit * delete RoleAssignment when terminating * update ARM config template * nit * nit * delete role assignment with guid * update role assignment removal logic * Separate resource group region and VM, attached resources * nit * nit * nit * nit * add error handling for deletion * format * deployment naming update * test * nit * update deployment constant names * update open_ports to wait for the nsg creation corresponding to the VM being provisioned * format * nit * format * update docstring * add back deleted snippet * format * delete nic with retries * error handle update * [dev] restrict pylint to changed files (#4184) * [dev] restrict pylint to changed files * fix glob * avoid use of xargs -d * Update packer scripts (#4203) * Update custom image packer script to exclude .sky and include python sys packages * add comments * Upgrade Azure SDK version requirement (#4204) * [Jobs] Add option to specify `max_restarts_on_errors` (#4169) * Add option to specify `max_retry_on_failure` * fix recover counts * fix log streaming * fix docs * fix * fix * fix * fix * fix default value * Fix spinner * Add unit test for default strategy * fix test * format * Update docs/source/examples/managed-jobs.rst Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * rename to restarts * Update docs/source/examples/managed-jobs.rst Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * update docs * warning instead of error out * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * rename * add comment * fix * rename * Update sky/execution.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update sky/execution.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * address comments * format * commit changes for docs * Format --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * [Core] Fix job race condition. (#4193) * [Core] Fix job race condition. * fix * simplify url * change to list_jobs * upd ray comments * only store jobs in ray_id_set * [Core] Fix issue with the wrong path of setup logs (#4209) * fix issue with a getting setup logs * More conservative * print error * comment * [Jobs] Fix jobs name (#4213) * fix issue with a getting setup logs * More conservative * print error * comment * Fix job name * [Performance] Speed up Azure A10 instance creation (#4205) * Use date instead of timestamp in skypilot image names * Speed up Azure A10 VM creation * disable nouveau and use smaller instance * address comments * address comments * add todo * [Tests] Fix public bucket tests (#4216) fix * [Catalog] Add TPU V6e. (#4218) * [Catalog] Add TPU V6e. * swap if else branch * [test] smoke test fixes for managed jobs (#4217) * [test] don't wait for old pending jobs controller messages `sky jobs queue` used to output a temporary "waiting" message while the managed jobs controller was still being provisioned/starting. Since #3288 this is not shown, and instead the queued jobs themselves will show PENDING/STARTING. This also requires some changes to tests to permit the PENDING and STARTING states for managed jobs. * fix default aws region * [test] wait for RECOVERING more quickly Smoke tests were failing because some managed jobs were fulling recovering back to the RUNNING state before the smoke test could catch the RECOVERING case (see e.g. #4192 `test_managed_jobs_cancellation_gcp`). Change tests that manually terminate a managed job instance, so that they will wait for the managed job to change away from the RUNNING state, checking every 10s. * address PR comments * fix * Add user toolkits to all sky custom images and fix PyTorch issue on A10 (#4219) * Add user toolkits to all sky custom images * address comments * [Core] Support TPU v6 (#4220) * init * fix * nit * format * add readme * add inference example * nit * add multi-host training * rephrase catalog doc * Update examples/tpu/v6e/README.md Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * [Core] Make home address replacement more robust (#4227) * Make home address replacement more robust * format * [UX] sky launch --fast (#4159) * [UX] skip provisioning stages if cluster is already available * add new --skip-setup flag and further limit stages to match sky exec * rename flag to --fast * add smoke test for sky launch --fast * changes stages for --fast * fix --fast help message * add api test for fast param (outside CLI) * lint * explicitly specify stages * [Docs] Tpu v6 docs (#4221) * Update TPU v6 docs * tpu v6 docs * add TPU v6 * update * Fix tpu docs * fix indents * restructure TPU doc * Fix * Fix * fix * Fix TPU * fix docs * Update docs/source/reference/tpu.rst Co-authored-by: Tian Xia <cblmemo@gmail.com> --------- Co-authored-by: Tian Xia <cblmemo@gmail.com> * [ux] add sky jobs launch --fast (#4231) * [ux] add sky jobs launch --fast This flag will make the jobs controller launch use `sky launch --fast`. There are a few known situations where this can cause misbehavior in the jobs controller: - The SkyPilot wheel is outdated (due to changes in the SkyPilot code or a version upgrade). - The user's cloud credentials have changed. In this case the new credentials will not be synced, and if there are new clouds available in `sky check`, the cloud depedencies may not be correctly installed. However, this does speed up `jobs launch` _significantly_, so provide it as a dangerous option. Soon we will add robustness checks to `sky launch --fast` that will fix the above caveats, and we can remove this flag and just enable the behavior by default. * Apply suggestions from code review Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * fix lint --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [UX] Show 0.25 on controller queue (#4230) * Show 0.25 on controller queue * format * [Storage] Avoid opt-in regions for S3 (#4239) * S3 fix + timeout * S3 fix + timeout * lint * Update K8s docker image build and the source artifact registry (#4224) * Attempt at improving performance of k8s cluster launch * remove conda env creation * add multiple regions * K8s sky launch pulls the new docker images * Move k8s script * use us region only * typo * Remove --system-site-packages when setup sky cluster (#4168) * Remove --system-site-packages when setup sky cluster * add comments * [AWS/Azure] Avoid error out during image size check (#4244) * Avoid error out during image size check * Avoid error for azure * lint * [AWS] Disable additional auto update services for ubuntu image with cloud-init (#4252) * Disable additional auto update services for ubuntu image * simplify the commands * [Dashboard] Add a simple status filter. (#4253) * Disable more potential unattended upgrade sources for AWS (#4246) * Fix AWS unattended upgrade issue * more commands * add retry and disable all unattended * remove retry * disable unattended upgrades and add retry in aws default image * [docs]: OCI key_file path clarrification (#4262) * [docs]: OCI key_file path clarrification * Update installation.rst * [k8s] Parallelize setup for faster multi-node provisioning (#4240) * parallelize setup * lint * Add retries * lint * retry for get_remote_home_dir * optimize privilege check * parallelize termination * increase num threads * comments * lint * do not redirect stderr to /dev/null when submitting job (#4247) * do not redirect stderr to /dev/null when submitting job Should fix #4199. * remove grep, add worker_maximum_startup_concurrency override * [tests] Exclude runpod from smoke tests unless specified (#4238) Add runpod * Update comments pointing to Lambda's docs (#4272) * [Core] Avoid PENDING job to be set to FAILED and speed up job scheduling (#4264) * fix race condition for setting job status to FAILED during INIT * Fix * fix * format * Add smoke tests * revert pending submit * remove update entirely for the job schedule step * wait for job 32 to finish * fix smoke * move and rename * Add comment * minor * Set minimum port number a Ray worker can listen on to 11002 (#4278) Set worker minimum port number * [docs] use k8s instead of kubernetes in the CLI (#4164) * [docs] use k8s instead of kubernetes in the CLI * fix docs build script for linux * Update docs/source/reference/kubernetes/kubernetes-getting-started.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [jobs] autodown managed job clusters (#4267) * [jobs] autodown managed job clusters If all goes correctly, the managed job controller should tear down a managed job cluster once the managed job completes. However, if the controller fails somehow (e.g. crashes, is terminated, etc), we don't want to leak resources. As a failsafe, set autodown on the job cluster. This is not foolproof, since the skylet on the cluster can also crash, but it's likely to catch many cases. * add comment about autodown duration * add leading _ * [UX] Improve Formatting of Post Job Creation Logs (#4198) * Update cloud_vm_ray_backend.py * Update cloud_vm_ray_backend.py * format * Fix `stream_logs` Duplicate Job Handling and TypeError (#4274) fix: multiple `job_id` * Update sky/serve/load_balancer.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * feat(serve): Improve load balancing policy error message and display 1. Add available policies to schema validation 2. Show available policies in error message when invalid policy is specified 3. Display load balancing policy in service spec repr when explicitly set * fix(serve): Update load balancing policy schema to match implemented policies Only 'round_robin' is currently implemented in LoadBalancingPolicy class * linting * refactor(serve): Remove policy enum from schema Move policy validation to code to avoid duplication and make it easier to maintain when adding new policies * fix * linting * Update sky/serve/service_spec.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * Fix circular import in schemas.py by moving load_balancing_policies import inside function * linting --------- Co-authored-by: Tian Xia <cblmemo@gmail.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Co-authored-by: Yika <yikaluo@assemblesys.com> Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> Co-authored-by: Andy Lee <andylizf@outlook.com> Co-authored-by: landscapepainter <34902420+landscapepainter@users.noreply.github.com> Co-authored-by: Hysun He <hysunhe@foxmail.com> Co-authored-by: Cody Brownstein <105375373+cbrownstein-lambda@users.noreply.github.com>

fix

de35be4

cblmemo mentioned this pull request Aug 26, 2024

[Azure] Add back fractional A10 instance types skypilot-org/skypilot-catalog#79

Open

Michaelvll reviewed Aug 27, 2024

View reviewed changes

cblmemo added 7 commits August 27, 2024 14:06

change catalog to float gpu num

39d6c15

support print float point gpu in sky launch. TODO: test if the ray de…

7324504

…ployment group works for fractional one

fix unittest

347ad62

format

71af06e

patch ray resources to ceil value

d419442

support launch from --gpus A10

f529689

only allow strictly match fractional gpu counts

2031a50

cblmemo changed the title ~~[Azure] Support fractional A10 instance types only from instance_type=xxx~~ [Azure] Support fractional A10 instance types Aug 28, 2024

cblmemo requested a review from Michaelvll August 28, 2024 18:19

Michaelvll reviewed Sep 2, 2024

View reviewed changes

cblmemo added 2 commits September 3, 2024 14:54

address comment

07e47d6

Merge remote-tracking branch 'origin/master' into support-fractional-a10

639c686

Michaelvll approved these changes Sep 6, 2024

View reviewed changes

cblmemo added 3 commits September 6, 2024 16:54

Merge remote-tracking branch 'origin/master' into support-fractional-a10

4c45ff7

change back condition

84d6d0d

fix

eca7033

Michaelvll requested changes Sep 9, 2024

View reviewed changes

apply suggestions from code review

0055fc1

Michaelvll reviewed Sep 11, 2024

View reviewed changes

sky/resources.py Outdated Show resolved Hide resolved

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved

fix

9652119

cblmemo and others added 2 commits September 11, 2024 00:04

Update sky/backends/cloud_vm_ray_backend.py

a5c5b15

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

format

d2cff96

fix display of fuzzy candidates

e8e9954

cblmemo added 2 commits September 11, 2024 22:48

fix precision issue

db607fa

fix num gpu required

e98ecdc

Michaelvll reviewed Sep 13, 2024

View reviewed changes

cblmemo requested a review from Michaelvll September 19, 2024 19:50

Michaelvll reviewed Oct 11, 2024

View reviewed changes

cblmemo added 6 commits October 11, 2024 15:22

refactor in check_resources_fit_cluster

8ada7a2

change type annotation of acc_count

f6c9fad

enable fuzzy fp acc count

a1f59a0

Merge remote-tracking branch 'origin/master' into support-fractional-a10

bcbf5ec

fix k8s

3200d39

Merge remote-tracking branch 'origin/master' into support-fractional-a10

6e41da5

cblmemo requested a review from Michaelvll October 25, 2024 00:17

Michaelvll approved these changes Oct 25, 2024

View reviewed changes

sky/clouds/service_catalog/common.py Outdated Show resolved Hide resolved

cblmemo and others added 3 commits October 25, 2024 13:14

Update sky/clouds/service_catalog/common.py

fb3049f

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

fix integer gpus

82d442f

format

84d146c

cblmemo added this pull request to the merge queue Oct 26, 2024

Merged via the queue into master with commit 647fcea Oct 26, 2024
20 checks passed

cblmemo deleted the support-fractional-a10 branch October 26, 2024 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Azure] Support fractional A10 instance types #3877

[Azure] Support fractional A10 instance types #3877

cblmemo commented Aug 26, 2024 •

edited

Loading

Michaelvll Aug 27, 2024

cblmemo Aug 27, 2024

cblmemo commented Aug 28, 2024

Michaelvll left a comment

Michaelvll Sep 2, 2024

cblmemo Sep 3, 2024

Michaelvll Sep 2, 2024

cblmemo Sep 3, 2024

cblmemo Sep 7, 2024

Michaelvll Sep 9, 2024

Michaelvll Sep 9, 2024

Michaelvll left a comment

Michaelvll Sep 6, 2024

cblmemo Sep 7, 2024

Michaelvll Sep 9, 2024

cblmemo commented Sep 11, 2024

cblmemo commented Sep 11, 2024

cblmemo commented Sep 11, 2024

cblmemo commented Sep 12, 2024

Michaelvll Sep 13, 2024

cblmemo Sep 13, 2024

Michaelvll Oct 11, 2024

cblmemo Oct 11, 2024 •

edited

Loading

cblmemo commented Oct 1, 2024

cblmemo commented Oct 10, 2024

Michaelvll Oct 11, 2024

Michaelvll left a comment

cblmemo commented Oct 25, 2024

cblmemo commented Oct 26, 2024

[Azure] Support fractional A10 instance types #3877

[Azure] Support fractional A10 instance types #3877

Conversation

cblmemo commented Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cblmemo commented Aug 28, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cblmemo commented Sep 11, 2024

cblmemo commented Sep 11, 2024

cblmemo commented Sep 11, 2024

cblmemo commented Sep 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cblmemo Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

cblmemo commented Oct 1, 2024

cblmemo commented Oct 10, 2024

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

cblmemo commented Oct 25, 2024

cblmemo commented Oct 26, 2024

cblmemo commented Aug 26, 2024 •

edited

Loading

cblmemo Oct 11, 2024 •

edited

Loading