SkyPilot v0.6.0
SkyPilot v0.6.0: Jobs API, SkyServe on Kubernetes, Spot + On-demand mixing, Paperspace support and more!
We are excited to release SkyPilot v0.6.0! This release includes a number of new features:
- Managed Jobs for job execution and recovery
- SkyServe and Jobs on Kubernetes
- Mix on-demand and spot instances in SkyServe
- New cloud: Paperspace
Release Highlights
Managed Jobs
- The spot controller has been enhanced to support any job on on-demand or spot instances.
- To use, run
sky jobs launch
instead ofsky spot launch
.
- To use, run
- The new job controller can automatically recover jobs from any spot preemptions or hardware failures, and also execute pipelines of jobs.
- The
sky jobs
API is identical to thesky spot
API, but also supports on-demand instances.
SkyServe and Jobs on Kubernetes
- SkyPilot can now run SkyServe and Managed Job controllers on Kubernetes
- This means you can now run your SkyServe and Managed Jobs on your Kubernetes cluster!
- Simply run
sky jobs launch
orsky serve up
, and SkyPilot will automatically deploy the controller on your Kubernetes cluster if available and run jobs on the cheapest available location.
Mix on-demand and spot instances in SkyServe
- SkyServe now supports a new intelligent policy for mixing spot and on-demand instances. Example.
- Uses on-demand instances to ensure availability and spot instances to save costs.
- Dynamically falls back to on-demand replicas when spot replicas are not available. Example.
Paperspace support
- Newest cloud to join the Sky: Paperspace!
- Paperspace offers the latest GPUs including H100 and A100-80GB for AI training and inference.
- Simply add your Paperspace API key to
~/.paperspace/config.json
and runsky check paperspace
to get started. - Big thanks to @asaiacai for contributing Paperspace support!
More LLMs and Recipes
Deprecation Notes
The following features have been deprecated and will be removed in the next minor release:
sky spot
CLI: usesky jobs
CLI instead.core.spot_xxx
APIs: refactored tojobs.xxx
.qps_lower_threshold
andauto_restart
inservice
: usetarget_qps_per_replica
instead.
Changelog
Managed Jobs
- Changes make to local catalog at ~/.sky/catalog are now reflected on the controller (#3289)
- The name of the spot job is now included in the
SKYPILOT_TASK_ID
environment variable (#3424) - Legacy spot job APIs have been refactored from
core.spot_xxx
tojobs.xxx
(#3417) - Cloud for the controller is now chosen based on the resources of the replicas (#3363)
- Bug fixes (#3302, #3397, #3459, #3468, #3480)
SkyServe
New Features
- New intelligent policy for mixing spot and on-demand instances in SkyServe (#3194)
- SkyServe now uses proxy instead of HTTP redirect responses for better performance (#3395)
- Readiness probe now supports headers: this is useful for authentication or other headers required for readiness checks (#3552)
Enhancements
- Optimizations - replicas are reused when only service section is changed (#3214)
- Rolling updates are now the default behavior for SkyServe (#3249)
- Controller cloud is now chosen from replica resources if it is not already up (#3231)
- Bug fixes and API improvements (#3257, #3299, #3303, #3411, #3411, #3546)
Kubernetes
- Kubernetes clusters can now run SkyServe and Managed Jobs (#3377, #3524, #3521)
sky show-gpus
now shows realtime availability of GPUs in the cluster (#3499)- Autoscaling Kubernetes clusters are now supported: SkyPilot can now wait for GKE node pools, Karpenter and other autoscalers to provision nodes (#3513, #3415)
- Use Kubernetes service accounts by specifying
remote_identity
in ~/.sky/config.yaml (#3377, #3527) sky local up
now also automatically installs the Nginx Ingress Controller (#3223)- Support for specifying custom pod configurations with
pod_config
(#3244)- Use this to modify the pod configuration for your environment, e.g., attaching volumes, specifying imagePullSecrets, increasing /dev/shm size limit, setting
HTTP_PROXY
and more! See examplepod_config
here.
- Use this to modify the pod configuration for your environment, e.g., attaching volumes, specifying imagePullSecrets, increasing /dev/shm size limit, setting
- Support for specifying custom metadata to all Kubernetes resources created by SkyPilot (#3333)
- Useful for tracking resources created by SkyPilot in your Kubernetes cluster.
- Support for PodIP mode for exposing ports (#3445)
Enhancements
- GPU Isolation: SkyPilot no longer uses privileged containers and pods can no longer use GPUs not allocated to them (#3443)
- Ingress creation requests are now batched to minimize nginx reloads and ingress paths are namespaced (#3263, #3373)
- All SkyPilot pods are now labelled with
skypilot-user
to identify the owner of the pod (#3576) - Special characters in environment variables are now correctly parsed (#3322)
- GPU labelling is now more robust (#3274)
- Bug fixes and quality of life improvements (#3266, #3392, #3439, #3509, #3524, #3525, #3532, #3563, #3578, #3374)
CLI & Core interfaces
New Features
resources
now supportslabels
field to set labels (instance tags on aws, labels on gcp and k8s) on cloud resources (#3464, #3505)sky check
now supports checking credentials for specific clouds, e.g.sky check aws gcp
(#3229)- You can also restrict which clouds are checked by setting
allowed_clouds
in~/.sky/config.yaml
. (#3556)
- You can also restrict which clouds are checked by setting
any_of
orordered
fields inresources
can now have clouds that are not enabled (#3567)- A new environment variable
SKYPILOT_CLUSTER_INFO
, containing cluster name, cloud, region and zone is now available in all tasks (#3424)
Enhancements
- Optimizer is up to 10x faster when multiple resources are specified (#3567)
- Autostop timer is now reset at the start of a new sky launch to avoid unexpected autostops (#3205)
- GCP GPUs now include
DEVICE_MEM
insky show-gpus
(#3375) - Better sorting for
sky show-gpus
(#3492) - Handling for usernames containing invalid characters (#3528)
- Null environment variables now raise an error (#3557)
Runtime & Backend
- SkyPilot now supports Python 3.11 (#3248)
- SkyPilot runtime is now isolated from any environment changes made by user code (#3575, #3326, #3339)
- Fix for jobs and services running longer than 12 days (#3460)
- Docker runtime fixes and enhancements, including fix for storage mounting in container (#3450, #3436, #3481, #3343)
- Bug fixes and optimizations (#3280, #3292, #3178, #3386, #3292, #3386, #3407, #3423, #3368, #3457, #3469, #3482, #3495, #3512, #3536, #3568)
Optimizations
- Lazy imports for 2x faster import times (#3394, #3463)
- Faster setup and job submission (#3523, #3484),
Cloud: GCP
Cloud: Azure
- Custom images are now supported on Azure. Simply specify
image_id
in theresources
field. (#3362) - 8x faster autostop for Azure (#3519)
- Fix GPUs not being detected in Azure (#3313)
- Provisioning fixes (#3483)
Cloud: AWS
- Fine-grained IAM roles: you can now specify IAM roles on a per-resource basis (#3488, #3514)
- SkyPilot can now be run in ECS containers by assuming
container-role
IAM roles (#3503) - SkyPilot will not delete user-specified security groups (#3402)
Cloud: Fluidstack
- H100 and A100 Nvlink support for Fluidstack (#3467)
- Opening ports is now supported for Fluidstack (#3294)
- Bug fixes (#3254, #3265)
Other Clouds
- Bug fixes for Lambda provisioning and termination (#3409, #3410)
- Multi-gpu fixes for RunPod (#3291)
- Cudo: handle missing project errors (#3438)
Thanks to all contributors!
New contributors: @MysteryManav, @JGSweets, @Harthgar, @mjkanji
Many thanks to all contributors who contributed to this release!
Contributors: @Michaelvll, @romilbhardwaj, @concretevitamin, @cblmemo, @MaoZiming, @shethhriday29, @asaiacai, @JGSweets, @mjkanji, @MysteryManav, @landscapepainter, @Harthgar, @mjibril, @dtran24, @fozziethebeat, @JungleCatSW
Full Changelog: v0.5.0...v0.6.0