Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Provisioner] Support docker in Lambda Cloud and TPU #4115

Merged
merged 6 commits into from
Oct 20, 2024
Merged

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Oct 18, 2024

Support docker as runtime on Lambda Cloud and TPU. TPU part is inherit from #3819.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --image-id docker:continuumio/miniconda3:4.12.0 --cloud lambda
    • sky launch --image-id docker:continuumio/miniconda3:4.12.0 --cloud azure
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@cblmemo cblmemo marked this pull request as ready for review October 18, 2024 04:31
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support @cblmemo! Just want to make sure that id -nG $USER | grep -qw docker can be successfully run on other clouds as well. Can we test with AWS CPU/GPU, GCP CPU/GPU/TPU to make sure it does not break existing services?

Comment on lines 200 to 203
# SkyPilot: Add the current user to the docker group
# if not already added.
self._run('id -nG $USER | grep -qw docker || '
'sudo usermod -aG docker $USER')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this command running within other commands, e.g., _check_docker_installed, to reduce the ssh connection?

Copy link
Collaborator

@Michaelvll Michaelvll Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just curious, how does it compare with using sudo docker in all places docker is called. Asking because had the same issue before running docker on TPU VMs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets merge this version first to unblock our user ;) filed an issue #4121 to keep track of this.

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 18, 2024

Thanks for adding the support @cblmemo! Just want to make sure that id -nG $USER | grep -qw docker can be successfully run on other clouds as well. Can we test with AWS CPU/GPU, GCP CPU/GPU/TPU to make sure it does not break existing services?

Do you mean testing them with docker? As the codepath changed by this PR won't be touched by non-docker clusters. I'm a little bit confused as we does not support docker with TPU for now?

@Michaelvll
Copy link
Collaborator

Michaelvll commented Oct 18, 2024

Thanks for adding the support @cblmemo! Just want to make sure that id -nG $USER | grep -qw docker can be successfully run on other clouds as well. Can we test with AWS CPU/GPU, GCP CPU/GPU/TPU to make sure it does not break existing services?

Do you mean testing them with docker? As the codepath changed by this PR won't be touched by non-docker clusters. I'm a little bit confused as we does not support docker with TPU for now?

Yes, testing with docker as runtime. Right, docker on TPU was not supported, but IIRC, it was mainly blocked by the exact issue you fixed, see here: #3819

Maybe we can pick that PR up once this is merged.

@cblmemo cblmemo changed the title [Provisioner] Support docker in Lambda Cloud [Provisioner] Support docker in Lambda Cloud and TPU Oct 18, 2024
@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 18, 2024

@Michaelvll Just tested all and it works well. On Lambda it failed to recognize the nvidia runtime for some reason, so i manually patched the run options. PTAL again!

def _configure_runtime(self, run_options: List[str]) -> List[str]:
if self.docker_config.get('disable_automatic_runtime_detection'):
return run_options
runtime_output = (self._run(f'{self.docker_cmd} ' +
'info -f "{{.Runtimes}}"'))
if 'nvidia-container-runtime' in runtime_output:
try:
self._run('nvidia-smi', log_err_when_fail=False)
return run_options + ['--runtime=nvidia', '--gpus all']
except Exception as e: # pylint: disable=broad-except
logger.debug(
'Nvidia Container Runtime is present in the docker image'
'specified, but no GPUs found on the cluster. It should '
'still if the cluster is expected to have no GPU.\n'
f' Details for nvidia-smi: {e}')
return run_options
return run_options

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update and adding TPU docker support @cblmemo! LGTM.

sky/utils/command_runner.py Show resolved Hide resolved
@Michaelvll Michaelvll mentioned this pull request Oct 20, 2024
6 tasks
@cblmemo cblmemo added this pull request to the merge queue Oct 20, 2024
Merged via the queue into master with commit 63e96f4 Oct 20, 2024
20 checks passed
@cblmemo cblmemo deleted the lambda-docker branch October 20, 2024 20:43
cblmemo added a commit that referenced this pull request Oct 21, 2024
…t Flow Definition (#4067)

* provide an example, edited from pipeline.yml

* more focus on dependencies for user dag lib

* more powerful user interface

* load and dump new yaml format

* fix

* fix: reversed logic in add_edge

* [docs] Unroll k8s internal load balancer docs (#4083)

unroll load balancer docs

* rename

* refactor due to reviewer's comments

* generate task.name if not given

* [docs] `sky status --kubernetes` docs (#4064)

* observability docs

* comments

* [UX] Show log after failure and fix the color issue with narrow window (#4084)

* fix narrow window and show log path during exception

* format

* format

* [k8s] `sky status --k8s` refactor (#4079)

* refactor

* lint

* refactor, dataclass

* refactor, dataclass

* refactor

* lint

* add comments for add_edge

* add `print_exception_no_traceback` when raise

* make `Dag.tasks` a property

* print dependencies for `__repr__`

* move `get_unique_task_name` to common_utils

* [Performance] Use new GCP custom images (#4027)

* [Performance] Use new custom image to create GCP GPU VMs

* update image tags for both CPU and GPU

* always generate .sky/python_path

---------

Co-authored-by: Yika Luo <yikaluo@Yikas-MacBook-Pro.local>

* [GCP] Add H100 mega (#4099)

* Add H100 mega support on GCP

* fix for some other regions

* format

* fix resource type

* fix catalog fetching

* [GCP] Add gVNIC support (#4095)

* add gvnic support through config.yaml

* lint

* docs

* [Lambda] Lambda Cloud SkyPilot provisioner (#3865)

* feat: lambda cloud new provisioner

* feat: address cblmemo reviews and other reviews + make multi-node work again

* fix: quotes

* fix: address some reviews

* chore: rm unused option

* chore: update typedef

* feat: use lists directly

* fix: formatting

* chore: address reviews

* fix: formatting

* chore: rm query ports since default impl per review

* feat: add back query ports

* fix: formatting

* chore: add newline at eof

* feat: try removing query ports again

* [Docs] GKE Nvidia Driver installation instructions update (#4106)

* docs

* docs

* docs

* [Performance] Use new AWS custom images (#4091)

* rename methods to use downstream/edge terminology

* [Performance] Add Packer image generation scripts for GCP and AWS (#4068)

* [Performance] Add Packer image generation scripts for GCP and AWS

* Add docker install and tests

* solve nvidia container issue

* Install cuDNN

* [Performance] Scripts to copy/delete AWS images for all regions and add cloud deps (#4073)

* [Performance] Add AWS script to copy images for all regions

* script to delete all AWS images across regions

* Add cloud dependencies to image

---------

Co-authored-by: Yika Luo <yikaluo@Yikas-MacBook-Pro.local>

* Disable AWS images.csv refreshing (#4116)

* [Docs] .skyignore doc (#4114)

* [Docs] .skyignore doc

* Correct typos

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

---------

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* [Core] Raise error for none existing cluster when endpoint is called (#4117)

raise error for none existing cluster

* Refresh local aws images.csv when image not found (#4127)

Refresh local aws images.csv by pulling from github catalog when image tag not found

* [Docs] News revamps. (#4126)

* News revamps.

updates

updates

updates

updates

updates

updates

updates

updates

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* [Serve] Support manually terminating a replica and with purge option (#4032)

* define replica id param in cli

* create endpoint on controller

* call controller endpoint to scale down replica

* add classmethod decorator

* add handler methods for readability in cli

* update docstr and error msg, and inline in cli

* update log and return err msg

* add docstr, catch and reraise err, add stopped and nonexistent message

* inline constant to avoid circular import

* fix error statement and return encoded str

* add purge feature

* add purge replica usage in docstr

* use .get to handle unexpected packages

* fix: diff terminate replica when failed/purging or not

* fix: stay up to date for `is_controller_accessible`

* revert

* up to date with current APIs

* error handling

* when purged remove record in the main loop

* refactor due to reviewer's suggestions

* combine functions

* fix: terminate the healthy replica even with purge option

* remove abbr

* Update sky/serve/core.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* Update sky/serve/core.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* Update sky/serve/controller.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* Update sky/serve/controller.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* Update sky/cli.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* got services hint

* check if not yes in the outside if branch

* fix some output messages

* Update sky/serve/core.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* set conflict status code for already scheduled termination

* combine purge and normal terminating down branch together

* bump version

* global exception handler to render a json response with error messages

* fix: use responses.JSONResponse for dict serialize

* error messages for old controller

* fix: check version mismatch in generated code

* revert mistakenly change update_service

* refine already in terminating message

* fix: branch code workaround in cls.build

* wording

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* refactor due to reviewer's comments

* fix use ux_utils

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* add changelog as comments

* fix messages

* edit the message for mismatch error

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* no traceback when raising in `terminate_replica`

* messages decode

* Apply suggestions from code review

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* format

* forma

* Empty commit

---------

Co-authored-by: David Tran <davidtran@Davids-MacBook-Pro-2.local>
Co-authored-by: David Tran <david.tran@datadoghq.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>

* [Provisioner] Support docker in Lambda Cloud and TPU (#4115)

* [Provisioner] Support docker in Lambda Cloud

* fix permission issue

* merge with check docker installed

* add tpu support & test

* patch lambda cloud

* add comment

* Apply suggestions from code review

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* change wording all to up/downstream style

* Add unique suffix to task names, fallback to timestamp if unnamed

* Unify handling of single and multiple tasks without dependencies

* Refactor tasks initialization: use list comprehension and fail fast

* Fix remove task dependency description: upstream, not downstream

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* Remove duplicated `self.edges`, use nx api instead

* [Serve] Add `ux_utils.print_exception_no_traceback()` for cleaner error output (#4111)

* add `ux_utils.print_exception_no_traceback()` for cleaner error output

* Empty commit

* remove unnecessary with block

* Partially revert: Remove unnecessary `ux_utils.print_exception_no_traceback()` wrappers (#4130)

fix unnecessary with block for returning

* Revert "Add unique suffix to task names, fallback to timestamp if unnamed"

Otherwise, users can not refer to the task by name in the DAG.

This reverts commit 8486352.

* comment the checking used as upstream logic

* [examples] Deepspeed fixes + k8s support (#4124)

deepspeed kubernetes fixes

* Empty commit

* [OCI] Support more OS types in addition to ubuntu (#4080)

* Bug fix for sky config file path resolution.

* format

* [OCI] Bug fix for image_id in Task YAML

* [OCI]: Support more OS types (esp. oraclelinux) in addition to ubuntu.

* format

* Disable system firewall

* Bug fix for validation of the Marketplace images

* Update sky/clouds/oci.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* Update sky/clouds/oci.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* variable/function naming

* address review comments: not to change the service_catalog api. call oci_catalog directly for get os type for a image.

* Update sky/clouds/oci.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* Update sky/clouds/oci.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* Update sky/clouds/oci.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* address review comments

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* Apply suggestions from code review

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* fix: typing.cast

* add TODOs for future function migration

* remove dependencies wording to reduce ambiguity

* temporarily add github actions

---------

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: yika-luo <yikaluo@assemblesys.com>
Co-authored-by: Yika Luo <yikaluo@Yikas-MacBook-Pro.local>
Co-authored-by: Kote Mushegiani <mushegiani@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: David Tran <davidtran@Davids-MacBook-Pro-2.local>
Co-authored-by: David Tran <david.tran@datadoghq.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Hysun He <hysunhe@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants