Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨🚨🚨Move to using tags rather than latest for docker images and consolidate image repos 🚨 🚨🚨 #2554

Merged
merged 14 commits into from
Mar 18, 2024

Conversation

muellerzr
Copy link
Collaborator

@muellerzr muellerzr commented Mar 14, 2024

What does this PR do?

Rather than having huggingface/accelerate-gpu huggingface/accelerate-cpu + a new repository for each new env that we will add (e.g. soon to be XLA and XLA_gpu), this PR centralizes all our docker images to just be in huggingface/accelerate and uses the right tags to make discovery and management easier.

This PR also includes a README in /docker which mentions what images are available and how to run things.

The Pros:

  • Easier discoverability of images based on tags, rather than needing to know the full repo
  • Easier interoperability and debugging as images now have nightly tags associated with the date they were ran
  • Allows for easier fine-grained tagging and build management to use later on (such as deepspeed specific builds, XLA-specific builds, etc, which now can be pushed as a seperate tag instead of needing to make a ton of new repos)

The Cons:

  • Downstream users using huggingface/accelerate-cpu or huggingface/accelerate-gpu will not receive updated images.

Not sure if there's a good way to alert users to the deprecation inside Docker, so we'll just need to do so thoroughly in comms.

The conversion:

  • huggingface/accelerate-gpu:latest -> huggingface/accelerate:gpu-nightly
  • huggingface/accelerate-cpu:latest -> huggingface/accelerate:cpu-nightly

New image tags

Nightlies are now built with versioning in mind for individuals to roll back to easily (or CI's). This helps us easily trakc down if a new dependency got released, when it happened, and what was new in the build.

  • huggingface/accelerate:{gpu,cpu}-nightly
  • huggingface/accelerate:{gpu,cpu}-nightly-{YYYY}-{MM}-{DD}

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ydshieh @BenjaminBossan

@muellerzr
Copy link
Collaborator Author

New repo is located here: https://hub.docker.com/r/huggingface/accelerate/tags

Looking at migrating the README.md to the repo currently

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

Huggingface/accelerate-gpu:latest -> huggingface/accelerate:gpu-latest

I think we don't have tag containing -latest anymore.

Regarding how long to keep the images, I still think it's better to talk (or at least inform) the infra team. Not sure how large is an accelerate image, but transformers images are of around ~10G. (And I don't know if we will get problems when we have hundred of such images 😆 )

@muellerzr
Copy link
Collaborator Author

I think we don't have tag containing -latest anymore.

Thanks, yup adjusted as it's -nightly now

Regarding how long to keep the images, I still think it's better to talk (or at least inform) the infra team. Not sure how large is an accelerate image, but transformers images are of around ~10G. (And I don't know if we will get problems when we have hundred of such images 😆 )

Definitely the next step. Accelerate images on Docker are also around 10gb

@muellerzr muellerzr changed the title Move to using tags rather than latest for docker images and consolidate image repos 🚨🚨🚨Move to using tags rather than latest for docker images and consolidate image repos 🚨 🚨🚨 Mar 14, 2024
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for these changes make sense to me. As mentioned, the question about retention should probably be addressed.

Regarding the deprecation, there doesn't seem to be an official way: docker/hub-feedback#907. Not sure what can be done, really, except for mentioning it in the release notes.

@muellerzr
Copy link
Collaborator Author

Not sure what can be done, really, except for mentioning it in the release notes.

@BenjaminBossan I more or less plan on doing the following:

  1. The official README in the repos will be updated to include a big deprecation notification and what to migrate to
  2. We'll publish one more release of the images, stating a PRINT statement on entry to the container that the image is deprecated and to migrate (NVIDIA does this, it stands to reason we can too)

Next steps from here are more fine-grained builds as needed internally and externally

  • A DeepSpeed specific build (tag: accelerate:cuda-deepspeed)
  • A TransformerEngine + MS-AMP specific build (tag: accelerate-cuda-fp8)

@muellerzr muellerzr merged commit 2ad42e7 into main Mar 18, 2024
25 checks passed
@muellerzr muellerzr deleted the use-accelerate-repo branch March 18, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants