-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Re-)Enable Nightly + Past CI #22393
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments that might be helpful
@@ -67,35 +67,6 @@ jobs: | |||
push: true | |||
tags: huggingface/transformers-all-latest-gpu-push-ci | |||
|
|||
latest-with-torch-nightly-docker: | |||
name: "Nightly PyTorch + Stable TensorFlow" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved to a new file build-nightly-ci-docker-image
.
@@ -153,34 +124,6 @@ jobs: | |||
push: true | |||
tags: huggingface/transformers-pytorch-deepspeed-latest-gpu-push-ci | |||
|
|||
nightly-torch-deepspeed-docker: | |||
name: "Nightly PyTorch + DeepSpeed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same above: this is moved to a new file build-nightly-ci-docker-image
.
workflow_call: | ||
push: | ||
branches: | ||
- build_nightly_ci_docker_image* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New workflow file to build nightly CI docker images.
Mainly to be triggered from self-nightly-past-ci-caller.yml
via workflow_call
event.
(we don't build images here in a daily basis for now - they are built only when nightly ci is triggered)
@@ -3,7 +3,7 @@ name: Build docker images (Past CI) | |||
on: | |||
push: | |||
branches: | |||
- past-ci-docker-image* | |||
- build_past_ci_docker_image* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to nightly CI docker image build above - but we don't need to use workflow_call
event to build past CI docker images each time the past CI is triggered.
Build once is enough - but sometimes we might want/need to update them manually (via push event).
framework_version: ${{ matrix.version }} | ||
run: | | ||
echo "base_image=$(python3 -c 'import os; from utils.past_ci_versions import past_versions_testing; base_image = past_versions_testing["pytorch"][os.environ["framework_version"]]["base_image"]; print(base_image)')" >> $GITHUB_OUTPUT | ||
- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
past CI docker image building is a bit special - different base docker images are required for different torch/tensorflow versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is renamed to self-nightly-past-ci-caller.yml
path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} | ||
|
||
run_all_tests_torch_cuda_extensions_gpu: | ||
name: Torch CUDA extension tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add DeepSpeed
job to past CI workflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main change to this file is about adding some deepspeed stuff.
# | ||
## install torch_tensorrt (fx path) | ||
#RUN git clone https://github.com/pytorch/TensorRT.git | ||
#RUN cd TensorRT/py && python3 setup.py install --fx-only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not bother by all these 3rd party libraries for now. We can iterate later.
utils/notification_service.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adopt necessary changes to make the nightly-past-ci
workflow could report correctly
push: | ||
branches: | ||
- run_nightly_ci* | ||
- run_past_ci* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will need to measure how long one round of workflow run will take, and add a schedule even here with proper interval.
The DS part looks good, @ydshieh I wonder if you want to continue testing torchdynamo at all. Users wanting to use it should be encouraged to move to torch>=2.0 instead, where it's built in. But a subject for a different PR I guess. |
From my side, it would be great if I don't have to deal with all the potential (installation/runtime) issues for such 3rd party libraries across with different torch versions (at least, not with previous torch versions). It's best to focus on the torch and torch+DeepSpeed testing results. |
oh, I meant not testing torchdynamo in general transformers-wide. For sure you don't need any unrelated packages installed to test deepspeed, other its own deps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks @ydshieh! Let's give it a test run and settle on a schedule.
sudo ls -l /usr/local/lib/ | ||
sudo ls -l /usr/share/ | ||
sudo du -sh /usr/local/lib/ | ||
sudo du -sh /usr/share/ | ||
sudo rm -rf /usr/local/lib/android | ||
sudo rm -rf /usr/share/dotnet | ||
sudo du -sh /usr/local/lib/ | ||
sudo du -sh /usr/share/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is quite a lot of cleanup, where is this coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disk full error occurred when I tried to install torch 2.0.0 (pre) while the official version is not out yet. It's probably no longer necessary after the official version is released.
- Not really a lot of cleanup: there are only
2
. Other commands are just to print information. - The only useful cleanup is
/usr/local/lib/android
which is about 12 GB. - (The disk full error is due to torch was trying to find/install several versions that would match some version requirements)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this important to keep? Every line of code in yml file is going to be copy/pasted across several others as work is done in these files/as new files pop up, so I'd be wary of adding non-necessary commands in the files :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check if it's still necessary after the official torch 2.0.0 was released - and remove this part if possible :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LysandreJik Unfortunately, we still have the disk full issue. (See this job run page)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this is already .github/workflows/build-docker-images.yml
on the main
branch. Here is just the same code for the new file .github/workflows/build-nightly-ci-docker-images.yml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, let's keep it then!
ARG BASE_DOCKER_IMAGE="nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04" | ||
ARG BASE_DOCKER_IMAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this change do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for past CI - different versions of torch
and tensorflow
CI require different base docker images.
It's doesn't make a lot of sense to define a default BASE_DOCKER_IMAGE
in the past CI docker file.
I move them to utils/past_ci_versions.py
. This information will be fetched in .github/workflows/build-past-ci-docker-images.yml
, and pass to the docker files via
build-args: |
REF=main
BASE_DOCKER_IMAGE=${{ steps.get-base-image.outputs.base_image }}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, cool!
Without TensorFlow Past CI - it takes 2.5 days to run the Nightly CI + PyTorch Past CI. The TensorFlow past CI will only run under push events. |
* Enable Nightly + Past CI * put schedule --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Enable Nightly + Past CI * put schedule --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
What does this PR do?
(Re-)Enable Nightly + Past CI
cc @stas00 : I don't think there is something (related to
DeepSpeed
) that really needs your review in this PR. But if you prefer, you can take a look the 2Dockerfile
files underdocker
(and more files if you want). Thank you.p.s. I launched a full run (without TensorFlow past version CIs) here