Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Re-)Enable Nightly + Past CI #22393

Merged
merged 5 commits into from
Mar 30, 2023
Merged

(Re-)Enable Nightly + Past CI #22393

merged 5 commits into from
Mar 30, 2023

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Mar 27, 2023

What does this PR do?

(Re-)Enable Nightly + Past CI

cc @stas00 : I don't think there is something (related to DeepSpeed) that really needs your review in this PR. But if you prefer, you can take a look the 2 Dockerfile files under docker (and more files if you want). Thank you.

p.s. I launched a full run (without TensorFlow past version CIs) here

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 27, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator Author

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments that might be helpful

@@ -67,35 +67,6 @@ jobs:
push: true
tags: huggingface/transformers-all-latest-gpu-push-ci

latest-with-torch-nightly-docker:
name: "Nightly PyTorch + Stable TensorFlow"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved to a new file build-nightly-ci-docker-image.

@@ -153,34 +124,6 @@ jobs:
push: true
tags: huggingface/transformers-pytorch-deepspeed-latest-gpu-push-ci

nightly-torch-deepspeed-docker:
name: "Nightly PyTorch + DeepSpeed"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above: this is moved to a new file build-nightly-ci-docker-image.

workflow_call:
push:
branches:
- build_nightly_ci_docker_image*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New workflow file to build nightly CI docker images.

Mainly to be triggered from self-nightly-past-ci-caller.yml via workflow_call event.

(we don't build images here in a daily basis for now - they are built only when nightly ci is triggered)

@@ -3,7 +3,7 @@ name: Build docker images (Past CI)
on:
push:
branches:
- past-ci-docker-image*
- build_past_ci_docker_image*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to nightly CI docker image build above - but we don't need to use workflow_call event to build past CI docker images each time the past CI is triggered.

Build once is enough - but sometimes we might want/need to update them manually (via push event).

framework_version: ${{ matrix.version }}
run: |
echo "base_image=$(python3 -c 'import os; from utils.past_ci_versions import past_versions_testing; base_image = past_versions_testing["pytorch"][os.environ["framework_version"]]["base_image"]; print(base_image)')" >> $GITHUB_OUTPUT
-
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

past CI docker image building is a bit special - different base docker images are required for different torch/tensorflow versions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is renamed to self-nightly-past-ci-caller.yml

path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}

run_all_tests_torch_cuda_extensions_gpu:
name: Torch CUDA extension tests
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add DeepSpeed job to past CI workflow

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change to this file is about adding some deepspeed stuff.

#
## install torch_tensorrt (fx path)
#RUN git clone https://github.com/pytorch/TensorRT.git
#RUN cd TensorRT/py && python3 setup.py install --fx-only
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not bother by all these 3rd party libraries for now. We can iterate later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopt necessary changes to make the nightly-past-ci workflow could report correctly

@ydshieh ydshieh marked this pull request as ready for review March 27, 2023 11:54
@ydshieh ydshieh requested a review from LysandreJik March 27, 2023 11:55
push:
branches:
- run_nightly_ci*
- run_past_ci*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to measure how long one round of workflow run will take, and add a schedule even here with proper interval.

@stas00
Copy link
Contributor

stas00 commented Mar 27, 2023

The DS part looks good, @ydshieh

I wonder if you want to continue testing torchdynamo at all. Users wanting to use it should be encouraged to move to torch>=2.0 instead, where it's built in. But a subject for a different PR I guess.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Mar 27, 2023

The DS part looks good, @ydshieh

I wonder if you want to continue testing torchdynamo at all. Users wanting to use it should be encouraged to move to torch>=2.0 instead, where it's built in. But a subject for a different PR I guess.

From my side, it would be great if I don't have to deal with all the potential (installation/runtime) issues for such 3rd party libraries across with different torch versions (at least, not with previous torch versions). It's best to focus on the torch and torch+DeepSpeed testing results.

@stas00
Copy link
Contributor

stas00 commented Mar 27, 2023

oh, I meant not testing torchdynamo in general transformers-wide. For sure you don't need any unrelated packages installed to test deepspeed, other its own deps.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks @ydshieh! Let's give it a test run and settle on a schedule.

Comment on lines +20 to +27
sudo ls -l /usr/local/lib/
sudo ls -l /usr/share/
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/share/dotnet
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is quite a lot of cleanup, where is this coming from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disk full error occurred when I tried to install torch 2.0.0 (pre) while the official version is not out yet. It's probably no longer necessary after the official version is released.

  • Not really a lot of cleanup: there are only 2. Other commands are just to print information.
  • The only useful cleanup is /usr/local/lib/android which is about 12 GB.
  • (The disk full error is due to torch was trying to find/install several versions that would match some version requirements)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this important to keep? Every line of code in yml file is going to be copy/pasted across several others as work is done in these files/as new files pop up, so I'd be wary of adding non-necessary commands in the files :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check if it's still necessary after the official torch 2.0.0 was released - and remove this part if possible :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LysandreJik Unfortunately, we still have the disk full issue. (See this job run page)

Screenshot 2023-03-30 182830

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this is already .github/workflows/build-docker-images.yml on the main branch. Here is just the same code for the new file .github/workflows/build-nightly-ci-docker-images.yml.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's keep it then!

ARG BASE_DOCKER_IMAGE="nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04"
ARG BASE_DOCKER_IMAGE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this change do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for past CI - different versions of torch and tensorflow CI require different base docker images.
It's doesn't make a lot of sense to define a default BASE_DOCKER_IMAGE in the past CI docker file.

I move them to utils/past_ci_versions.py. This information will be fetched in .github/workflows/build-past-ci-docker-images.yml, and pass to the docker files via

build-args: |
  REF=main
  BASE_DOCKER_IMAGE=${{ steps.get-base-image.outputs.base_image }}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, cool!

@ydshieh
Copy link
Collaborator Author

ydshieh commented Mar 30, 2023

Without TensorFlow Past CI - it takes 2.5 days to run the Nightly CI + PyTorch Past CI.
I put the schedule to trigger the workflow on Sunday and Thursday at 2 AM.

The TensorFlow past CI will only run under push events.

@ydshieh ydshieh merged commit 0fe6c6b into main Mar 30, 2023
@ydshieh ydshieh deleted the enable_backward_forward_ci branch March 30, 2023 19:06
raghavanone pushed a commit to raghavanone/transformers that referenced this pull request Apr 5, 2023
* Enable Nightly + Past CI

* put schedule

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
* Enable Nightly + Past CI

* put schedule

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants