(Re-)Enable Nightly + Past CI #22393

ydshieh · 2023-03-27T07:31:03Z

What does this PR do?

(Re-)Enable Nightly + Past CI

cc @stas00 : I don't think there is something (related to DeepSpeed) that really needs your review in this PR. But if you prefer, you can take a look the 2 Dockerfile files under docker (and more files if you want). Thank you.

p.s. I launched a full run (without TensorFlow past version CIs) here

HuggingFaceDocBuilderDev · 2023-03-27T08:03:28Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh

Left some comments that might be helpful

ydshieh · 2023-03-27T07:37:16Z

.github/workflows/build-docker-images.yml

@@ -67,35 +67,6 @@ jobs:
          push: true
          tags: huggingface/transformers-all-latest-gpu-push-ci

-  latest-with-torch-nightly-docker:
-    name: "Nightly PyTorch + Stable TensorFlow"


This is moved to a new file build-nightly-ci-docker-image.

ydshieh · 2023-03-27T07:37:36Z

.github/workflows/build-docker-images.yml

@@ -153,34 +124,6 @@ jobs:
          push: true
          tags: huggingface/transformers-pytorch-deepspeed-latest-gpu-push-ci

-  nightly-torch-deepspeed-docker:
-    name: "Nightly PyTorch + DeepSpeed"


Same above: this is moved to a new file build-nightly-ci-docker-image.

ydshieh · 2023-03-27T08:02:10Z

.github/workflows/build-nightly-ci-docker-images.yml

+  workflow_call:
+  push:
+    branches:
+      - build_nightly_ci_docker_image*


New workflow file to build nightly CI docker images.

Mainly to be triggered from self-nightly-past-ci-caller.yml via workflow_call event.

(we don't build images here in a daily basis for now - they are built only when nightly ci is triggered)

ydshieh · 2023-03-27T08:10:35Z

.github/workflows/build-past-ci-docker-images.yml

@@ -3,7 +3,7 @@ name: Build docker images (Past CI)
 on:
  push:
    branches:
-      - past-ci-docker-image*
+      - build_past_ci_docker_image*


similar to nightly CI docker image build above - but we don't need to use workflow_call event to build past CI docker images each time the past CI is triggered.

Build once is enough - but sometimes we might want/need to update them manually (via push event).

ydshieh · 2023-03-27T11:09:28Z

.github/workflows/build-past-ci-docker-images.yml

+          framework_version: ${{ matrix.version }}
+        run: |
+          echo "base_image=$(python3 -c 'import os; from utils.past_ci_versions import past_versions_testing; base_image = past_versions_testing["pytorch"][os.environ["framework_version"]]["base_image"]; print(base_image)')" >> $GITHUB_OUTPUT
+      -


past CI docker image building is a bit special - different base docker images are required for different torch/tensorflow versions.

ydshieh · 2023-03-27T11:47:54Z

.github/workflows/self-past-caller.yml

This file is renamed to self-nightly-past-ci-caller.yml

ydshieh · 2023-03-27T11:48:35Z

.github/workflows/self-past.yml

          path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}

+  run_all_tests_torch_cuda_extensions_gpu:
+    name: Torch CUDA extension tests


add DeepSpeed job to past CI workflow

ydshieh · 2023-03-27T11:49:36Z

docker/transformers-past-gpu/Dockerfile

The main change to this file is about adding some deepspeed stuff.

ydshieh · 2023-03-27T11:50:38Z

docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile

+#
+## install torch_tensorrt (fx path)
+#RUN git clone https://github.com/pytorch/TensorRT.git
+#RUN cd TensorRT/py && python3 setup.py install --fx-only


Let's not bother by all these 3rd party libraries for now. We can iterate later.

ydshieh · 2023-03-27T11:52:51Z

utils/notification_service.py

Adopt necessary changes to make the nightly-past-ci workflow could report correctly

ydshieh · 2023-03-27T14:00:25Z

.github/workflows/self-nightly-past-ci-caller.yml

+  push:
+    branches:
+      - run_nightly_ci*
+      - run_past_ci*


I will need to measure how long one round of workflow run will take, and add a schedule even here with proper interval.

stas00 · 2023-03-27T16:19:54Z

The DS part looks good, @ydshieh

I wonder if you want to continue testing torchdynamo at all. Users wanting to use it should be encouraged to move to torch>=2.0 instead, where it's built in. But a subject for a different PR I guess.

ydshieh · 2023-03-27T16:26:22Z

The DS part looks good, @ydshieh

I wonder if you want to continue testing torchdynamo at all. Users wanting to use it should be encouraged to move to torch>=2.0 instead, where it's built in. But a subject for a different PR I guess.

From my side, it would be great if I don't have to deal with all the potential (installation/runtime) issues for such 3rd party libraries across with different torch versions (at least, not with previous torch versions). It's best to focus on the torch and torch+DeepSpeed testing results.

stas00 · 2023-03-27T16:40:31Z

oh, I meant not testing torchdynamo in general transformers-wide. For sure you don't need any unrelated packages installed to test deepspeed, other its own deps.

LysandreJik

Great, thanks @ydshieh! Let's give it a test run and settle on a schedule.

LysandreJik · 2023-03-28T18:52:44Z

.github/workflows/build-nightly-ci-docker-images.yml

+          sudo ls -l /usr/local/lib/
+          sudo ls -l /usr/share/
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
+          sudo rm -rf /usr/local/lib/android
+          sudo rm -rf /usr/share/dotnet
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/


That is quite a lot of cleanup, where is this coming from?

Disk full error occurred when I tried to install torch 2.0.0 (pre) while the official version is not out yet. It's probably no longer necessary after the official version is released.

Not really a lot of cleanup: there are only 2. Other commands are just to print information.

The only useful cleanup is /usr/local/lib/android which is about 12 GB.

(The disk full error is due to torch was trying to find/install several versions that would match some version requirements)

Is this important to keep? Every line of code in yml file is going to be copy/pasted across several others as work is done in these files/as new files pop up, so I'd be wary of adding non-necessary commands in the files :)

I will check if it's still necessary after the official torch 2.0.0 was released - and remove this part if possible :-)

@LysandreJik Unfortunately, we still have the disk full issue. (See this job run page)

BTW, this is already .github/workflows/build-docker-images.yml on the main branch. Here is just the same code for the new file .github/workflows/build-nightly-ci-docker-images.yml.

Sounds good, let's keep it then!

LysandreJik · 2023-03-28T18:54:28Z

docker/transformers-past-gpu/Dockerfile

-ARG BASE_DOCKER_IMAGE="nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04"
+ARG BASE_DOCKER_IMAGE


What does this change do?

This is for past CI - different versions of torch and tensorflow CI require different base docker images.
It's doesn't make a lot of sense to define a default BASE_DOCKER_IMAGE in the past CI docker file.

I move them to utils/past_ci_versions.py. This information will be fetched in .github/workflows/build-past-ci-docker-images.yml, and pass to the docker files via

build-args: | REF=main BASE_DOCKER_IMAGE=${{ steps.get-base-image.outputs.base_image }}

Understood, cool!

ydshieh · 2023-03-30T12:11:27Z

Without TensorFlow Past CI - it takes 2.5 days to run the Nightly CI + PyTorch Past CI.
I put the schedule to trigger the workflow on Sunday and Thursday at 2 AM.

The TensorFlow past CI will only run under push events.

* Enable Nightly + Past CI * put schedule --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

ydshieh added 2 commits March 27, 2023 08:35

Enable Nightly + Past CI

5d06322

Rename

54127c6

ydshieh added 2 commits March 27, 2023 10:08

update

5292dcd

update

91e73a8

ydshieh commented Mar 27, 2023

View reviewed changes

ydshieh marked this pull request as ready for review March 27, 2023 11:54

ydshieh requested a review from LysandreJik March 27, 2023 11:55

ydshieh commented Mar 27, 2023

View reviewed changes

LysandreJik approved these changes Mar 28, 2023

View reviewed changes

put schedule

0c1959c

ydshieh merged commit 0fe6c6b into main Mar 30, 2023

ydshieh deleted the enable_backward_forward_ci branch March 30, 2023 19:06

raghavanone pushed a commit to raghavanone/transformers that referenced this pull request Apr 5, 2023

(Re-)Enable Nightly + Past CI (huggingface#22393)

7492a1e

* Enable Nightly + Past CI * put schedule --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

This was referenced Apr 18, 2023

Fix Past CI not running against the latest main #22823

Merged

Install accelerete@main in PyTorch Past CI jobs. #22963

Merged

Fix DeepSpeed CI job link in Past CI #22967

Merged

ydshieh mentioned this pull request Jun 12, 2023

Add the number of model test failures to slack CI report #24207

Merged

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

(Re-)Enable Nightly + Past CI (huggingface#22393)

d090fb9

* Enable Nightly + Past CI * put schedule --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Re-)Enable Nightly + Past CI #22393

(Re-)Enable Nightly + Past CI #22393

ydshieh commented Mar 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 27, 2023 •

edited

Loading

ydshieh left a comment

ydshieh Mar 27, 2023

ydshieh Mar 27, 2023

ydshieh Mar 27, 2023

ydshieh Mar 27, 2023

ydshieh Mar 27, 2023

ydshieh Mar 27, 2023

ydshieh Mar 27, 2023

ydshieh Mar 27, 2023

ydshieh Mar 27, 2023

ydshieh Mar 27, 2023

ydshieh Mar 27, 2023

stas00 commented Mar 27, 2023

ydshieh commented Mar 27, 2023 •

edited

Loading

stas00 commented Mar 27, 2023

LysandreJik left a comment

LysandreJik Mar 28, 2023

ydshieh Mar 29, 2023

LysandreJik Mar 30, 2023

ydshieh Mar 30, 2023

ydshieh Mar 30, 2023

ydshieh Mar 30, 2023

LysandreJik Mar 30, 2023

LysandreJik Mar 28, 2023

ydshieh Mar 29, 2023

LysandreJik Mar 30, 2023

ydshieh commented Mar 30, 2023

		ARG BASE_DOCKER_IMAGE="nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04"
		ARG BASE_DOCKER_IMAGE

(Re-)Enable Nightly + Past CI #22393

(Re-)Enable Nightly + Past CI #22393

Conversation

ydshieh commented Mar 27, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 27, 2023 • edited Loading

ydshieh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Mar 27, 2023

ydshieh commented Mar 27, 2023 • edited Loading

stas00 commented Mar 27, 2023

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh commented Mar 30, 2023

ydshieh commented Mar 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 27, 2023 •

edited

Loading

ydshieh commented Mar 27, 2023 •

edited

Loading