Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deepspeed test to amd scheduled CI #27633

Merged
merged 35 commits into from
Dec 11, 2023
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
1e8ce66
add deepspeed scheduled test for amd
echarlaix Nov 21, 2023
bf276ed
fix image
echarlaix Nov 21, 2023
2cfb53d
add dockerfile
echarlaix Nov 23, 2023
5a9a529
add comment
echarlaix Nov 23, 2023
af46e87
enable tests
echarlaix Nov 23, 2023
c29d249
trigger
echarlaix Nov 27, 2023
a0c3daf
remove trigger for this branch
echarlaix Nov 27, 2023
4cb9d6f
trigger
echarlaix Nov 28, 2023
a703349
change runner env to trigger the docker build image test
echarlaix Nov 28, 2023
a47ac2c
use new docker image
echarlaix Nov 28, 2023
233bd7f
remove test suffix from docker image tag
echarlaix Nov 28, 2023
971ba80
replace test docker image with original image
echarlaix Nov 28, 2023
da4774c
push new image
echarlaix Nov 29, 2023
cbe995f
Trigger
echarlaix Nov 30, 2023
e16c271
add back amd tests
echarlaix Nov 30, 2023
70c3580
fix typo
echarlaix Nov 30, 2023
090b88e
add amd tests back
echarlaix Nov 30, 2023
508ae29
fix
echarlaix Nov 30, 2023
09fee9e
comment until docker image build scheduled test fix
echarlaix Nov 30, 2023
407cfe9
remove deprecated deepspeed build option
echarlaix Nov 30, 2023
f846b80
upgrade torch
echarlaix Nov 30, 2023
785b63a
update docker & make tests pass
Dec 4, 2023
ba8cc9f
Merge branch 'main' into run_amd_scheduled_ci_caller_deepspeed_test
fxmarty Dec 4, 2023
f0f931e
Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile
fxmarty Dec 5, 2023
40398b9
fix
echarlaix Dec 5, 2023
3332cd2
tmp disable test
echarlaix Dec 5, 2023
9696cc4
precompile deepspeed to avoid timeout during tests
echarlaix Dec 5, 2023
84a7a33
fix comment
echarlaix Dec 5, 2023
fc6d890
Merge branch 'main' into run_amd_scheduled_ci_caller_deepspeed_test
echarlaix Dec 5, 2023
df00cff
trigger deepspeed tests with new image
echarlaix Dec 5, 2023
92c402d
comment tests
echarlaix Dec 5, 2023
fa82a9c
trigger
echarlaix Dec 6, 2023
ecb9239
add sklearn dependency to fix slow tests
echarlaix Dec 6, 2023
cfcc312
enable back other tests
echarlaix Dec 6, 2023
ae82b3f
final update
ydshieh Dec 7, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .github/workflows/build-docker-images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -271,3 +271,39 @@ jobs:
REF=main
push: true
tags: huggingface/transformers-tensorflow-gpu

# latest-pytorch-deepspeed-amd:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also disabled as

# Need to be fixed with the help from Guillaume.
cc @ydshieh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydshieh is there something we need to do here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another issue that we can deal with outside this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here we have to build the image manually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added one manually so that we can verify the deepspeed tests : echarlaix/amd-deepspeed-test

# name: "PyTorch + DeepSpeed (AMD) [dev]"

# runs-on: [self-hosted, docker-gpu, amd-gpu, single-gpu, mi210]
# steps:
# - name: Set up Docker Buildx
# uses: docker/setup-buildx-action@v3
# - name: Check out code
# uses: actions/checkout@v3
# - name: Login to DockerHub
# uses: docker/login-action@v3
# with:
# username: ${{ secrets.DOCKERHUB_USERNAME }}
# password: ${{ secrets.DOCKERHUB_PASSWORD }}
# - name: Build and push
# uses: docker/build-push-action@v5
# with:
# context: ./docker/transformers-pytorch-deepspeed-amd-gpu
# build-args: |
# REF=main
# push: true
# tags: huggingface/transformers-pytorch-deepspeed-amd-gpu${{ inputs.image_postfix }}
# # Push CI images still need to be re-built daily
# -
# name: Build and push (for Push CI) in a daily basis
# # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`.
# # The later case is useful for manual image building for debugging purpose. Use another tag in this case!
# if: inputs.image_postfix != '-push-ci'
# uses: docker/build-push-action@v5
# with:
# context: ./docker/transformers-pytorch-deepspeed-amd-gpu
# build-args: |
# REF=main
# push: true
# tags: huggingface/transformers-pytorch-deepspeed-amd-gpu-push-ci
2 changes: 1 addition & 1 deletion .github/workflows/self-scheduled-amd-caller.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ on:
- cron: "17 2 * * *"
push:
branches:
- run_amd_scheduled_ci_caller*
- run_amd_scheduled_ci_caller__*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove this modification before merging (added to disable all the other AMD scheduled tests)


jobs:
run_amd_ci_mi210:
Expand Down
60 changes: 58 additions & 2 deletions .github/workflows/self-scheduled-amd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,62 @@ jobs:
name: ${{ matrix.machine_type }}_run_tests_torch_pipeline_gpu
path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_pipeline_gpu

run_tests_torch_deepspeed_gpu:
name: Torch ROCm deepspeed tests
strategy:
fail-fast: false
matrix:
machine_type: [single-gpu, multi-gpu]

runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}']
needs: setup
container:
image: huggingface/transformers-pytorch-deepspeed-amd-gpu
options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
steps:
- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ github.sha }}

- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add

      # To avoid unknown test failures
      - name: Pre build DeepSpeed *again*
        working-directory: /workspace
        run: |
          python3 -m pip uninstall -y deepspeed
          DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

to be the same as in other workflow file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure to understand why we need to uninstall and reinstall deepspeed here, what issue does it solve ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember exactly, it has been one year or more ago. I can try to find from the history if you would like to have the information.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in our case we don't need it at the moment, so does it work if we keep it that way ? If you want me to uninstall / reinstall it in the tests, I can directly update and use huggingface/transformers-pytorch-amd-gpu and just install deepspeed in the tests directly (to avoid this step)

- name: ROCM-SMI
run: |
rocm-smi
- name: ROCM-INFO
run: |
rocminfo | grep "Agent" -A 14
- name: Show ROCR environment
run: |
echo "ROCR: $ROCR_VISIBLE_DEVICES"

- name: Environment
working-directory: /transformers
run: |
python3 utils/print_env.py

- name: Show installed libraries and their versions
working-directory: /transformers
run: pip freeze

- name: Run all tests on GPU
working-directory: /transformers
run: python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_torch_deepspeed_gpu tests/deepspeed tests/extended

- name: Failure short reports
if: ${{ failure() }}
continue-on-error: true
run: cat /transformers/reports/${{ matrix.machine_type }}_tests_torch_deepspeed_gpu/failures_short.txt

- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v3
with:
name: ${{ matrix.machine_type }}_run_tests_torch_deepspeed_gpu_test_reports
path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_deepspeed_gpu

run_extract_warnings:
name: Extract warnings in CI artifacts
runs-on: ubuntu-22.04
Expand All @@ -368,7 +424,7 @@ jobs:
run_tests_multi_gpu,
run_examples_gpu,
run_pipelines_torch_gpu,
# run_all_tests_torch_cuda_extensions_gpu
run_tests_torch_deepspeed_gpu
]
steps:
- name: Checkout transformers
Expand Down Expand Up @@ -417,7 +473,7 @@ jobs:
run_tests_multi_gpu,
run_examples_gpu,
run_pipelines_torch_gpu,
# run_all_tests_torch_cuda_extensions_gpu,
run_tests_torch_deepspeed_gpu,
run_extract_warnings
]
steps:
Expand Down
26 changes: 26 additions & 0 deletions docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
FROM rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1
LABEL maintainer="Hugging Face"

ARG DEBIAN_FRONTEND=noninteractive
ARG PYTORCH='2.0.1'
ARG ROCM='5.7'

RUN apt update && \
apt install -y --no-install-recommends libaio-dev git && \
apt clean && \
rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install --no-cache-dir --upgrade pip

RUN python3 -m pip uninstall -y apex

ARG REF=main
WORKDIR /
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to have

RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1

or whatever equivalent for deepspeed in ROCM if necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure we can pre-compile deepspeed for this set of ops, was just wondering whether we can keep it in jit mode so that all the machine compatible ops can be dynamically build at runtime

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jit mode will make some tests slower and potentially timeout, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't observe this when testing but added just in case : 9696cc4

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @echarlaix . The most important is to do this again at CI time, as mentioned in the comment below.

(It may or may not relevant now, but I never checked again. I keep both to just avoid potential issue popping up)


# When installing in editable mode, `transformers` is not recognized as a package.
# this line must be added in order for python to be aware of transformers.
RUN cd transformers && python3 setup.py develop

RUN python3 -c "from deepspeed.launcher.runner import main"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will (always) use the torch 2.0.1 which is not what we want. I suggest to

  • either based on docker/transformers-pytorch-amd-gpu/Dockerfile on current main

  • or not even to build an AMD+deepespeed image: just build+install deepspeed at CI runtime

  • otherwise, based on this image, but need to install the

    • ARG PYTORCH='2.1.0'
    • ARG TORCH_VISION='0.16.0'
    • ARG TORCH_AUDIO='2.1.0'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ydshieh that makes sense, just upgraded the torch version in f846b80.
I cannot use docker/transformers-pytorch-amd-gpu/Dockerfile at the moment as there is an incompatibility with ROCm and deepspeed for some reason (coming from the adam extension), but can modify it and change its parent image if needed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, try to build the image + run it to make sure we are still good.
(more importantly, check the versions of installed torch etc. do give the expected versions - we do get some surprise sometimes :-) )

Regarding where to build the image, let's talk.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can update huggingface/transformers-pytorch-deepspeed-amd-gpu and trigger the tests again so that we can check everything is working as expected before merging. When running the tests locally (with the updated image so with torch==2.1.0+rocm5.6), the failing tests looked like the same as for the current CI

Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ RUN python3 -m pip uninstall -y torch-tensorrt

# recompile apex
RUN python3 -m pip uninstall -y apex
RUN git clone https://github.com/NVIDIA/apex
# RUN git clone https://github.com/NVIDIA/apex
# `MAX_JOBS=1` disables parallel building to avoid cpu memory OOM when building image on GitHub Action (standard) runners
# TODO: check if there is alternative way to install latest apex
# RUN cd apex && MAX_JOBS=1 python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check .
Expand Down
Loading