-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add deepspeed test to amd scheduled CI #27633
Changes from 20 commits
1e8ce66
bf276ed
2cfb53d
5a9a529
af46e87
c29d249
a0c3daf
4cb9d6f
a703349
a47ac2c
233bd7f
971ba80
da4774c
cbe995f
e16c271
70c3580
090b88e
508ae29
09fee9e
407cfe9
f846b80
785b63a
ba8cc9f
f0f931e
40398b9
3332cd2
9696cc4
84a7a33
fc6d890
df00cff
92c402d
fa82a9c
ecb9239
cfcc312
ae82b3f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,7 @@ on: | |
- cron: "17 2 * * *" | ||
push: | ||
branches: | ||
- run_amd_scheduled_ci_caller* | ||
- run_amd_scheduled_ci_caller__* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will remove this modification before merging (added to disable all the other AMD scheduled tests) |
||
|
||
jobs: | ||
run_amd_ci_mi210: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -356,6 +356,62 @@ jobs: | |
name: ${{ matrix.machine_type }}_run_tests_torch_pipeline_gpu | ||
path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_pipeline_gpu | ||
|
||
run_tests_torch_deepspeed_gpu: | ||
name: Torch ROCm deepspeed tests | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
machine_type: [single-gpu, multi-gpu] | ||
|
||
runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] | ||
needs: setup | ||
container: | ||
image: huggingface/transformers-pytorch-deepspeed-amd-gpu | ||
options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ | ||
steps: | ||
- name: Update clone | ||
working-directory: /transformers | ||
run: git fetch && git checkout ${{ github.sha }} | ||
|
||
- name: Reinstall transformers in edit mode (remove the one installed during docker image build) | ||
working-directory: /transformers | ||
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe add
to be the same as in other workflow file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure to understand why we need to uninstall and reinstall deepspeed here, what issue does it solve ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't remember exactly, it has been one year or more ago. I can try to find from the history if you would like to have the information. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in our case we don't need it at the moment, so does it work if we keep it that way ? If you want me to uninstall / reinstall it in the tests, I can directly update and use |
||
- name: ROCM-SMI | ||
run: | | ||
rocm-smi | ||
- name: ROCM-INFO | ||
run: | | ||
rocminfo | grep "Agent" -A 14 | ||
- name: Show ROCR environment | ||
run: | | ||
echo "ROCR: $ROCR_VISIBLE_DEVICES" | ||
|
||
- name: Environment | ||
working-directory: /transformers | ||
run: | | ||
python3 utils/print_env.py | ||
|
||
- name: Show installed libraries and their versions | ||
working-directory: /transformers | ||
run: pip freeze | ||
|
||
- name: Run all tests on GPU | ||
working-directory: /transformers | ||
run: python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_torch_deepspeed_gpu tests/deepspeed tests/extended | ||
|
||
- name: Failure short reports | ||
if: ${{ failure() }} | ||
continue-on-error: true | ||
run: cat /transformers/reports/${{ matrix.machine_type }}_tests_torch_deepspeed_gpu/failures_short.txt | ||
|
||
- name: Test suite reports artifacts | ||
if: ${{ always() }} | ||
uses: actions/upload-artifact@v3 | ||
with: | ||
name: ${{ matrix.machine_type }}_run_tests_torch_deepspeed_gpu_test_reports | ||
path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_deepspeed_gpu | ||
|
||
run_extract_warnings: | ||
name: Extract warnings in CI artifacts | ||
runs-on: ubuntu-22.04 | ||
|
@@ -368,7 +424,7 @@ jobs: | |
run_tests_multi_gpu, | ||
run_examples_gpu, | ||
run_pipelines_torch_gpu, | ||
# run_all_tests_torch_cuda_extensions_gpu | ||
run_tests_torch_deepspeed_gpu | ||
] | ||
steps: | ||
- name: Checkout transformers | ||
|
@@ -417,7 +473,7 @@ jobs: | |
run_tests_multi_gpu, | ||
run_examples_gpu, | ||
run_pipelines_torch_gpu, | ||
# run_all_tests_torch_cuda_extensions_gpu, | ||
run_tests_torch_deepspeed_gpu, | ||
run_extract_warnings | ||
] | ||
steps: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
FROM rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 | ||
LABEL maintainer="Hugging Face" | ||
|
||
ARG DEBIAN_FRONTEND=noninteractive | ||
ARG PYTORCH='2.0.1' | ||
ARG ROCM='5.7' | ||
|
||
RUN apt update && \ | ||
apt install -y --no-install-recommends libaio-dev git && \ | ||
apt clean && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
RUN python3 -m pip install --no-cache-dir --upgrade pip | ||
|
||
RUN python3 -m pip uninstall -y apex | ||
|
||
ARG REF=main | ||
WORKDIR / | ||
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF | ||
RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest to have
or whatever equivalent for deepspeed in ROCM if necessary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure we can pre-compile deepspeed for this set of ops, was just wondering whether we can keep it in jit mode so that all the machine compatible ops can be dynamically build at runtime There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. jit mode will make some tests slower and potentially timeout, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @echarlaix . The most important is to do this again at CI time, as mentioned in the comment below. (It may or may not relevant now, but I never checked again. I keep both to just avoid potential issue popping up) |
||
|
||
# When installing in editable mode, `transformers` is not recognized as a package. | ||
# this line must be added in order for python to be aware of transformers. | ||
RUN cd transformers && python3 setup.py develop | ||
|
||
RUN python3 -c "from deepspeed.launcher.runner import main" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will (always) use the torch
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @ydshieh that makes sense, just upgraded the torch version in f846b80. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this case, try to build the image + run it to make sure we are still good. Regarding where to build the image, let's talk. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure I can update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also disabled as
transformers/.github/workflows/build-docker-images.yml
Line 211 in 510270a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ydshieh is there something we need to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another issue that we can deal with outside this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here we have to build the image manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added one manually so that we can verify the deepspeed tests :
echarlaix/amd-deepspeed-test