[AMD][ROCm][CI] unit tests fixes or skip #5323

hongxiayang · 2024-06-06T19:38:09Z

This pull request has the following changes:

fix or skip distributed and quantization unit tests

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

hliuca

I have used Hongxia's branch and built on MI300X, and the changes in build scripts fixed the issue. Thanks.

hongxiayang · 2024-06-07T20:44:25Z

I have used Hongxia's branch and built on MI300X, and the changes in build scripts fixed the issue. Thanks.

Thank you @hliuca for verifying the change on MI300x.

hliuca

I have verified the build and things look good..

hongxiayang · 2024-06-07T21:41:08Z

@WoosukKwon As discussed in our last meeting, this is a critical fix. Please review this. Thank you very much !

WoosukKwon

@hongxiayang Huge thanks for the PR! Didn't know that the current docker file has such a critical bug.

Left minor comments on the code style. Please take a look!

docs/source/getting_started/amd-installation.rst

Dockerfile.rocm

WoosukKwon

Thanks for addressing the comment!

Dockerfile.rocm

WoosukKwon · 2024-06-10T17:40:10Z

@hongxiayang Also, please resolve the merge conflicts so that I can merge :)

hongxiayang · 2024-06-10T19:01:02Z

@hongxiayang Also, please resolve the merge conflicts so that I can merge :)

Resolved the merge conflicts. Thanks!

WoosukKwon

LGTM! Thanks for the PR!

WoosukKwon · 2024-06-11T00:32:04Z

@hongxiayang The AMD CI failed. Could you please take a look?

hongxiayang · 2024-06-12T16:41:30Z

@hongxiayang The AMD CI failed. Could you please take a look?

@WoosukKwon Fixed several CI failures.

WoosukKwon · 2024-06-12T18:56:27Z

tests/quantization/test_compressed_tensors.py

-def test_compressed_tensors_no_enforce_eager(vllm_runner):
-    model_path = "nm-testing/tinyllama-oneshot-w8a8-static-v2"
-    with vllm_runner(model_path) as llm:
-        sampling_params = SamplingParams()
-        output = llm.generate("Hello world!", sampling_params=sampling_params)
-        assert output
-
-
-def test_compressed_tensors_w8a8_dynanmic_per_token(vllm_runner):


@hongxiayang Is it safe to delete these tests?

oh, I only add a decorator to skip. Need to look at why it is deleted.

WoosukKwon · 2024-06-12T23:13:56Z

@hongxiayang Thanks for update the PR! I will merge it once it passes the AMD CI.

mawong-amd · 2024-06-13T11:11:43Z

tests/distributed/test_basic_distributed_correctness.py

The changes here don't seem to have fixed the AMD distributed test, if they don't help we should remove them.

it worked on my local env. Again, we should revisit unit tests when we introduce a new feature.

mawong-amd · 2024-06-13T11:15:53Z

tests/quantization/test_compressed_tensors.py

The quantization test errors are unrelated to ROCm. We should not unilaterally skip tests that are also broken in upstream: when these are fixed we will lose tests for no reason.

(1) I got this error when I ran this test: "ValueError: compressed-tensors quantization is currently not supported in ROCm.".
ROCm did not support it right now.
(2) we should revisit the test if this is supported by ROCm. We will need to borrow pytorch practice eventually to book keep all the skip tests and do parity review periodically.

hongxiayang · 2024-06-13T13:52:17Z

it seems the @skipIfRocm is broken in this CI even it works fine in my local env if I ran it locally, and expecttest package was inside the container.

root:/vllm-workspace/tests/quantization# pip show expecttest
Name: expecttest
Version: 0.1.6
Summary: 
Home-page: https://github.com/ezyang/expecttest
Author: Edward Z. Yang
Author-email: ezyang@mit.edu
License: MIT
Location: /opt/conda/envs/py_3.10/lib/python3.10/site-packages
Requires: 
Required-by:

and test

root:/vllm-workspace/tests/quantization# pytest test_compressed_tensors.py 
========================================================================== test session starts ===========================================================================
platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.4.0
rootdir: /vllm-workspace
configfile: pyproject.toml
plugins: shard-0.1.2, xdoctest-1.1.0, xdist-3.3.1, flakefinder-1.1.0, cpp-2.3.0, rerunfailures-14.0, hypothesis-5.35.1, asyncio-0.23.7, anyio-4.4.0
asyncio: mode=strict
collected 3 items                                                                                                                                                        
Running 3 items in this shard

test_compressed_tensors.py sss                                                                                                                                     [100%]

=========================================================================== 3 skipped in 1.93s ===========================================================================

The CI complains that the expecttest module :

WoosukKwon · 2024-06-13T17:35:57Z

@hongxiayang It seems there are 3 failed tests in the AMD CI. Is this expected?

hongxiayang · 2024-06-13T17:56:54Z

@hongxiayang It seems there are 3 failed tests in the AMD CI. Is this expected?

@WoosukKwon I am looking. Seems each rebase will bring out something new.
For the engine test failure related to test_stop_strings.py:
I consulted several folks, and got slightly different answers.
I am not completely certain whether we should expect the same output for the same prompt for all cases to compare with non-AMD GPUs.

See below:

tests/distributed/test_chunked_prefill_distributed.py

WoosukKwon · 2024-06-16T21:35:59Z

Now the AMD CI finally ran, but the tests failed because of the new Numpy 2.0 release 😭
@hongxiayang I'm really sorry for this, but could you update the PR to fix the error?

…e docker to support ROCm 6.1

hongxiayang · 2024-06-17T20:59:57Z

Now the AMD CI finally ran, but the tests failed because of the new Numpy 2.0 release 😭 @hongxiayang I'm really sorry for this, but could you update the PR to fix the error?

@WoosukKwon thanks. I rebased my branch which picked up the change requiring numpy < 2.0. Now the CI is failing on some ssl errors, for example:

WoosukKwon · 2024-06-17T22:02:33Z

@hongxiayang Just restarted the tests with SSL errors and it seems they work now. However, the PR still has failures in the AMD CI tests. Could you please take a look?

hongxiayang · 2024-06-18T02:01:04Z

@WoosukKwon Our customers are still waiting for this fix. The longer we wait, the more unit tests I may have to fix since more people are adding more unit tests. To make the build fix easier to merge, I will split this PR to two or more small PRs. Hope this is ok for you. The first splited PR will just contain the cmake fix. That will at least enable the customers build their own docker image correctly on MI300x with ROCm 6.1 or 6.1.2. cc @hliuca

WoosukKwon · 2024-06-18T02:03:01Z

@hongxiayang Sounds good. Please go ahead. Or, do you think it makes sense to merge the current PR?

WoosukKwon · 2024-06-18T02:11:25Z

I think there are two types of error: 1. A Ray related error and 2. a correctness error. I think the correctness error could be because the new ROCm release might include changes in its libraries like hipBLAS, which in turn causes small numerical differences.

hongxiayang · 2024-06-18T15:24:23Z

I think there are two types of error: 1. A Ray related error and 2. a correctness error. I think the correctness error could be because the new ROCm release might include changes in its libraries like hipBLAS, which in turn causes small numerical differences.

Thanks. For (1) I did some research/investigations related to test_utils.py in tests/distributed. It may need to wait until a newer ROCm/pytorch release.
For (2) the test_stop_string.py related. Investigation is still underway, and there is no concrete conclusion yet.

hongxiayang · 2024-06-18T15:26:02Z

@hongxiayang Sounds good. Please go ahead. Or, do you think it makes sense to merge the current PR?

Thanks. The first split PR is #5641.
I might split the fixes for the unit tests to a different PR.

hongxiayang · 2024-06-19T14:32:35Z

Since cmake fix (#5641) is merged, I am repurposing this PR to only include changes to skip or fix failed unit tests.

WoosukKwon · 2024-06-19T17:33:02Z

@hongxiayang Sounds great! Could you please update the PR to resolve the merge conflicts?

hongxiayang marked this pull request as ready for review June 6, 2024 20:03

WoosukKwon added the rocm label Jun 7, 2024

hliuca approved these changes Jun 7, 2024

View reviewed changes

WoosukKwon self-assigned this Jun 8, 2024

WoosukKwon reviewed Jun 8, 2024

View reviewed changes

docs/source/getting_started/amd-installation.rst Outdated Show resolved Hide resolved

Dockerfile.rocm Outdated Show resolved Hide resolved

Dockerfile.rocm Outdated Show resolved Hide resolved

WoosukKwon approved these changes Jun 10, 2024

View reviewed changes

Dockerfile.rocm Show resolved Hide resolved

WoosukKwon mentioned this pull request Jun 10, 2024

Bump version to v0.5.0 #5384

Merged

WoosukKwon approved these changes Jun 10, 2024

View reviewed changes

mawong-amd mentioned this pull request Jun 11, 2024

[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes #5422

Merged

WoosukKwon reviewed Jun 12, 2024

View reviewed changes

mawong-amd reviewed Jun 13, 2024

View reviewed changes

WoosukKwon closed this Jun 13, 2024

WoosukKwon reopened this Jun 13, 2024

WoosukKwon reviewed Jun 14, 2024

View reviewed changes

tests/distributed/test_chunked_prefill_distributed.py Outdated Show resolved Hide resolved

hongxiayang added 3 commits June 17, 2024 13:52

Fixed the cmake build bug which generate garbage on mi300x and upgrad…

a9425c0

…e docker to support ROCm 6.1

cp those .so files for CI

3a56158

refactor Dockerfile based on review feedback

e82e882

hongxiayang added 10 commits June 17, 2024 13:55

fix the distributed test

e379f25

fix CI failure on documentation and quantization failures

5c45688

yapf isort fix

23f4ce8

added back the removed line done by mistake

c77b43f

add empty lines

09ad642

test failure fix for distributed when enforce-eager for now

3ab440b

revert skipIfROCm to see whether it fails again with upstream update

e535bce

ruff fix

f36d045

nit remove bool which was suggested by ruff

ffcff6b

rebase and misc

0142d1f

hongxiayang force-pushed the cmake_fix_with_6.1upgrade branch from 6267596 to 0142d1f Compare June 17, 2024 16:35

hongxiayang and others added 2 commits June 17, 2024 12:41

Merge branch 'main' into cmake_fix_with_6.1upgrade

c821b6b

format

76ce55c

hongxiayang mentioned this pull request Jun 18, 2024

[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices #5641

Merged

hongxiayang changed the title ~~[Bugfix][CI/Build][Upgrade][AMD][ROCm]Fixed the cmake build bug which generate garbage on mi300x and rocm6.1 upgrade~~ [Upgrade][AMD][ROCm] rocm6.1 upgrade Jun 19, 2024

hongxiayang changed the title ~~[Upgrade][AMD][ROCm] rocm6.1 upgrade~~ [AMD][ROCm][CI] unit tests fixes or skip Jun 19, 2024

hongxiayang closed this Jun 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][ROCm][CI] unit tests fixes or skip #5323

[AMD][ROCm][CI] unit tests fixes or skip #5323

hongxiayang commented Jun 6, 2024 •

edited

Loading

hliuca left a comment

hongxiayang commented Jun 7, 2024

hliuca left a comment •

edited

Loading

hongxiayang commented Jun 7, 2024

WoosukKwon left a comment

WoosukKwon left a comment

WoosukKwon commented Jun 10, 2024

hongxiayang commented Jun 10, 2024

WoosukKwon left a comment

WoosukKwon commented Jun 11, 2024

hongxiayang commented Jun 12, 2024

WoosukKwon Jun 12, 2024

hongxiayang Jun 12, 2024

WoosukKwon commented Jun 12, 2024

mawong-amd Jun 13, 2024

hongxiayang Jun 13, 2024

mawong-amd Jun 13, 2024 •

edited

Loading

hongxiayang Jun 13, 2024

hongxiayang commented Jun 13, 2024 •

edited

Loading

WoosukKwon commented Jun 13, 2024

hongxiayang commented Jun 13, 2024

WoosukKwon commented Jun 16, 2024

hongxiayang commented Jun 17, 2024

WoosukKwon commented Jun 17, 2024 •

edited

Loading

hongxiayang commented Jun 18, 2024

WoosukKwon commented Jun 18, 2024

WoosukKwon commented Jun 18, 2024

hongxiayang commented Jun 18, 2024 •

edited

Loading

hongxiayang commented Jun 18, 2024 •

edited

Loading

hongxiayang commented Jun 19, 2024

WoosukKwon commented Jun 19, 2024

[AMD][ROCm][CI] unit tests fixes or skip #5323

[AMD][ROCm][CI] unit tests fixes or skip #5323

Conversation

hongxiayang commented Jun 6, 2024 • edited Loading

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

hliuca left a comment

Choose a reason for hiding this comment

hongxiayang commented Jun 7, 2024

hliuca left a comment • edited Loading

Choose a reason for hiding this comment

hongxiayang commented Jun 7, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon commented Jun 10, 2024

hongxiayang commented Jun 10, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon commented Jun 11, 2024

hongxiayang commented Jun 12, 2024

WoosukKwon Jun 12, 2024

Choose a reason for hiding this comment

hongxiayang Jun 12, 2024

Choose a reason for hiding this comment

WoosukKwon commented Jun 12, 2024

mawong-amd Jun 13, 2024

Choose a reason for hiding this comment

hongxiayang Jun 13, 2024

Choose a reason for hiding this comment

mawong-amd Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

hongxiayang Jun 13, 2024

Choose a reason for hiding this comment

hongxiayang commented Jun 13, 2024 • edited Loading

WoosukKwon commented Jun 13, 2024

hongxiayang commented Jun 13, 2024

WoosukKwon commented Jun 16, 2024

hongxiayang commented Jun 17, 2024

WoosukKwon commented Jun 17, 2024 • edited Loading

hongxiayang commented Jun 18, 2024

WoosukKwon commented Jun 18, 2024

WoosukKwon commented Jun 18, 2024

hongxiayang commented Jun 18, 2024 • edited Loading

hongxiayang commented Jun 18, 2024 • edited Loading

hongxiayang commented Jun 19, 2024

WoosukKwon commented Jun 19, 2024

hongxiayang commented Jun 6, 2024 •

edited

Loading

hliuca left a comment •

edited

Loading

mawong-amd Jun 13, 2024 •

edited

Loading

hongxiayang commented Jun 13, 2024 •

edited

Loading

WoosukKwon commented Jun 17, 2024 •

edited

Loading

hongxiayang commented Jun 18, 2024 •

edited

Loading

hongxiayang commented Jun 18, 2024 •

edited

Loading