[Bug]: Distribute Tests PR test fails #5544

bong-furiosa · 2024-06-14T15:26:21Z

Your current environment

vLLM version 0.5.0.post1

🐛 Describe the bug

Hello!

I would like to know if the tests/distributed/test_utils.py file (Merged at #5473) might be causing errors during the Distribute Tests process on BuildKite.

When I checked #5422 and #5412, I found that both PRs failed during the Distribute Tests process. The reason for the failure is as follows:

[2024-06-14T00:24:15Z] Running 1 items in this shard: tests/distributed/test_utils.py::test_cuda_device_count_stateless
[2024-06-14T00:24:15Z]
[2024-06-14T00:24:30Z] distributed/test_utils.py::test_cuda_device_count_stateless 2024-06-14 00:24:30,636	INFO worker.py:1753 -- Started a local Ray instance.
[2024-06-14T00:24:33Z] FAILED
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z] =================================== FAILURES ===================================
[2024-06-14T00:24:33Z] _______________________ test_cuda_device_count_stateless _______________________
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z]     def test_cuda_device_count_stateless():
[2024-06-14T00:24:33Z]         """Test that cuda_device_count_stateless changes return value if
[2024-06-14T00:24:33Z]         CUDA_VISIBLE_DEVICES is changed."""
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z]         actor = _CUDADeviceCountStatelessTestActor.options(num_gpus=2).remote()
[2024-06-14T00:24:33Z] >       assert ray.get(actor.get_cuda_visible_devices.remote()) == "0,1"
[2024-06-14T00:24:33Z] E       AssertionError: assert '1,0' == '0,1'
[2024-06-14T00:24:33Z] E         
[2024-06-14T00:24:33Z] E         - 0,1
[2024-06-14T00:24:33Z] E         + 1,0
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z] distributed/test_utils.py:26: AssertionError
[2024-06-14T00:24:33Z] =========================== short test summary info ============================
[2024-06-14T00:24:33Z] FAILED distributed/test_utils.py::test_cuda_device_count_stateless - AssertionError: assert '1,0' == '0,1'
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z]   - 0,1
[2024-06-14T00:24:33Z]   + 1,0
[2024-06-14T00:24:33Z] ============================== 1 failed in 17.55s ==============================
[2024-06-14T00:24:36Z] 🚨 Error: The command exited with status 1
[2024-06-14T00:24:36Z] user command error: The plugin docker command hook exited with status 1

Since I am not an expert in the Ray framework, so I am not sure how critical the difference between 0, 1 and 1, 0.
I think the fact that "1,0" was output in a simple test code using Ray indicates that under certain conditions, the result can be "1,0". Therefore, it might be reasonable to conclude that the assert line should allow "1,0".

Would there be any issues if the assert line is modified as shown below?

# assert ray.get(actor.get_cuda_visible_devices.remote()) == "0,1"
assert ray.get(actor.get_cuda_visible_devices.remote()) in ["0,1", "1,0"]

I am concerned that this issue might be affecting the correct check of the Distribute Tests, and would like to inquire about it.
If this issue is not the cause of the test fail problem, I would greatly appreciate it if you could check the Distribute Tests logs and proved some hints on what might be causing the errors. 🙇

The text was updated successfully, but these errors were encountered:

bong-furiosa added the bug Something isn't working label Jun 14, 2024

youkaichao mentioned this issue Jun 14, 2024

[mis] fix flaky test of test_cuda_device_count_stateless #5546

Merged

simon-mo closed this as completed in #5546 Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Distribute Tests PR test fails #5544

[Bug]: Distribute Tests PR test fails #5544

bong-furiosa commented Jun 14, 2024

[Bug]: Distribute Tests PR test fails #5544

[Bug]: Distribute Tests PR test fails #5544

Comments

bong-furiosa commented Jun 14, 2024

Your current environment

🐛 Describe the bug