You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to know if the tests/distributed/test_utils.py file (Merged at #5473) might be causing errors during the Distribute Tests process on BuildKite.
When I checked #5422 and #5412, I found that both PRs failed during the Distribute Tests process. The reason for the failure is as follows:
[2024-06-14T00:24:15Z] Running 1 items in this shard: tests/distributed/test_utils.py::test_cuda_device_count_stateless
[2024-06-14T00:24:15Z]
[2024-06-14T00:24:30Z] distributed/test_utils.py::test_cuda_device_count_stateless 2024-06-14 00:24:30,636 INFO worker.py:1753 -- Started a local Ray instance.
[2024-06-14T00:24:33Z] FAILED
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z] =================================== FAILURES ===================================
[2024-06-14T00:24:33Z] _______________________ test_cuda_device_count_stateless _______________________
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z] def test_cuda_device_count_stateless():
[2024-06-14T00:24:33Z] """Test that cuda_device_count_stateless changes return value if
[2024-06-14T00:24:33Z] CUDA_VISIBLE_DEVICES is changed."""
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z] actor = _CUDADeviceCountStatelessTestActor.options(num_gpus=2).remote()
[2024-06-14T00:24:33Z] > assert ray.get(actor.get_cuda_visible_devices.remote()) == "0,1"
[2024-06-14T00:24:33Z] E AssertionError: assert '1,0' == '0,1'
[2024-06-14T00:24:33Z] E
[2024-06-14T00:24:33Z] E - 0,1
[2024-06-14T00:24:33Z] E + 1,0
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z] distributed/test_utils.py:26: AssertionError
[2024-06-14T00:24:33Z] =========================== short test summary info ============================
[2024-06-14T00:24:33Z] FAILED distributed/test_utils.py::test_cuda_device_count_stateless - AssertionError: assert '1,0' == '0,1'
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z] - 0,1
[2024-06-14T00:24:33Z] + 1,0
[2024-06-14T00:24:33Z] ============================== 1 failed in 17.55s ==============================
[2024-06-14T00:24:36Z] 🚨 Error: The command exited with status 1
[2024-06-14T00:24:36Z] user command error: The plugin docker command hook exited with status 1
Since I am not an expert in the Ray framework, so I am not sure how critical the difference between 0, 1 and 1, 0. I think the fact that "1,0" was output in a simple test code using Ray indicates that under certain conditions, the result can be "1,0". Therefore, it might be reasonable to conclude that the assert line should allow "1,0".
Would there be any issues if the assert line is modified as shown below?
# assert ray.get(actor.get_cuda_visible_devices.remote()) == "0,1"assertray.get(actor.get_cuda_visible_devices.remote()) in ["0,1", "1,0"]
I am concerned that this issue might be affecting the correct check of the Distribute Tests, and would like to inquire about it.
If this issue is not the cause of the test fail problem, I would greatly appreciate it if you could check the Distribute Tests logs and proved some hints on what might be causing the errors. 🙇
The text was updated successfully, but these errors were encountered:
Your current environment
🐛 Describe the bug
Hello!
I would like to know if the
tests/distributed/test_utils.py
file (Merged at #5473) might be causing errors during the Distribute Tests process on BuildKite.When I checked #5422 and #5412, I found that both PRs failed during the Distribute Tests process. The reason for the failure is as follows:
Since I am not an expert in the Ray framework, so I am not sure how critical the difference between
0, 1
and1, 0
.I think the fact that "1,0" was output in a simple test code using Ray indicates that under certain conditions, the result can be "1,0". Therefore, it might be reasonable to conclude that the assert line should allow "1,0".
Would there be any issues if the assert line is modified as shown below?
I am concerned that this issue might be affecting the correct check of the Distribute Tests, and would like to inquire about it.
If this issue is not the cause of the test fail problem, I would greatly appreciate it if you could check the Distribute Tests logs and proved some hints on what might be causing the errors. 🙇
The text was updated successfully, but these errors were encountered: