-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Ray Train test_jax_trainer::test_minimal_multihost Flaky Test Fix #56548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray Train test_jax_trainer::test_minimal_multihost Flaky Test Fix #56548
Conversation
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a flaky test, test_jax_trainer.py::test_minimal_multihost, by moving the worker_runtime_env from the RunConfig of the JaxTrainer to the runtime_env of ray.init in the test fixtures. This change is sound, but I've found a critical issue in the implementation for the multi-host test fixture where runtime_env is incorrectly defined as a tuple instead of a dictionary. I've also included a suggestion to reduce code duplication for better maintainability.
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
…y-project#56548) A fix that addresses the failing flaky test `test_jax_trainer.py::test_minimal_multihost`. https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177 Issue: The `test_minimal_multihost` introduces a race condition by attempting to initialize a virtualenv directory twice at the same directory path during worker runtime environment setup. This test would not fail in a true multi-host environment, but the tests simulate a multi-host environment on a singular device. This might be a ray core issue resulting in errors on runtime _env, but this PR will at least unblock the test so it is no longer flaky. Fix: Move `worker_runtime_env` to the job level so that the `pip install jax` only happens once --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: zac <zac@anyscale.com>
…6548) A fix that addresses the failing flaky test `test_jax_trainer.py::test_minimal_multihost`. https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177 Issue: The `test_minimal_multihost` introduces a race condition by attempting to initialize a virtualenv directory twice at the same directory path during worker runtime environment setup. This test would not fail in a true multi-host environment, but the tests simulate a multi-host environment on a singular device. This might be a ray core issue resulting in errors on runtime _env, but this PR will at least unblock the test so it is no longer flaky. Fix: Move `worker_runtime_env` to the job level so that the `pip install jax` only happens once --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…y-project#56548) A fix that addresses the failing flaky test `test_jax_trainer.py::test_minimal_multihost`. https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177 Issue: The `test_minimal_multihost` introduces a race condition by attempting to initialize a virtualenv directory twice at the same directory path during worker runtime environment setup. This test would not fail in a true multi-host environment, but the tests simulate a multi-host environment on a singular device. This might be a ray core issue resulting in errors on runtime _env, but this PR will at least unblock the test so it is no longer flaky. Fix: Move `worker_runtime_env` to the job level so that the `pip install jax` only happens once --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Marco Stephan <marco@magic.dev>
…6548) A fix that addresses the failing flaky test `test_jax_trainer.py::test_minimal_multihost`. https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177 Issue: The `test_minimal_multihost` introduces a race condition by attempting to initialize a virtualenv directory twice at the same directory path during worker runtime environment setup. This test would not fail in a true multi-host environment, but the tests simulate a multi-host environment on a singular device. This might be a ray core issue resulting in errors on runtime _env, but this PR will at least unblock the test so it is no longer flaky. Fix: Move `worker_runtime_env` to the job level so that the `pip install jax` only happens once --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Revisiting #56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
…y-project#56548) A fix that addresses the failing flaky test `test_jax_trainer.py::test_minimal_multihost`. https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177 Issue: The `test_minimal_multihost` introduces a race condition by attempting to initialize a virtualenv directory twice at the same directory path during worker runtime environment setup. This test would not fail in a true multi-host environment, but the tests simulate a multi-host environment on a singular device. This might be a ray core issue resulting in errors on runtime _env, but this PR will at least unblock the test so it is no longer flaky. Fix: Move `worker_runtime_env` to the job level so that the `pip install jax` only happens once --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
…y-project#56548) A fix that addresses the failing flaky test `test_jax_trainer.py::test_minimal_multihost`. https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177 Issue: The `test_minimal_multihost` introduces a race condition by attempting to initialize a virtualenv directory twice at the same directory path during worker runtime environment setup. This test would not fail in a true multi-host environment, but the tests simulate a multi-host environment on a singular device. This might be a ray core issue resulting in errors on runtime _env, but this PR will at least unblock the test so it is no longer flaky. Fix: Move `worker_runtime_env` to the job level so that the `pip install jax` only happens once --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
…y-project#56548) A fix that addresses the failing flaky test `test_jax_trainer.py::test_minimal_multihost`. https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177 Issue: The `test_minimal_multihost` introduces a race condition by attempting to initialize a virtualenv directory twice at the same directory path during worker runtime environment setup. This test would not fail in a true multi-host environment, but the tests simulate a multi-host environment on a singular device. This might be a ray core issue resulting in errors on runtime _env, but this PR will at least unblock the test so it is no longer flaky. Fix: Move `worker_runtime_env` to the job level so that the `pip install jax` only happens once --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…y-project#56548) A fix that addresses the failing flaky test `test_jax_trainer.py::test_minimal_multihost`. https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177 Issue: The `test_minimal_multihost` introduces a race condition by attempting to initialize a virtualenv directory twice at the same directory path during worker runtime environment setup. This test would not fail in a true multi-host environment, but the tests simulate a multi-host environment on a singular device. This might be a ray core issue resulting in errors on runtime _env, but this PR will at least unblock the test so it is no longer flaky. Fix: Move `worker_runtime_env` to the job level so that the `pip install jax` only happens once --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Revisiting ray-project#56548 as test continues to be flaky on CI **Solution**: The previous attempt to deflake this test still used a `pip install jax` via the `ray.init` runtime_env args. Hence, the pip install related error persisted. This PR instead adds `jax` and `jaxlib` as a dependency of CI train tests, avoiding the need to `pip install jax` via the runtime_env. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
A fix that addresses the failing flaky test
test_jax_trainer.py::test_minimal_multihost.https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177
Issue:
The
test_minimal_multihostintroduces a race condition by attempting to initialize a virtualenv directory twice at the same directory path during worker runtime environment setup. This test would not fail in a true multi-host environment, but the tests simulate a multi-host environment on a singular device. This might be a ray core issue resulting in errors on runtime _env, but this PR will at least unblock the test so it is no longer flaky.Fix:
Move
worker_runtime_envto the job level so that thepip install jaxonly happens once