Skip to content

Conversation

@JasonLi1909
Copy link
Contributor

@JasonLi1909 JasonLi1909 commented Sep 15, 2025

A fix that addresses the failing flaky test test_jax_trainer.py::test_minimal_multihost.
https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177

Issue:
The test_minimal_multihost introduces a race condition by attempting to initialize a virtualenv directory twice at the same directory path during worker runtime environment setup. This test would not fail in a true multi-host environment, but the tests simulate a multi-host environment on a singular device. This might be a ray core issue resulting in errors on runtime _env, but this PR will at least unblock the test so it is no longer flaky.

Fix:
Move worker_runtime_env to the job level so that the pip install jax only happens once

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 requested a review from a team as a code owner September 15, 2025 21:13
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a flaky test, test_jax_trainer.py::test_minimal_multihost, by moving the worker_runtime_env from the RunConfig of the JaxTrainer to the runtime_env of ray.init in the test fixtures. This change is sound, but I've found a critical issue in the implementation for the multi-host test fixture where runtime_env is incorrectly defined as a tuple instead of a dictionary. I've also included a suggestion to reduce code duplication for better maintainability.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Sep 16, 2025
JasonLi1909 and others added 4 commits September 16, 2025 10:57
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 added go add ONLY when ready to merge, run all tests and removed train Ray Train Related Issue labels Sep 22, 2025
@matthewdeng matthewdeng enabled auto-merge (squash) September 22, 2025 21:59
@matthewdeng matthewdeng merged commit 0251d82 into ray-project:master Sep 22, 2025
8 checks passed
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…y-project#56548)

A fix that addresses the failing flaky test
`test_jax_trainer.py::test_minimal_multihost`.

https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177

Issue: 
The `test_minimal_multihost` introduces a race condition by attempting
to initialize a virtualenv directory twice at the same directory path
during worker runtime environment setup. This test would not fail in a
true multi-host environment, but the tests simulate a multi-host
environment on a singular device. This might be a ray core issue
resulting in errors on runtime _env, but this PR will at least unblock
the test so it is no longer flaky.

Fix: 
Move `worker_runtime_env` to the job level so that the `pip install jax`
only happens once

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: zac <zac@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Sep 24, 2025
…6548)

A fix that addresses the failing flaky test
`test_jax_trainer.py::test_minimal_multihost`.

https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177

Issue: 
The `test_minimal_multihost` introduces a race condition by attempting
to initialize a virtualenv directory twice at the same directory path
during worker runtime environment setup. This test would not fail in a
true multi-host environment, but the tests simulate a multi-host
environment on a singular device. This might be a ray core issue
resulting in errors on runtime _env, but this PR will at least unblock
the test so it is no longer flaky.

Fix: 
Move `worker_runtime_env` to the job level so that the `pip install jax`
only happens once

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
…y-project#56548)

A fix that addresses the failing flaky test
`test_jax_trainer.py::test_minimal_multihost`.

https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177

Issue:
The `test_minimal_multihost` introduces a race condition by attempting
to initialize a virtualenv directory twice at the same directory path
during worker runtime environment setup. This test would not fail in a
true multi-host environment, but the tests simulate a multi-host
environment on a singular device. This might be a ray core issue
resulting in errors on runtime _env, but this PR will at least unblock
the test so it is no longer flaky.

Fix:
Move `worker_runtime_env` to the job level so that the `pip install jax`
only happens once

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
…6548)

A fix that addresses the failing flaky test
`test_jax_trainer.py::test_minimal_multihost`.

https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177

Issue: 
The `test_minimal_multihost` introduces a race condition by attempting
to initialize a virtualenv directory twice at the same directory path
during worker runtime environment setup. This test would not fail in a
true multi-host environment, but the tests simulate a multi-host
environment on a singular device. This might be a ray core issue
resulting in errors on runtime _env, but this PR will at least unblock
the test so it is no longer flaky.

Fix: 
Move `worker_runtime_env` to the job level so that the `pip install jax`
only happens once

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
matthewdeng pushed a commit that referenced this pull request Oct 3, 2025
Revisiting #56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…y-project#56548)

A fix that addresses the failing flaky test
`test_jax_trainer.py::test_minimal_multihost`.

https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177

Issue:
The `test_minimal_multihost` introduces a race condition by attempting
to initialize a virtualenv directory twice at the same directory path
during worker runtime environment setup. This test would not fail in a
true multi-host environment, but the tests simulate a multi-host
environment on a singular device. This might be a ray core issue
resulting in errors on runtime _env, but this PR will at least unblock
the test so it is no longer flaky.

Fix:
Move `worker_runtime_env` to the job level so that the `pip install jax`
only happens once

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…y-project#56548)

A fix that addresses the failing flaky test
`test_jax_trainer.py::test_minimal_multihost`.

https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177

Issue: 
The `test_minimal_multihost` introduces a race condition by attempting
to initialize a virtualenv directory twice at the same directory path
during worker runtime environment setup. This test would not fail in a
true multi-host environment, but the tests simulate a multi-host
environment on a singular device. This might be a ray core issue
resulting in errors on runtime _env, but this PR will at least unblock
the test so it is no longer flaky.

Fix: 
Move `worker_runtime_env` to the job level so that the `pip install jax`
only happens once

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…y-project#56548)

A fix that addresses the failing flaky test
`test_jax_trainer.py::test_minimal_multihost`.

https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177

Issue: 
The `test_minimal_multihost` introduces a race condition by attempting
to initialize a virtualenv directory twice at the same directory path
during worker runtime environment setup. This test would not fail in a
true multi-host environment, but the tests simulate a multi-host
environment on a singular device. This might be a ray core issue
resulting in errors on runtime _env, but this PR will at least unblock
the test so it is no longer flaky.

Fix: 
Move `worker_runtime_env` to the job level so that the `pip install jax`
only happens once

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…y-project#56548)

A fix that addresses the failing flaky test
`test_jax_trainer.py::test_minimal_multihost`.

https://buildkite.com/ray-project/postmerge/builds/12941#01993f89-cc62-4e31-8de2-8b18f81ac177

Issue:
The `test_minimal_multihost` introduces a race condition by attempting
to initialize a virtualenv directory twice at the same directory path
during worker runtime environment setup. This test would not fail in a
true multi-host environment, but the tests simulate a multi-host
environment on a singular device. This might be a ray core issue
resulting in errors on runtime _env, but this PR will at least unblock
the test so it is no longer flaky.

Fix:
Move `worker_runtime_env` to the job level so that the `pip install jax`
only happens once

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
Revisiting ray-project#56548 as test
continues to be flaky on CI

**Solution**: The previous attempt to deflake this test still used a
`pip install jax` via the `ray.init` runtime_env args. Hence, the pip
install related error persisted. This PR instead adds `jax` and `jaxlib`
as a dependency of CI train tests, avoiding the need to `pip install
jax` via the runtime_env.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants