-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] flaky unit test test_find_random_open_port #4458
Comments
@StrikerRUS will you look at the two options I provided in the description and let me know if you have a preference? I have a preference for option 1,
because I think that will make such failures even rarer but still be a defense against problems where |
I'm OK with any option of ones you've provided above. |
I recently thought about maybe defining a function that finds |
hey @jmoralez , sorry it took so long to get back to you. If you want to attempt that, I'd support it! I hadn't taken that on yet because it seemed fairly complex in exchange for eliminating a somewhat-rare collision.
LightGBM/python-package/lightgbm/dask.py Line 327 in fdc582e
Then, the code would have to run I might still do the option |
Haha, that's ok. I'll try this and let you know if I get it working. I agree on changing that test to allow some conflicts in the meantime. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Description
The unit tests
tests/python_package_test/test_dask.py::test_find_random_open_port
can occasionally fail randomly. Specifically,lgb.dask._find_random_open_port()
is called once on each of the two workers in a Dask cluster, and if it finds the same port for both workers, the test fails.I think what we're seeing is the problem mentioned in #4133 (comment).
I think that when we dealt with this problem before in #4133, it was patched within
lightgbm.dask
directly.LightGBM/python-package/lightgbm/dask.py
Lines 345 to 367 in 0d1d12f
But I didn't consider that fact that this particular unit test might suffer from the same problem.
In other words, this unit test is saying such collisions should never happen but in fact they DO happen and
lgb.dask
was changed to explicitly handle that case.I think such failures should be very rare, but that we should still try to eliminate them or make them even rarer. I see two options:
client.run(lgb.dask._find_random_open_ports)
5 times and check that no more than 1 conflict was found)test_find_random_open_port
and trust other Dask unit tests (liketest_network_params_not_required_but_respected_if_given
andtest_possibly_fix_worker_map
) to do the work.Reproducible example
This is hard to reproduce because it is random. Most recently, the unit tests on #4454 for one job failed with the following error.
Environment info
LightGBM version or commit hash: latest
master
(0d1d12f)Command(s) you used to install LightGBM
This test most recently failed on
CUDA Version / cuda 11.2.2 source (linux, gcc, Python 3.7)
CI job in #4454 (https://github.com/microsoft/LightGBM/pull/4454/checks?check_run_id=3012592901), but I think it could happen randomly in any Python jobs where that test is run.Additional Comments
Opened from #4454 (comment).
The text was updated successfully, but these errors were encountered: