-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] flaky test_ranker test #3817
Comments
Just wonder: is there any |
OY. Ok this should be decreased. We've set This might be kind of outrageous, but wondering if maybe we should support a constructor argument
This would ensure split points are the same in each worker, would stabilize testing. |
I'm submitting a PR to lower the threshold |
Can we somehow remove randomness from Dask data partitioning? |
Line 226 in dask.py:
might do the trick. There are probably other solutions too. I'll try this out over the weekend - in the meantime, PR is out to lower threshold. |
Oh thanks a lot for the info! Glad that there are possible workarounds. |
one of the benefits of data-parallel learning in LightGBM is that you don't have to send all data to all workers. So you can have a training dataset that's larger than what any one machine involved can hold by itself. So I'm not comfortable with only testing the scenario where all workers have to have all of the training data. I can see that this failure is from the test where the chunk sizes are not all the same ( group = [5, 5, 5, 10, 10, 10, ...]). That's the thing that is opening up the possibility of nondeterministic behavior. For other tests, we have equal chunk sizes and so Dask usually ends up with a roughly even split. I think we might be able to solve this for testing by two changes:
I think those two changes could make this specific test more reliable, and that those changes (which do introduce some time cost and could make the tests different from how users would normally work with What do you think @ffineis ? |
Yeah for sure, I hadn't even heard of |
I want to note, I just did 300 test runs while investigating another issue (#3829 (comment)). This ranker test did not fail in any of them. |
Nice! That's good. I do have updates on this I've been meaning to push. Will get to them tonight, tomorrow night latest. Ran some experiments using the |
Hey, sorry for the delay re: client.rebalancing the test_ranker data. I ran some experiments: Running 100 tests with rebalancing and 100 without, on each test I saved each worker's
Here are the worker ratio distributions for the fixed group-size tests, the pre-specified group size tests, rebalanced and non-rebalanced tests: So the rebalancing really does help out in the case of the The same is not true for the array input, in which case Here are the resulting distributions of spearman rank correlation coefficients: As we can see, the poor redistribution of data on the DataFrame input causes smaller spearmanr correlation, but redistribution helps shift the correlation to the right for Anyway! This is all to say that we should leave the data distribution alone in the case of I'd like to submit a a PR adding the following to
|
Thanks for the thorough research and explanation! It makes sense to me that the problem isn't as severe with the dataframe case because we put group information into the index and Dask can use that to cheaply form a plan to distribute the data. I'm ok with your proposal to use I think that we should expect these differences in the model produced wouldn't be as severe with real-world data sizes, and that the outputs just have higher variance because we're trying to use very small amounts of data and very few training iterations to keep the tests fast. So I'm not worried that this stuff is something that LightGBM's users will necessarily have to think a lot about. |
fixed by #3892, will re-open if I see that error again |
|
One more case with almost the same value but for another job and Python version: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=9759&view=logs&j=02a2c3ba-81f8-54e3-0767-5d5adbb0daa9&t=720ee3fa-96d4-5b47-dbf4-01607b74ade2.
|
Alright, I'lll look into this and try to patch. Curious myself what seems to have brought this back to life after 2 months of peace! Might also consider changing the quality of ranking from spearman correlation to NDCG@k, perhaps that statistic is less variable. |
@jameslamb ah ok awesome, thanks for the heads up. Yeah was going to suggest this is probably related to the new testing parameters. Makes sense that a regularization technique could start causing performance drops in our small test set sizes when these tests are basically just testing to see whether models can overfit to the training data. I still think moving to NDCG as a rank quality measure might be more appropriate, so I'll just test it out locally and see if the variance is in fact lower than spearman correlation. Thanks for the heads up though. |
Just noting, since we're keeping track of it, that I just saw this again on #4188
|
No reports since April 17. I guess we can close this. WDYT, guys? |
Agree! I haven't seen this since then. Let's close it. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Just seen in #3755.
LightGBM/tests/python_package_test/test_dask.py
Line 480 in 477cbf3
Full logs:
The text was updated successfully, but these errors were encountered: