-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][dask] Create an informative categorical feature #4113
Conversation
… n_features for regression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall, looks great! I totally support this change, and it's ok to not touch the ranker tests for now. I'm glad it allowed for some simplifications, and will enable us to go deeper on testing the statistical behavior of the Dask module (#3896 (comment)).
I looked at the logs to see if making the models more complex had a big impact on total runtime of the tests. It doesn't seem like it!
See the results below for the linux_regular
job. A 20 second difference is not problematic given the run time of this project's CI jobs. Just comparing one job isn't a precise estimate, but it's enough to give me confidence that this change won't have a huge impact on the test runtime.
latest build on master (link)
===== 388 passed, 7 skipped, 2 xfailed, 357 warnings in 696.97s (0:11:36) ======
this PR (link)
===== 388 passed, 7 skipped, 2 xfailed, 357 warnings in 720.08s (0:12:00) ======
assert_eq(s1, s2, atol=.01) | ||
assert_eq(s1, s1_local, atol=.003) | ||
assert_eq(s1, s2, atol=.01) | ||
assert_eq(s1, s1_local, atol=0.003) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you tested, did you try running these tests multiple times in a row? I'd like to know how close these tolerances are to the maximum difference that you saw. It's good to push these tolerances down to catch big differences between distributed training and non-distributed training, but not so close to the edge that the tests might fail sometimes due to randomness in how the chunks are distributed across the workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the test above the maximum absolute difference I observed was 0.0025, so 0.01 should be fine. For this one I believe we can even remove the atol, since this checks the score for the same model, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, thank you. Yes, I think you're right. s1
and s1_local
should be identical since they're from the same model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around with the checks for the predictions and the target and found that for atol=50
we could go to rtol=0.5
(the maximum observed was 0.25). But maybe we could instead check p1
vs p2
? They're very similar so maybe that'd be a better check and could go very low in the tolerances as well. Also I switched the order in the comparison for this one because assert_eq(a, b)
checks abs(a - b) <= atol + rtol * abs(b)
, so I think it makes more sense that b = y
in this case. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I'm a little confused by your notation, sorry. Are you referring to this change specifically?
To be honest, I don't understand what you mean. Can you give a small reproducible example that shows assert_eq()
passing or failing based only on the order that you pass things in?
To me, it looks like the checks in that function are symmetric: https://github.com/dask/dask/blob/913a192fd3b35696883bdd4c4eeba2418b946b3a/dask/dataframe/utils.py#L841-L847.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, I'm sorry. Yes, I meant those checks exactly. The asymmetric part is in https://github.com/dask/dask/blob/913a192fd3b35696883bdd4c4eeba2418b946b3a/dask/dataframe/utils.py#L847, which calls np.allclose
, and np.isclose
Here's an example:
import numpy as np
a = 100
b = 120
np.allclose(a, b, rtol=0.1, atol=9) # (20 - 9) / 0.1 = 110 <= 120 (True)
np.allclose(b, a, rtol=0.1, atol=9) # (20 - 9) / 0.1 = 110 <= 100 (False)
Here's with assert_eq
:
import numpy as np
from dask.array.utils import assert_eq
a = np.array([100])
b = np.array([120])
assert_eq(a, b, rtol=0.1, atol=9) # passes
assert_eq(b, a, rtol=0.1, atol=9) # fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯 I had no idea that these were not symmetric! Thanks for the example. I see now that there's a note about this in the np.isclose()
docs:
Unlike the built-in
math.isclose
, the above equation is not symmetric
ina
andb
-- it assumesb
is the reference value -- so that
isclose(a, b)
might be different fromisclose(b, a)
Ok, since this is checking that we get a "good enough" model result, I agree with the change to treat y
(the actual value of the target) as the "reference value". Thanks for the attention to detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also now I know @jmoralez and @jose-moralez are the same person 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, yes. This is the account I use at work. I usually contribute on my free time using my personal account, but feel free to ping me on this one at work hours.
if not output.startswith('dataframe'): | ||
assert_eq(s1, s2, atol=.01) | ||
assert_eq(s1, s1_local, atol=.003) | ||
assert_eq(s1, s2, atol=.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_eq(s1, s2, atol=.01) | |
assert_eq(s1, s2, atol=0.01) |
There's a failing test in linux regular. I've seen this happen a couple of times now, the environment finds conflicts and then gives up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, thanks very much for the suggestions.
yeah, I've seen this a few times too :/ We try not to use pinned dependencies for Python packages in this project's CI, to be sure the project is tested against new releases. Sometimes this can result in failures like that where |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This modifies the
dataframe-with-categorical
output type to create only two categorical features and makes one of them informative for the task.In order to have more consistent results I've modified the
_create_data
function to generate 1,000 samples instead of 100, which caused problems with categoricals and reduced the accuracy of the distributed training. Also, for regression 100 features were being generated, I changed this to generate only 4, with 2 informative.Adding more samples allows the tolerances in the array checks to be lower, but I had to increase a bit the complexity of the models for
test_regressor
andtest_classifier
.I didn't change anything in
_create_ranking_data
because I haven't seen problems with those tests, but could take a look as well.