Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests][dask] Create an informative categorical feature #4113

Merged
merged 2 commits into from
Mar 26, 2021

Conversation

jmoralez
Copy link
Collaborator

This modifies the dataframe-with-categorical output type to create only two categorical features and makes one of them informative for the task.

In order to have more consistent results I've modified the _create_data function to generate 1,000 samples instead of 100, which caused problems with categoricals and reduced the accuracy of the distributed training. Also, for regression 100 features were being generated, I changed this to generate only 4, with 2 informative.

Adding more samples allows the tolerances in the array checks to be lower, but I had to increase a bit the complexity of the models for test_regressor and test_classifier.

I didn't change anything in _create_ranking_data because I haven't seen problems with those tests, but could take a look as well.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, looks great! I totally support this change, and it's ok to not touch the ranker tests for now. I'm glad it allowed for some simplifications, and will enable us to go deeper on testing the statistical behavior of the Dask module (#3896 (comment)).

I looked at the logs to see if making the models more complex had a big impact on total runtime of the tests. It doesn't seem like it!

See the results below for the linux_regular job. A 20 second difference is not problematic given the run time of this project's CI jobs. Just comparing one job isn't a precise estimate, but it's enough to give me confidence that this change won't have a huge impact on the test runtime.

latest build on master (link)

===== 388 passed, 7 skipped, 2 xfailed, 357 warnings in 696.97s (0:11:36) ======

this PR (link)

===== 388 passed, 7 skipped, 2 xfailed, 357 warnings in 720.08s (0:12:00) ======

assert_eq(s1, s2, atol=.01)
assert_eq(s1, s1_local, atol=.003)
assert_eq(s1, s2, atol=.01)
assert_eq(s1, s1_local, atol=0.003)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you tested, did you try running these tests multiple times in a row? I'd like to know how close these tolerances are to the maximum difference that you saw. It's good to push these tolerances down to catch big differences between distributed training and non-distributed training, but not so close to the edge that the tests might fail sometimes due to randomness in how the chunks are distributed across the workers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test above the maximum absolute difference I observed was 0.0025, so 0.01 should be fine. For this one I believe we can even remove the atol, since this checks the score for the same model, doesn't it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thank you. Yes, I think you're right. s1 and s1_local should be identical since they're from the same model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with the checks for the predictions and the target and found that for atol=50 we could go to rtol=0.5 (the maximum observed was 0.25). But maybe we could instead check p1 vs p2? They're very similar so maybe that'd be a better check and could go very low in the tolerances as well. Also I switched the order in the comparison for this one because assert_eq(a, b) checks abs(a - b) <= atol + rtol * abs(b), so I think it makes more sense that b = y in this case. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I'm a little confused by your notation, sorry. Are you referring to this change specifically?

image

To be honest, I don't understand what you mean. Can you give a small reproducible example that shows assert_eq() passing or failing based only on the order that you pass things in?

To me, it looks like the checks in that function are symmetric: https://github.com/dask/dask/blob/913a192fd3b35696883bdd4c4eeba2418b946b3a/dask/dataframe/utils.py#L841-L847.

Copy link

@jose-moralez jose-moralez Mar 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I'm sorry. Yes, I meant those checks exactly. The asymmetric part is in https://github.com/dask/dask/blob/913a192fd3b35696883bdd4c4eeba2418b946b3a/dask/dataframe/utils.py#L847, which calls np.allclose, and np.isclose

Here's an example:

import numpy as np

a = 100
b = 120
np.allclose(a, b, rtol=0.1, atol=9)  # (20 - 9) / 0.1 = 110 <= 120 (True)
np.allclose(b, a, rtol=0.1, atol=9)  # (20 - 9) / 0.1 = 110 <= 100 (False)

Here's with assert_eq:

import numpy as np
from dask.array.utils import assert_eq

a = np.array([100])
b = np.array([120])

assert_eq(a, b, rtol=0.1, atol=9)  # passes
assert_eq(b, a, rtol=0.1, atol=9)  # fails

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 I had no idea that these were not symmetric! Thanks for the example. I see now that there's a note about this in the np.isclose() docs:

Unlike the built-in math.isclose, the above equation is not symmetric
in a and b -- it assumes b is the reference value -- so that
isclose(a, b) might be different from isclose(b, a)

Ok, since this is checking that we get a "good enough" model result, I agree with the change to treat y (the actual value of the target) as the "reference value". Thanks for the attention to detail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also now I know @jmoralez and @jose-moralez are the same person 😛

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, yes. This is the account I use at work. I usually contribute on my free time using my personal account, but feel free to ping me on this one at work hours.

if not output.startswith('dataframe'):
assert_eq(s1, s2, atol=.01)
assert_eq(s1, s1_local, atol=.003)
assert_eq(s1, s2, atol=.01)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_eq(s1, s2, atol=.01)
assert_eq(s1, s2, atol=0.01)

@jose-moralez
Copy link

There's a failing test in linux regular. I've seen this happen a couple of times now, the environment finds conflicts and then gives up.

@jameslamb jameslamb self-requested a review March 26, 2021 17:14
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thanks very much for the suggestions.

@jameslamb
Copy link
Collaborator

There's a failing test in linux regular. I've seen this happen a couple of times now, the environment finds conflicts and then gives up.

yeah, I've seen this a few times too :/

We try not to use pinned dependencies for Python packages in this project's CI, to be sure the project is tested against new releases. Sometimes this can result in failures like that where conda thinks it cannot resolve an environment. Usually 1 rebuild fixes this, and I just restarted that job.

@jameslamb jameslamb merged commit 8cc6eef into microsoft:master Mar 26, 2021
@jmoralez jmoralez deleted the tests-categoricals branch March 26, 2021 20:27
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants