-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] [python] reduce unnecessary data loading in tests #3486
Conversation
interesting, I see this in some tests:
Maybe there was not a default value in older versions of Python? Because today it has a default of 128: https://docs.python.org/3/library/functools.html#functools.lru_cache Anyway, I'm going to try switching these to just
UPDATE: nevermind, |
@jameslamb I think we can have |
@StrikerRUS can you explain why you think From that code snippet, it looks like it would be faster than setting a I understand why it would be bad in user code, since the cache can grow without limit, but for these unit tests where we completely control the set of unique combinations and know it to be small, I think we should have a preference for the faster option. |
@jameslamb |
oh I see. No I really just cared more about the caching than using Least Recently Used (LRU), since we have so few variations of kwargs for each dataset. I did forget though that |
@jameslamb Ah OK! I misunderstood you then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb Please check my comments below.
import warnings | ||
warnings.warn("Could not import functools.lru_cache", RuntimeWarning) | ||
|
||
def lru_cache(maxsize=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please memoize this too
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(*load_breast_cancer(return_X_y=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be the purpose? That's the only call to load_breast_cancer()
in that module. So I think the caching would add a tiny bit of overhead for no benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it differ with other calls you've memoized? 5 calls of load_breast_cancer()
in test_plotting.py
is even more than 3 calls of the same function in test_basic.py
, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, but the method in which this call is performed (setUp
) is called before each test. So, actually we have 5 calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOOOOOOOOOO haha ok. I haven't used unittest.TestCase
in a while, I forgot which one was a "run before every test" setup and which one was a "run exactly once, before any tests" one.
Ok yes I'll update this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in dfb0fd3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Actually I think we should refactor this to "run exactly once, before any tests" (setUpClass()
), but it is another issue, of course.
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot!
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
I spent some time this weekend learning
snakeviz
, a Python profiler. I tried running it over the Python package tests here, and I think I found an opportunity to cut the test run time.I realized we're spending time loading and reloading the same datasets. This PR proposes reducing those calls by caching the results of
sklearn.datasets.load_*
calls. The datasets are really small so I don't think holding them all in memory will cause an issue.UPDATE: Ok from some experiments, the speedup from this seems really small on my laptop. But might be larger in CI environments, where we know that disk I/O is a lot slower (#2965 (comment)).
Results
These results were obtained on my laptop. The blockquote results come from one run of each test. Then I ran them all again 10 times to get a better estimate of the mean time.
test_basic.py
before (mean = 1.33s)
after (mean = 1.36s)
test_engine.py
before (mean = 12.78s)
after (mean = 11.84s)
test_sklearn.py
before(mean = 9.68s)
after (mean = 9.50s)
How to reproduce these tests