Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize use of n_jobs and reporting of computation time #1038

Merged
merged 16 commits into from
Apr 13, 2021
Merged

Conversation

Neeratyoy
Copy link
Contributor

Reference Issue

Addresses #895.

@Neeratyoy Neeratyoy requested a review from mfeurer March 24, 2021 13:55
doc/usage.rst Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
@Neeratyoy Neeratyoy requested a review from mfeurer March 29, 2021 22:39
@Neeratyoy Neeratyoy marked this pull request as ready for review March 30, 2021 14:10
Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I renamed the file so I could render it on my local machine. Also, I left yet another few comments :)

examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes.py Outdated Show resolved Hide resolved
@mfeurer
Copy link
Collaborator

mfeurer commented Apr 6, 2021

Hey, I think the example is really getting along well. Based on the current status I'm wondering three things:

  1. Should we have a summary section which summarizes the behavior of OpenML-Python, scikit-learn and joblib?
  2. Should we have a section which shows that the measured numbers will differ based on the backend?
  3. We currently don't have case 2 from Why is computation time not reported if n_jobs != 1 or != None? #895 (comment)

@Neeratyoy
Copy link
Contributor Author

  1. Should we have a summary section which summarizes the behavior of OpenML-Python, scikit-learn and joblib?

Do you mean their interaction and behaviour? It would be useful I guess, but the only concern or question is what (exactly) to summarize I guess.

  1. Should we have a section which shows that the measured numbers will differ based on the backend?

Though we could, from the OpenML API standpoint, we can also ignore it. If the duty of parallelization is delegated to scikit-learn or OpenML (as we show in the example), I feel it is likely that the user may not set the backend with a parallel_backend context.
Having said that, it also might be fair to show a 4th case in the example, giving an example of changing the backend.

On second thought, this might necessitate the mention of how the backends change in the entire stack of function calls going through OpenML to scikit-learn pipelines. This is obviously quite complicated.

  1. We currently don't have case 2 from #895 (comment)

In the latest SGDClassifier documentation, the parallelization happens through joblib which I thought we are already covering.
While the HistGradientBoosting, still appears to be experimental, so didn't quite think about including it..

@Neeratyoy Neeratyoy requested a review from mfeurer April 6, 2021 15:10
@mfeurer
Copy link
Collaborator

mfeurer commented Apr 6, 2021

Do you mean their interaction and behaviour? It would be useful I guess, but the only concern or question is what (exactly) to summarize I guess.

I think the main caveats to pay attention to, like the gist of #895

Though we could, from the OpenML API standpoint, we can also ignore it.

Probably I mixed up something. But an example on how this can be changed might be a good idea!

In the latest SGDClassifier documentation, the parallelization happens through joblib which I thought we are already covering.
While the HistGradientBoosting, still appears to be experimental, so didn't quite think about including it..

Yes, but that's for fitting multiple classifiers for a "one vs all" scheme. I was more thinking about something like the neural network where the number of used cores can't be set via the API.

examples/30_extended/fetch_runtimes_tutorial.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes_tutorial.py Outdated Show resolved Hide resolved
################################################################################
# Summmary
# *********
# OpenML records model runtimes for the CPU-clock and the wall-clock times. The above
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, should we explicitly say "the scikit-learn extension"? Everything we're writing here is exclusive about the scikit-learn extension, so it could be confusing otherwise.

examples/30_extended/fetch_runtimes_tutorial.py Outdated Show resolved Hide resolved
examples/30_extended/fetch_runtimes_tutorial.py Outdated Show resolved Hide resolved
@Neeratyoy Neeratyoy requested a review from mfeurer April 8, 2021 14:07
@mfeurer
Copy link
Collaborator

mfeurer commented Apr 9, 2021

I still can't annotate all my comments... anyway, linear SVM also releases the GIL: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/svm/_liblinear.pyx#L61

Maybe naive bayes doesn't release the GIL?

@Neeratyoy Neeratyoy requested a review from mfeurer April 12, 2021 18:08
* Minor reshuffling

* Update examples/30_extended/fetch_runtimes_tutorial.py

Co-authored-by: Neeratyoy Mallik <neeratyoy@gmail.com>

Co-authored-by: Neeratyoy Mallik <neeratyoy@gmail.com>
@mfeurer mfeurer merged commit 3a1dfbd into develop Apr 13, 2021
@mfeurer mfeurer deleted the fix_895 branch April 13, 2021 16:02
github-actions bot pushed a commit that referenced this pull request Apr 13, 2021
PGijsbers pushed a commit to Mirkazemi/openml-python that referenced this pull request Feb 23, 2023
)

* Unit test to test existence of refit time

* Measuring runtime always

* Removing redundant check in unit test

* Updating docs with runtimes

* Adding more utilities to new example

* Removing refit_time + fetching trace runtime in example

* rename example

* Reiterating with changes to example from @mfeurer suggestions

* Including refit time and other minor formatting

* Adding more cases + a concluding summary

* Cosmetic changes

* Adding 5th case with no release of GIL

* Removing debug code

* Runtime measurement example updates (openml#1052)

* Minor reshuffling

* Update examples/30_extended/fetch_runtimes_tutorial.py

Co-authored-by: Neeratyoy Mallik <neeratyoy@gmail.com>

Co-authored-by: Neeratyoy Mallik <neeratyoy@gmail.com>

Co-authored-by: Matthias Feurer <feurerm@informatik.uni-freiburg.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants