Why is computation time not reported if n_jobs != 1 or != None? #895

NicolasHug · 2020-01-17T17:52:07Z

I'm running a big benchmark suite with RandomizedSearchCV(n_jobs=-1).

Unfortunately, computation time is reported only if n_jobs is None or 1.

I don't understand the reason #229. Why isn't the interpretation left out to the user?

As a side note: n_jobs=None can be overridden with a context manager:

from joblib import parallel_backend
with parallel_backend('loky', n_jobs=-1):
	RandomizedSearchCV(n_jobs=None)

This is equivalent to just calling RandomizedSearchCV(n_jobs=-1).

With the latter, openml won't report computation time, but as far as I understand, the former will run just fine and report the computation time. So it seems that the check isn't properly enforced anyway.

CC @amueller

The text was updated successfully, but these errors were encountered:

amueller · 2020-02-03T20:56:06Z

ping @janvanrijn @mfeurer ;)

amueller · 2020-02-03T20:57:39Z

Should we add a wallclock_time_millis_training additionally maybe which can always be computed?

amueller · 2020-02-03T21:03:05Z

The reason it is wrong for n_jobs != 1 is that internally it uses process_time which will not count any of the subprocess time, and it's not using wall-clock time.

mfeurer · 2020-02-03T21:14:39Z

ping @janvanrijn @mfeurer ;)

I'll come back to you after the ICML deadline.

mfeurer · 2020-02-10T09:58:36Z

Thanks for raising this issue, it seems that there are indeed one or two problems here.

I believe the reason why the wallclock time is not reported if the number of cores is -1 is because we can't figure out on how many cores it was executed and the number then only makes limited sense. Currently, this is a very restrictive assumption that can be circumvented in plenty of ways (as you showed). Do you have any suggestions on how to improve on this?

Should we add a wallclock_time_millis_training additionally maybe which can always be computed?

That exists and is computed if n_jobs != -1.

In order to get the times of each base run you can check optimization trace which should have the time for each model fit. However, we currently don't seem to store the refit time correctly (or at all?), which to me currently seems like the biggest bug here.

NicolasHug · 2020-02-10T13:44:26Z

Do you have any suggestions on how to improve on this?

I think you can use effective_n_jobs from joblib: https://github.com/joblib/joblib/blob/master/joblib/parallel.py#L366

mfeurer · 2020-07-31T16:59:07Z

Yet another issue we have to think about is the recent use of OpenMP in scikit-learn which might make it harder for us to get a useful estimate of the used time.

mfeurer · 2021-02-01T14:17:39Z

Sorry that this has stalled for so long, but now it's finally time to pick this up and finish it!

I think we basically have the following cases here which we need to consider:

estimators that don't involve any parallelism, for example simple decision trees
estimators that do parallelization inside themselves via BLAS or OpenMP, for example SGD or HistGradientBoosting
estimators that do parallelization via joblib, for example RandomForest
HPO algorithms that call an underlying algorithm multiple times via joblib

and IIRC we can measure the following things:

CPU time for the whole run
Wallclock time for the whole run
CPU time for each individual run in HPO
Wallclock time for each individual run in HPO

That means we can do the following things for cases 1-4:

Easy, we can measure both CPU time and wallclock time
Tricky. We can measure both CPU time and wallclock time, but for wallclock time we won't know how many CPUs were involved. In case OpenML is started on several processes or machines (no idea if this is realistic) we also don't get reliable estimates of the process usage any more.
Hard. We can easily measure CPU and wallclock time as long as n_jobs=1. We can still measure wallclock time as long as n_jobs>=1 as we'd know how many cores are used. In case of n_jobs==-1 we won't know how many cores are being used, but we could use effective_n_jobs to get an estimate. CPU time is never measurable as we don't have access to the CPU time of the individual jobs.
Easier again. As each individual job measures the time itself we can in the end gather all individual times and add them up to obtain the total time taken.

As @NicolasHug pointed out, one can override the behavior via a context manager. Another caveat is that when using a server-worker system such as dask one does not necessarily get all available CPUs or the jobs might just be in the queue, making the wallclock time of the overall run completely useless.

Therefore, I propose to do the following:

Document what we're doing
Implement not being cheated by the context manager
Implement storing the refit time for HPO as asked for in Store runtime of instances of BaseSearchCV #248
Figure out what to do with dask - can we somehow store which backend was used?

What do you think about this @NicolasHug @amueller @PGijsbers

PGijsbers · 2021-02-10T12:13:21Z

I'd be careful not to spend too much time on this, as it will become a very complicated/impossible project on its own (we're going to have to account for different parallelization strategies/packages, but would also need to start capturing hardware information etc.). However making the proposed changes, and then clearly documenting under which conditions what is measured, and how to interpret this data, still seems like a worthwhile change to me.

mfeurer · 2021-04-13T16:04:52Z

We followed the suggestion of @NicolasHug to just log the CPU and wallclock time and give the user the possibility and duty to interpret those. To simplify matters we added a lengthy example.

NicolasHug mentioned this issue Feb 10, 2020

[MRG] Successive halving for faster parameter search scikit-learn/scikit-learn#13900

Merged

6 tasks

Neeratyoy mentioned this issue Feb 23, 2021

Measuring runtimes #1031

Merged

Neeratyoy mentioned this issue Mar 12, 2021

Standardize use of n_jobs and reporting of computation time #1038

Merged

mfeurer closed this as completed Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is computation time not reported if n_jobs != 1 or != None? #895

Why is computation time not reported if n_jobs != 1 or != None? #895

NicolasHug commented Jan 17, 2020

amueller commented Feb 3, 2020

amueller commented Feb 3, 2020 •

edited

Loading

amueller commented Feb 3, 2020

mfeurer commented Feb 3, 2020

mfeurer commented Feb 10, 2020

NicolasHug commented Feb 10, 2020

mfeurer commented Jul 31, 2020

mfeurer commented Feb 1, 2021

PGijsbers commented Feb 10, 2021

mfeurer commented Apr 13, 2021

Why is computation time not reported if n_jobs != 1 or != None? #895

Why is computation time not reported if n_jobs != 1 or != None? #895

Comments

NicolasHug commented Jan 17, 2020

amueller commented Feb 3, 2020

amueller commented Feb 3, 2020 • edited Loading

amueller commented Feb 3, 2020

mfeurer commented Feb 3, 2020

mfeurer commented Feb 10, 2020

NicolasHug commented Feb 10, 2020

mfeurer commented Jul 31, 2020

mfeurer commented Feb 1, 2021

PGijsbers commented Feb 10, 2021

mfeurer commented Apr 13, 2021

amueller commented Feb 3, 2020 •

edited

Loading