Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_eval_time_mins parameter doesn't stop a long-running eval #508

Open
dnuffer opened this issue Jun 24, 2017 · 16 comments
Open

max_eval_time_mins parameter doesn't stop a long-running eval #508

dnuffer opened this issue Jun 24, 2017 · 16 comments

Comments

@dnuffer
Copy link

dnuffer commented Jun 24, 2017

During fit(), I noticed that some evaluations were running for over an hour even though the max_eval_time_mins=5.
This is because the code is not actually stopping the thread doing the eval.

In class Interruptable_cross_val_score, stop() is not actually interrupting the thread, but instead calling self._stopevent.set(), which doesn't stop the thread because nothing is checking that event, and then waiting for the thread to stop on its own.

Python doesn't have a good way to interrupt threads. See the discussion at https://stackoverflow.com/questions/323972/is-there-any-way-to-kill-a-thread-in-python

Given that _wrapped_cross_val_score() is already running in a separate process due to joblib, one solution would be to make Interruptable_cross_val_score a daemon thread, and then remove the call to tmp_it.stop() in _wrapped_cross_val_score(). Thus when the timeout passes, the process will exit cleanly.

@weixuanfu
Copy link
Contributor

Thank you for suggestion. I will look into it. One of my branches uses stopit module for timeout function. I need to check if it is better than daemon thread solution.

@rhiever
Copy link
Contributor

rhiever commented Jun 27, 2017

@weixuanfu2016's PR that should fix this issue is merged on to the development branch. @dnuffer, can you try the dev branch and let us know if that corrects your issue?

@dnuffer
Copy link
Author

dnuffer commented Jul 2, 2017

I tried the development branch, and it didn't fix the issue. I'm think it's probably because the stopit module is a pure python solution and so can only interrupt a thread once it runs some python code, and since the core of most ml training algorithms is written non-python code, stopit won't get a chance to interrupt the thread.

I also tried my earlier suggestion, but it doesn't work because a pool of processes is used, and a process doesn't exit once an evaluation is complete, leaving the threads running.

I have successfully used the timeout feature in hyperopt-sklearn, and so I dug into how it works.
This is the code: https://github.com/hyperopt/hyperopt-sklearn/blob/master/hpsklearn/estimator.py
The trial_timeout variable controls how long each trial is allowed to run. Then fn_with_timeout() is where the action happens. Each trial is run in a separate process (using multiprocessing.Process) using a Pipe for communicating the result at the end of _cost_fn(). When a timeout happens the child process is terminated ensuring a certain and clean exit.

@weixuanfu
Copy link
Contributor

Hmm, interesting. Thank you for these tests. I will look into it.

@CSNoyes
Copy link

CSNoyes commented Jul 5, 2017

I can report similar issues. Using the development branch also does not fix it.

@weixuanfu
Copy link
Contributor

I just posted a PR #522 and use the way in hyperopt-sklearn to kill child process. Could you try this branch and let us know if that corrects your issue using the command below? @dnuffer @CSNoyes

pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu2016/tpot.git@timeout_pipe

@rhiever For this way, I need use threading backend in joblib instead of multiprocessing. It maybe not as efficient as before in parallel computing.

One drewback is that CTRL+C only works in Linux and Mac but not in Windows. So I add a warning message about it.

@CSNoyes
Copy link

CSNoyes commented Jul 5, 2017

@weixuanfu2016 Tested on OSX 10.12 and Ubuntu 14.04 with high dimensionality dataset (I think poly features was getting stuck), looks good so far. Will update if it creeps back in.

@dnuffer
Copy link
Author

dnuffer commented Jul 9, 2017

The timeout_pipe branch has fixed the issue for me.

@weixuanfu
Copy link
Contributor

weixuanfu commented Jul 11, 2017

@dnuffer @CSNoyes Thank you for feedbacks.

@rhiever and I had second thoughts about this issue. We thought this issue might be related to the start methods in multiprocessing. I also reproduced the freezing issue when n_jobs >1 in MacOS and Linux but it seems everything is all right when n_jobs = 1.

@dnuffer @CSNoyes Could you please let me know the sys environment, python version and n_jobs settings when this issue happened before? Thanks.

@weixuanfu
Copy link
Contributor

weixuanfu commented Jul 11, 2017

I tested the solution of forkserver when n_jobs > 1 and it at least solved the freezing issue when using TPOT on super large datasets. I put the demo in a branch, which can only be tested in Linux and MacOS with python 3.4+

@dnuffer
Copy link
Author

dnuffer commented Jul 11, 2017

I have been using ubuntu 17.04 with python 3.5.3. I've been mostly using n_jobs=22. I am using a dataset with dimensionality of ~17000.

@weixuanfu
Copy link
Contributor

weixuanfu commented Jul 20, 2017

@dnuffer @CSNoyes

Below is the demo for using forkserver to solving this issue in Linux and MacOS. I am still thinking whether we should put this solution into codes. It seems that it is not easy to use forkserver with joblib in interactive mode. Maybe we should provide a friendly warning message about this and/or tghe solution in the demo below as the Q&A in scikit-learn. Please let me know if the issue still exists with the demo below in your environment. Thanks.

import multiprocessing
if __name__ == '__main__':
    multiprocessing.set_start_method('forkserver')
    # Note: need move import sklearn into main unless a RuntimeError (RuntimeError: context has already been set) will raise
    from sklearn.datasets import make_classification
    from tpot import TPOTClassifier
    # make a huge dataset
    X, y = make_classification(n_samples=50000, n_features=200,
                                        n_informative=20, n_redundant=20,
                                        n_classes=5, random_state=42)
    
    # max_eval_time_mins=0.04 means 2 seconds limits for evaluating a single pipeline 
    # working in python 3.4+ in Linux OS and MacOS
    tpot = TPOTClassifier(generations=5, population_size=50, offspring_size=100, random_state=42, n_jobs=2, max_eval_time_mins=0.04, verbosity=3) 
    tpot.fit(X, y)

@jaksmid
Copy link

jaksmid commented Oct 4, 2017

The PR branch worked from me for the big dataset ( approx 600 MB). The 0.8/0.9 branch freezes.
Could we reinitiate discussion for the permanent fix? It seems the PR was closed without merging.

@weixuanfu
Copy link
Contributor

@jaksmid did 0.9 version freezes with forkserver start methods. Or maybe it is be a memory issue with a large number of n_jobs. Could you please provide more details about the issue in your environment?

The reason why I closed that PR is that it did not save computation time with n_jobs > 1 in my tests.

@jaksmid
Copy link

jaksmid commented Oct 4, 2017

Thanks @weixuanfu for the speedy response.

If I add the

import multiprocessing
multiprocessing.set_start_method('forkserver')

lines it seems to be working. Otherwise it utilises all cores to 100%. After some time the CPU consumption per core drops to zero with no observable progress. Memory pressure does not see to be a problem.

Using python 3.6.0 in the virtualenv on Mac Os Sierra.

Please let me know if you need further information.

@fabianmax
Copy link

In my experiments, tpot still ignores the max_eval_time_mins=5 parameter for datasets between 1000 and 5000 observations (5 to 25 columns). When fit() is called, tpot runs for an indefinitely long time period (at least several hours).

While I am able to stop the process by using the early_stop parameter, I would really like to set a specific time period.

I am using version 0.9.3 of tpot, python 3.6.2 and OSX 10.13.6. tpot runs in single thread mode (n_jobs=1).

Please let me know if you need any further infos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants