max_eval_time_mins parameter doesn't stop a long-running eval #508

dnuffer · 2017-06-24T21:02:59Z

During fit(), I noticed that some evaluations were running for over an hour even though the max_eval_time_mins=5.
This is because the code is not actually stopping the thread doing the eval.

In class Interruptable_cross_val_score, stop() is not actually interrupting the thread, but instead calling self._stopevent.set(), which doesn't stop the thread because nothing is checking that event, and then waiting for the thread to stop on its own.

Python doesn't have a good way to interrupt threads. See the discussion at https://stackoverflow.com/questions/323972/is-there-any-way-to-kill-a-thread-in-python

Given that _wrapped_cross_val_score() is already running in a separate process due to joblib, one solution would be to make Interruptable_cross_val_score a daemon thread, and then remove the call to tmp_it.stop() in _wrapped_cross_val_score(). Thus when the timeout passes, the process will exit cleanly.

The text was updated successfully, but these errors were encountered:

weixuanfu · 2017-06-24T23:22:55Z

Thank you for suggestion. I will look into it. One of my branches uses stopit module for timeout function. I need to check if it is better than daemon thread solution.

rhiever · 2017-06-27T15:09:17Z

@weixuanfu2016's PR that should fix this issue is merged on to the development branch. @dnuffer, can you try the dev branch and let us know if that corrects your issue?

dnuffer · 2017-07-02T19:47:21Z

I tried the development branch, and it didn't fix the issue. I'm think it's probably because the stopit module is a pure python solution and so can only interrupt a thread once it runs some python code, and since the core of most ml training algorithms is written non-python code, stopit won't get a chance to interrupt the thread.

I also tried my earlier suggestion, but it doesn't work because a pool of processes is used, and a process doesn't exit once an evaluation is complete, leaving the threads running.

I have successfully used the timeout feature in hyperopt-sklearn, and so I dug into how it works.
This is the code: https://github.com/hyperopt/hyperopt-sklearn/blob/master/hpsklearn/estimator.py
The trial_timeout variable controls how long each trial is allowed to run. Then fn_with_timeout() is where the action happens. Each trial is run in a separate process (using multiprocessing.Process) using a Pipe for communicating the result at the end of _cost_fn(). When a timeout happens the child process is terminated ensuring a certain and clean exit.

weixuanfu · 2017-07-02T20:23:39Z

Hmm, interesting. Thank you for these tests. I will look into it.

CSNoyes · 2017-07-05T08:40:09Z

I can report similar issues. Using the development branch also does not fix it.

weixuanfu · 2017-07-05T16:39:45Z

I just posted a PR #522 and use the way in hyperopt-sklearn to kill child process. Could you try this branch and let us know if that corrects your issue using the command below? @dnuffer @CSNoyes

pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu2016/tpot.git@timeout_pipe

@rhiever For this way, I need use threading backend in joblib instead of multiprocessing. It maybe not as efficient as before in parallel computing.

One drewback is that CTRL+C only works in Linux and Mac but not in Windows. So I add a warning message about it.

CSNoyes · 2017-07-05T19:55:45Z

@weixuanfu2016 Tested on OSX 10.12 and Ubuntu 14.04 with high dimensionality dataset (I think poly features was getting stuck), looks good so far. Will update if it creeps back in.

dnuffer · 2017-07-09T18:15:50Z

The timeout_pipe branch has fixed the issue for me.

weixuanfu · 2017-07-11T19:03:02Z

@dnuffer @CSNoyes Thank you for feedbacks.

@rhiever and I had second thoughts about this issue. We thought this issue might be related to the start methods in multiprocessing. I also reproduced the freezing issue when n_jobs >1 in MacOS and Linux but it seems everything is all right when n_jobs = 1.

@dnuffer @CSNoyes Could you please let me know the sys environment, python version and n_jobs settings when this issue happened before? Thanks.

weixuanfu · 2017-07-11T19:40:04Z

I tested the solution of forkserver when n_jobs > 1 and it at least solved the freezing issue when using TPOT on super large datasets. I put the demo in a branch, which can only be tested in Linux and MacOS with python 3.4+

dnuffer · 2017-07-11T21:57:40Z

I have been using ubuntu 17.04 with python 3.5.3. I've been mostly using n_jobs=22. I am using a dataset with dimensionality of ~17000.

weixuanfu · 2017-07-20T14:16:33Z

@dnuffer @CSNoyes

Below is the demo for using forkserver to solving this issue in Linux and MacOS. I am still thinking whether we should put this solution into codes. It seems that it is not easy to use forkserver with joblib in interactive mode. Maybe we should provide a friendly warning message about this and/or tghe solution in the demo below as the Q&A in scikit-learn. Please let me know if the issue still exists with the demo below in your environment. Thanks.

import multiprocessing
if __name__ == '__main__':
    multiprocessing.set_start_method('forkserver')
    # Note: need move import sklearn into main unless a RuntimeError (RuntimeError: context has already been set) will raise
    from sklearn.datasets import make_classification
    from tpot import TPOTClassifier
    # make a huge dataset
    X, y = make_classification(n_samples=50000, n_features=200,
                                        n_informative=20, n_redundant=20,
                                        n_classes=5, random_state=42)
    
    # max_eval_time_mins=0.04 means 2 seconds limits for evaluating a single pipeline 
    # working in python 3.4+ in Linux OS and MacOS
    tpot = TPOTClassifier(generations=5, population_size=50, offspring_size=100, random_state=42, n_jobs=2, max_eval_time_mins=0.04, verbosity=3) 
    tpot.fit(X, y)

jaksmid · 2017-10-04T11:10:16Z

The PR branch worked from me for the big dataset ( approx 600 MB). The 0.8/0.9 branch freezes.
Could we reinitiate discussion for the permanent fix? It seems the PR was closed without merging.

weixuanfu · 2017-10-04T11:24:50Z

@jaksmid did 0.9 version freezes with forkserver start methods. Or maybe it is be a memory issue with a large number of n_jobs. Could you please provide more details about the issue in your environment?

The reason why I closed that PR is that it did not save computation time with n_jobs > 1 in my tests.

jaksmid · 2017-10-04T12:05:36Z

Thanks @weixuanfu for the speedy response.

If I add the

import multiprocessing
multiprocessing.set_start_method('forkserver')

lines it seems to be working. Otherwise it utilises all cores to 100%. After some time the CPU consumption per core drops to zero with no observable progress. Memory pressure does not see to be a problem.

Using python 3.6.0 in the virtualenv on Mac Os Sierra.

Please let me know if you need further information.

fabianmax · 2018-07-30T16:21:03Z

In my experiments, tpot still ignores the max_eval_time_mins=5 parameter for datasets between 1000 and 5000 observations (5 to 25 columns). When fit() is called, tpot runs for an indefinitely long time period (at least several hours).

While I am able to stop the process by using the early_stop parameter, I would really like to set a specific time period.

I am using version 0.9.3 of tpot, python 3.6.2 and OSX 10.13.6. tpot runs in single thread mode (n_jobs=1).

Please let me know if you need any further infos.

weixuanfu mentioned this issue Jun 26, 2017

Use stopit module for timeout function #509

Merged

rhiever added the bug label Jun 27, 2017

rhiever added need contributor and removed need contributor labels Jun 27, 2017

weixuanfu mentioned this issue Jul 5, 2017

Use Multiprocessing.Process and Pipe for timeout function #522

Closed

rhiever added the being worked on label Jul 7, 2017

weixuanfu mentioned this issue Aug 5, 2017

TPOT stuck at 0% #542

Closed

weixuanfu mentioned this issue Aug 15, 2017

Kernel appears to have died in Jupyter Notebook using TPOTClassifier #546

Closed

perib mentioned this issue Sep 21, 2023

TPOT2 and the future of TPOT development -- From the Devs #1322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_eval_time_mins parameter doesn't stop a long-running eval #508

max_eval_time_mins parameter doesn't stop a long-running eval #508

dnuffer commented Jun 24, 2017

weixuanfu commented Jun 24, 2017

rhiever commented Jun 27, 2017

dnuffer commented Jul 2, 2017

weixuanfu commented Jul 2, 2017

CSNoyes commented Jul 5, 2017

weixuanfu commented Jul 5, 2017

CSNoyes commented Jul 5, 2017

dnuffer commented Jul 9, 2017

weixuanfu commented Jul 11, 2017 •

edited

Loading

weixuanfu commented Jul 11, 2017 •

edited

Loading

dnuffer commented Jul 11, 2017

weixuanfu commented Jul 20, 2017 •

edited

Loading

jaksmid commented Oct 4, 2017

weixuanfu commented Oct 4, 2017

jaksmid commented Oct 4, 2017

fabianmax commented Jul 30, 2018

max_eval_time_mins parameter doesn't stop a long-running eval #508

max_eval_time_mins parameter doesn't stop a long-running eval #508

Comments

dnuffer commented Jun 24, 2017

weixuanfu commented Jun 24, 2017

rhiever commented Jun 27, 2017

dnuffer commented Jul 2, 2017

weixuanfu commented Jul 2, 2017

CSNoyes commented Jul 5, 2017

weixuanfu commented Jul 5, 2017

CSNoyes commented Jul 5, 2017

dnuffer commented Jul 9, 2017

weixuanfu commented Jul 11, 2017 • edited Loading

weixuanfu commented Jul 11, 2017 • edited Loading

dnuffer commented Jul 11, 2017

weixuanfu commented Jul 20, 2017 • edited Loading

jaksmid commented Oct 4, 2017

weixuanfu commented Oct 4, 2017

jaksmid commented Oct 4, 2017

fabianmax commented Jul 30, 2018

weixuanfu commented Jul 11, 2017 •

edited

Loading

weixuanfu commented Jul 11, 2017 •

edited

Loading

weixuanfu commented Jul 20, 2017 •

edited

Loading