-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Consensus Operators #96
Conversation
…cleaning up comments
… Requires more thorough testing, finishing adaboost weighting
…our, simple tests.
Happy to see this PR come in! Were the consensus operators used in any of your tests? I'm currently running a big TPOT benchmark on the cluster, but I'll line this PR up for the next benchmark in line. |
Yeah I ran numerous small tests that ended up with consensus in the pipeline. Performed well, but tough to compare since some of the other runs ended up with (presumably) overfit simple pipelines with perfect accuracy. |
Sounds promising! I look forward to benchmarking the code then. It may take a while to get to the benchmark, though. Just a heads up. |
Looks like your tests are having some issues with Python 3. I think it's because you're using Python 2 |
Whoops, that's what I get for being stuck in the 2.7 past. |
Tch tch tch... join us in ze future! 👍 |
D'oh! Now it's failing the unit tests. |
I'm out and about right now, but I wouldn't be surprised if I was testing with older tests. I'll test again when I get back. |
With the above commits I've made the necessary changes to run all the tests in tests.py, and tested functionality with some small examples all within a Python 3.5 environment. Are there any tests I'm missing? |
Integration tests (i.e., running TPOT on a fixed data set with a fixed RNG On Sun, Feb 28, 2016 at 10:48 AM, Nathan notifications@github.com wrote:
Randal S. Olson, Ph.D. E-mail: rso@randalolson.com | Twitter: @randal_olson |
I'll do a set of runs on the MNIST data and report some stats on the performance and appearance of consensus operators compared to others. |
Just a small update: The base TPOT benchmark should finish up by the weekend, after which point I'll be able to throw on "TPOT-Consensus" and give it a serious spin. Will keep you posted. |
Okay, then I will hold off from getting deeper into this until the benchmarks and work on something else. I think that there's a lot more information encoded in the input features from the various DataFrames rather than just using the guesses, but perhaps it's not worth the effort right now. Are there any other schemes we want to test, though? Threshold? |
Benchmarks are queued now. Have 10 copies of TPOT-Consensus running against 90 different data sets. Analyzing the resulting best pipelines should give us a good sense of whether the consensus operators are usefully contributing or not.
Yes, that could be a good one. |
BTW, if you want to take a stab at #105 in a separate branch in the meantime, that would be awesome. I think that's a huge issue to address on the research end right now. |
I realized that I might have a different idea of thresholding than what you're talking about: I'm thinking of assigning a DataFrame a 0 weight (eliminating impact on the guesses) if they do not pass a (perhaps parameterized) threshold of accuracy. |
Ah, yes. I usually think of threshold as "if X% of guesses are for one On Tuesday, March 8, 2016, Nathan notifications@github.com wrote:
Randal S. Olson, Ph.D. E-mail: rso@randalolson.com | Twitter: @randal_olson |
Oh wait I merged the upstream changes without thinking about the possible consequences for the benchmark tests; should I go ahead and revert the merge? |
Well, I already have a copy of TPOT-Consensus on the HPCC, so it should be okay. |
Another small update: HPCC is taking bloody forever to run these jobs. They're stuck in a queue behind some bigger jobs I had queued. Bad queue management system... sigh. |
The jobs are finishing up today, so I should be able to analyze the results tomorrow morning and see how this turned out. Also looks like this branch has conflicts with the latest version of TPOT. Argh. Let's not bother cleaning up that merge until we see if this feature will allow for better pipelines. |
Agreed. It's not worth it to fix the merge if the results aren't looking good. But if they are (fingers crossed), at least this PR is only ~a week behind. |
Welp... I'm sad to report that TPOT doesn't really seem to be evolving pipelines with the consensus operator. Only 1.5% of the pipelines from the benchmark even contained a consensus operator, and none of those really seemed to use them in a meaningful way. It's possible that Pareto optimization is disfavoring the larger pipelines that the consensus operators entail. If you want to roll back the GP selection process to simply maximize classification accuracy again, I can grab the latest from this fork and re-run the benchmark. I should also note that a large portion (over half) of the runs didn't finish in time -- I only gave each run 8 hours to complete 100 generations -- so it's possible that consensus operators were being used there. That's still a bad sign, though, as it likely means that TPOT with the consensus operators are even slower than it already is. Not good! Perhaps a more promising path is to try to combine the population of pipelines into ensembles, as in #105. Really looking forward to hearing how that pans out. |
That stinks, but negative results are useful results too, I suppose. I'll take a look at testing without Pareto optimization when I get the chance, but I agree that #105 is probably more promising. |
What does this PR do?
Addresses #77, adds three Consensus pipeline operators: consensus_two, consensus_three, and consensus_four. Adds corresponding export_utils code and a test.
Where should the reviewer start?
consensus_two and the weighting/combination functions defined above it.
How should this PR be tested?
Seeing if the consensus operators contribute more to the overall fitness of the populations generated than just _combine_dfs. The export code could use more thorough testing as well.
Any background context you want to provide?
I originally had an additional weighting scheme I was trying to put into place, but implementing it was challenging, so I opted to remove it.
What are the relevant issues?
#77
Screenshots (if appropriate)
Questions:
I don't think so.
No, everything's implemented from scratch.