Add Consensus Operators #96

bartleyn · 2016-02-27T15:22:55Z

What does this PR do?

Addresses #77, adds three Consensus pipeline operators: consensus_two, consensus_three, and consensus_four. Adds corresponding export_utils code and a test.

Where should the reviewer start?

consensus_two and the weighting/combination functions defined above it.

How should this PR be tested?

Seeing if the consensus operators contribute more to the overall fitness of the populations generated than just _combine_dfs. The export code could use more thorough testing as well.

Any background context you want to provide?

I originally had an additional weighting scheme I was trying to put into place, but implementing it was challenging, so I opted to remove it.

What are the relevant issues?

#77

Screenshots (if appropriate)

Questions:

Do the docs need to be updated?
I don't think so.
Does this PR add new (Python) dependencies?
No, everything's implemented from scratch.

… memory usage.

…cleaning up comments

… Requires more thorough testing, finishing adaboost weighting

…our, simple tests.

rhiever · 2016-02-27T15:28:21Z

Happy to see this PR come in! Were the consensus operators used in any of your tests? I'm currently running a big TPOT benchmark on the cluster, but I'll line this PR up for the next benchmark in line.

bartleyn · 2016-02-27T15:37:55Z

Yeah I ran numerous small tests that ended up with consensus in the pipeline. Performed well, but tough to compare since some of the other runs ended up with (presumably) overfit simple pipelines with perfect accuracy.

rhiever · 2016-02-27T15:54:45Z

Sounds promising! I look forward to benchmarking the code then.

It may take a while to get to the benchmark, though. Just a heads up.

rhiever · 2016-02-27T15:55:11Z

Looks like your tests are having some issues with Python 3. I think it's because you're using Python 2 print statements. 👎 ;-)

bartleyn · 2016-02-27T16:01:34Z

Whoops, that's what I get for being stuck in the 2.7 past.

rhiever · 2016-02-27T16:11:43Z

Tch tch tch... join us in ze future! 👍

rhiever · 2016-02-27T16:12:41Z

D'oh! Now it's failing the unit tests.

bartleyn · 2016-02-27T19:53:01Z

I'm out and about right now, but I wouldn't be surprised if I was testing with older tests. I'll test again when I get back.

bartleyn · 2016-02-28T15:48:02Z

With the above commits I've made the necessary changes to run all the tests in tests.py, and tested functionality with some small examples all within a Python 3.5 environment. Are there any tests I'm missing?

rhiever · 2016-02-28T15:52:03Z

Integration tests (i.e., running TPOT on a fixed data set with a fixed RNG
seed for a fixed number of generations and checking the output) are always
useful, but I;m not sure if we can do that here.

On Sun, Feb 28, 2016 at 10:48 AM, Nathan notifications@github.com wrote:

With the above commits I've made the necessary changes to run all the
tests in tests.py, and tested functionality with some small examples all
within a Python 3.5 environment. Are there any tests I'm missing?

—
Reply to this email directly or view it on GitHub
#96 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

bartleyn · 2016-02-28T19:20:36Z

I'll do a set of runs on the MNIST data and report some stats on the performance and appearance of consensus operators compared to others.

rhiever · 2016-03-03T01:55:32Z

Just a small update: The base TPOT benchmark should finish up by the weekend, after which point I'll be able to throw on "TPOT-Consensus" and give it a serious spin. Will keep you posted.

bartleyn · 2016-03-07T03:35:46Z

Okay, then I will hold off from getting deeper into this until the benchmarks and work on something else. I think that there's a lot more information encoded in the input features from the various DataFrames rather than just using the guesses, but perhaps it's not worth the effort right now. Are there any other schemes we want to test, though? Threshold?

rhiever · 2016-03-07T23:01:48Z

Benchmarks are queued now. Have 10 copies of TPOT-Consensus running against 90 different data sets. Analyzing the resulting best pipelines should give us a good sense of whether the consensus operators are usefully contributing or not.

Are there any other schemes we want to test, though? Threshold?

Yes, that could be a good one.

rhiever · 2016-03-07T23:02:46Z

BTW, if you want to take a stab at #105 in a separate branch in the meantime, that would be awesome. I think that's a huge issue to address on the research end right now.

bartleyn · 2016-03-08T23:08:20Z

I realized that I might have a different idea of thresholding than what you're talking about: I'm thinking of assigning a DataFrame a 0 weight (eliminating impact on the guesses) if they do not pass a (perhaps parameterized) threshold of accuracy.

rhiever · 2016-03-09T00:15:41Z

Ah, yes. I usually think of threshold as "if X% of guesses are for one
class, then it's that class." Maybe that will be too difficult in the multi
class case though.

On Tuesday, March 8, 2016, Nathan notifications@github.com wrote:

I realized that I might have a different idea of thresholding than what
you're talking about: I'm thinking of assigning a DataFrame a 0 weight
(eliminating impact on the guesses) if they do not pass a (perhaps
parameterized) threshold of accuracy.

—
Reply to this email directly or view it on GitHub
#96 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

bartleyn · 2016-03-09T05:01:19Z

Oh wait I merged the upstream changes without thinking about the possible consequences for the benchmark tests; should I go ahead and revert the merge?

rhiever · 2016-03-09T05:25:10Z

Well, I already have a copy of TPOT-Consensus on the HPCC, so it should be okay.

rhiever · 2016-03-11T17:29:16Z

Another small update: HPCC is taking bloody forever to run these jobs. They're stuck in a queue behind some bigger jobs I had queued. Bad queue management system... sigh.

rhiever · 2016-03-15T16:18:52Z

The jobs are finishing up today, so I should be able to analyze the results tomorrow morning and see how this turned out.

Also looks like this branch has conflicts with the latest version of TPOT. Argh. Let's not bother cleaning up that merge until we see if this feature will allow for better pipelines.

bartleyn · 2016-03-15T16:37:39Z

Agreed. It's not worth it to fix the merge if the results aren't looking good. But if they are (fingers crossed), at least this PR is only ~a week behind.

rhiever · 2016-03-16T15:26:05Z

Welp... I'm sad to report that TPOT doesn't really seem to be evolving pipelines with the consensus operator. Only 1.5% of the pipelines from the benchmark even contained a consensus operator, and none of those really seemed to use them in a meaningful way.

It's possible that Pareto optimization is disfavoring the larger pipelines that the consensus operators entail. If you want to roll back the GP selection process to simply maximize classification accuracy again, I can grab the latest from this fork and re-run the benchmark.

I should also note that a large portion (over half) of the runs didn't finish in time -- I only gave each run 8 hours to complete 100 generations -- so it's possible that consensus operators were being used there. That's still a bad sign, though, as it likely means that TPOT with the consensus operators are even slower than it already is. Not good!

Perhaps a more promising path is to try to combine the population of pipelines into ensembles, as in #105. Really looking forward to hearing how that pans out.

bartleyn · 2016-03-17T21:33:14Z

That stinks, but negative results are useful results too, I suppose. I'll take a look at testing without Pareto optimization when I get the chance, but I agree that #105 is probably more promising.

bartleyn added 12 commits February 18, 2016 23:55

Added consensus operators, working through adaboost. Need to optimize…

6c8afec

… memory usage.

consensus_two working, needs another once over for optimization, and …

160593e

…cleaning up comments

consensus_three and four seem to work, no memory blowups like before.…

6d0ba52

… Requires more thorough testing, finishing adaboost weighting

Changed Twitter Bot to TPOT in License, added consensus_two, three, f…

bb8fc28

…our, simple tests.

Removed adaboost, moved some consensus operator options to class fields.

4ff886c

cleanup comments, remove ident and uncomment combine_dfs

191c4fa

cleanup comments, remove ident and uncomment combine_dfs

22a3e19

Merge branch 'master' of https://github.com/bartleyn/tpot

86a1129

Added comment to get_ht_dict

c560f1a

Merge/sync with master fork

1cbe2a8

Updated export_utils, cleanup

cbf4945

Corrected one line in export_utils for the consensus_operators

26a162e

Updated print statements in tests to be print function call from python3

7fac073

bartleyn added 2 commits February 27, 2016 17:07

2to3 conversion of consensus stuff, updated export_utils and tests

47248ed

got rid of verion check from 2to3 conversion

0be5775

rhiever added the enhancement label Mar 3, 2016

bartleyn added 2 commits March 8, 2016 22:38

Added threshold consensus operator

1e62150

Merge remote-tracking branch 'upstream/master'

83aaa67

rhiever closed this May 1, 2016

AIAdventures mentioned this pull request Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this pull request Mar 20, 2018

Segfault on optimization process #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Consensus Operators #96

Add Consensus Operators #96

bartleyn commented Feb 27, 2016

rhiever commented Feb 27, 2016

bartleyn commented Feb 27, 2016

rhiever commented Feb 27, 2016

rhiever commented Feb 27, 2016

bartleyn commented Feb 27, 2016

rhiever commented Feb 27, 2016

rhiever commented Feb 27, 2016

bartleyn commented Feb 27, 2016

bartleyn commented Feb 28, 2016

rhiever commented Feb 28, 2016

bartleyn commented Feb 28, 2016

rhiever commented Mar 3, 2016

bartleyn commented Mar 7, 2016

rhiever commented Mar 7, 2016

rhiever commented Mar 7, 2016

bartleyn commented Mar 8, 2016

rhiever commented Mar 9, 2016

bartleyn commented Mar 9, 2016

rhiever commented Mar 9, 2016

rhiever commented Mar 11, 2016

rhiever commented Mar 15, 2016

bartleyn commented Mar 15, 2016

rhiever commented Mar 16, 2016

bartleyn commented Mar 17, 2016

Add Consensus Operators #96

Add Consensus Operators #96

Conversation

bartleyn commented Feb 27, 2016

What does this PR do?

Where should the reviewer start?

How should this PR be tested?

Any background context you want to provide?

What are the relevant issues?

Screenshots (if appropriate)

Questions:

rhiever commented Feb 27, 2016

bartleyn commented Feb 27, 2016

rhiever commented Feb 27, 2016

rhiever commented Feb 27, 2016

bartleyn commented Feb 27, 2016

rhiever commented Feb 27, 2016

rhiever commented Feb 27, 2016

bartleyn commented Feb 27, 2016

bartleyn commented Feb 28, 2016

rhiever commented Feb 28, 2016

bartleyn commented Feb 28, 2016

rhiever commented Mar 3, 2016

bartleyn commented Mar 7, 2016

rhiever commented Mar 7, 2016

rhiever commented Mar 7, 2016

bartleyn commented Mar 8, 2016

rhiever commented Mar 9, 2016

bartleyn commented Mar 9, 2016

rhiever commented Mar 9, 2016

rhiever commented Mar 11, 2016

rhiever commented Mar 15, 2016

bartleyn commented Mar 15, 2016

rhiever commented Mar 16, 2016

bartleyn commented Mar 17, 2016