-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Milestone
Description
Since we don't maintain the column names any more, it seems that we could replace the pandas DataFrames in our pipeline structure with numpy matrices. We're always changing the data into numpy matrices anyway when passing them to the sklearn operations, so I'm not seeing the point of using pandas DataFrames any more.
This might make TPOT more memory efficient, as we won't introduce DataFrame overhead either.
To make this happen, we would need to:
- Store the train/test indices as internal
self
variables (in place of having agroup
column) - Ensure that the
class
column is always the last entry in the matrix (in place of having aclass
column) - Ensure that the latest
guess
column is always the second-to-last entry in the matrix (in place of having aguess
column)
I believe that this would also make #29 much easier to implement.
Any downsides to this change that we can think of?