Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames

Since we don't maintain the column names any more, it seems that we could replace the pandas DataFrames in our pipeline structure with numpy matrices. We're always changing the data into numpy matrices anyway when passing them to the sklearn operations, so I'm not seeing the point of using pandas DataFrames any more.

This might make TPOT more memory efficient, as we won't introduce DataFrame overhead either.

To make this happen, we would need to:
- Store the train/test indices as internal `self` variables (in place of having a `group` column)
- Ensure that the `class` column is always the last entry in the matrix (in place of having a `class` column)
- Ensure that the latest `guess` column is always the second-to-last entry in the matrix (in place of having a `guess` column)

I believe that this would also make #29 much easier to implement.

Any downsides to this change that we can think of?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames #113

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames #113

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions