Skip to content

Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames #113

@rhiever

Description

@rhiever

Since we don't maintain the column names any more, it seems that we could replace the pandas DataFrames in our pipeline structure with numpy matrices. We're always changing the data into numpy matrices anyway when passing them to the sklearn operations, so I'm not seeing the point of using pandas DataFrames any more.

This might make TPOT more memory efficient, as we won't introduce DataFrame overhead either.

To make this happen, we would need to:

  • Store the train/test indices as internal self variables (in place of having a group column)
  • Ensure that the class column is always the last entry in the matrix (in place of having a class column)
  • Ensure that the latest guess column is always the second-to-last entry in the matrix (in place of having a guess column)

I believe that this would also make #29 much easier to implement.

Any downsides to this change that we can think of?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions