Skip to content

SLEP012 - DataArray #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 18, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions slep012/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
.. _slep_012:

==========
InputArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InputArray or DataArray? The PR intro says the latter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just came up with InputArray. It's better than DataArray because it's not taken in the python ecosystem yet afaik (OpenCV has it in C++?) but I agree with @jnothman below in that we also use it for output - and actually the primary purpose of it is using it for output :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetaArray?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be an array of arrays? Or an array of metadata?

==========
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should start with the problem we try to solve, and then a motivating example of how it will look like once we solved the problem.
So start with motivation I'd say.


:Author: Adrin jalali
:Status: Draft
:Type: Standards Track
:Created: 2019-12-20

Motivation
**********

This proposal results in a solution to propagating feature names through
transformers, pipelines, and the column transformer. Ideally, we would have::

df = pd.readcsv('tabular.csv')
# transforming the data in an arbitrary way
transformer0 = ColumnTransformer(...)
# a pipeline preprocessing the data and then a classifier (or a regressor)
clf = make_pipeline(transfoemer0, ..., SVC())

# now we can investigate features at each stage of the pipeline
clf[-1].input_feature_names_

The feature names are propagated throughout the pipeline and the user can
investigate them at each step of the pipeline.

This proposal suggests adding a new data structure, called ``InputArray``,
which augments the data array ``X`` with additional meta-data. In this proposal
we assume the feature names (and other potential meta-data) are attached to the
data when passed to an estimator. Alternative solutions are discussed later in
this document.

A main constraint of this data structure is that is should be backward
compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "backwards compatible" is a pretty loose statement that deserves some explication. Will the object be an instance of ndarray? Will operations over DataArray produce raw ndarrays in all cases, or the subtype in some?

How do we handle sparse arrays?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this backwards compatibility, or do we target a major release and require users to unwrap the data structure for some operations??

transformer, would not break. This SLEP focuses on *feature names* as the only
meta-data attached to the data. Support for other meta-data can be added later.

Backward/NumPy/Pandas Compatibility
***********************************

Since currently transformers return a ``numpy`` or a ``scipy`` array, backward
compatibility in this context means the operations which are valid on those
arrays should also be valid on the new data structure.

All operations are delegated to the *data* part of the container, and the
meta-data is lost immediately after each operation and operations result in a
``numpy.ndarray``. This includes indexing and slicing, *i.e.* to avoid
performance degradation, ``__getitem__`` is not overloaded and if the user
wishes to preserve the meta-data, they shall do so via explicitly calling a
method such as ``select()``. Operations between two ``InpuArray``s will not
try to align rows and/or columns of the two given objects.

``pandas`` compatibility comes ideally as a ``pd.DataFrame(inputarray)``, for
which ``pandas`` does not provide a clean API at the moment. Alternatively,
``inputarray.todataframe()`` would return a ``pandas.DataFrame`` with the
relevant meta-data attached.

Feature Names
*************

Feature names are an object ``ndarray`` of strings aligned with the columns.
They can be ``None``.

Operations
**********
Comment on lines +67 to +68
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do transformers behave w.r.t to the new input format? We want them to return a DataArray if they are passed in DataArray I suppose


Estimators understand the ``InputArray`` and extract the feature names from the
given data before applying the operations and transformations on the data.

All transformers return an ``InputArray`` with feature names attached to it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how do we ensure the backward compatibility criteria from above, namely:

code which expects a numpy.ndarray as the output of a transformer, would not break

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since any operation on the InputArray would result in a pure numpy array, the existing code won't break.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how is this enabled, technically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i'm trying to say is, this should be explained in the SLEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed on not including those details here :P (#25 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, OK

As Andy noted it should be mentioned in the SLEP that the details are intentionally left out.

Though IMHO this is a slippery slope: leaving out details may mean ending up with a vacuous SLEP with no actual proposal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. There are different ways of implementing this feature, and I don't think it's the scope of the slep to define those ways. The SLEP is about the API, not how that API is achieved. We may move from one solution to the other, w/o having to write a slep for it, as long as it doesn't substantially change the API.

The way feature names are generated is discussed in *SLEP007 - The Style of The
Feature Names*.

Sparse Arrays
*************

Ideally sparse arrays follow the same pattern, but since ``scipy.sparse`` does
not provide the kinda of API provided by ``numpy``, we may need to find
compromises.

Factory Methods
***************
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give some examples? I suppose these would look like e.g.

DataArray.form_pandas(...)
DataArray.form_xarray(...)
DataArray(numpy_array, feature_names)


There will be factory methods creating an ``InputArray`` given a
``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about conversion in the other direction?

an ``sp.SparseMatrix`` and a given set of feature names.

An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a
``todataframe()`` method.

``X`` being an ``InputArray``::

>>> np.array(X)
>>> X.todataframe()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just my two cents: spark already have toPandas() and given that eventully there might be different "Dataframes" (e.g. Modin, Dask) wouldn't it make sense to hav ex.topandas() x.topdataframe rather than a generic dataframe?

>>> pd.DataFrame(X) # only if pandas implements the API

And given ``X`` a ``np.ndarray`` or an ``sp.sparse`` matrix and a set of
feature names, one can make the right ``InputArray`` using::

>>> make_inputarray(X, feature_names)

Alternative Solutions
*********************

Since we expect the feature names to be attached to the data given to an
estimator, there are a few potential approaches we can take:

- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data
as a ``pandas.DataFrame``, and if so, the transformer would output a
``pandas.DataFrame`` which also includes the [generated] feature names. This
is not a feasible solution since ``pandas`` plans to move to a per column
representation, which means ``pd.DataFrame(np.asarray(df))`` has two
guaranteed memory copies.
- ``XArray``: we could accept a `pandas.DataFrame``, and use
``xarray.DataArray`` as the output of transformers, including feature names.
However, ``xarray`` has a hard dependency on ``pandas``, and uses
``pandas.Index`` to handle row labels and aligns rows when an operation
between two ``xarray.DataArray`` is done, which can be time consuming, and is
not the semantic expected in ``scikit-learn``; we only expect the number of
rows to be equal, and that the rows always correspond to one another in the
same order.

As a result, we need to have another data structure which we'll use to transfer
data related information (such as feature names), which is lightweight and
doesn't interfere with existing user code.

Another alternative to the problem of passing meta-data around is to pass that
as a parameter to ``fit``. This would heavily involve modifying meta-estimators
since they'd need to pass that information, and extract the relevant
information from the estimators to pass that along to the next estimator. Our
prototype implementations showed significant challenges compared to when the
meta-data is attached to the data.
1 change: 1 addition & 0 deletions under_review.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ SLEPs under review
:maxdepth: 1

slep007/proposal
slep012/proposal