-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP012 - DataArray #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c991dce
0b25e2f
ef37bab
85513f7
48ce7f4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
.. _slep_012: | ||
|
||
========== | ||
InputArray | ||
========== | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you should start with the problem we try to solve, and then a motivating example of how it will look like once we solved the problem. |
||
|
||
:Author: Adrin jalali | ||
:Status: Draft | ||
:Type: Standards Track | ||
:Created: 2019-12-20 | ||
|
||
Motivation | ||
********** | ||
|
||
This proposal results in a solution to propagating feature names through | ||
transformers, pipelines, and the column transformer. Ideally, we would have:: | ||
|
||
df = pd.readcsv('tabular.csv') | ||
# transforming the data in an arbitrary way | ||
transformer0 = ColumnTransformer(...) | ||
# a pipeline preprocessing the data and then a classifier (or a regressor) | ||
clf = make_pipeline(transfoemer0, ..., SVC()) | ||
|
||
# now we can investigate features at each stage of the pipeline | ||
clf[-1].input_feature_names_ | ||
|
||
The feature names are propagated throughout the pipeline and the user can | ||
investigate them at each step of the pipeline. | ||
|
||
This proposal suggests adding a new data structure, called ``InputArray``, | ||
which augments the data array ``X`` with additional meta-data. In this proposal | ||
we assume the feature names (and other potential meta-data) are attached to the | ||
data when passed to an estimator. Alternative solutions are discussed later in | ||
this document. | ||
|
||
A main constraint of this data structure is that is should be backward | ||
compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This "backwards compatible" is a pretty loose statement that deserves some explication. Will the object be an instance of ndarray? Will operations over DataArray produce raw ndarrays in all cases, or the subtype in some? How do we handle sparse arrays? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need this backwards compatibility, or do we target a major release and require users to unwrap the data structure for some operations?? |
||
transformer, would not break. This SLEP focuses on *feature names* as the only | ||
meta-data attached to the data. Support for other meta-data can be added later. | ||
|
||
Backward/NumPy/Pandas Compatibility | ||
*********************************** | ||
|
||
Since currently transformers return a ``numpy`` or a ``scipy`` array, backward | ||
compatibility in this context means the operations which are valid on those | ||
arrays should also be valid on the new data structure. | ||
|
||
All operations are delegated to the *data* part of the container, and the | ||
meta-data is lost immediately after each operation and operations result in a | ||
``numpy.ndarray``. This includes indexing and slicing, *i.e.* to avoid | ||
performance degradation, ``__getitem__`` is not overloaded and if the user | ||
wishes to preserve the meta-data, they shall do so via explicitly calling a | ||
method such as ``select()``. Operations between two ``InpuArray``s will not | ||
try to align rows and/or columns of the two given objects. | ||
|
||
``pandas`` compatibility comes ideally as a ``pd.DataFrame(inputarray)``, for | ||
which ``pandas`` does not provide a clean API at the moment. Alternatively, | ||
``inputarray.todataframe()`` would return a ``pandas.DataFrame`` with the | ||
relevant meta-data attached. | ||
|
||
Feature Names | ||
************* | ||
|
||
Feature names are an object ``ndarray`` of strings aligned with the columns. | ||
They can be ``None``. | ||
|
||
Operations | ||
********** | ||
Comment on lines
+67
to
+68
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do transformers behave w.r.t to the new input format? We want them to return a DataArray if they are passed in DataArray I suppose |
||
|
||
Estimators understand the ``InputArray`` and extract the feature names from the | ||
given data before applying the operations and transformations on the data. | ||
|
||
All transformers return an ``InputArray`` with feature names attached to it. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then how do we ensure the backward compatibility criteria from above, namely:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since any operation on the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But how is this enabled, technically? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what i'm trying to say is, this should be explained in the SLEP. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought we agreed on not including those details here :P (#25 (comment)) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, OK As Andy noted it should be mentioned in the SLEP that the details are intentionally left out. Though IMHO this is a slippery slope: leaving out details may mean ending up with a vacuous SLEP with no actual proposal There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think so. There are different ways of implementing this feature, and I don't think it's the scope of the slep to define those ways. The SLEP is about the API, not how that API is achieved. We may move from one solution to the other, w/o having to write a slep for it, as long as it doesn't substantially change the API. |
||
The way feature names are generated is discussed in *SLEP007 - The Style of The | ||
Feature Names*. | ||
|
||
Sparse Arrays | ||
************* | ||
|
||
Ideally sparse arrays follow the same pattern, but since ``scipy.sparse`` does | ||
not provide the kinda of API provided by ``numpy``, we may need to find | ||
compromises. | ||
|
||
Factory Methods | ||
*************** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you give some examples? I suppose these would look like e.g. DataArray.form_pandas(...)
DataArray.form_xarray(...)
DataArray(numpy_array, feature_names) |
||
|
||
There will be factory methods creating an ``InputArray`` given a | ||
``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about conversion in the other direction? |
||
an ``sp.SparseMatrix`` and a given set of feature names. | ||
|
||
An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a | ||
``todataframe()`` method. | ||
|
||
``X`` being an ``InputArray``:: | ||
|
||
>>> np.array(X) | ||
>>> X.todataframe() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just my two cents: spark already have |
||
>>> pd.DataFrame(X) # only if pandas implements the API | ||
|
||
And given ``X`` a ``np.ndarray`` or an ``sp.sparse`` matrix and a set of | ||
feature names, one can make the right ``InputArray`` using:: | ||
|
||
>>> make_inputarray(X, feature_names) | ||
|
||
Alternative Solutions | ||
********************* | ||
|
||
Since we expect the feature names to be attached to the data given to an | ||
estimator, there are a few potential approaches we can take: | ||
|
||
- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data | ||
as a ``pandas.DataFrame``, and if so, the transformer would output a | ||
``pandas.DataFrame`` which also includes the [generated] feature names. This | ||
is not a feasible solution since ``pandas`` plans to move to a per column | ||
representation, which means ``pd.DataFrame(np.asarray(df))`` has two | ||
guaranteed memory copies. | ||
- ``XArray``: we could accept a `pandas.DataFrame``, and use | ||
``xarray.DataArray`` as the output of transformers, including feature names. | ||
However, ``xarray`` has a hard dependency on ``pandas``, and uses | ||
``pandas.Index`` to handle row labels and aligns rows when an operation | ||
between two ``xarray.DataArray`` is done, which can be time consuming, and is | ||
not the semantic expected in ``scikit-learn``; we only expect the number of | ||
rows to be equal, and that the rows always correspond to one another in the | ||
same order. | ||
|
||
As a result, we need to have another data structure which we'll use to transfer | ||
data related information (such as feature names), which is lightweight and | ||
doesn't interfere with existing user code. | ||
|
||
Another alternative to the problem of passing meta-data around is to pass that | ||
as a parameter to ``fit``. This would heavily involve modifying meta-estimators | ||
since they'd need to pass that information, and extract the relevant | ||
information from the estimators to pass that along to the next estimator. Our | ||
prototype implementations showed significant challenges compared to when the | ||
meta-data is attached to the data. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,3 +9,4 @@ SLEPs under review | |
:maxdepth: 1 | ||
|
||
slep007/proposal | ||
slep012/proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InputArray or DataArray? The PR intro says the latter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just came up with InputArray. It's better than DataArray because it's not taken in the python ecosystem yet afaik (OpenCV has it in C++?) but I agree with @jnothman below in that we also use it for output - and actually the primary purpose of it is using it for output :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetaArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that be an array of arrays? Or an array of metadata?