SLEP012 - DataArray #25

adrinjalali · 2019-12-07T16:13:32Z

This SLEP discusses a new data structure which would carry some meta-data along the data itself. The feature names are the first set of meta-data we would implement, but it could grow from that point.

Things we need to decide to continue the write up of this SLEP:

scope 1: do we want to have other meta-data here in the future? A NamedArray with only feature names may be another SLEP?
scope 2: should this slep stick to feature names, or should it discuss other meta-data
implementation details: do we want to include implementation details here?
the name: It started as NamedArray, but we seem to lean towards including other meta-data than feature names. Therefore we probably should use another name. DataArray is a placeholder. Happy to change.

TomAugspurger · 2019-12-07T17:44:42Z

slep012/proposal.rst

+- ``XArray``: we could accept a `pandas.DataFrame``, and use
+  ``xarray.DataArray`` as the output of transformers, including feature names.
+  However, ``xarray`` depends on ``pandas``, and uses ``pandas.Series`` to
+  handle row labels and aligns rows when an operation between two


Uses a pandas Index for labels, not Series?

Can you expand on why alignment is an issue here? I wouldn’t expect row labels to be used internally in an estimator, but I would expect them to be preserved through a transformation.

I added some clarification, and this is what I remember from my discussions with @GaelVaroquaux

Also, more than happy to have you comment on any part of this proposal \o/

jnothman · 2019-12-08T22:32:47Z

should this slep stick to feature names, or should it discuss other meta-data

I think it should give examples of other potential things we might include.

jnothman · 2019-12-08T22:35:15Z

FeatureArray?

jnothman · 2019-12-08T22:36:51Z

slep012/proposal.rst

+InputArray
+==========
+
+This proposal suggests adding a new data structure, called ``InputArray``,


But it's used for output from scikit-learn models isn't it?

jnothman · 2019-12-08T22:38:32Z

slep012/proposal.rst

+doesn't interfere with existing user code.
+
+A main constraint of this data structure is that is should be backward
+compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a


This "backwards compatible" is a pretty loose statement that deserves some explication. Will the object be an instance of ndarray? Will operations over DataArray produce raw ndarrays in all cases, or the subtype in some?

How do we handle sparse arrays?

Do we need this backwards compatibility, or do we target a major release and require users to unwrap the data structure for some operations??

jnothman · 2019-12-08T22:40:24Z

slep012/proposal.rst

+***************
+
+There will be factory methods creating an ``InputArray`` given a
+``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or


What about conversion in the other direction?

jnothman · 2019-12-08T22:41:08Z

slep012/proposal.rst

+Feature Names
+*************
+
+Feature names are an array of strings aligned with the columns. They can be


An array of strings? I think you mean either an object array or a list (the latter being what get_feature_names currently returns)

NicolasHug

Thanks for opening the SLEP

NicolasHug · 2019-12-09T19:14:03Z

slep012/proposal.rst

+All of the above applies to sparse arrays.
+
+Factory Methods
+***************


Can you give some examples? I suppose these would look like e.g.

DataArray.form_pandas(...) DataArray.form_xarray(...) DataArray(numpy_array, feature_names)

NicolasHug · 2019-12-09T19:14:37Z

slep012/proposal.rst

+.. _slep_012:
+
+==========
+InputArray


InputArray or DataArray? The PR intro says the latter

Also just came up with InputArray. It's better than DataArray because it's not taken in the python ecosystem yet afaik (OpenCV has it in C++?) but I agree with @jnothman below in that we also use it for output - and actually the primary purpose of it is using it for output :-/

Wouldn't that be an array of arrays? Or an array of metadata?

NicolasHug · 2019-12-09T19:15:29Z

slep012/proposal.rst

+feature names to be attached to the data given to an estimator, there are a few
+approaches we can take:


I think alternative solutions should go at the end in their own section

NicolasHug · 2019-12-09T19:19:47Z

slep012/proposal.rst

+==========
+
+This proposal suggests adding a new data structure, called ``InputArray``,
+which wraps a data matrix with some added information about the data. This was


Wraps how?

Maybe give an example of what the DataArray class would look like? It's not clear whether the plan is to go for inheritance or composition

Maybe augments the data array X with additional meta-data.

For now I intentionally left the implementation details out. The idea is that the solution should satisfy the requirements of this proposal. I can put the details in, but that'll be a lot of details.

Ok then mention that maybe? And make the requirements clear on what numpy compatibility means, or at least list the relevant questions?

NicolasHug · 2019-12-09T19:21:45Z

slep012/proposal.rst

+Operations
+**********


How do transformers behave w.r.t to the new input format? We want them to return a DataArray if they are passed in DataArray I suppose

TomAugspurger · 2019-12-09T20:47:50Z

slep012/proposal.rst

+  However, ``xarray`` has a hard dependency on ``pandas``, and uses
+  ``pandas.Index`` to handle row labels and aligns rows when an operation
+  between two ``xarray.DataArray`` is done, which can be time consuming, and is
+  not the semantic expected in ``scikit-learn``; we only expect the number of


Expanding on this "row labels" point: I don't think it should be mentioned at all since it should be a non-issue. No reasonable estimator should be using row labels in its implementation, since the row labels in the data passed to .fit can't be expected to match the row labels passed in .transform, .predict, .score, etc.

If I had to guess where auto-alignment could be an issue, it would be when the features passed to, say, .score differ from the inputs to .fit.

class MinScalar: def __fit__(self, X: DataFrame): self.minima_ = X.min() # Series, the row labels are the original feature names. def transform(self, X: DataFrame): return X - self.minima_

When used incorrectly, you could get something like

In [4]: X = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) In [5]: minima = X.min() # simulate fit In [6]: minima Out[6]: A 1 B 4 dtype: int64 In [7]: X_new = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}) In [8]: X_new - minima # simulate transform. This should raise! Out[8]: A B C 0 0.0 0.0 NaN 1 1.0 1.0 NaN 2 2.0 2.0 NaN

But I expect that input validation (that the feature names passed to score / transform match the feature names passed to fit) be handled earlier.

I agree that indices shouldn't be an issue as we could always just reindex in the beginning. I think the main reason for not using xarray was to keep 100% numpy-compatibility. If we don't require that, I think xarray is a viable alternative. I was convinced at some point that xarray is not a good idea but it's been so long that I can't reconstruct the reasoning. I think basically I didn't want a hard, incompatible switch.

IIRC:

certain operations have a different semantic comparing numpy and xarray.

xarray has a hard pandas dependency

row alignment does take time, and we have discussed that it's something we definitely don't want. @GaelVaroquaux may have a clearer argument on this one

As far as I remember, otherwise it was a good alternative.

amueller · 2019-12-10T16:53:14Z

Scope 1) I would say we leave the route open to include feature props in the future.

DataArray is the name of the xarray class, so I'd say -1 on that.
FeatureArray seems better, but if we at some point in the future want to include sample weights or other sample props, it's going to be wrong.

We could do something non-informative like SklArray. InputArray? InfoArray? ugh...

Scope 2) I agree with @jnothman, you should give examples that we might care about in the future.

Implementation details: Yes, it should include implementation details and alternatives.

amueller

I think this should be reorganized to start with a motivation. It should also discuss why this solution is better than what was proposed in the old slep.

Also it should actually describe the behavior we want. Where will it be accepted as input parameter? Probably everywhere? When will it be produced as output parameter? Always? When the input was a NamedArray? When the input was a NamedArray or a DataFrame?

amueller · 2019-12-10T16:55:51Z

slep012/proposal.rst

+.. _slep_012:
+
+==========
+InputArray


Also just came up with InputArray. It's better than DataArray because it's not taken in the python ecosystem yet afaik (OpenCV has it in C++?) but I agree with @jnothman below in that we also use it for output - and actually the primary purpose of it is using it for output :-/

slep012/proposal.rst

amueller · 2019-12-10T16:57:54Z

slep012/proposal.rst

+==========
+
+This proposal suggests adding a new data structure, called ``InputArray``,
+which wraps a data matrix with some added information about the data. This was


Maybe augments the data array X with additional meta-data.

amueller · 2019-12-10T16:58:48Z

slep012/proposal.rst

+
+==========
+InputArray
+==========


I think you should start with the problem we try to solve, and then a motivating example of how it will look like once we solved the problem.
So start with motivation I'd say.

amueller · 2019-12-10T16:59:39Z

slep012/proposal.rst

+feature names to be attached to the data given to an estimator, there are a few
+approaches we can take:


amueller · 2019-12-10T17:02:38Z

slep012/proposal.rst

+  However, ``xarray`` has a hard dependency on ``pandas``, and uses
+  ``pandas.Index`` to handle row labels and aligns rows when an operation
+  between two ``xarray.DataArray`` is done, which can be time consuming, and is
+  not the semantic expected in ``scikit-learn``; we only expect the number of


I agree that indices shouldn't be an issue as we could always just reindex in the beginning. I think the main reason for not using xarray was to keep 100% numpy-compatibility. If we don't require that, I think xarray is a viable alternative. I was convinced at some point that xarray is not a good idea but it's been so long that I can't reconstruct the reasoning. I think basically I didn't want a hard, incompatible switch.

amueller · 2019-12-10T17:09:44Z

slep012/proposal.rst

+
+All usual operations (including slicing through ``__getitem__``) return an
+``np.ndarray``. The ``__array__`` method also returns the underlying data, w/o
+any modifications. This prevents any unwanted computational overhead as a


I don't follow this argument.

amueller · 2019-12-10T17:10:12Z

slep012/proposal.rst

+any modifications. This prevents any unwanted computational overhead as a
+result of migrating to this data structure.
+
+The ``select()`` method will act like a ``__getitem__``, except that it


Have we discussed this before? So this is similar to .loc in pandas?

amueller

I think this should be reorganized to start with a motivation. It should also discuss why this solution is better than what was proposed in the old slep.

Also it should actually describe the behavior we want. Where will it be accepted as input parameter? Probably everywhere? When will it be produced as output parameter? Always? When the input was a NamedArray? When the input was a NamedArray or a DataFrame?

amueller

I think this should be reorganized to start with a motivation. It should also discuss why this solution is better than what was proposed in the old slep.

Also it should actually describe the behavior we want. Where will it be accepted as input parameter? Probably everywhere? When will it be produced as output parameter? Always? When the input was a NamedArray? When the input was a NamedArray or a DataFrame?

amueller · 2019-12-11T20:36:56Z

I want to record something that we discussed in person yesterday, which is enabling simple conversion to pandas dataframes. Ideally users should be able to do pd.DataFrame(our_thing) to get a dataframe (using something like the __array__ protocol for numpy). For example we'd like seaborn to work natively on our new data structure.
It's not entirely clear whether either of these goals are achievable in a non-hacky way right now, but I think they are important considerations.

A less elegant solution would be a toframe() method, but that probably wouldn't allow us to integrate with existing tools like seaborn.

amueller · 2019-12-11T20:58:29Z

@jorisvandenbossche says nothing like the __array__ protocol exists for pandas.

jorisvandenbossche · 2019-12-11T21:25:00Z

For example we'd like seaborn to work natively on our new data structure.

Then seaborn would still need to convert inputs to DataFrames, and not assume it already is (didn't check seaborn's inernals). So even if there is such a pandas protocol, all downstream libraries that accept dataframes might need to be adapted before natively working on such a new data structure (not impossible of course, and one needs to start with it at some point, but certainly an extra hurdle).

amueller · 2019-12-11T21:56:45Z

Indeed, the downstream libraries would need work. But this would be an enabling feature that allows downstream libraries to implement this conversion without knowing anything about the input format. Seaborn has some isinstance(data, pd.DataFrame) which would require adjustment. There's also some hasattr(x, 'shape').

alexgarel · 2019-12-16T20:59:10Z

Hi, I found this very interesting. I'm not sure, my comment will be relevant, but what about simple integration of numpy structured arrays ?

Also an integration with Pipelines would be interesting, like FeatureUnion (retaining feature names).

jnothman · 2019-12-17T07:31:35Z

Struct arrays would meet some but not all of our needs, yet are quite tricky to work with, e.g. very limited support for object dtype and impossible to offer backwards compatibility (we would be dealing with 1d struct arrays rather than 2d arrays).

lorentzenchr · 2019-12-19T15:50:17Z

Dear core devs
I admire your hard work to get feature names in scikit-learn done. This is a huge step forward for usability. As this isn't the first proposal, you already put a lot of thought into it. I don't know were to put these, but here are a few thoughts from a user:

1. Yet another data structure

There are already several data structures in Python, to name a few: numpy.ndarray, pandas.DataFrame, xarray.DataArray, pyarrow.Table, datatable.Frame (h2oai), ...
I worry about fragmentation and interoperability issues between data structures, as they form the very foundation of any process (ETL, ML, visualization, ...) on top if it.

2. User interface and underlying data structure

Algorithms in scikit-learn are implemented on numpy.ndarray (homogeneous n-dimensional data). But usually, my data comes as heterogeneous tabular data (strings, categorical, floats). How deep should the integration between those two go? ColumnTransformer is already a great help.

3. Compare to/Learn from R

R settled on its data.frame structure (or data.table or tibble, which are interoperable). This enables data.frame-in and data.frame-out processes on which many libraries and functions rely, especially in the tidyverse. R formulae are another means to work directly on dataframes instead of matrices/arrays.

4. Standard use case

At least in my work, I usually start with a tabular data structure, something like this.

import pandas as pd
df = pd.read_parquet("myfile.parquet")

After some feature preprocessing/engineering, I need to pass them to ColumnTransformer, before I can use an estimator:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.X import XRegressor
y = df['target']
X = df[:, ['feature_a', 'feature_b', ...]]
col_trans = ColumnTransformer(
    [('feature_a_ohe', OneHotEncoder(),['feature_a']), ...])
model = Pipeline(steps=[('col_trans', col_trans), ('regressor', XRegressor())])
model.fit(X, y)

After that, I'd like inspect my model with respect to the original input features ('feature_a', 'feature_b' —the ones before the ColumnTransformer).

amueller · 2019-12-24T17:27:59Z

Thanks for your input @lorentzenchr.
These concerns are definitely on our mind. Let me briefly reply with my thoughts

Yes this is a big concern, but we have not found a better solution.
That's a good question. I think @jorisvandenbossche suggested working on pandas dataframes where possible, though this is a tiny fraction of sklearn (some of the preprocessing module, maybe feature selection?). I'd be in favor of this but the reach will necessarily be quite limited.
R does a memory copy for every input and output as far as I know. We opted not to do that. That was one of the strongest arguments against using pandas. Maybe having memory copies are worth it if usability improvements make it worth it? Though if pandas implemented Feature request: Protocol for converting something to a pandas DataFrame pandas-dev/pandas#30218 we could integrate with other libraries easily without having to force a memory copy on our end.
yes that's exactly what I want.

amueller · 2019-12-27T17:30:37Z

Do you want another review or do you want to address the comments first? I can also send a PR.

GaelVaroquaux · 2019-12-27T21:07:06Z

• row alignment does take time, and we have discussed that it's something we definitely don't want. @GaelVaroquaux may have a clearer argument on this one

Not only does it take time, it is in practice a significant source of bugs for naive users who do not know that it is happening.

adrinjalali · 2019-12-27T21:20:35Z

@amueller I tried to address the comments. I'm happy to take more comments, or for you to either change directly this PR or a PR to my fork if you think it needs discussion and things are not clear.

I also update the NamedArray PR I had with the lates prototype I had for sparse arrays (scikit-learn/scikit-learn#14315)

TomAugspurger · 2019-12-27T21:39:31Z

Can someone expand on the row alignment issue? I don’t see any case where row labels would be used by an estimator.

TomAugspurger · 2020-01-07T15:40:50Z

Adrin attempted an explanation to my "row-label alignment" confusion. The
concern may not be internal to scikit-learn, which I think will disregard row
labels. Rather, it was for users who might receive this named array from a
.transform. If they're unfamiliar with pandas / xarray, they'll be surprised
by the alignment. In code,

>>> trn = StandardScaler()
>>> a = xr.DataArray(np.random.randn(3, 2), coords={"dim_0": ['a', 'b', 'c']})
>>> b = np.random.randn(3, 2)
>>> trn.fit(a)
>>> trn.transform(a) + trn.tranform(b)  # ? + ?, does this align?

This comes down to a fundamental question: what is the output type from
.transform? I worry that we're conflating two things: Some data structure for
scikit-learn to use internally for propagating feature names, and some data
structure for users to provide feature names to scikit-learn, and use after
getting a result back from scikit-learn (via transform).

My initial preferences are for

Scikit-Learn to continue only using NumPy ndarrays / scipy sparse matricies internally. Arguments are converted to ndarrays / matricies explicitly.
Estimator.transform to return an object with the same type as the input
a. Feature names (.columns for a DataFrame) match estimator.output_features_names_ (may have the attribute name wrong)
b. (possibly) Row labels, if any, match the input row labels

Applying those principles gives the following table

Fit Input	Transform Input	Transform Output
ndarray	ndarray	ndarray
ndarray	DataArray	DataArray
ndarray	DataFrame	DataFrame
DataArray	ndarray	ndarray
DataArray	DataArray	DataArray
DataArray	ndarray	ndarray
DataFrame	DataFrame	DataFrame
DataFrame	DataArray	DataArray
DataFrame	DataFrame	DataFrame

Some objections to this approach are that

Feature names are only available to people using xarray, pandas. IMO that's acceptable.
Adding a new data container to the ecosystem should have a high bar. Scikit-Learn coudl
define an interface for other containers to met.
A potential pandas refactor might make pd.DataFrame(np.asarray(df)) have two memory
copies. But that refactor is uncertain, at is some years off at a minimum.

amueller · 2020-01-07T16:30:37Z

@TomAugspurger what does "internally" mean here? An estimator doesn't know whether it's in a pipeline or not, and so if you'd pass a dataframe into a pipeline, you'd convert between dataframe and numpy array (which is used within the estimator) between every step. Previously I think the agreement was not to pay this cost. I'm not 100% sure any more, though. That would only be a cost if there was this refactor, though.

amueller · 2020-01-07T16:35:22Z

There's one more concern with your proposal: it's not backward compatible.
That could be solved by a global flag sklearn.set_config(preserve_frames=True) or something like that.

TomAugspurger · 2020-01-07T16:44:54Z

what does "internally" mean here?

Essentially, whatever check_array(X) returns. The array / matrix scikit-learn actually works with inside .fit and .transform.

That may or may not be a copy for a homogenous DataFrame today.

it's not backward compatible.

Right. If you're making a new InputArray you wouldn't have the legacy behavior concerns.

amueller · 2020-01-07T16:46:54Z

@TomAugspurger ok, I think so far anything we discussed used ndarray/scipy sparse internally, and we were only discussing input/output format.

amueller · 2020-01-07T16:53:11Z

Should we do a big list of the pros and cons of the three solutions we considered?
One is pandas-in, pandas-out, one is DataArray, and one is passing meta-data through the pipeline and meta-estimators. This discussion is really been pretty gnarly. This could be in the alternatives section of this SLEP or in a separate document.
We don't really have a framework for making a decision between three options (other than discussion).

amueller · 2020-01-07T18:52:53Z

@adrinjalali @GaelVaroquaux can you maybe recap the arguments against pandas-in pandas-out? I used to be convinced by the memory copy argument but honestly I think the usage advantage outweights that, given that it's a potential future concern and that we make so many copies already (and that they can be easily avoided by the user by converting to numpy if needed).

adrinjalali · 2020-01-13T16:08:17Z

there were/are some arguments against pandas/xarray, I'll try to recap:

pandas dependency (pandas, xarray)
row alignment (pandas, xarray)
different semantics on some of the operations between numpy and pandas/xarray. I don't remember what those operations where, but certain operations have the same name, but different semantics when you compare numpy and pandas/xarray
pandas in pandas out would kinda imply supporting pandas which in turn could mean understanding multi-indices (among other pandas features), which we probably don't wanna get into. We could be explicit and accept only a subset of pandas features, but that's also not easy to define.
API: both xarray and pandas have a very different API than what people are used to with numpy. My implementation using xarray had a bunch of utility functions to get around the obscure xarray API. This is not a bit issue though, since people can still convert between the three libraries more or less.

On the plus side, once I did have an implementation of the feature names with xarray and it was working. When it came down to it, it was the operation semantics, pandas dependency, and row alignment which resulted in us working on a different data structure (IIRC).

amueller · 2020-01-13T20:17:57Z

Can you elaborate on row alignment and what the issue is?

And why are different semantics a problem? It means an incompatible change but when users provide pandas dataframes as inputs, I would suppose they are familiar with pandas semantics. One could argue that transformers currently changing the semantics from pandas to numpy is an issue ;) - certainly having a breaking change with a config flag is some annoyance, though.

pandas in pandas out would kinda imply supporting pandas which in turn could mean understanding multi-indices (among other pandas features).

I'm not sure what you mean. We are currently supporting pandas as input types quite explicitly.

The API point seems to be a repeat of the different semantics point, right?

The reason I didn't like xarray was the axis semantics which is not really natural for us.

I guess writing this down well requires a different list of pro/cons for xarray and pandas each.

The point that @TomAugspurger made was more about preserving data types than using one or the other. From a user perspective that makes sense, since we'd want users to work with the API that they are familiar with. Having a pandas user work with xarray might not be super familiar (though xarray users are probably familiar with pandas to some degree?).

If we would want to preserve both types, that would probably require some custom code, and more code for each additional type we'd want to support (if any). As @TomAugspurger alluded to, we could try to define an abstraction for that, but that would be pretty far future.

I'd be happy to just preserve type for pandas for now.

Meaning there's actually 5 solutions:
a) preserve type (from a white-list of types which could start with just pd.DataFrame)
b) always use pd.Dataframe
c) always use xarray.DataArray
d) use DataArray
e) pass information through in meta-estimators.

The semantic differences would be an argument against b) and c) but not against a) imho.

TomAugspurger · 2020-01-13T20:38:36Z

Apologies for muddling the DataArray discussion with the "pandas (or xarray) in / pandas (or xarray) out" discussion. But I think they're related, since the need for a custom DataArray is lessened if pandas in / pandas out becomes the default behavior, and if feature names are limited to arrays that already have names.

row alignment (pandas, xarray)
which in turn could mean understanding multi-indices (among other pandas features),

Again, I really think that row-labels are a non-issue for scikit-learn internally :) And as for users, if they're passing in a DataFrame then they probably are familiar with pandas' alignment. The extent of scikit-learn's interaction with row labels would be something like

def transform(self, X, y=None):
    X, row_labels, input_type = check_array(X)
    # rest of tranform, operating on an ndarray
    result = ...
    # some hypothetical function that recreates a DataFrame / DataArray,
    # preserving row labels, attaching new features names.
    result = construct_result(result, row_labels, feature_name, input_type)
    return result

If the issue with multi-indices are for the columns, then I'd say scikit-learn shouldn't try to support those. I think / hope that you have a requirement that feature_names_in_ and feature_name_out_ be a sequence of strings.

adrinjalali · 2020-01-14T12:48:08Z

If both sample_weight and X are indexed, should sklearn try to align them in operations?

amueller · 2020-01-14T16:05:52Z

@adrinjalali that's a good question. Right now, if X, y and sample_weight are indexed, we just drop the index. Given that we have explicit support for pandas in ColumnTransformer that could be considered a bug.

For columns we are enforcing that they must be exactly the same between fit and transform. I could see us taking a similar stance here, i.e. asserting that the index is the same for X, y, sample_weights and raise an error if not.

Given that we decided against aligning columns, I think it makes sense to also not align rows. This issues is already present in the code base right now, though.
I don't really follow why you think it relates to changing the output type.

mfcabrera · 2020-01-15T10:41:57Z

slep012/proposal.rst

+``X`` being an ``InputArray``::
+
+    >>> np.array(X)
+    >>> X.todataframe()


just my two cents: spark already have toPandas() and given that eventully there might be different "Dataframes" (e.g. Modin, Dask) wouldn't it make sense to hav ex.topandas() x.topdataframe rather than a generic dataframe?

NicolasHug · 2020-01-16T16:46:43Z

slep012/proposal.rst

+Estimators understand the ``InputArray`` and extract the feature names from the
+given data before applying the operations and transformations on the data.
+
+All transformers return an ``InputArray`` with feature names attached to it.


Then how do we ensure the backward compatibility criteria from above, namely:

code which expects a numpy.ndarray as the output of a transformer, would not break

Since any operation on the InputArray would result in a pure numpy array, the existing code won't break.

But how is this enabled, technically?

scikit-learn/scikit-learn#14315

what i'm trying to say is, this should be explained in the SLEP.

I thought we agreed on not including those details here :P (#25 (comment))

Right, OK

As Andy noted it should be mentioned in the SLEP that the details are intentionally left out.

Though IMHO this is a slippery slope: leaving out details may mean ending up with a vacuous SLEP with no actual proposal

I don't think so. There are different ways of implementing this feature, and I don't think it's the scope of the slep to define those ways. The SLEP is about the API, not how that API is achieved. We may move from one solution to the other, w/o having to write a slep for it, as long as it doesn't substantially change the API.

adrinjalali · 2020-01-17T12:21:47Z

I kinda get the feeling that if:

we limit the index types on cols and ignore row indices of the input
only support feature names if the user has xarray and pandas installed

we can go back to using xarray as a return type for transformers if the user provides feature names through either an xarray or a pandas data structure?

We could have a global conf to enable/disable the behavior, and have it disabled by default for a few releases.

amueller · 2020-02-18T16:59:14Z

I agree with @adrinjalali only that I would actually use pandas, not xarray. I think having less dependencies and types is better. The main reason I had for xarray vs pandas was the zero-copy issue, and I come to give that less weight (given that the future of pandas is unclear and that we copy a lot anyway).

Sorry for going back and forth on this. Main question: should we have a separate slep for using an existing "array" type (pandas it not really an array, I know).
That will make voting and discussion harder and harder. I wonder if maybe discussing this on a call would be easier so we can sync up?
I would love to hear @GaelVaroquaux's, @jnothman's and @jorisvandenbossche's take.

adrinjalali · 2020-02-18T18:51:49Z

my gut feeling is that if we implement the machinery to have this optional, we could have it in the global config to be set as either xarray or pandas, and it shouldn't be too much work.

adrinjalali · 2020-02-18T18:52:48Z

Oh, and on the SLEP issue, I'd say we should work on a slep with an existing data type, and if we get the feeling that there's some consensus there, I'd withdraw this slep (I wish it was merged :D )

amueller · 2020-02-18T19:37:26Z

I'm happy to merge it as it is.
And doing pandas or xarray is substantially more work I think. Though I guess if we could compartmentalize it into some helper functions for wrapping and unwrapping it might not be so bad.

amueller

I think it would be good to merge and then keep discussing

adrinjalali · 2020-02-18T19:43:19Z

And doing pandas or xarray is substantially more work I think. Though I guess if we could compartmentalize it into some helper functions for wrapping and unwrapping it might not be so bad.

I'd be happy to have a prototype implementation once the n_features_in_ is in. I'm almost tempted to base my PR on @NicolasHug 's implementation and start before it's merged.

amueller · 2020-02-18T19:44:28Z

maybe coordinate with @thomasjpfan who I think is working on a new SLEP.

adrinjalali · 2020-02-18T19:55:25Z

@amueller would you like to merge? :D

amueller · 2020-02-18T19:59:36Z

Sure. I think the proposal is reasonably complete, even though I don't expect a vote on it in the current form very soon.

amueller · 2020-09-16T19:08:04Z

Can someone remind me where we had the discussion that concluded with making the output type globally constant instead of it depending on the input type?

adrinjalali · 2020-09-17T15:30:19Z

I don't remember where we had the discussion, but I thought we agreed that not knowing what the output is by looking at a piece of code is bad practice?

Also, that policy means the same code which before would take a pd.DataFrame and return an ndarray now would return a pd.DataFrame, which is confusing and not backward compatible (which we could argue is not essential for v1.0 ;) )

So I think I'd really prefer not to depend the output type on the input type.

initial writeup of the slep

c991dce

TomAugspurger reviewed Dec 7, 2019

View reviewed changes

clarify on xarray

0b25e2f

jnothman reviewed Dec 8, 2019

View reviewed changes

NicolasHug reviewed Dec 9, 2019

View reviewed changes

TomAugspurger reviewed Dec 9, 2019

View reviewed changes

amueller reviewed Dec 10, 2019

View reviewed changes

amueller mentioned this pull request Dec 11, 2019

Feature request: Protocol for converting something to a pandas DataFrame pandas-dev/pandas#30218

Open

address more comments

ef37bab

mfcabrera reviewed Jan 15, 2020

View reviewed changes

NicolasHug reviewed Jan 16, 2020

View reviewed changes

adrinjalali mentioned this pull request Feb 2, 2020

Feature names with input features scikit-learn/scikit-learn#13307

Closed

amueller approved these changes Feb 18, 2020

View reviewed changes

adrinjalali added 2 commits February 18, 2020 20:50

Merge remote-tracking branch 'upstream/master' into slep012/DataArray

85513f7

add headers

48ce7f4

amueller merged commit 37ab0c1 into scikit-learn:master Feb 18, 2020

amueller mentioned this pull request Feb 28, 2020

SLEP 014 Pandas in Pandas out #37

Merged

adrinjalali deleted the slep012/DataArray branch May 14, 2020 11:21

		feature names to be attached to the data given to an estimator, there are a few
		approaches we can take:

		Operations
		**********

Uh oh!

SLEP012 - DataArray #25

SLEP012 - DataArray #25

Uh oh!

Conversation

adrinjalali commented Dec 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Dec 8, 2019

Uh oh!

jnothman commented Dec 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

amueller commented Dec 10, 2019 •

edited

Loading

amueller commented Dec 11, 2019 •

edited

Loading