Skip to content

SLEP012 - DataArray #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 18, 2020
Merged

Conversation

adrinjalali
Copy link
Member

This SLEP discusses a new data structure which would carry some meta-data along the data itself. The feature names are the first set of meta-data we would implement, but it could grow from that point.

Things we need to decide to continue the write up of this SLEP:

  • scope 1: do we want to have other meta-data here in the future? A NamedArray with only feature names may be another SLEP?
  • scope 2: should this slep stick to feature names, or should it discuss other meta-data
  • implementation details: do we want to include implementation details here?
  • the name: It started as NamedArray, but we seem to lean towards including other meta-data than feature names. Therefore we probably should use another name. DataArray is a placeholder. Happy to change.

- ``XArray``: we could accept a `pandas.DataFrame``, and use
``xarray.DataArray`` as the output of transformers, including feature names.
However, ``xarray`` depends on ``pandas``, and uses ``pandas.Series`` to
handle row labels and aligns rows when an operation between two

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses a pandas Index for labels, not Series?

Can you expand on why alignment is an issue here? I wouldn’t expect row labels to be used internally in an estimator, but I would expect them to be preserved through a transformation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some clarification, and this is what I remember from my discussions with @GaelVaroquaux

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, more than happy to have you comment on any part of this proposal \o/

@jnothman
Copy link
Member

jnothman commented Dec 8, 2019

should this slep stick to feature names, or should it discuss other meta-data

I think it should give examples of other potential things we might include.

@jnothman
Copy link
Member

jnothman commented Dec 8, 2019

FeatureArray?

InputArray
==========

This proposal suggests adding a new data structure, called ``InputArray``,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's used for output from scikit-learn models isn't it?

doesn't interfere with existing user code.

A main constraint of this data structure is that is should be backward
compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "backwards compatible" is a pretty loose statement that deserves some explication. Will the object be an instance of ndarray? Will operations over DataArray produce raw ndarrays in all cases, or the subtype in some?

How do we handle sparse arrays?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this backwards compatibility, or do we target a major release and require users to unwrap the data structure for some operations??

***************

There will be factory methods creating an ``InputArray`` given a
``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about conversion in the other direction?

Feature Names
*************

Feature names are an array of strings aligned with the columns. They can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An array of strings? I think you mean either an object array or a list (the latter being what get_feature_names currently returns)

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening the SLEP

All of the above applies to sparse arrays.

Factory Methods
***************
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give some examples? I suppose these would look like e.g.

DataArray.form_pandas(...)
DataArray.form_xarray(...)
DataArray(numpy_array, feature_names)

.. _slep_012:

==========
InputArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InputArray or DataArray? The PR intro says the latter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just came up with InputArray. It's better than DataArray because it's not taken in the python ecosystem yet afaik (OpenCV has it in C++?) but I agree with @jnothman below in that we also use it for output - and actually the primary purpose of it is using it for output :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetaArray?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be an array of arrays? Or an array of metadata?

Comment on lines 10 to 11
feature names to be attached to the data given to an estimator, there are a few
approaches we can take:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think alternative solutions should go at the end in their own section

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.

==========

This proposal suggests adding a new data structure, called ``InputArray``,
which wraps a data matrix with some added information about the data. This was
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wraps how?

Maybe give an example of what the DataArray class would look like? It's not clear whether the plan is to go for inheritance or composition

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe augments the data array X with additional meta-data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I intentionally left the implementation details out. The idea is that the solution should satisfy the requirements of this proposal. I can put the details in, but that'll be a lot of details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then mention that maybe? And make the requirements clear on what numpy compatibility means, or at least list the relevant questions?

Comment on lines +44 to +45
Operations
**********
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do transformers behave w.r.t to the new input format? We want them to return a DataArray if they are passed in DataArray I suppose

However, ``xarray`` has a hard dependency on ``pandas``, and uses
``pandas.Index`` to handle row labels and aligns rows when an operation
between two ``xarray.DataArray`` is done, which can be time consuming, and is
not the semantic expected in ``scikit-learn``; we only expect the number of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding on this "row labels" point: I don't think it should be mentioned at all since it should be a non-issue. No reasonable estimator should be using row labels in its implementation, since the row labels in the data passed to .fit can't be expected to match the row labels passed in .transform, .predict, .score, etc.

If I had to guess where auto-alignment could be an issue, it would be when the features passed to, say, .score differ from the inputs to .fit.

class MinScalar:
    def __fit__(self, X: DataFrame):
        self.minima_ = X.min()  # Series, the row labels are the original feature names.
    def transform(self, X: DataFrame):
        return X - self.minima_

When used incorrectly, you could get something like

In [4]: X = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

In [5]: minima = X.min()  # simulate fit

In [6]: minima
Out[6]:
A    1
B    4
dtype: int64

In [7]: X_new = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})

In [8]: X_new - minima  # simulate transform. This should raise!
Out[8]:
     A    B   C
0  0.0  0.0 NaN
1  1.0  1.0 NaN
2  2.0  2.0 NaN

But I expect that input validation (that the feature names passed to score / transform match the feature names passed to fit) be handled earlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that indices shouldn't be an issue as we could always just reindex in the beginning. I think the main reason for not using xarray was to keep 100% numpy-compatibility. If we don't require that, I think xarray is a viable alternative. I was convinced at some point that xarray is not a good idea but it's been so long that I can't reconstruct the reasoning. I think basically I didn't want a hard, incompatible switch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC:

  • certain operations have a different semantic comparing numpy and xarray.
  • xarray has a hard pandas dependency
  • row alignment does take time, and we have discussed that it's something we definitely don't want. @GaelVaroquaux may have a clearer argument on this one

As far as I remember, otherwise it was a good alternative.

@amueller
Copy link
Member

amueller commented Dec 10, 2019

Scope 1) I would say we leave the route open to include feature props in the future.

DataArray is the name of the xarray class, so I'd say -1 on that.
FeatureArray seems better, but if we at some point in the future want to include sample weights or other sample props, it's going to be wrong.

We could do something non-informative like SklArray. InputArray? InfoArray? ugh...

Scope 2) I agree with @jnothman, you should give examples that we might care about in the future.

Implementation details: Yes, it should include implementation details and alternatives.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be reorganized to start with a motivation. It should also discuss why this solution is better than what was proposed in the old slep.

Also it should actually describe the behavior we want. Where will it be accepted as input parameter? Probably everywhere? When will it be produced as output parameter? Always? When the input was a NamedArray? When the input was a NamedArray or a DataFrame?

.. _slep_012:

==========
InputArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just came up with InputArray. It's better than DataArray because it's not taken in the python ecosystem yet afaik (OpenCV has it in C++?) but I agree with @jnothman below in that we also use it for output - and actually the primary purpose of it is using it for output :-/

==========

This proposal suggests adding a new data structure, called ``InputArray``,
which wraps a data matrix with some added information about the data. This was
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe augments the data array X with additional meta-data.


==========
InputArray
==========
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should start with the problem we try to solve, and then a motivating example of how it will look like once we solved the problem.
So start with motivation I'd say.

Comment on lines 10 to 11
feature names to be attached to the data given to an estimator, there are a few
approaches we can take:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.

However, ``xarray`` has a hard dependency on ``pandas``, and uses
``pandas.Index`` to handle row labels and aligns rows when an operation
between two ``xarray.DataArray`` is done, which can be time consuming, and is
not the semantic expected in ``scikit-learn``; we only expect the number of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that indices shouldn't be an issue as we could always just reindex in the beginning. I think the main reason for not using xarray was to keep 100% numpy-compatibility. If we don't require that, I think xarray is a viable alternative. I was convinced at some point that xarray is not a good idea but it's been so long that I can't reconstruct the reasoning. I think basically I didn't want a hard, incompatible switch.


All usual operations (including slicing through ``__getitem__``) return an
``np.ndarray``. The ``__array__`` method also returns the underlying data, w/o
any modifications. This prevents any unwanted computational overhead as a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this argument.

any modifications. This prevents any unwanted computational overhead as a
result of migrating to this data structure.

The ``select()`` method will act like a ``__getitem__``, except that it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we discussed this before? So this is similar to .loc in pandas?

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be reorganized to start with a motivation. It should also discuss why this solution is better than what was proposed in the old slep.

Also it should actually describe the behavior we want. Where will it be accepted as input parameter? Probably everywhere? When will it be produced as output parameter? Always? When the input was a NamedArray? When the input was a NamedArray or a DataFrame?

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be reorganized to start with a motivation. It should also discuss why this solution is better than what was proposed in the old slep.

Also it should actually describe the behavior we want. Where will it be accepted as input parameter? Probably everywhere? When will it be produced as output parameter? Always? When the input was a NamedArray? When the input was a NamedArray or a DataFrame?

@amueller
Copy link
Member

amueller commented Dec 11, 2019

I want to record something that we discussed in person yesterday, which is enabling simple conversion to pandas dataframes. Ideally users should be able to do pd.DataFrame(our_thing) to get a dataframe (using something like the __array__ protocol for numpy). For example we'd like seaborn to work natively on our new data structure.
It's not entirely clear whether either of these goals are achievable in a non-hacky way right now, but I think they are important considerations.

A less elegant solution would be a toframe() method, but that probably wouldn't allow us to integrate with existing tools like seaborn.

@amueller
Copy link
Member

@jorisvandenbossche says nothing like the __array__ protocol exists for pandas.

@jorisvandenbossche
Copy link
Member

For example we'd like seaborn to work natively on our new data structure.

Then seaborn would still need to convert inputs to DataFrames, and not assume it already is (didn't check seaborn's inernals). So even if there is such a pandas protocol, all downstream libraries that accept dataframes might need to be adapted before natively working on such a new data structure (not impossible of course, and one needs to start with it at some point, but certainly an extra hurdle).

@amueller
Copy link
Member

Indeed, the downstream libraries would need work. But this would be an enabling feature that allows downstream libraries to implement this conversion without knowing anything about the input format. Seaborn has some isinstance(data, pd.DataFrame) which would require adjustment. There's also some hasattr(x, 'shape').

@alexgarel
Copy link

Hi, I found this very interesting. I'm not sure, my comment will be relevant, but what about simple integration of numpy structured arrays ?

Also an integration with Pipelines would be interesting, like FeatureUnion (retaining feature names).

@jnothman
Copy link
Member

jnothman commented Dec 17, 2019 via email

@lorentzenchr
Copy link
Member

Dear core devs
I admire your hard work to get feature names in scikit-learn done. This is a huge step forward for usability. As this isn't the first proposal, you already put a lot of thought into it. I don't know were to put these, but here are a few thoughts from a user:

1. Yet another data structure

There are already several data structures in Python, to name a few: numpy.ndarray, pandas.DataFrame, xarray.DataArray, pyarrow.Table, datatable.Frame (h2oai), ...
I worry about fragmentation and interoperability issues between data structures, as they form the very foundation of any process (ETL, ML, visualization, ...) on top if it.

2. User interface and underlying data structure

Algorithms in scikit-learn are implemented on numpy.ndarray (homogeneous n-dimensional data). But usually, my data comes as heterogeneous tabular data (strings, categorical, floats). How deep should the integration between those two go? ColumnTransformer is already a great help.

3. Compare to/Learn from R

R settled on its data.frame structure (or data.table or tibble, which are interoperable). This enables data.frame-in and data.frame-out processes on which many libraries and functions rely, especially in the tidyverse. R formulae are another means to work directly on dataframes instead of matrices/arrays.

4. Standard use case

At least in my work, I usually start with a tabular data structure, something like this.

import pandas as pd
df = pd.read_parquet("myfile.parquet")

After some feature preprocessing/engineering, I need to pass them to ColumnTransformer, before I can use an estimator:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.X import XRegressor
y = df['target']
X = df[:, ['feature_a', 'feature_b', ...]]
col_trans = ColumnTransformer(
    [('feature_a_ohe', OneHotEncoder(),['feature_a']), ...])
model = Pipeline(steps=[('col_trans', col_trans), ('regressor', XRegressor())])
model.fit(X, y)

After that, I'd like inspect my model with respect to the original input features ('feature_a', 'feature_b' —the ones before the ColumnTransformer).

@amueller
Copy link
Member

Thanks for your input @lorentzenchr.
These concerns are definitely on our mind. Let me briefly reply with my thoughts

  1. Yes this is a big concern, but we have not found a better solution.

  2. That's a good question. I think @jorisvandenbossche suggested working on pandas dataframes where possible, though this is a tiny fraction of sklearn (some of the preprocessing module, maybe feature selection?). I'd be in favor of this but the reach will necessarily be quite limited.

  3. R does a memory copy for every input and output as far as I know. We opted not to do that. That was one of the strongest arguments against using pandas. Maybe having memory copies are worth it if usability improvements make it worth it? Though if pandas implemented Feature request: Protocol for converting something to a pandas DataFrame pandas-dev/pandas#30218 we could integrate with other libraries easily without having to force a memory copy on our end.

  4. yes that's exactly what I want.

@amueller
Copy link
Member

Do you want another review or do you want to address the comments first? I can also send a PR.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 27, 2019 via email

@adrinjalali
Copy link
Member Author

@amueller I tried to address the comments. I'm happy to take more comments, or for you to either change directly this PR or a PR to my fork if you think it needs discussion and things are not clear.

I also update the NamedArray PR I had with the lates prototype I had for sparse arrays (scikit-learn/scikit-learn#14315)

@TomAugspurger
Copy link

TomAugspurger commented Dec 27, 2019

Can someone expand on the row alignment issue? I don’t see any case where row labels would be used by an estimator.

@TomAugspurger
Copy link

Adrin attempted an explanation to my "row-label alignment" confusion. The
concern may not be internal to scikit-learn, which I think will disregard row
labels. Rather, it was for users who might receive this named array from a
.transform. If they're unfamiliar with pandas / xarray, they'll be surprised
by the alignment. In code,

>>> trn = StandardScaler()
>>> a = xr.DataArray(np.random.randn(3, 2), coords={"dim_0": ['a', 'b', 'c']})
>>> b = np.random.randn(3, 2)
>>> trn.fit(a)
>>> trn.transform(a) + trn.tranform(b)  # ? + ?, does this align?

This comes down to a fundamental question: what is the output type from
.transform? I worry that we're conflating two things: Some data structure for
scikit-learn to use internally for propagating feature names, and some data
structure for users to provide feature names to scikit-learn, and use after
getting a result back from scikit-learn (via transform).

My initial preferences are for

  1. Scikit-Learn to continue only using NumPy ndarrays / scipy sparse matricies internally. Arguments are converted to ndarrays / matricies explicitly.
  2. Estimator.transform to return an object with the same type as the input
    a. Feature names (.columns for a DataFrame) match estimator.output_features_names_ (may have the attribute name wrong)
    b. (possibly) Row labels, if any, match the input row labels

Applying those principles gives the following table

Fit Input Transform Input Transform Output
ndarray ndarray ndarray
ndarray DataArray DataArray
ndarray DataFrame DataFrame
DataArray ndarray ndarray
DataArray DataArray DataArray
DataArray ndarray ndarray
DataFrame DataFrame DataFrame
DataFrame DataArray DataArray
DataFrame DataFrame DataFrame

Some objections to this approach are that

  1. Feature names are only available to people using xarray, pandas. IMO that's acceptable.
    Adding a new data container to the ecosystem should have a high bar. Scikit-Learn coudl
    define an interface for other containers to met.
  2. A potential pandas refactor might make pd.DataFrame(np.asarray(df)) have two memory
    copies. But that refactor is uncertain, at is some years off at a minimum.

@amueller
Copy link
Member

amueller commented Jan 7, 2020

@TomAugspurger what does "internally" mean here? An estimator doesn't know whether it's in a pipeline or not, and so if you'd pass a dataframe into a pipeline, you'd convert between dataframe and numpy array (which is used within the estimator) between every step. Previously I think the agreement was not to pay this cost. I'm not 100% sure any more, though. That would only be a cost if there was this refactor, though.

@amueller
Copy link
Member

amueller commented Jan 7, 2020

There's one more concern with your proposal: it's not backward compatible.
That could be solved by a global flag sklearn.set_config(preserve_frames=True) or something like that.

@TomAugspurger
Copy link

what does "internally" mean here?

Essentially, whatever check_array(X) returns. The array / matrix scikit-learn actually works with inside .fit and .transform.

That may or may not be a copy for a homogenous DataFrame today.

it's not backward compatible.

Right. If you're making a new InputArray you wouldn't have the legacy behavior concerns.

@amueller
Copy link
Member

amueller commented Jan 7, 2020

@TomAugspurger ok, I think so far anything we discussed used ndarray/scipy sparse internally, and we were only discussing input/output format.

@amueller
Copy link
Member

amueller commented Jan 7, 2020

Should we do a big list of the pros and cons of the three solutions we considered?
One is pandas-in, pandas-out, one is DataArray, and one is passing meta-data through the pipeline and meta-estimators. This discussion is really been pretty gnarly. This could be in the alternatives section of this SLEP or in a separate document.
We don't really have a framework for making a decision between three options (other than discussion).

@amueller
Copy link
Member

amueller commented Jan 7, 2020

@adrinjalali @GaelVaroquaux can you maybe recap the arguments against pandas-in pandas-out? I used to be convinced by the memory copy argument but honestly I think the usage advantage outweights that, given that it's a potential future concern and that we make so many copies already (and that they can be easily avoided by the user by converting to numpy if needed).

@adrinjalali
Copy link
Member Author

there were/are some arguments against pandas/xarray, I'll try to recap:

  • pandas dependency (pandas, xarray)
  • row alignment (pandas, xarray)
  • different semantics on some of the operations between numpy and pandas/xarray. I don't remember what those operations where, but certain operations have the same name, but different semantics when you compare numpy and pandas/xarray
  • pandas in pandas out would kinda imply supporting pandas which in turn could mean understanding multi-indices (among other pandas features), which we probably don't wanna get into. We could be explicit and accept only a subset of pandas features, but that's also not easy to define.
  • API: both xarray and pandas have a very different API than what people are used to with numpy. My implementation using xarray had a bunch of utility functions to get around the obscure xarray API. This is not a bit issue though, since people can still convert between the three libraries more or less.

On the plus side, once I did have an implementation of the feature names with xarray and it was working. When it came down to it, it was the operation semantics, pandas dependency, and row alignment which resulted in us working on a different data structure (IIRC).

@amueller
Copy link
Member

Can you elaborate on row alignment and what the issue is?

And why are different semantics a problem? It means an incompatible change but when users provide pandas dataframes as inputs, I would suppose they are familiar with pandas semantics. One could argue that transformers currently changing the semantics from pandas to numpy is an issue ;) - certainly having a breaking change with a config flag is some annoyance, though.

pandas in pandas out would kinda imply supporting pandas which in turn could mean understanding multi-indices (among other pandas features).

I'm not sure what you mean. We are currently supporting pandas as input types quite explicitly.

The API point seems to be a repeat of the different semantics point, right?

The reason I didn't like xarray was the axis semantics which is not really natural for us.

I guess writing this down well requires a different list of pro/cons for xarray and pandas each.

The point that @TomAugspurger made was more about preserving data types than using one or the other. From a user perspective that makes sense, since we'd want users to work with the API that they are familiar with. Having a pandas user work with xarray might not be super familiar (though xarray users are probably familiar with pandas to some degree?).

If we would want to preserve both types, that would probably require some custom code, and more code for each additional type we'd want to support (if any). As @TomAugspurger alluded to, we could try to define an abstraction for that, but that would be pretty far future.

I'd be happy to just preserve type for pandas for now.

Meaning there's actually 5 solutions:
a) preserve type (from a white-list of types which could start with just pd.DataFrame)
b) always use pd.Dataframe
c) always use xarray.DataArray
d) use DataArray
e) pass information through in meta-estimators.

The semantic differences would be an argument against b) and c) but not against a) imho.

@TomAugspurger
Copy link

Apologies for muddling the DataArray discussion with the "pandas (or xarray) in / pandas (or xarray) out" discussion. But I think they're related, since the need for a custom DataArray is lessened if pandas in / pandas out becomes the default behavior, and if feature names are limited to arrays that already have names.

row alignment (pandas, xarray)
which in turn could mean understanding multi-indices (among other pandas features),

Again, I really think that row-labels are a non-issue for scikit-learn internally :) And as for users, if they're passing in a DataFrame then they probably are familiar with pandas' alignment. The extent of scikit-learn's interaction with row labels would be something like

def transform(self, X, y=None):
    X, row_labels, input_type = check_array(X)
    # rest of tranform, operating on an ndarray
    result = ...
    # some hypothetical function that recreates a DataFrame / DataArray,
    # preserving row labels, attaching new features names.
    result = construct_result(result, row_labels, feature_name, input_type)
    return result   

If the issue with multi-indices are for the columns, then I'd say scikit-learn shouldn't try to support those. I think / hope that you have a requirement that feature_names_in_ and feature_name_out_ be a sequence of strings.

@adrinjalali
Copy link
Member Author

If both sample_weight and X are indexed, should sklearn try to align them in operations?

@amueller
Copy link
Member

@adrinjalali that's a good question. Right now, if X, y and sample_weight are indexed, we just drop the index. Given that we have explicit support for pandas in ColumnTransformer that could be considered a bug.

For columns we are enforcing that they must be exactly the same between fit and transform. I could see us taking a similar stance here, i.e. asserting that the index is the same for X, y, sample_weights and raise an error if not.

Given that we decided against aligning columns, I think it makes sense to also not align rows. This issues is already present in the code base right now, though.
I don't really follow why you think it relates to changing the output type.

``X`` being an ``InputArray``::

>>> np.array(X)
>>> X.todataframe()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just my two cents: spark already have toPandas() and given that eventully there might be different "Dataframes" (e.g. Modin, Dask) wouldn't it make sense to hav ex.topandas() x.topdataframe rather than a generic dataframe?

Estimators understand the ``InputArray`` and extract the feature names from the
given data before applying the operations and transformations on the data.

All transformers return an ``InputArray`` with feature names attached to it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how do we ensure the backward compatibility criteria from above, namely:

code which expects a numpy.ndarray as the output of a transformer, would not break

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since any operation on the InputArray would result in a pure numpy array, the existing code won't break.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how is this enabled, technically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i'm trying to say is, this should be explained in the SLEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed on not including those details here :P (#25 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, OK

As Andy noted it should be mentioned in the SLEP that the details are intentionally left out.

Though IMHO this is a slippery slope: leaving out details may mean ending up with a vacuous SLEP with no actual proposal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. There are different ways of implementing this feature, and I don't think it's the scope of the slep to define those ways. The SLEP is about the API, not how that API is achieved. We may move from one solution to the other, w/o having to write a slep for it, as long as it doesn't substantially change the API.

@adrinjalali
Copy link
Member Author

I kinda get the feeling that if:

  • we limit the index types on cols and ignore row indices of the input
  • only support feature names if the user has xarray and pandas installed

we can go back to using xarray as a return type for transformers if the user provides feature names through either an xarray or a pandas data structure?

We could have a global conf to enable/disable the behavior, and have it disabled by default for a few releases.

@amueller
Copy link
Member

I agree with @adrinjalali only that I would actually use pandas, not xarray. I think having less dependencies and types is better. The main reason I had for xarray vs pandas was the zero-copy issue, and I come to give that less weight (given that the future of pandas is unclear and that we copy a lot anyway).

Sorry for going back and forth on this. Main question: should we have a separate slep for using an existing "array" type (pandas it not really an array, I know).
That will make voting and discussion harder and harder. I wonder if maybe discussing this on a call would be easier so we can sync up?
I would love to hear @GaelVaroquaux's, @jnothman's and @jorisvandenbossche's take.

@adrinjalali
Copy link
Member Author

my gut feeling is that if we implement the machinery to have this optional, we could have it in the global config to be set as either xarray or pandas, and it shouldn't be too much work.

@adrinjalali
Copy link
Member Author

Oh, and on the SLEP issue, I'd say we should work on a slep with an existing data type, and if we get the feeling that there's some consensus there, I'd withdraw this slep (I wish it was merged :D )

@amueller
Copy link
Member

I'm happy to merge it as it is.
And doing pandas or xarray is substantially more work I think. Though I guess if we could compartmentalize it into some helper functions for wrapping and unwrapping it might not be so bad.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to merge and then keep discussing

@adrinjalali
Copy link
Member Author

And doing pandas or xarray is substantially more work I think. Though I guess if we could compartmentalize it into some helper functions for wrapping and unwrapping it might not be so bad.

I'd be happy to have a prototype implementation once the n_features_in_ is in. I'm almost tempted to base my PR on @NicolasHug 's implementation and start before it's merged.

@amueller
Copy link
Member

maybe coordinate with @thomasjpfan who I think is working on a new SLEP.

@adrinjalali
Copy link
Member Author

@amueller would you like to merge? :D

@amueller
Copy link
Member

Sure. I think the proposal is reasonably complete, even though I don't expect a vote on it in the current form very soon.

@amueller amueller merged commit 37ab0c1 into scikit-learn:master Feb 18, 2020
@adrinjalali adrinjalali deleted the slep012/DataArray branch May 14, 2020 11:21
@amueller
Copy link
Member

Can someone remind me where we had the discussion that concluded with making the output type globally constant instead of it depending on the input type?

@adrinjalali
Copy link
Member Author

I don't remember where we had the discussion, but I thought we agreed that not knowing what the output is by looking at a piece of code is bad practice?

Also, that policy means the same code which before would take a pd.DataFrame and return an ndarray now would return a pd.DataFrame, which is confusing and not backward compatible (which we could argue is not essential for v1.0 ;) )

So I think I'd really prefer not to depend the output type on the input type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants