-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP012 - DataArray #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
slep012/proposal.rst
Outdated
- ``XArray``: we could accept a `pandas.DataFrame``, and use | ||
``xarray.DataArray`` as the output of transformers, including feature names. | ||
However, ``xarray`` depends on ``pandas``, and uses ``pandas.Series`` to | ||
handle row labels and aligns rows when an operation between two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses a pandas Index for labels, not Series?
Can you expand on why alignment is an issue here? I wouldn’t expect row labels to be used internally in an estimator, but I would expect them to be preserved through a transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some clarification, and this is what I remember from my discussions with @GaelVaroquaux
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, more than happy to have you comment on any part of this proposal \o/
I think it should give examples of other potential things we might include. |
FeatureArray? |
slep012/proposal.rst
Outdated
InputArray | ||
========== | ||
|
||
This proposal suggests adding a new data structure, called ``InputArray``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's used for output from scikit-learn models isn't it?
doesn't interfere with existing user code. | ||
|
||
A main constraint of this data structure is that is should be backward | ||
compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "backwards compatible" is a pretty loose statement that deserves some explication. Will the object be an instance of ndarray? Will operations over DataArray produce raw ndarrays in all cases, or the subtype in some?
How do we handle sparse arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this backwards compatibility, or do we target a major release and require users to unwrap the data structure for some operations??
*************** | ||
|
||
There will be factory methods creating an ``InputArray`` given a | ||
``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about conversion in the other direction?
slep012/proposal.rst
Outdated
Feature Names | ||
************* | ||
|
||
Feature names are an array of strings aligned with the columns. They can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An array of strings? I think you mean either an object array or a list (the latter being what get_feature_names currently returns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening the SLEP
All of the above applies to sparse arrays. | ||
|
||
Factory Methods | ||
*************** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give some examples? I suppose these would look like e.g.
DataArray.form_pandas(...)
DataArray.form_xarray(...)
DataArray(numpy_array, feature_names)
.. _slep_012: | ||
|
||
========== | ||
InputArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InputArray or DataArray? The PR intro says the latter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just came up with InputArray. It's better than DataArray because it's not taken in the python ecosystem yet afaik (OpenCV has it in C++?) but I agree with @jnothman below in that we also use it for output - and actually the primary purpose of it is using it for output :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetaArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that be an array of arrays? Or an array of metadata?
slep012/proposal.rst
Outdated
feature names to be attached to the data given to an estimator, there are a few | ||
approaches we can take: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think alternative solutions should go at the end in their own section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed.
slep012/proposal.rst
Outdated
========== | ||
|
||
This proposal suggests adding a new data structure, called ``InputArray``, | ||
which wraps a data matrix with some added information about the data. This was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wraps how?
Maybe give an example of what the DataArray class would look like? It's not clear whether the plan is to go for inheritance or composition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe augments the data array X
with additional meta-data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I intentionally left the implementation details out. The idea is that the solution should satisfy the requirements of this proposal. I can put the details in, but that'll be a lot of details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok then mention that maybe? And make the requirements clear on what numpy compatibility means, or at least list the relevant questions?
Operations | ||
********** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do transformers behave w.r.t to the new input format? We want them to return a DataArray if they are passed in DataArray I suppose
slep012/proposal.rst
Outdated
However, ``xarray`` has a hard dependency on ``pandas``, and uses | ||
``pandas.Index`` to handle row labels and aligns rows when an operation | ||
between two ``xarray.DataArray`` is done, which can be time consuming, and is | ||
not the semantic expected in ``scikit-learn``; we only expect the number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanding on this "row labels" point: I don't think it should be mentioned at all since it should be a non-issue. No reasonable estimator should be using row labels in its implementation, since the row labels in the data passed to .fit
can't be expected to match the row labels passed in .transform
, .predict
, .score
, etc.
If I had to guess where auto-alignment could be an issue, it would be when the features passed to, say, .score
differ from the inputs to .fit
.
class MinScalar:
def __fit__(self, X: DataFrame):
self.minima_ = X.min() # Series, the row labels are the original feature names.
def transform(self, X: DataFrame):
return X - self.minima_
When used incorrectly, you could get something like
In [4]: X = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
In [5]: minima = X.min() # simulate fit
In [6]: minima
Out[6]:
A 1
B 4
dtype: int64
In [7]: X_new = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})
In [8]: X_new - minima # simulate transform. This should raise!
Out[8]:
A B C
0 0.0 0.0 NaN
1 1.0 1.0 NaN
2 2.0 2.0 NaN
But I expect that input validation (that the feature names passed to score
/ transform
match the feature names passed to fit
) be handled earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that indices shouldn't be an issue as we could always just reindex in the beginning. I think the main reason for not using xarray was to keep 100% numpy-compatibility. If we don't require that, I think xarray is a viable alternative. I was convinced at some point that xarray is not a good idea but it's been so long that I can't reconstruct the reasoning. I think basically I didn't want a hard, incompatible switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC:
- certain operations have a different semantic comparing numpy and xarray.
- xarray has a hard pandas dependency
- row alignment does take time, and we have discussed that it's something we definitely don't want. @GaelVaroquaux may have a clearer argument on this one
As far as I remember, otherwise it was a good alternative.
Scope 1) I would say we leave the route open to include feature props in the future. DataArray is the name of the xarray class, so I'd say -1 on that. We could do something non-informative like Scope 2) I agree with @jnothman, you should give examples that we might care about in the future. Implementation details: Yes, it should include implementation details and alternatives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be reorganized to start with a motivation. It should also discuss why this solution is better than what was proposed in the old slep.
Also it should actually describe the behavior we want. Where will it be accepted as input parameter? Probably everywhere? When will it be produced as output parameter? Always? When the input was a NamedArray? When the input was a NamedArray or a DataFrame?
.. _slep_012: | ||
|
||
========== | ||
InputArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just came up with InputArray. It's better than DataArray because it's not taken in the python ecosystem yet afaik (OpenCV has it in C++?) but I agree with @jnothman below in that we also use it for output - and actually the primary purpose of it is using it for output :-/
slep012/proposal.rst
Outdated
========== | ||
|
||
This proposal suggests adding a new data structure, called ``InputArray``, | ||
which wraps a data matrix with some added information about the data. This was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe augments the data array X
with additional meta-data.
|
||
========== | ||
InputArray | ||
========== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should start with the problem we try to solve, and then a motivating example of how it will look like once we solved the problem.
So start with motivation I'd say.
slep012/proposal.rst
Outdated
feature names to be attached to the data given to an estimator, there are a few | ||
approaches we can take: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed.
slep012/proposal.rst
Outdated
However, ``xarray`` has a hard dependency on ``pandas``, and uses | ||
``pandas.Index`` to handle row labels and aligns rows when an operation | ||
between two ``xarray.DataArray`` is done, which can be time consuming, and is | ||
not the semantic expected in ``scikit-learn``; we only expect the number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that indices shouldn't be an issue as we could always just reindex in the beginning. I think the main reason for not using xarray was to keep 100% numpy-compatibility. If we don't require that, I think xarray is a viable alternative. I was convinced at some point that xarray is not a good idea but it's been so long that I can't reconstruct the reasoning. I think basically I didn't want a hard, incompatible switch.
slep012/proposal.rst
Outdated
|
||
All usual operations (including slicing through ``__getitem__``) return an | ||
``np.ndarray``. The ``__array__`` method also returns the underlying data, w/o | ||
any modifications. This prevents any unwanted computational overhead as a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow this argument.
slep012/proposal.rst
Outdated
any modifications. This prevents any unwanted computational overhead as a | ||
result of migrating to this data structure. | ||
|
||
The ``select()`` method will act like a ``__getitem__``, except that it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we discussed this before? So this is similar to .loc
in pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be reorganized to start with a motivation. It should also discuss why this solution is better than what was proposed in the old slep.
Also it should actually describe the behavior we want. Where will it be accepted as input parameter? Probably everywhere? When will it be produced as output parameter? Always? When the input was a NamedArray? When the input was a NamedArray or a DataFrame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be reorganized to start with a motivation. It should also discuss why this solution is better than what was proposed in the old slep.
Also it should actually describe the behavior we want. Where will it be accepted as input parameter? Probably everywhere? When will it be produced as output parameter? Always? When the input was a NamedArray? When the input was a NamedArray or a DataFrame?
I want to record something that we discussed in person yesterday, which is enabling simple conversion to pandas dataframes. Ideally users should be able to do A less elegant solution would be a |
@jorisvandenbossche says nothing like the |
Then seaborn would still need to convert inputs to DataFrames, and not assume it already is (didn't check seaborn's inernals). So even if there is such a pandas protocol, all downstream libraries that accept dataframes might need to be adapted before natively working on such a new data structure (not impossible of course, and one needs to start with it at some point, but certainly an extra hurdle). |
Indeed, the downstream libraries would need work. But this would be an enabling feature that allows downstream libraries to implement this conversion without knowing anything about the input format. Seaborn has some |
Hi, I found this very interesting. I'm not sure, my comment will be relevant, but what about simple integration of numpy structured arrays ? Also an integration with Pipelines would be interesting, like FeatureUnion (retaining feature names). |
Struct arrays would meet some but not all of our needs, yet are quite
tricky to work with, e.g. very limited support for object dtype and
impossible to offer backwards compatibility (we would be dealing with 1d
struct arrays rather than 2d arrays).
|
Dear core devs 1. Yet another data structureThere are already several data structures in Python, to name a few: numpy.ndarray, pandas.DataFrame, xarray.DataArray, pyarrow.Table, datatable.Frame (h2oai), ... 2. User interface and underlying data structureAlgorithms in scikit-learn are implemented on numpy.ndarray (homogeneous n-dimensional data). But usually, my data comes as heterogeneous tabular data (strings, categorical, floats). How deep should the integration between those two go? ColumnTransformer is already a great help. 3. Compare to/Learn from RR settled on its data.frame structure (or data.table or tibble, which are interoperable). This enables data.frame-in and data.frame-out processes on which many libraries and functions rely, especially in the tidyverse. R formulae are another means to work directly on dataframes instead of matrices/arrays. 4. Standard use caseAt least in my work, I usually start with a tabular data structure, something like this. import pandas as pd
df = pd.read_parquet("myfile.parquet") After some feature preprocessing/engineering, I need to pass them to ColumnTransformer, before I can use an estimator: from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.X import XRegressor
y = df['target']
X = df[:, ['feature_a', 'feature_b', ...]]
col_trans = ColumnTransformer(
[('feature_a_ohe', OneHotEncoder(),['feature_a']), ...])
model = Pipeline(steps=[('col_trans', col_trans), ('regressor', XRegressor())])
model.fit(X, y) After that, I'd like inspect my model with respect to the original input features ('feature_a', 'feature_b' —the ones before the ColumnTransformer). |
Thanks for your input @lorentzenchr.
|
Do you want another review or do you want to address the comments first? I can also send a PR. |
• row alignment does take time, and we have discussed that it's something we
definitely don't want. @GaelVaroquaux may have a clearer argument on this
one
Not only does it take time, it is in practice a significant source of
bugs for naive users who do not know that it is happening.
|
@amueller I tried to address the comments. I'm happy to take more comments, or for you to either change directly this PR or a PR to my fork if you think it needs discussion and things are not clear. I also update the NamedArray PR I had with the lates prototype I had for sparse arrays (scikit-learn/scikit-learn#14315) |
Can someone expand on the row alignment issue? I don’t see any case where row labels would be used by an estimator. |
Adrin attempted an explanation to my "row-label alignment" confusion. The >>> trn = StandardScaler()
>>> a = xr.DataArray(np.random.randn(3, 2), coords={"dim_0": ['a', 'b', 'c']})
>>> b = np.random.randn(3, 2)
>>> trn.fit(a)
>>> trn.transform(a) + trn.tranform(b) # ? + ?, does this align? This comes down to a fundamental question: what is the output type from My initial preferences are for
Applying those principles gives the following table
Some objections to this approach are that
|
@TomAugspurger what does "internally" mean here? An estimator doesn't know whether it's in a pipeline or not, and so if you'd pass a dataframe into a pipeline, you'd convert between dataframe and numpy array (which is used within the estimator) between every step. Previously I think the agreement was not to pay this cost. I'm not 100% sure any more, though. That would only be a cost if there was this refactor, though. |
There's one more concern with your proposal: it's not backward compatible. |
Essentially, whatever That may or may not be a copy for a homogenous DataFrame today.
Right. If you're making a new InputArray you wouldn't have the legacy behavior concerns. |
@TomAugspurger ok, I think so far anything we discussed used ndarray/scipy sparse internally, and we were only discussing input/output format. |
Should we do a big list of the pros and cons of the three solutions we considered? |
@adrinjalali @GaelVaroquaux can you maybe recap the arguments against pandas-in pandas-out? I used to be convinced by the memory copy argument but honestly I think the usage advantage outweights that, given that it's a potential future concern and that we make so many copies already (and that they can be easily avoided by the user by converting to numpy if needed). |
there were/are some arguments against pandas/xarray, I'll try to recap:
On the plus side, once I did have an implementation of the feature names with xarray and it was working. When it came down to it, it was the operation semantics, pandas dependency, and row alignment which resulted in us working on a different data structure (IIRC). |
Can you elaborate on row alignment and what the issue is? And why are different semantics a problem? It means an incompatible change but when users provide pandas dataframes as inputs, I would suppose they are familiar with pandas semantics. One could argue that transformers currently changing the semantics from pandas to numpy is an issue ;) - certainly having a breaking change with a config flag is some annoyance, though.
I'm not sure what you mean. We are currently supporting pandas as input types quite explicitly. The API point seems to be a repeat of the different semantics point, right? The reason I didn't like xarray was the axis semantics which is not really natural for us. I guess writing this down well requires a different list of pro/cons for xarray and pandas each. The point that @TomAugspurger made was more about preserving data types than using one or the other. From a user perspective that makes sense, since we'd want users to work with the API that they are familiar with. Having a pandas user work with xarray might not be super familiar (though xarray users are probably familiar with pandas to some degree?). If we would want to preserve both types, that would probably require some custom code, and more code for each additional type we'd want to support (if any). As @TomAugspurger alluded to, we could try to define an abstraction for that, but that would be pretty far future. I'd be happy to just preserve type for pandas for now. Meaning there's actually 5 solutions: The semantic differences would be an argument against b) and c) but not against a) imho. |
Apologies for muddling the DataArray discussion with the "pandas (or xarray) in / pandas (or xarray) out" discussion. But I think they're related, since the need for a custom DataArray is lessened if pandas in / pandas out becomes the default behavior, and if feature names are limited to arrays that already have names.
Again, I really think that row-labels are a non-issue for scikit-learn internally :) And as for users, if they're passing in a DataFrame then they probably are familiar with pandas' alignment. The extent of scikit-learn's interaction with row labels would be something like def transform(self, X, y=None):
X, row_labels, input_type = check_array(X)
# rest of tranform, operating on an ndarray
result = ...
# some hypothetical function that recreates a DataFrame / DataArray,
# preserving row labels, attaching new features names.
result = construct_result(result, row_labels, feature_name, input_type)
return result If the issue with multi-indices are for the columns, then I'd say scikit-learn shouldn't try to support those. I think / hope that you have a requirement that |
If both sample_weight and X are indexed, should sklearn try to align them in operations? |
@adrinjalali that's a good question. Right now, if For columns we are enforcing that they must be exactly the same between fit and transform. I could see us taking a similar stance here, i.e. asserting that the index is the same for Given that we decided against aligning columns, I think it makes sense to also not align rows. This issues is already present in the code base right now, though. |
``X`` being an ``InputArray``:: | ||
|
||
>>> np.array(X) | ||
>>> X.todataframe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just my two cents: spark already have toPandas()
and given that eventully there might be different "Dataframes" (e.g. Modin, Dask) wouldn't it make sense to hav ex.topandas()
x.topdataframe
rather than a generic dataframe?
Estimators understand the ``InputArray`` and extract the feature names from the | ||
given data before applying the operations and transformations on the data. | ||
|
||
All transformers return an ``InputArray`` with feature names attached to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then how do we ensure the backward compatibility criteria from above, namely:
code which expects a
numpy.ndarray
as the output of a transformer, would not break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since any operation on the InputArray
would result in a pure numpy array, the existing code won't break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how is this enabled, technically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i'm trying to say is, this should be explained in the SLEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we agreed on not including those details here :P (#25 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, OK
As Andy noted it should be mentioned in the SLEP that the details are intentionally left out.
Though IMHO this is a slippery slope: leaving out details may mean ending up with a vacuous SLEP with no actual proposal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. There are different ways of implementing this feature, and I don't think it's the scope of the slep to define those ways. The SLEP is about the API, not how that API is achieved. We may move from one solution to the other, w/o having to write a slep for it, as long as it doesn't substantially change the API.
I kinda get the feeling that if:
we can go back to using We could have a global conf to enable/disable the behavior, and have it disabled by default for a few releases. |
I agree with @adrinjalali only that I would actually use pandas, not xarray. I think having less dependencies and types is better. The main reason I had for xarray vs pandas was the zero-copy issue, and I come to give that less weight (given that the future of pandas is unclear and that we copy a lot anyway). Sorry for going back and forth on this. Main question: should we have a separate slep for using an existing "array" type (pandas it not really an array, I know). |
my gut feeling is that if we implement the machinery to have this optional, we could have it in the global config to be set as either xarray or pandas, and it shouldn't be too much work. |
Oh, and on the SLEP issue, I'd say we should work on a slep with an existing data type, and if we get the feeling that there's some consensus there, I'd withdraw this slep (I wish it was merged :D ) |
I'm happy to merge it as it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to merge and then keep discussing
I'd be happy to have a prototype implementation once the |
maybe coordinate with @thomasjpfan who I think is working on a new SLEP. |
@amueller would you like to merge? :D |
Sure. I think the proposal is reasonably complete, even though I don't expect a vote on it in the current form very soon. |
Can someone remind me where we had the discussion that concluded with making the output type globally constant instead of it depending on the input type? |
I don't remember where we had the discussion, but I thought we agreed that not knowing what the output is by looking at a piece of code is bad practice? Also, that policy means the same code which before would take a So I think I'd really prefer not to depend the output type on the input type. |
This SLEP discusses a new data structure which would carry some meta-data along the data itself. The feature names are the first set of meta-data we would implement, but it could grow from that point.
Things we need to decide to continue the write up of this SLEP:
NamedArray
with only feature names may be another SLEP?NamedArray
, but we seem to lean towards including other meta-data than feature names. Therefore we probably should use another name.DataArray
is a placeholder. Happy to change.