-
Notifications
You must be signed in to change notification settings - Fork 21
set_column_by_name
/ set_column
?
#235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
thanks @ivirshup ! I've got a PR open to add my general issue with
😄 yeah I realised it was pretty "ugly" whilst preparing and presenting the presentation - would
be less ugly in your opinion? |
I guess I want the behavior you are describing 😅. But I also want there to be a unique solution to Personally, I think of dataframes as e.g. confusing definition of unionIn [1]: import pandas as pd
In [2]: pd.Index(["a", "a", "b"]).intersection(pd.Index(["a", "a", "b"]))
Out[2]: Index(['a', 'b'], dtype='object')
In [3]: pd.Index(["a", "a", "b"]).union(pd.Index(["a", "b", "a"]))
Out[3]: Index(['a', 'a', 'b'], dtype='object')
I think most libraries do "key insertion order", but I don't really care since I'll be accessing columns by name anyways. I sort of see an argument for saying if col.name not in df.get_column_names():
df = df.set_column(col) # or whatever other than position
else:
raise SomeError(f"{col.name} was taken, be more original")
I would actually like to rephrase from "the ugliest" to "the only ugly part". |
Thanks for your thoughts, much appreciated @kkraus14 has brought up the idea of unordered columns before - maybe we should do that? So then we'd have:
and corresponding plural versions. Then there'd be no need for And The name |
I think I brought up a question about whether we had explicitly decided that columns were ordered or not. From my perspective, it's a common pattern to iterate over the columns in a dataframe and returning column names as a set doesn't guarantee you a consistent iteration order as far as I'm aware, which is undesired behavior. My stance here is that columns should be both ordered and named. We could update the behavior of |
sounds good - you OK with the other renamings suggested? |
@MarcoGorelli your proposal sounds right, but I agree that A case where order really matters: visual inspection of the table. It's quite annoying when:
But you don't need that much order awareness to accomplish this, basically just insertion order and stable iteration order.
For I would also like to make an case for a |
Sure, let's keep columns ordered then, and default to inserting columns at the end. people can always reorder them with for if col.name in df.get_column_names():
df = df.set_column(col)
else:
df = df.insert_column(col) ? But I also don't really care too much about this particular topic, I'm much more concerned Ibis and polars-lazy not being compatible with some column-related parts of the API - if you had any thoughts / inputs on #229, that would be appreciated Just for my understanding, what's your use case for the dataframe standard? |
ContextI work on anndata. This library provides the It's sort of like a 2-D xarray.Dataset specialized for exploratory analysis of high dimensional data, where your data looks like something that could be passed into a scikit-learn transformer. It's not actually an xarray.Dataset because we need support for sparse arrays and dataframes (for 1d annotations along the observations or variables, e.g. "cell type"). We're currently expanding the types of arrays we can work, sometimes with help from the array-api, but not as often as we'd like since we have sparse data. However, we currently only support pandas dataframes, and I'd like to change that. This initiative seems like a great way to do that. What I'd like to be able to do with arbitrary dataframe typesThe big operations we need are:
df = DataFrame()
store: h5py.Group | zarr.Group = ...
df = Dataframe.from_dict(
{k: v.decode() for k, v in store.items()}
)
And if we can do these without having to specialize for each case, as we've been doing to have sparse array support. Eventually I would also like to support more dataframe types in our analysis libraries, but being able to handle them in our data container would need to happen first. |
Thanks @ivirshup
Could you expand on what you need please? Row labels are disallowed by the standard, would you be able to manage without? |
Yes. We just need positional indexing of rows. We can implement our own labels on top of that. |
ok that should be fine then, thanks! How about df = Dataframe.from_dict(
{k: v.decode() for k, v in store.items()}
) , when you do The standard doesn't have a df_standard = df.__dataframe_consortium_standard__()
namespace = df_standard.__dataframe_namespace__()
df = namespace.dataframe_from_dict(
{k: v.decode().__column_consortium_standard__() for k, v in store.items()}
) |
Thanks for the work that you're doing here!
I would like to request the addition of
DataFrame.set_item_by_name
to the API.From the demonstration at euroscipy last week, the ugliest part of the usage was adding columns by:
This seems like an obvious omission, but I haven't found any documentation saying why it hasn't been included. Seems like it was discussed in a call (#138 (comment))?
The text was updated successfully, but these errors were encountered: