Skip to content

set_column_by_name/ set_column? #235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ivirshup opened this issue Aug 22, 2023 · 12 comments · Fixed by #269
Closed

set_column_by_name/ set_column? #235

ivirshup opened this issue Aug 22, 2023 · 12 comments · Fixed by #269

Comments

@ivirshup
Copy link

Thanks for the work that you're doing here!

I would like to request the addition of DataFrame.set_item_by_name to the API.

From the demonstration at euroscipy last week, the ugliest part of the usage was adding columns by:

for col in cols:
    df.insert(len(df.get_column_names()), col)

This seems like an obvious omission, but I haven't found any documentation saying why it hasn't been included. Seems like it was discussed in a call (#138 (comment))?

@ivirshup ivirshup changed the title set_column_by_name? set_column_by_name/ set_column? Aug 22, 2023
@MarcoGorelli
Copy link
Contributor

thanks @ivirshup !

I've got a PR open to add DataFrame.update_column and DataFrame.update_columns - I think this is what you're after? #232

my general issue with set_column was that it tried to do two things:

  • add a column if not present (but where? not clear)
  • update a column

From the demonstration at euroscipy last week, the ugliest part of the usage was

😄 yeah I realised it was pretty "ugly" whilst preparing and presenting the presentation - would

  • insert_column(loc, column)
  • update_column(column)

be less ugly in your opinion?

@ivirshup
Copy link
Author

my general issue with set_column was that it tried to do two things:

  • add a column if not present (but where? not clear)
  • update a column

I guess I want the behavior you are describing 😅. But I also want there to be a unique solution to DataFrame.get_column_by_name.

Personally, I think of dataframes as Mapping[str, Array] where every array has the same length. This model, which works well in across the R and python libraries I've used, would say "you are overwriting or inserting the column", and I don't think this has been confusing. Of course, many of these implementation have strange ways you can construct dataframes with non-unique column names which leads to extremely confusing behavior:

e.g. confusing definition of union
In [1]: import pandas as pd

In [2]: pd.Index(["a", "a", "b"]).intersection(pd.Index(["a", "a", "b"]))
Out[2]: Index(['a', 'b'], dtype='object')

In [3]: pd.Index(["a", "a", "b"]).union(pd.Index(["a", "b", "a"]))
Out[3]: Index(['a', 'a', 'b'], dtype='object')
  • add a column if not present (but where? not clear)

I think most libraries do "key insertion order", but I don't really care since I'll be accessing columns by name anyways.


I sort of see an argument for saying overwrite is different than insert and these should be distinct. I don't really see an argument for set by position, get by name. I'm also fine with writing:

if col.name not in df.get_column_names():
    df = df.set_column(col)  # or whatever other than position
else:
    raise SomeError(f"{col.name} was taken, be more original")

From the demonstration at euroscipy last week, the ugliest part of the usage was

😄 yeah I realised it was pretty "ugly"

I would actually like to rephrase from "the ugliest" to "the only ugly part".

Well.... maybe the names could be shorter image

@MarcoGorelli
Copy link
Contributor

Thanks for your thoughts, much appreciated

@kkraus14 has brought up the idea of unordered columns before - maybe we should do that? So then we'd have:

  • insert_column
  • update_column
  • get_column
  • drop_column

and corresponding plural versions.

Then there'd be no need for by_name because that's all they could be

And get_column_names would return a set.

The name get_columns_by_name has been a source of annoyance for me to be honest, I'd be glad if we could move away from it

@kkraus14
Copy link
Collaborator

@kkraus14 has brought up the idea of unordered columns before - maybe we should do that?

I think I brought up a question about whether we had explicitly decided that columns were ordered or not. From my perspective, it's a common pattern to iterate over the columns in a dataframe and returning column names as a set doesn't guarantee you a consistent iteration order as far as I'm aware, which is undesired behavior.

My stance here is that columns should be both ordered and named. We could update the behavior of df.insert to default to inserting at the end of the dataframe?

@MarcoGorelli
Copy link
Contributor

We could update the behavior of df.insert to default to inserting at the end of the dataframe?

sounds good - you OK with the other renamings suggested?

@ivirshup
Copy link
Author

ivirshup commented Aug 23, 2023

@MarcoGorelli your proposal sounds right, but I agree that set's unstable iteration order is undesirable.

A case where order really matters: visual inspection of the table. It's quite annoying when:

df.head()
# a | b | c
# ...
df.write(store)
df = store.read()
df.head()
# c | a | b
# ...

But you don't need that much order awareness to accomplish this, basically just insertion order and stable iteration order.


We could update the behavior of df.insert to default to inserting at the end of the dataframe?

For insert_column to have a default location the argument order would have to change as well, no? Does position become key word?


I would also like to make an case for a set_column that overwrites or inserts, but maybe I'll leave that to a separate issue?

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Aug 23, 2023

Sure, let's keep columns ordered then, and default to inserting columns at the end. people can always reorder them with get_columns_by_name and get_column_names() anyway

for set_column - can't you do:

if col.name in df.get_column_names():
    df = df.set_column(col)
else:
    df = df.insert_column(col)

?
If this is meant to be a developer-focused API, I'd suggest forcing such explicitness

But I also don't really care too much about this particular topic, I'm much more concerned Ibis and polars-lazy not being compatible with some column-related parts of the API - if you had any thoughts / inputs on #229, that would be appreciated

Just for my understanding, what's your use case for the dataframe standard?

@ivirshup
Copy link
Author

ivirshup commented Aug 25, 2023

Just for my understanding, what's your use case for the dataframe standard?

Context

I work on anndata. This library provides the AnnData object, which most tools in the scverse ecosystem (focussed on bioinformatics, in particular single-cell data analysis) use to interchange data. It also has hdf5 and zarr based storage formats. This object contains arrays and dataframe objects and keeps them aligned along two dimensions: observations and variables.

It's sort of like a 2-D xarray.Dataset specialized for exploratory analysis of high dimensional data, where your data looks like something that could be passed into a scikit-learn transformer. It's not actually an xarray.Dataset because we need support for sparse arrays and dataframes (for 1d annotations along the observations or variables, e.g. "cell type"). We're currently expanding the types of arrays we can work, sometimes with help from the array-api, but not as often as we'd like since we have sparse data.

However, we currently only support pandas dataframes, and I'd like to change that. This initiative seems like a great way to do that.

What I'd like to be able to do with arbitrary dataframe types

The big operations we need are:

  • Creation, e.g.
df = DataFrame()
store: h5py.Group | zarr.Group = ...

df = Dataframe.from_dict(
    {k: v.decode() for k, v in store.items()}
)
  • Row based indexing
  • Concatenation

And if we can do these without having to specialize for each case, as we've been doing to have sparse array support.


Eventually I would also like to support more dataframe types in our analysis libraries, but being able to handle them in our data container would need to happen first.

@MarcoGorelli
Copy link
Contributor

Thanks @ivirshup

Row based indexing

Could you expand on what you need please? Row labels are disallowed by the standard, would you be able to manage without?

@ivirshup
Copy link
Author

ivirshup commented Aug 25, 2023

Could you expand on what you need please? Row labels are disallowed by the standard, would you be able to manage without?

Yes. We just need positional indexing of rows. We can implement our own labels on top of that.

@MarcoGorelli
Copy link
Contributor

ok that should be fine then, thanks!

How about

df = Dataframe.from_dict(
    {k: v.decode() for k, v in store.items()}
)

, when you do v.decode(), what object is that?

The standard doesn't have a dataframe.from_dict method, where you could create a dataframe from a dict of columns. So you could do something like:

df_standard = df.__dataframe_consortium_standard__()
namespace = df_standard.__dataframe_namespace__()
df = namespace.dataframe_from_dict(
    {k: v.decode().__column_consortium_standard__() for k, v in store.items()}
)

@MarcoGorelli
Copy link
Contributor

Hi @ivirshup - if you're still interested in this, you may be interested in Narwhals

I think it covers your needs (creation, indexing, concatenation) - I also noticed that someone has already built a library annsel on top of your AnnData library using Narwhals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants