Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make checking var_names and obs_names for uniqueness optional #1112

Closed
niklasmueboe opened this issue Aug 29, 2023 · 5 comments · Fixed by #1507
Closed

Make checking var_names and obs_names for uniqueness optional #1112

niklasmueboe opened this issue Aug 29, 2023 · 5 comments · Fixed by #1507

Comments

@niklasmueboe
Copy link

niklasmueboe commented Aug 29, 2023

By default when creating a new AnnData instance in the _init_as_actual function both var_names and obs_names are checked for uniqueness. When working with AnnData objects with several million obs/var this can become a performance limiting factor taking multiple seconds or even minutes.

Making the uniqueness check optional would allow the user the option to skip it when instantiating a new AnnData object e.g. via ad.AnnData(X=X, ... , check_unique=False). This could be useful to improve performance especially when the input data is already guaranteed to have unique obs_names and var_names.

This was originally raised (with an example implementation) as a PR #1081

@ivirshup
Copy link
Member

ivirshup commented Aug 29, 2023

Thanks for the issue!

Why the current behavior is like it is

AnnData checks for uniqueness on instantiation for two reasons:

  1. During many operations, like label based indexing, pandas checks for uniqueness of the labels. Since the uniqueness check is cached, this means we are just getting this check out of the way somewhere where we can warn about it.
  2. Many label based operations have unintuitive behavior when indices are not unique. For example plot "gene" when two variables called "gene" are present. Sometimes operations will use the first result, some operations will throw an error.

I think that the unexpected behavior is bad enough that AnnData should take steps to make sure library developers and users don't run into it.

Alternatives

However, it could be handled differently. Some options that have been considered:

  • Allow labels with faster uniqueness checks (e.g. fixed length types like UUIDs). However, pandas has updated their uniqueness check recently and it's really fast. Actually kind of hard to beat: PERF: Indexes with arrow arrays covert the arrays to numpy object arrays, and that's very slow pandas-dev/pandas#50121 (comment)
  • Defer uniqueness checks to when label based indexing happens. This is fairly similar to your PR, but ideally we would do a little more handling of it, allowing us to warn users. This would probably need to be verified against usage in scanpy + other modules.
  • Allow users to promise that their indices are unique. This one is difficult since pandas does not have a public way for you to set is_unique on an index.

See also:

So, we're up for better solutions, but I would like to keep some level of safety. It would be nice to see benchmarks on possible speed improvements.

How much time are you seeing taken by the uniqueness check? And what do you think of this?

@niklasmueboe
Copy link
Author

In my case I sometimes have up to ~200M obs (var is irrlevant in comparison) and then uniqueness checks can take around 2 mins.

I wasn't aware of the fact that pandas does uniquenesss checks when label-indexing anyway, and deferring the uniqueness check than would not necessary be beneficial (at least in my use case), because the computation will be done sooner or later.

I think that as long as pandas does not offer a solution for this (i.e. some public API for uniqueness guarantees) finding a real solution might be tricky.

Sidenote:
The issue partially arises due to the fact that only string indices are allowed while natively a MultiIndex of integers would be more suitable for me (and much faster, although not perfect). But I guess getting MultiIndex support might be even harder?

@ivirshup
Copy link
Member

ivirshup commented Aug 30, 2023

Could you tell me a little more about what your data is? E.g. what are your observations?

But I guess getting MultiIndex support might be even harder?

I think so, but I've always had a hard time with MultiIndexes. Could you share a bit how you'd use them?

I would like to detach the axis labels from the obs and var dataframes (a little like xarray) and possibly allow more variety in index types. MultiIndexs though are particularly weird.

I think we'll start allowing non-string non-integer labels before that, but I'm not sure it will help performance much in this case.

MultiIndex of integers

Integer values as indices is particularly hard because it's ambiguous with positional indexing (as discussed in previous issues on this topic). I have been wondering about having a 'label-less' dimension, where only positional indexing is allowed.

But, either solution is liable to break a lot of downstream code.

@niklasmueboe
Copy link
Author

Integer values as indices is particularly hard because it's ambiguous with positional indexing (as discussed in previous issues on this topic). I have been wondering about having a 'label-less' dimension, where only positional indexing is allowed.

For MultiIndexes the disambiguation could be made by only allowing tuples for slicing i.e. adata[1, :] would be assumed to be positional indexing and adata[(1,), :] would be "label-based" MultiIndex of Integers. But I understand the problems. Especially because in pandas df.loc[1, :] is also allowed as label-based indexing.

@flying-sheep
Copy link
Member

Fixed in #1507

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants