Make checking var_names and obs_names for uniqueness optional #1112

niklasmueboe · 2023-08-29T12:13:11Z

By default when creating a new AnnData instance in the _init_as_actual function both var_names and obs_names are checked for uniqueness. When working with AnnData objects with several million obs/var this can become a performance limiting factor taking multiple seconds or even minutes.

Making the uniqueness check optional would allow the user the option to skip it when instantiating a new AnnData object e.g. via ad.AnnData(X=X, ... , check_unique=False). This could be useful to improve performance especially when the input data is already guaranteed to have unique obs_names and var_names.

This was originally raised (with an example implementation) as a PR #1081

The text was updated successfully, but these errors were encountered:

ivirshup · 2023-08-29T12:47:04Z

Thanks for the issue!

Why the current behavior is like it is

AnnData checks for uniqueness on instantiation for two reasons:

During many operations, like label based indexing, pandas checks for uniqueness of the labels. Since the uniqueness check is cached, this means we are just getting this check out of the way somewhere where we can warn about it.
Many label based operations have unintuitive behavior when indices are not unique. For example plot "gene" when two variables called "gene" are present. Sometimes operations will use the first result, some operations will throw an error.

I think that the unexpected behavior is bad enough that AnnData should take steps to make sure library developers and users don't run into it.

Alternatives

However, it could be handled differently. Some options that have been considered:

Allow labels with faster uniqueness checks (e.g. fixed length types like UUIDs). However, pandas has updated their uniqueness check recently and it's really fast. Actually kind of hard to beat: PERF: Indexes with arrow arrays covert the arrays to numpy object arrays, and that's very slow pandas-dev/pandas#50121 (comment)
Defer uniqueness checks to when label based indexing happens. This is fairly similar to your PR, but ideally we would do a little more handling of it, allowing us to warn users. This would probably need to be verified against usage in scanpy + other modules.
Allow users to promise that their indices are unique. This one is difficult since pandas does not have a public way for you to set is_unique on an index.

See also:

Feature request - var_names/obs_names as fixed-sized types (integer or bytes) #777

So, we're up for better solutions, but I would like to keep some level of safety. It would be nice to see benchmarks on possible speed improvements.

How much time are you seeing taken by the uniqueness check? And what do you think of this?

niklasmueboe · 2023-08-30T12:28:57Z

In my case I sometimes have up to ~200M obs (var is irrlevant in comparison) and then uniqueness checks can take around 2 mins.

I wasn't aware of the fact that pandas does uniquenesss checks when label-indexing anyway, and deferring the uniqueness check than would not necessary be beneficial (at least in my use case), because the computation will be done sooner or later.

I think that as long as pandas does not offer a solution for this (i.e. some public API for uniqueness guarantees) finding a real solution might be tricky.

Sidenote:
The issue partially arises due to the fact that only string indices are allowed while natively a MultiIndex of integers would be more suitable for me (and much faster, although not perfect). But I guess getting MultiIndex support might be even harder?

ivirshup · 2023-08-30T12:39:43Z

Could you tell me a little more about what your data is? E.g. what are your observations?

But I guess getting MultiIndex support might be even harder?

I think so, but I've always had a hard time with MultiIndexes. Could you share a bit how you'd use them?

I would like to detach the axis labels from the obs and var dataframes (a little like xarray) and possibly allow more variety in index types. MultiIndexs though are particularly weird.

I think we'll start allowing non-string non-integer labels before that, but I'm not sure it will help performance much in this case.

MultiIndex of integers

Integer values as indices is particularly hard because it's ambiguous with positional indexing (as discussed in previous issues on this topic). I have been wondering about having a 'label-less' dimension, where only positional indexing is allowed.

But, either solution is liable to break a lot of downstream code.

niklasmueboe · 2023-09-04T06:49:43Z

Integer values as indices is particularly hard because it's ambiguous with positional indexing (as discussed in previous issues on this topic). I have been wondering about having a 'label-less' dimension, where only positional indexing is allowed.

For MultiIndexes the disambiguation could be made by only allowing tuples for slicing i.e. adata[1, :] would be assumed to be positional indexing and adata[(1,), :] would be "label-based" MultiIndex of Integers. But I understand the problems. Especially because in pandas df.loc[1, :] is also allowed as label-based indexing.

flying-sheep · 2024-06-07T07:25:15Z

Fixed in #1507

niklasmueboe added the enhancement label Aug 29, 2023

ivirshup added performance 🐌 topic: indexing labels Aug 29, 2023

ilan-gold mentioned this issue Jan 23, 2024

(feat): add methods for changing settings #1270

Merged

3 tasks

flying-sheep mentioned this issue Jun 7, 2024

(feat): add check_uniqueness setting #1507

Merged

3 tasks

flying-sheep closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make checking var_names and obs_names for uniqueness optional #1112

Make checking var_names and obs_names for uniqueness optional #1112

niklasmueboe commented Aug 29, 2023 •

edited

Loading

ivirshup commented Aug 29, 2023 •

edited

Loading

niklasmueboe commented Aug 30, 2023

ivirshup commented Aug 30, 2023 •

edited

Loading

niklasmueboe commented Sep 4, 2023

flying-sheep commented Jun 7, 2024

Make checking var_names and obs_names for uniqueness optional #1112

Make checking var_names and obs_names for uniqueness optional #1112

Comments

niklasmueboe commented Aug 29, 2023 • edited Loading

ivirshup commented Aug 29, 2023 • edited Loading

Why the current behavior is like it is

Alternatives

niklasmueboe commented Aug 30, 2023

ivirshup commented Aug 30, 2023 • edited Loading

niklasmueboe commented Sep 4, 2023

flying-sheep commented Jun 7, 2024

niklasmueboe commented Aug 29, 2023 •

edited

Loading

ivirshup commented Aug 29, 2023 •

edited

Loading

ivirshup commented Aug 30, 2023 •

edited

Loading