-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make checking var_names and obs_names for uniqueness optional #1112
Comments
Thanks for the issue! Why the current behavior is like it isAnnData checks for uniqueness on instantiation for two reasons:
I think that the unexpected behavior is bad enough that AnnData should take steps to make sure library developers and users don't run into it. AlternativesHowever, it could be handled differently. Some options that have been considered:
See also: So, we're up for better solutions, but I would like to keep some level of safety. It would be nice to see benchmarks on possible speed improvements. How much time are you seeing taken by the uniqueness check? And what do you think of this? |
In my case I sometimes have up to ~200M I wasn't aware of the fact that pandas does uniquenesss checks when label-indexing anyway, and deferring the uniqueness check than would not necessary be beneficial (at least in my use case), because the computation will be done sooner or later. I think that as long as pandas does not offer a solution for this (i.e. some public API for uniqueness guarantees) finding a real solution might be tricky. Sidenote: |
Could you tell me a little more about what your data is? E.g. what are your observations?
I think so, but I've always had a hard time with MultiIndexes. Could you share a bit how you'd use them? I would like to detach the axis labels from the I think we'll start allowing non-string non-integer labels before that, but I'm not sure it will help performance much in this case.
Integer values as indices is particularly hard because it's ambiguous with positional indexing (as discussed in previous issues on this topic). I have been wondering about having a 'label-less' dimension, where only positional indexing is allowed. But, either solution is liable to break a lot of downstream code. |
For MultiIndexes the disambiguation could be made by only allowing tuples for slicing i.e. |
Fixed in #1507 |
By default when creating a new AnnData instance in the
_init_as_actual
function bothvar_names
andobs_names
are checked for uniqueness. When working with AnnData objects with several millionobs
/var
this can become a performance limiting factor taking multiple seconds or even minutes.Making the uniqueness check optional would allow the user the option to skip it when instantiating a new AnnData object e.g. via
ad.AnnData(X=X, ... , check_unique=False)
. This could be useful to improve performance especially when the input data is already guaranteed to have uniqueobs_names
andvar_names.
This was originally raised (with an example implementation) as a PR #1081
The text was updated successfully, but these errors were encountered: