Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory consumption increase for anndata objects #363

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

fhausmann
Copy link

Fixes the increased memory consumption when creating multiple anndata objects.
Introducing weak circular references which doesn't interfere with the standard reference count based garbage collector.
Add one additional test for checking this behaviour for future changes.

Fixes #360.

@ivirshup
Copy link
Member

ivirshup commented May 5, 2020

Thanks for the PR!

One potential issue, what should this do?

from anndata import AnnData
import numpy as np

obsm = AnnData(np.ones((10, 10))).obsm
obsm["a"] = np.zeros((10, 5))
traceback as of bf3e6bc
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-82d6b89026db> in <module>
      1 obsm = AnnData(np.ones((10, 10))).obsm
----> 2 obsm["a"] = np.zeros((10, 5))

~/github/anndata/anndata/_core/aligned_mapping.py in __setitem__(self, key, value)
    150 
    151     def __setitem__(self, key: str, value: V):
--> 152         value = self._validate_value(value, key)
    153         self._data[key] = value
    154 

~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
    214                 f"value.index does not match parent’s axis {self.axes[0]} names"
    215             )
--> 216         return super()._validate_value(val, key)
    217 
    218 

~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
     49         """Raises an error if value is invalid"""
     50         for i, axis in enumerate(self.axes):
---> 51             if self.parent.shape[axis] != val.shape[i]:
     52                 right_shape = tuple(self.parent.shape[a] for a in self.axes)
     53                 raise ValueError(

AttributeError: 'NoneType' object has no attribute 'shape'

@fhausmann
Copy link
Author

fhausmann commented May 5, 2020

I think, this is the right/exected behaviour.
For example if you do: y = dict(x=5)['x'] the dictionary is evicted from memory and you cannot access it any more. The same happens in the code above. The anndata object is evicted from memory, the weakref in AlignedMapping gets None and only obsm is kept in memory.
So I think it's the expected behaviour expect that the error message is unclear.

For example the following code works as expected.

adata = AnnData(np.ones((10, 10)))
obsm =adata.obsm
obsm["a"] = np.zeros((10, 5))

However, if you want to keep the current behaviour, I would suggest to store the parent.shape in the AlignedMapping object and update it whenever accessed (and the parent is not None). This would make AlignedMapping self-contained.

What are your thoughts on this ?

@ivirshup
Copy link
Member

ivirshup commented May 5, 2020

I think it's confusing from an API standpoint, since to users the aligned mapping classes should just be like dicts. This is what stopped me from using weakref.proxys here in the first place.

However, if you want to keep the current behaviour, I would suggest to store the parent.shape in the AlignedMapping object and update it whenever accessed (and the parent is not None).

That could work! You'd also need to keep a reference to the indexes.

As an alternative, I was thinking the relationship could be reversed. The aligned mapping gets a normal reference to the anndata object, but the anndata object wraps the underlying dict whenever the attribute is accessed. This way the reference isn't circular. For example, something like:

    @property
    def obsm(self) -> Union[AxisArrays, AxisArraysView]:
        """\
        Multi-dimensional annotation of observations
        (mutable structured :class:`~numpy.ndarray`).

        Stores for each key a two or higher-dimensional :class:`~numpy.ndarray`
        of length `n_obs`.
        Is sliced with `data` and `obs` but behaves otherwise like a :term:`mapping`.
        """
        if self.is_view:
            obsm = AxisArrays(self._adata_ref, 0, self._obsm)
            return obsm._view(self, self.obs_names)
        else:
            return AxisArrays(self, 0, self._obsm)

@fhausmann
Copy link
Author

That could work! You'd also need to keep a reference to the indexes.

This is already present in self.dim_names, therefore I guess it's fine.

As an alternative, I was thinking the relationship could be reversed. The aligned mapping gets a normal reference to the anndata object, but the anndata object wraps the underlying dict whenever the attribute is accessed. This way the reference isn't circular.

Yes this would also work, but has the disadvantage to keep the anndata object in memory if only obsm/varm is accessible, which means you keep the potential memory expensive X array, even if you can't use it. Additionally, you would do these _validate_value each time you access obsm/varm, which could be expensive in terms of running time, but I'm not sure about this.

Therefore I tried the shape thing first. I'm happy to hear your comments!

@ivirshup
Copy link
Member

ivirshup commented May 6, 2020

but has the disadvantage to keep the anndata object in memory if only obsm/varm is accessible

A lot of the behavior in this packages assumes there will be a live instance of an AnnData object. I don't think these classes make much sense without a live parent object, since they largely exist to maintain copy-on-write semantics.

Additionally, you would do these _validate_value each time you access obsm/varm, which could be expensive in terms of running time, but I'm not sure about this.

I think this could be avoided. You would only need to validate items whenever they're added, and on AnnData instantiation. Otherwise, they should only be modified through the __getitem__ and __setitem__ methods.

@fhausmann
Copy link
Author

Ok, I can understand that this is a valid design assumption, however a very strong one in my eyes, because it's hard to maintain a living parent for all children without inducing circular references.

Do you think your idea above is enough to have less API / behaviour changes while resolving the issue ?
Then I'm happy to implement your idea, but otherwise I think I don't have enough time to think about and implement different ideas.

@ivirshup
Copy link
Member

ivirshup commented May 6, 2020

because it's hard to maintain a living parent for all children without inducing circular references.

Ah, but until now it didn't seem like there was any downside to having circular references 😄. Looks like we've just managed to hit an edge case of python's garbage collection (a small amount of large objects)

Do you think your idea above is enough to have less API / behaviour changes while resolving the issue ?

I'm happy for whatever solution. It's very possible there would be problems with my idea that I haven't considered yet. It's just what I had thought of before you opened this PR. And my bias is towards less code that deals with state.

@fhausmann
Copy link
Author

Looks like we've just managed to hit an edge case of python's garbage collection (a small amount of large objects)

Yes, it's definitely an edge case and a downside of the python garbage collection, which you hit. You lucky one 😄

And my bias is towards less code that deals with state.

Yes this is a downside of the new code. One could make a base class which handles the state, where all these objects inherit from, but still you have to manage them at least once with the current solution.

Comment on lines 85 to 88
@property
def parent_shape(self) -> Tuple[int, int]:
return self._parent.shape

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the shape tuple just be an attribute storing a tuple? Previously, I avoided this to reduce redundancy, but I think there's more complexity introduced by all the getters and setters. This would also remove the need for the AnnData.__del__ method.

I was checking to see if this might break inplace subsetting, but it looks like that was already a bit broken:

from anndata.tests.helpers import gen_adata

a = gen_adata((200, 100))
o1 = a.obsm
a._inplace_subset_obs(slice(50))
o2 = a.obsm
assert o1["array"].shape != o2["array"].shape
assert o1.parent is o2.parent

Copy link
Author

@fhausmann fhausmann May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced this for exactly this case (the anndata object changes shape), which then should be propagated to the AxisArrays. We can drop these getters, setters and the __del__ method for sure, when propagated at all functions inducing different shapes explicitly. However, I'm not sure, that's so easy to find them all (or it is only these _inplace_subset_* functions) ?

I guess also on master it's expected that o1 is changed ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or it is only these inplace_subset* functions

The shape of the instance should only change via the _inplace_subset_ methods.

We can drop these getters, setters and the del method for sure, when propagated at all functions inducing different shapes explicitly.

Which properties are you referring to here?

I guess also on master it's expected that o1 is changed ?

I think it would either make sense that it change, or it no longer referred to the parent of the wrong shape.

Don't feel obligated to fix things that were already broken on master in this PR. If it's not intimately related, it's probably better to open an issue and fix it in a separate PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which properties are you referring to here?

    @property
    def parent_shape(self) -> Tuple[int, int]:
        if self._parent:
            self._parent_shape = self._parent.shape
        return self._parent_shape

    @parent_shape.setter
    def parent_shape(self, shape: Tuple[int, int]):
        self._parent_shape = shape

for example. When we can ensure, that the parent does not change shape other than _inplace_subset_, like you said, we can drop these parent_shape property of AxisArrays and the __del__ method of anndata, store a tuple of the parent shape instead and update it in the _inplace_subset_ methods when necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code to drop the __del__ method and the parent_shape property.
We still need it once in the AlignedMapping class otherwise, a lot of subclasses has to be changed to.
parent_shape does not need to be updated in _inplace_subsets_ because they get newly created there.

Additionally I had a closer look to your solution above.

As an alternative, I was thinking the relationship could be reversed. The aligned mapping gets a normal reference to the anndata object, but the anndata object wraps the underlying dict whenever the attribute is accessed. This way the reference isn't circular.

However, then _obsm cannot be AxisArrays, otherwise you still have the circularity and if it's not an AxisArrays it breaks a lot of other functions and would require a lot of code changes to fix it or am I wrong ?

Copy link
Author

@fhausmann fhausmann Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great, thanks. I created a branch here: https://github.com/fhausmann/anndata/tree/memory_fix_lazy_obsm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see what's happening. We try to normalize passed indices to slice, int, array[int], or array[bool] types as soon as possible. The normalized indices are stored in view in the _oidx and _vidx attributes. AxisArray._view is only expecting to see those types as subset indices. This can be fixed by changing how the view is made in your obsm getter to:

return obsm._view(self, self._oidx)
# Instead of
return obsm._view(self, self.obs_names)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this, including some other small changes. However there is now another issue.
I had to introduce this https://github.com/fhausmann/anndata/blob/aa3e55b5e6251bc373a8ab94e8e11d7fc6d16dba/anndata/_core/aligned_mapping.py#L234-L237
To update the parent when obsm is modified.

However there still seem to be an issue pointed out by the following test: anndata/tests/test_base.py::test_setting_dim_index[obs]
If you create an anndata object containing obsm=dict('df'=pd.Dataframe) , copy it and create a view it looks like as all objects are referring to the same dataframe:
id(curr._obsm['df']) == id(orig._obsm['df']) # True at https://github.com/theislab/anndata/blob/58886f09b2e387c6389a2de20ed0bc7d20d1b843/anndata/tests/test_base.py#L187

I think it can be fixed with creating a copy when modifying a value in _obsm. However this leads to infinite recursion when trying to create an anndata object from a view inplace.

Additionally I think, these changes are now (not only) for fixing the original issue, but changing the anndata architecture. Should we now create a new pull request for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the AxisArray shouldn't have it's own version of the values. That should just be a reference to the values held by the parent. That is, I would think the __init__ should just do something like self._data = parent._obsm.

If you create an anndata object containing obsm=dict('df'=pd.Dataframe), copy it and create a view it looks like as all objects are referring to the same dataframe:
id(curr._obsm['df']) == id(orig._obsm['df']) # True at

Is the code for this like:

curr = orig[:, :].copy()
# or 
curr = orig.copy()[:, :]

Either way, I agree this doesn't look right. But if curr is a view, I'm not sure it should have values for ._obsm. It also looks like the .copy method isn't actually making a copy of the dataframe if this is true, so that would be another thing to look into.

Additionally I think, these changes are now (not only) for fixing the original issue, but changing the anndata architecture. Should we now create a new pull request for this?

I wouldn't worry too much about "architecture changes". A lot of the work I've done on this package has been to make changing the architecture easier. It's up to you how you'd like to organize this, but I often find starting a fresh PR/ branch helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we close this than ? Or do you think it could be worth considering if the lazy creation doesn't work out ?

fhausmann added 2 commits May 7, 2020 09:15
- __del__ from anndata because subssetting only occur in
 'inplace_subset_' methods where obsm, varm where also updated
- Therefore also the parent_shape setter and getter of AxisArrays
 was now unused
fhausmann added a commit to fhausmann/anndata that referenced this pull request Jun 30, 2020
@keithgmitchell
Copy link

Any update on this? Thanks.

@codecov
Copy link

codecov bot commented Sep 1, 2023

Codecov Report

Merging #363 (5fff6cf) into main (22f33bb) will decrease coverage by 2.02%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #363      +/-   ##
==========================================
- Coverage   84.90%   82.88%   -2.02%     
==========================================
  Files          36       36              
  Lines        5133     5189      +56     
==========================================
- Hits         4358     4301      -57     
- Misses        775      888     +113     
Flag Coverage Δ
gpu-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
anndata/_core/aligned_mapping.py 94.73% <100.00%> (+1.07%) ⬆️
anndata/_core/file_backing.py 92.47% <100.00%> (+1.00%) ⬆️

... and 7 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Anndata not properly garbage collected
3 participants