-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anndata not properly garbage collected #360
Comments
I think I've noticed this before but haven't been able to figure out what's going on. If it were a simple memory leak where the memory use just kept increasing, that would make sense to me, and would obviously be a leak. But the memory sometimes get collected. Do you have any idea what could be going on? |
Thank you for your quick reply.
|
One point I would like to know: can your machine run out of memory due to objects which could be collected? Is it possible that collection of these objects is triggered based on need? |
Yes It can. Not in the single-processed example I showed above, but when I do it multiprocessed my machine gets out-of-memory. See the code below for example. This fills 32 GB of RAM quite fast with a anndata file which takes 1.65 GB in RAM usually.
There is a workaround by setting mp.Pool(3,maxtasksperchild=1), meaning when A a (sub)process exits the memory is garbage collected as expected but if the processes are reused the anndata objects still accumulate. |
The issue seems to be circular references, which cannot be resolved with the standard reference counter garbage collector. They can only be resolved when with the full garbage collector (gc.collect()). The issue seem to be _parent in anndata._core.aligned_mapping.Layers, adata in anndata._core.file_backing.AnnDataFileManager and probably some more, which you can show by:
In total I get 6 circular and 1 actual reference for the anndata object. |
Glad to see you open a PR for this! Just a general point about this as an issue though. Unless this memory usage is being triggered by non-pathological code, I'm not sure it justifies huge changes in APIs or intended behavior to fix it. In practice, I think the gc runs frequently enough that these objects get collected. All the example code we have here is pretty weird, just allocating large amounts of memory and doing nothing with it. From what I can see, making a plot or running a pca seems to reliably deallocate the memory. I'd definitely like to see this fixed, but I would like to be conservative about how it's fixed. |
Thank you again for your quick responses and willingness to help.
I can understand that you don't want any changes in APIs or similar and I'm happy to take your ideas, such that we can avoid this.
Sure these code snippets are very pathological, but we see this behaviour also in non-pathological situations/ pipelines where we process or copy anndata objects a lot. However these codes are obviously to large to show here.
Additionally I cannot agree, that running pca is reliably deallocating the memory.
|
Those examples sound more reasonable. I just gave this case a try with your PR and the current master. I wasn't able to see any consistent improvement in memory usage on your PR branch though. I'm not completely sure what to make of all this though, since I've had generally negative experiences with memory usage and multiprocessing. Here are the results I recorded: The script I usedData was generated with: from scipy import sparse
import scanpy as sc
(
sc.AnnData(sparse.random(50000, 10000, format="csr"))
.write_h5ad("test_sparse.h5ad", compression="lzf")
) I was also using scanpy master for the efficient sparse pca implementation. import os
os.environ["OMP_NUM_THREADS"] = "4"
import anndata
import scanpy as sc
import multiprocessing as mp
import gc
_ANNDATA_FILENAME = "./test_sparse.h5ad"
def do_pca(filename):
data = anndata.read_h5ad(filename)
data.layers["dense"] = data.X.toarray() # To increase memory usage
sc.tl.pca(data)
return 0
with mp.Pool(2) as pool:
results = pool.starmap_async(do_pca, [(_ANNDATA_FILENAME,) for _ in range(30)])
print(results.get()) Of course, YMMV. Have you seen memory usage improvements for your workflows using your PR? |
I figured this one out, and am now able to see improved memory usage. Basically accessing
Good to hear. As an aside, I've generally had much better experiences with resource handling through |
Now KeyOverloads only have a proxy reference to the parent object. This prevents these objects being managed by the generational garbage collector. That was bad since it could keep large AnnData objects in memory. See scverse#360 for more info.
Now KeyOverloads only have a proxy reference to the parent object. This prevents these objects being managed by the generational garbage collector. That was bad since it could keep large AnnData objects in memory. See #360 for more info.
Thanks for fixing and this hint about dask! |
I just ran in a similar issue with the current anndata version 0.7.5. Unfortunately I know not enough about pythons garbage collection to fix it myself, but maybe a really trivial example might help you guys fix it: def do_stuff(adata): # %mprun magic works only if this is defined somewhere else in a file
copy0 = adata.copy()
del copy0
copy1 = adata.copy()
del copy1
copy2 = adata.copy()
del copy2
adata = ad.AnnData(np.ones((10000,10000)))
# copy numpy array
%mprun -f do_stuff do_stuff(adata.X)
# copy anndata
%mprun -f do_stuff do_stuff(adata) Output for the numpy part:
Output for the anndata part:
With some luck, in some runs some partial garbage collection takes place randomly, but over time some net leakage remains. As it was mentioned that the issues would not occur in real life, some words about my use case. I have a couple GB adata and run a method for O(10) different parameter sets. Every single call of the method should not change the original adata, and therefore uses a local copy to do its thing. All (many) of these local copies accumulate in memory eventually crashing the program. The |
This works for me. I tried both adata = None then gc.collect() and del adata then gc.collect(), and only the latter one works. |
Issue seems to persist in Running a function that creates AnnData object, runs PCA, Louvain & UMAP (with scanpy interface), returns the AnnData object. At each iteration the amount of used memory increases regardless. This I would say is typical usage of running a pipeline using multiple parameters in order to evaluate the differences. |
Feel free to check cap-anndata solution. |
Thanks, will try to integrate it in pipeline to replace AnnData - the interface should be compatible I suppose? Although I'd mention this is not an issue with too large files, but simply the fact that AnnData seems to persist to memory when it shouldn't, accumulating gradually within the same process. |
Yep
I don't think it's supported right now, but you could create issue for it. |
Description
When an anndata object is deleted in the current scope the underlying memory is not (reliably) freed as it is for numpy and others. The memory is kept allocated until the process exits. This lead to huge memory consumption if several anndata files are read sequentially in the same process.
Version:
Python 3.7.6 and Python 3.8.2
anndata==0.7.1
scanpy==1.4.6
Code to Reproduce:
The text was updated successfully, but these errors were encountered: