Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anndata not properly garbage collected #360

Closed
fhausmann opened this issue Apr 29, 2020 · 19 comments · Fixed by #1119 · May be fixed by #363
Closed

Anndata not properly garbage collected #360

fhausmann opened this issue Apr 29, 2020 · 19 comments · Fixed by #1119 · May be fixed by #363

Comments

@fhausmann
Copy link

Description
When an anndata object is deleted in the current scope the underlying memory is not (reliably) freed as it is for numpy and others. The memory is kept allocated until the process exits. This lead to huge memory consumption if several anndata files are read sequentially in the same process.

Version:
Python 3.7.6 and Python 3.8.2
anndata==0.7.1
scanpy==1.4.6

Code to Reproduce:

import scanpy as sc
import numpy as np
import os
import tracemalloc

def display_top(snapshot, key_type='lineno'):
    snapshot = snapshot.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
    ))
    top_stats = snapshot.statistics(key_type)
    total = sum(stat.size for stat in top_stats)
    return total

def trace_function(func, arg, n):
    total = np.zeros(n)
    data = func(arg).copy()
    tracemalloc.start()
    for i in range(n):
        data = func(arg).copy()
        snapshot = tracemalloc.take_snapshot()
        total[i] = display_top(snapshot)
    tracemalloc.stop()
    return total
_NP_FILENAME = 'numpydata.npy'
_ANNDATA_FILENAME = 'testfile.h5ad'
_RUNS = 5
data = sc.read(_ANNDATA_FILENAME)
np.save(_NP_FILENAME, np.random.rand(*data.shape))
total_np = trace_function(np.load, _NP_FILENAME, _RUNS)
total = trace_function(sc.read, _ANNDATA_FILENAME, _RUNS)
display(total_np)
array([1.74600576e+09, 1.74602235e+09, 1.74602685e+09, 1.74602814e+09,
       1.74602672e+09])
display(total)
array([1.74998379e+09, 2.62655985e+09, 3.50312986e+09, 2.62657418e+09,
       3.50315218e+09])
import matplotlib.pyplot as plt

plt.plot(np.arange(_RUNS), total_np, label='numpy')
plt.plot(np.arange(_RUNS), total, label='anndata')
plt.legend()
plt.show()

output_10_0

@ivirshup
Copy link
Member

I think I've noticed this before but haven't been able to figure out what's going on. If it were a simple memory leak where the memory use just kept increasing, that would make sense to me, and would obviously be a leak. But the memory sometimes get collected.

Do you have any idea what could be going on?

@fhausmann
Copy link
Author

Thank you for your quick reply.
Yes you're right, it's not a normal memory leak.
I did a small test today with pympler and it looks like the complete object (and not only parts of it) are kept in memory for some time. However I have no idea why this is the case, but I'll let you know when I have investigated it in more detail.

---- SUMMARY ------------------------------------------------------------------
Before juggling with tracked obj...      active      0     B      average   pct
  anndata._core.anndata.AnnData               0      0     B      0     B    0%
First init                               active      0     B      average   pct
  anndata._core.anndata.AnnData               1      1.65 GB      1.65 GB   73%
0. iteration                             active      0     B      average   pct
  anndata._core.anndata.AnnData               2      3.30 GB      1.65 GB  107%
1. iteration                             active      0     B      average   pct
  anndata._core.anndata.AnnData               3      4.95 GB      1.65 GB  127%
2. iteration                             active      0     B      average   pct
  anndata._core.anndata.AnnData               4      6.61 GB      1.65 GB  139%
3. iteration                             active      0     B      average   pct
  anndata._core.anndata.AnnData               5      8.26 GB      1.65 GB  148%
4. iteration                             active      0     B      average   pct
  anndata._core.anndata.AnnData               6      9.91 GB      1.65 GB  155%
5. iteration                             active      0     B      average   pct
  anndata._core.anndata.AnnData               7     11.56 GB      1.65 GB  160%
6. iteration                             active      0     B      average   pct
  anndata._core.anndata.AnnData               2      3.30 GB      1.65 GB  105%
7. iteration                             active      0     B      average   pct
  anndata._core.anndata.AnnData               3      4.95 GB      1.65 GB  125%
8. iteration                             active      0     B      average   pct
  anndata._core.anndata.AnnData               4      6.61 GB      1.65 GB  138%
9. iteration                             active      0     B      average   pct
  anndata._core.anndata.AnnData               5      8.26 GB      1.65 GB  147%
-------------------------------------------------------------------------------

@ivirshup
Copy link
Member

ivirshup commented May 1, 2020

One point I would like to know: can your machine run out of memory due to objects which could be collected? Is it possible that collection of these objects is triggered based on need?

@fhausmann
Copy link
Author

Yes It can. Not in the single-processed example I showed above, but when I do it multiprocessed my machine gets out-of-memory. See the code below for example. This fills 32 GB of RAM quite fast with a anndata file which takes 1.65 GB in RAM usually.

_RUNS = 3
def trace_function():
    data = anndata.read_h5ad(_ANNDATA_FILENAME)
    for i in range(_RUNS):
        data = anndata.read_h5ad(_ANNDATA_FILENAME)
    return 0
import multiprocessing as mp
with mp.Pool(3) as pool:
    results = pool.starmap_async(trace_function, [() for _ in range(100)])
    print(results.get())

There is a workaround by setting mp.Pool(3,maxtasksperchild=1), meaning when A a (sub)process exits the memory is garbage collected as expected but if the processes are reused the anndata objects still accumulate.

@fhausmann fhausmann changed the title Anndata not properly garbage collected, memory leakage Anndata not properly garbage collected May 1, 2020
@fhausmann
Copy link
Author

fhausmann commented May 4, 2020

The issue seems to be circular references, which cannot be resolved with the standard reference counter garbage collector. They can only be resolved when with the full garbage collector (gc.collect()).
This somehow explains the weird behaviour, because in the single-process example the full garbage collector is called, when the allocated memory reaches some limits. However in the multi-processed example, for each process the limit is not reached but in total all RAM is allocated.

The issue seem to be _parent in anndata._core.aligned_mapping.Layers, adata in anndata._core.file_backing.AnnDataFileManager and probably some more, which you can show by:

import gc
gc.get_referrers(data)

In total I get 6 circular and 1 actual reference for the anndata object.
The best thing, in my opinion, would be store only weak references here (or drop them completely if possible), but weak references seems prevent pickling of the object.

@ivirshup
Copy link
Member

ivirshup commented May 6, 2020

Glad to see you open a PR for this!

Just a general point about this as an issue though. Unless this memory usage is being triggered by non-pathological code, I'm not sure it justifies huge changes in APIs or intended behavior to fix it.

In practice, I think the gc runs frequently enough that these objects get collected. All the example code we have here is pretty weird, just allocating large amounts of memory and doing nothing with it. From what I can see, making a plot or running a pca seems to reliably deallocate the memory.

I'd definitely like to see this fixed, but I would like to be conservative about how it's fixed.

@fhausmann
Copy link
Author

Thank you again for your quick responses and willingness to help.

Just a general point about this as an issue though. Unless this memory usage is being triggered by non-pathological code, I'm not sure it justifies huge changes in APIs or intended behavior to fix it.

I can understand that you don't want any changes in APIs or similar and I'm happy to take your ideas, such that we can avoid this.

In practice, I think the gc runs frequently enough that these objects get collected. All the example code we have here is pretty weird, just allocating large amounts of memory and doing nothing with it.

Sure these code snippets are very pathological, but we see this behaviour also in non-pathological situations/ pipelines where we process or copy anndata objects a lot. However these codes are obviously to large to show here.

From what I can see, making a plot or running a pca seems to reliably deallocate the memory.

Additionally I cannot agree, that running pca is reliably deallocating the memory.
If you modify the above snippet, this still gradually fills memory. And I think doing multithreaded 100 PCA's is not as pathological.

import os
import anndata
import scanpy as sc
import multiprocessing as mp

def do_pca(filename):
    data = anndata.read_h5ad(filename)
    sc.tl.pca(data)
    data.write('tmpfile.h5ad')
    os.remove('tmpfile.h5ad')
    return 0

with mp.Pool(3) as pool:
    results = pool.starmap_async(do_pca, [(_ANNDATA_FILENAME,) for _ in range(100)])
    print(results.get())

@ivirshup
Copy link
Member

ivirshup commented May 6, 2020

Those examples sound more reasonable. I just gave this case a try with your PR and the current master. I wasn't able to see any consistent improvement in memory usage on your PR branch though. I'm not completely sure what to make of all this though, since I've had generally negative experiences with memory usage and multiprocessing. Here are the results I recorded:

master_mem-usage
memfix_mem-usage

The script I used

Data was generated with:

from scipy import sparse
import scanpy as sc

(
    sc.AnnData(sparse.random(50000, 10000, format="csr"))
    .write_h5ad("test_sparse.h5ad", compression="lzf")
)

I was also using scanpy master for the efficient sparse pca implementation.

import os
os.environ["OMP_NUM_THREADS"] = "4"

import anndata
import scanpy as sc
import multiprocessing as mp
import gc

_ANNDATA_FILENAME = "./test_sparse.h5ad"

def do_pca(filename):
    data = anndata.read_h5ad(filename)
    data.layers["dense"] = data.X.toarray()  # To increase memory usage
    sc.tl.pca(data)
    return 0


with mp.Pool(2) as pool:
    results = pool.starmap_async(do_pca, [(_ANNDATA_FILENAME,) for _ in range(30)])
    print(results.get())

Of course, YMMV. Have you seen memory usage improvements for your workflows using your PR?

@fhausmann
Copy link
Author

Yes I got improved memory usage. I now tested with the master branch of scanpy and got the same results as you. I guess it's not completely fixed and pca may stores somewhere circular references in the anndata object ? Strangely I didn't saw this behaviour with stable scanpy.

I went back to a toy example and replaced sc.tl.pca with time.sleep(1) and got the following results.
with_out_pca

I never checked our pipeline with the PR yet, however maxtasksperchild in multiprocessing.Pool was solving the issue for us in several cases.

@ivirshup
Copy link
Member

ivirshup commented May 7, 2020

I went back to a toy example and replaced sc.tl.pca with time.sleep(1) and got the following results.

I figured this one out, and am now able to see improved memory usage. Basically accessing uns creates circular references, where one of those object holds a reference to the anndata object. It's a very good use case for weakref, since one of the objects should only ever be referred to by the other. I'll make a quick PR for this, since it's a two line change, and you can merge that into your branch.

however maxtasksperchild in multiprocessing.Pool was solving the issue for us in several cases.

Good to hear. As an aside, I've generally had much better experiences with resource handling through dask than multiprocessing. This is especially true if task rely on each other, since dask is good about optimizing for data locality.

ivirshup added a commit to ivirshup/anndata that referenced this issue May 7, 2020
Now KeyOverloads only have a proxy reference to the parent object. This prevents these objects being managed by the generational garbage collector. That was bad since it could keep large AnnData objects in memory. See scverse#360 for more info.
ivirshup added a commit that referenced this issue May 7, 2020
Now KeyOverloads only have a proxy reference to the parent object. This prevents these objects being managed by the generational garbage collector. That was bad since it could keep large AnnData objects in memory. See #360 for more info.
@fhausmann
Copy link
Author

Thanks for fixing and this hint about dask!

@simonwm
Copy link

simonwm commented Feb 4, 2021

I just ran in a similar issue with the current anndata version 0.7.5.

Unfortunately I know not enough about pythons garbage collection to fix it myself, but maybe a really trivial example might help you guys fix it:

def do_stuff(adata): # %mprun magic works only if this is defined somewhere else in a file
    copy0 = adata.copy()
    del copy0
    copy1 = adata.copy()
    del copy1
    copy2 = adata.copy()
    del copy2

adata = ad.AnnData(np.ones((10000,10000)))

# copy numpy array
%mprun -f do_stuff do_stuff(adata.X)
# copy anndata
%mprun -f do_stuff do_stuff(adata)

Output for the numpy part:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   957    581.3 MiB    581.3 MiB           1   def do_stuff(adata):
   958    962.7 MiB    381.5 MiB           1       copy0 = adata.copy()
   959    581.3 MiB   -381.5 MiB           1       del copy0
   960    962.6 MiB    381.3 MiB           1       copy1 = adata.copy()
   961    581.3 MiB   -381.3 MiB           1       del copy1
   962    962.6 MiB    381.3 MiB           1       copy2 = adata.copy()
   963    581.3 MiB   -381.3 MiB           1       del copy2

Output for the anndata part:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   957    581.3 MiB    581.3 MiB           1   def do_stuff(adata):
   958    963.1 MiB    381.8 MiB           1       copy0 = adata.copy()
   959    963.1 MiB      0.0 MiB           1       del copy0
   960   1345.1 MiB    382.1 MiB           1       copy1 = adata.copy()
   961   1345.1 MiB      0.0 MiB           1       del copy1
   962   1727.0 MiB    381.8 MiB           1       copy2 = adata.copy()
   963   1727.0 MiB      0.0 MiB           1       del copy2

With some luck, in some runs some partial garbage collection takes place randomly, but over time some net leakage remains.

As it was mentioned that the issues would not occur in real life, some words about my use case. I have a couple GB adata and run a method for O(10) different parameter sets. Every single call of the method should not change the original adata, and therefore uses a local copy to do its thing. All (many) of these local copies accumulate in memory eventually crashing the program.

The gc.collect() call (thanks @fhausmann ) fixes my real life scenario as does replacing .X in the copies with an empty sparse matrix, but having to do this every time one uses a temporary anndata is certainly not ideal...

@ivirshup ivirshup modified the milestones: 0.9.2, 0.9.3 Jul 24, 2023
@ivirshup ivirshup mentioned this issue Sep 4, 2023
3 tasks
@ivirshup ivirshup modified the milestones: 0.9.3, 0.10.1, 0.10.2, 0.10.3 Oct 6, 2023
@flying-sheep flying-sheep modified the milestones: 0.10.3, 0.10.4 Oct 31, 2023
@flying-sheep flying-sheep modified the milestones: 0.10.4, 0.10.5 Jan 4, 2024
@ivirshup ivirshup modified the milestones: 0.10.5, 0.10.6 Jan 25, 2024
@Szym29
Copy link

Szym29 commented Feb 17, 2024

I am having the same issue in 0.9.1 when I loop over a few anndata objects. Even when I explicitly del the object every loop iteration, the problem persists. Per @fhausmann, using gc.collect() solves the problem.

This works for me. I tried both adata = None then gc.collect() and del adata then gc.collect(), and only the latter one works.

@ivirshup ivirshup modified the milestones: 0.10.6, 0.10.7 Mar 11, 2024
@flying-sheep flying-sheep modified the milestones: 0.10.7, 0.10.8 Apr 9, 2024
@ilan-gold ilan-gold self-assigned this Jun 3, 2024
@flying-sheep flying-sheep modified the milestones: 0.10.8, 0.10.9 Jun 18, 2024
@jeanmonet
Copy link

Issue seems to persist in anndata 0.10.6.

Running a function that creates AnnData object, runs PCA, Louvain & UMAP (with scanpy interface), returns the AnnData object.
I then delete the object with del and call gc.collect().

At each iteration the amount of used memory increases regardless.

This I would say is typical usage of running a pipeline using multiple parameters in order to evaluate the differences.

@siberianisaev
Copy link

Feel free to check cap-anndata solution.
It was specially designed for minimal RAM consumption.

@jeanmonet
Copy link

jeanmonet commented Jun 20, 2024

Feel free to check cap-anndata solution. It was specially designed for minimal RAM consumption.

Thanks, will try to integrate it in pipeline to replace AnnData - the interface should be compatible I suppose?
Quick question: does it support / integrate with multiomics (ie ATAC + RNA), like Muon?

Although I'd mention this is not an issue with too large files, but simply the fact that AnnData seems to persist to memory when it shouldn't, accumulating gradually within the same process.

@siberianisaev
Copy link

siberianisaev commented Jun 20, 2024

@jeanmonet

Thanks, will try to integrate it in pipeline to replace AnnData - the interface should be compatible I suppose?

Yep

Quick question: does it support / integrate with multiomics (ie ATAC + RNA), like Muon?

I don't think it's supported right now, but you could create issue for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment