-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Gaussian filter is not behaving lazily as expected and is allocating RAM for the entire array #386
Comments
Is it possible this line causes a copy to a NumPy array? phase_filtered = xr.DataArray(phase_filtered, dims=phase.dims, coords=phase.coords, name='phase_filtered') IOW if that line and all subsequent lines are commented, what does memory usage look like? |
Xarray integrates seamlessly with dask arrays, and the example code correctly produces an xarray dask object. At this stage, RAM usage remains minimal:
And a portion of the array processes correctly without excessive memory usage:
But when the full computation starts, even when lazily saving the data to a NetCDF file, it requires RAM for the entire array. I suspect there’s no proper variable cleanup in the gaussian_filter function, and the garbage collector is unable to free memory while the computation is ongoing. For reference, in my xarray and dask code, I immediately remove unnecessary objects to avoid memory issues. For example:
At this point, the library seems impractical because it consumes the same amount of RAM as numpy/scipy functions, making it impossible to use on large arrays. There is no way to call it efficiently on a decent-sized array without running into memory issues. |
Would recommend using the Should add this needs the Python Graphviz library. In Conda this is installable via the |
@AlexeyPechnikov next to the suggestion by John, can you confirm that your example code uses the amount of memory you'd expect after replacing the line
with
Given that many components are interacting in your code, this test will tell whether the problem you report is specific to the dask_image function. |
It seems ineffective because the image is not readable: The data structure of the variable phase_filtered for 30k x 30k grid is: @m-albert Yes, it’s the same, as expected, because dask-image internally calls image.map_overlap. The issue seems to be deeper within the image.map_overlap processing. Interestingly, I haven’t encountered any problems with processing many small overlapping chunks, like 32x32 with a 16-pixel overlap, but it doesn’t perform well when processing a few large overlapping chunks. |
Albert's suggestion is a good one because Dask-image is a very light wrapper on top of Dask Is it possible to adjust the reproducer to use Dask alone? If so, we might want to raise this upstream. Possibly improvements are needed to |
Okay, so in the case of small filter sizes
Could you give us an idea about the chunk sizes for which the function starts to not perform well for you? Also, could you detail what you mean by "doesn't perform well"? Above you mention that the script uses 20GB, but if I understand you correctly it runs through. Keep in mind that a large memory footprint is not problematic if you have the RAM available (and not necessarily unexpected). Actually, processing n chunks in parallel will require at least n times the memory required for processing a single chunk (more in the case of overlap). You mention that you're having trouble with large chunksizes: It might actually be fully expected that processing "a few overlapping chunks" in parallel would require 20GB, which is comparable to the size of the full input array. If you want to actively limit the memory usage by dask, some options would be to
|
@m-albert You can see the array and its chunks. For small overlaps, processing a few 2048x2048 chunks in parallel should require just a fraction of the memory compared to the full array size. I can process it on macOS/Linux with unlimited swap, but it breaks in Docker environments with the default swap (1GB or less) and sometimes on free Google Colab, where swap is disabled. For reference, this command works without any issues and with minimal memory consumption (about 1-2GB, which is hard to even monitor):
However, this command is 4-5 times slower, and requires about 10 times more RAM:
For a single-pixel overlap, the difference is extreme. Even for a full chunk overlap, the difference can be up to 9 times: the original block plus 8 surrounding ones, but the latter shouldn’t need to be materialized for each task when dealing with a small overlap. From my point of view, the problem lies in the full materialization of all overlapping chunks without immediate resource cleanup. |
@AlexeyPechnikov Thanks for these tests, it seems we can narrow down the problem to the use of Here's an issue reporting exactly what you've detailed in your last comment: dask/dask#3671. Probably at this point it'll make sense to visualize the graph produced by |
@m-albert As you can see, this issue has persisted in Dask for years and remains unresolved. It seems like this may continue to be a problem indefinitely. I’m not sure why you need the visualization, but here it is: In my code, I use a workaround by processing only chunk coordinates and extracting and materializing only the required data:
However, in my recent tests, I’ve found that Dask’s blockwise can now be used effectively, yielding similar performance. How about developing a universal solution? Currently, dask-image doesn’t offer advantages over standard Dask wrappers for SciPy functions, but a solution for overlapping processing could make a big impact. By the way, I develop an open-source Python InSAR software for satellite interferometry processing that can handle terabytes of data, even on a common laptop like an Apple Air. You can find the Google Colab links and Docker image at https://github.com/AlexeyPechnikov/pygmtsar Even complex examples provided here work in a Docker container with just 4-8GB RAM and 0.5-1GB swap. However, the code is tricky because of Dask-related issues, and it would be great to resolve this hassle completely. |
Indeed it seems like the graphs produced by
I've also played around with this a bit. Consider the following graph visualization for a 2D map_overlap: import numpy as np
import dask.array as da
a = da.random.random((6, ) * 2, chunks=(2, ) * 2)
b = a.map_overlap(lambda x: x, depth=1)
b.visualize('map_overlap.png', color='order') The color represents the order in which the tasks would be computed by the single threaded scheduler. From the visualization it becomes apparent that each output chunk depends on several input chunks. It seems that the input slices for each output slice are calculated at the same time (look at the third layer from below), instead of e.g. just after loading the relevant input chunk and keeping a slice of it in memory. So it could be that input chunk caching is creating the large memory footprint. In the following code I used from dask.optimization import inline_functions
import dask.array
def inline_first_hlg_layer(ar):
"""
This function inlines the first high-level graph layer of a dask array.
"""
layer_keys = list(ar.dask.layers.keys())
no_dep_keys = [k for k, v in ar.dask.dependencies.items() if not len(v)]
lowest_layer_key = [k for k in no_dep_keys
if not max([lk in k and lk != k for lk in layer_keys])][0]
outputs = [k for k in ar.dask
if k[0].startswith(ar.name)]
fast_function = ar.dask[[k for k in ar.dask.keys()
if k[0].startswith(lowest_layer_key)][0]][0]
dsk_inlined = inline_functions(ar.dask,
outputs,
[fast_function],
dependencies=ar.dask.get_all_dependencies(),
)
res_ar = dask.array.Array(dsk_inlined, ar.name, chunks=ar.chunks, dtype=ar.dtype)
return res_ar
b_inlined = inline_first_hlg_layer(b)
b_inlined.visualize('map_overlap_inlined.png', color='order') This yields the following nicely parallelised dask graph: How does this modified graph structure affect memory usage? Let's have a look at a larger array (comparable to the code snippets above) and profile it: from dask.diagnostics import ResourceProfiler, CacheProfiler, ProgressBar
a = da.random.random((20000,) * 2, chunks=(2000,) * 2)
b = a.map_overlap(lambda x: x, depth=1)
b_inlined = inline_first_hlg_layer(b)
os.system('rm -rf zarr1.zarr')
with ResourceProfiler(dt=.25) as rprof_inlined, ProgressBar(), CacheProfiler() as cprof_inlined:
b_inlined.to_zarr('zarr1.zarr')
os.system('rm -rf zarr2.zarr')
with ResourceProfiler(dt=.25) as rprof, ProgressBar(), CacheProfiler() as cprof:
b.to_zarr('zarr2.zarr')
from dask.diagnostics import visualize
visualize([rprof_inlined, rprof]) It seems that creating a linear and parallel graph structure by inlining the first graph layer
In my tests, this works for the threaded single machine scheduler and the distributed scheduler with @AlexeyPechnikov Could you try this approach on your data? I.e. applying
Do you want to add some more detail on this? |
@m-albert Yeah, the graph looks much better, and map_overlap works in about the same time and with similar memory consumption as map_blocks. However, the inline_first_hlg_layer() function produces errors with my data. I’ve fixed it somewhat for a 30k x 30k array, but it still doesn’t work for a 50k x 50k array, so my fix is definitely inaccurate. Please check the issue.
I’m not confident with inline_functions; do you believe this approach is robust and can work stably in all cases? I initially saw it as more of a last-mile solution for a specific case, but it would be great if we could use it as a universal solution. |
It seems there are no additional advantages to coding as I’ve shown using dask.delayed and merging the blocks; da.blockwise provides the same performance with more compact code:
Years ago, memory consumption was not predictable for da.blockwise on a Dask distributed cluster in some cases. As a result, I ended up using dask.delayed with immediate deletion of unused variables like this:
It’s not elegant, but it is (and was) predictable. Whether on large workstations or common laptops, we could process huge datasets. Recently, I rechecked, and da.blockwise just works. I hope we don’t need to replace it with a lot of dummy code today. |
This probably changed when If there is a way to rewrite We certainly are heavy users of |
I found that dask-image does its job much better than dask-ml, for example, which crashes even on small rasters of about 1000x1000 pixels, while the wrapped SciPy functions work well (I reported the issue some years ago, but it seems developers were not interested). What I’m asking for is intended to provide easy tools to operate on terabyte datasets, even on an 8GB RAM host. It’s possible right now, but the code looks complicated, and users who want to use their own filtering or other custom operations (not included in my software) find it challenging. If dask-image could provide easy-to-use functions for multidimensional image operations, it would be a truly unique tool, competing with well-established big data processing platforms (which, in my opinion, are technically terrible but more usable without deep knowledge). |
@jakirkham Agreed that it'd be best to improve @AlexeyPechnikov The function was mainly meant for a quick test of inlining and I think isn't in general compatible with I think this is relevant to us and directly relates to our issue: https://docs.dask.org/en/stable/order.html. In summary, @TomAugspurger added optional inlining of arrays in One idea would be to implement the same for In terms of the implementation, this would probably require inlining here. |
I was just reading some more about this and found a large discussion thread on Actually, the high level graph layer for array overlap has been implemented by @GenevieveBuckley! If I understand the discussions correctly, this layer helped delaying the creation of the low-level graph ("materializing" it). The |
@m-albert You’re thinking in terms of Dask implementation… From my perspective, it’s more straightforward to think about an independent implementation and then fit the concept to Dask. There are actually just a few key points needed for an effective map_overlap implementation:
|
The test script below uses 20 GB of RAM for the full array size. Despite the small Gaussian kernel, which requires minimal chunk overlap, dask_image.ndfilters still fails to perform lazy processing. The same issue occurs across multiple operating systems (Linux Debian, Linux Ubuntu, macOS) and Python versions (3.10, 3.11), as well as different versions of dask_image.
The text was updated successfully, but these errors were encountered: