distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_ is ineffective #5971

crusaderky · 2022-03-21T12:06:13Z

Ubuntu 21.10 x86/64
distributed 2022.3.0

The MALLOC_TRIM_THRESHOLD_ env variable seems to be effective at making memory deallocation more reactive.
However, the config variable that sets it doesn't seem to do anything - which indicates that the variable is being set after the worker process is started, whereas it should be set before spawning it.

import dask.array
import distributed

client = distributed.Client(n_workers=1, memory_limit="2 GiB")

N = 7_000
S = 160 * 1024

a = dask.array.random.random(N * S // 8, chunks=S // 8)
a = a.persist()
distributed.wait(a)
del a

Result:
Managed: 0
Unmanaged: 1.16 GiB

import os
import dask.array
import dask.config
import distributed

os.environ["MALLOC_TRIM_THRESHOLD_"] = str(dask.config.get("distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_"))
client = distributed.Client(n_workers=1, memory_limit="2 GiB")

N = 7_000
S = 160 * 1024

a = dask.array.random.random(N * S // 8, chunks=S // 8)
a = a.persist()
distributed.wait(a)
del a

Result:
Managed: 0
Unmanaged: 151 MiB

Production Workaround

Set the env variable on the shell, before starting dask-worker:

export MALLOC_TRIM_THRESHOLD_=65536
dask-worker <address>

The text was updated successfully, but these errors were encountered:

crusaderky · 2022-03-22T16:17:11Z

I'm uncertain about how to solve this. The simple solution, to change os.environ in Nanny.__init__ instead of passing them down to Worker, would also mean poisoning the whole process of the nanny. This is annoying for unit tests but I'm not sure if anybody cares in production?

The alternative is to have an intermediate process which sets the variables and then invokes python again, which however is very expensive.

This issue also impacts the other two variables set by the config:

  OMP_NUM_THREADS: 1
  MKL_NUM_THREADS: 1

AFAIK, if for any reason numpy is imported on the worker process before the config, these two variables will not be picked up.

gjoseph92 · 2022-03-22T17:43:21Z

if for any reason numpy is imported on the worker process before the config

Quite possible: #5729

The simple solution, to change os.environ in Nanny.__init__ instead of passing them down to Worker, would also mean poisoning the whole process of the nanny

To be fair, all the environment variables we're currently talking about setting (malloc_trim and num_threads) basically don't have an impact unless they're set before the interpreter starts. So for these specifically, setting them in the Nanny shouldn't actually change anything in practice. I still dislike the uncleanliness of setting them in the Nanny process, though.

The alternative is to have an intermediate process which sets the variables and then invokes python again

Could also have a process-wide lock for Nanny, and set os.environ in the parent process while holding that lock, then reset it before releasing. You'd want to be clever and still allow subprocesses with the same env to be spawned in parallel, so dask-worker --nworkers=100 is still performant. Still not perfectly clean, but maybe an acceptable tradeoff between cleanliness and performance.

fjetter · 2022-03-23T09:59:57Z

setting them in the Nanny shouldn't actually change anything in practice.

They'll be set before the worker process starts. the worker process is where it matters

gjoseph92 · 2022-03-24T23:23:55Z

I meant having them set in the Nanny process won't really affect things for the Nanny itself. Guido and I don't like the poor hygiene of leaving them set on the Nanny, but I'm just noting that that poor hygiene shouldn't affect anything on the Nanny in practice because the particular variables we're setting only have an effect at interpreter startup/NumPy import time.

crusaderky · 2022-03-25T10:17:00Z

I was worried about potential user-defined variables, not the three we set. But I'm leaning towards not over-engineering this just to cover purely hypothetical use cases.

dagibbs22 · 2023-12-06T14:34:49Z

I'm running a Jupyter Lab notebook in Ubuntu 22.04.2 LTS where unmanaged memory isn't being released. Dask is running using Coiled. After about 30 seconds of running my notebook, unmanaged memory appears and stays high for my long-running tasks. I have dask and distributed versions 2023.11.0 through conda-forge. I'm trying to follow the workaround here but having trouble with it.

When I include os.environ["MALLOC_TRIM_THRESHOLD_"] = str(dask.config.get("distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_")) in my notebook import cell, I get KeyError: 'MALLOC_TRIM_THRESHOLD_'. How do I "Set the env variable on the shell, before starting dask-worker:" as mentioned above with

export MALLOC_TRIM_THRESHOLD_=65536
dask-worker <address>

My imports are currently:

import coiled
import dask
from dask.distributed import Client, LocalCluster
import dask.config
import distributed
dask.config.set({'distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_': 5})

This seems to set the MALLOC_TRIM_THRESHOLD_ variable correction; after I create my client, I check the variable with client.run(os.getenv, "MALLOC_TRIM_THRESHOLD_")

And get

{'tls://10.1.32.137:33565': '5',
 'tls://10.1.33.247:40209': '5',
 'tls://10.1.35.138:32803': '5',
 'tls://10.1.41.133:46629': '5',
 'tls://10.1.41.140:41265': '5',
 'tls://10.1.44.62:41999': '5',
 'tls://10.1.45.159:35541': '5',
 'tls://10.1.47.63:38407': '5'}

But unmanaged memory still increases after about 30 seconds and stays high. That eventually causes my model to fail.

How do I trim the unmanaged memory? Thanks very much.

crusaderky · 2023-12-08T15:35:22Z

@dagibbs22 the workarounds described above are very old. This issue was resolved in July 2022.
If you're unhappy with the default that dask sets you can change it through dask config:

import dask
import coiled
dask.config.set({"distributed.nanny.pre-spawn-environ.MALLOC_TRIM_THRESHOLD_: your_value_here})
cluster = coiled.Cluster(...)

This said, I'd be honestly surprised if tampering with the setting were to fix your issue, and if your unmanaged memory does disappear I would love to see your code.

dagibbs22 · 2023-12-08T19:04:04Z

Thanks, @crusaderky . I could tell the issue had been resolved but couldn't tell what it was...

Adding dask.config.set({"distributed.nanny.pre-spawn-environ.MALLOC_TRIM_THRESHOLD_: 1}) didn't reduce my unmanaged memory but did keep the model running despite high unmanaged memory. That's not great, either, but at least it demonstrates that I need to look somewhere else to trim my unmanaged memory. Why did you think (correctly) that changing MALLOC_TRIM_THRESHOLD_ wouldn't make my unmanaged memory disappear? Do you have other suggestions for how to keep unmanaged memory from accumulating?

… on the full time series. Even on 2012-2021, unmanaged memory increased over time, getting into orange and then red for all workers. Eventually, workers died, but somehow they restarted and finished the time series. The same thing happened with the full time series two times (workers in the red zone for memory and then dying) but I guess it just happened too many times and eventually the model died. So, adding dask.config.set({"distributed.nanny.pre-spawn-environ.MALLOC_TRIM_THRESHOLD_": 1}) based on my conversation at dask/distributed#5971 (comment) didn't actually reduce unamanaged memory but did make the model push through the accumulated unmanaged memory, at least one or two times. Of course, this isn't a viable solution overall. But it is good data; unmanaged memory accumulation isn't due to MALLOC_TRIM_THRESHOLD_.

crusaderky · 2023-12-09T14:36:44Z

@dagibbs22 there are many causes for unmanaged memory, listed here: https://distributed.dask.org/en/stable/worker-memory.html#using-the-dashboard-to-monitor-memory-usage

Is unmanaged memory persisting while there are no tasks running? If it goes away, it's heap memory and you have to reduce the size of your chunks/partitions.

dagibbs22 · 2023-12-15T16:43:30Z

@crusaderky The old unmanaged memory gets up to about 6 GB in each worker when I run my notebook but drops to 2 GB per worker after the notebook finishes. Does 2 GB/worker count as "memory persisting while there are no tasks running"? Why would memory persist like that? Thanks.

crusaderky · 2023-12-18T10:51:10Z

Some of it will be logs. dask workers store log information in deques for forensic analysis. You can shorten them through the dask config:

distributed:
    admin:
        log-length: 0
        low-level-log-length: 0

crusaderky self-assigned this Mar 21, 2022

crusaderky added memory bug Something is broken labels Mar 25, 2022

This was referenced Apr 8, 2022

Scheduler memory leak / large worker footprint on simple workload #3898

Open

Memory may not shrink fast enough #5840

Open

gjoseph92 mentioned this issue Apr 27, 2022

Computation deadlocks due to worker rapidly running out of memory instead of spilling #6110

Closed

crusaderky mentioned this issue Apr 29, 2022

Unmanaged (Old) memory hanging #6232

Open

dcherian mentioned this issue May 2, 2022

workers exceeding max_mem setting pangeo-data/rechunker#100

Closed

crusaderky mentioned this issue Jun 15, 2022

Make AMM memory measure configurable #6577

Closed

This was referenced Jul 6, 2022

Set MALLOC_TRIM_THRESHOLD_ before interpreter start #6681

Merged

Release 2022.7.0 dask/community#261

Closed

crusaderky closed this as completed in #6681 Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_ is ineffective #5971

distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_ is ineffective #5971

crusaderky commented Mar 21, 2022 •

edited

Loading

crusaderky commented Mar 22, 2022

gjoseph92 commented Mar 22, 2022

fjetter commented Mar 23, 2022

gjoseph92 commented Mar 24, 2022

crusaderky commented Mar 25, 2022

dagibbs22 commented Dec 6, 2023 •

edited

Loading

crusaderky commented Dec 8, 2023

dagibbs22 commented Dec 8, 2023

crusaderky commented Dec 9, 2023

dagibbs22 commented Dec 15, 2023

crusaderky commented Dec 18, 2023

distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_ is ineffective #5971

distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_ is ineffective #5971

Comments

crusaderky commented Mar 21, 2022 • edited Loading

Production Workaround

crusaderky commented Mar 22, 2022

gjoseph92 commented Mar 22, 2022

fjetter commented Mar 23, 2022

gjoseph92 commented Mar 24, 2022

crusaderky commented Mar 25, 2022

dagibbs22 commented Dec 6, 2023 • edited Loading

crusaderky commented Dec 8, 2023

dagibbs22 commented Dec 8, 2023

crusaderky commented Dec 9, 2023

dagibbs22 commented Dec 15, 2023

crusaderky commented Dec 18, 2023

crusaderky commented Mar 21, 2022 •

edited

Loading

dagibbs22 commented Dec 6, 2023 •

edited

Loading