-
-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unmanaged Memory Leak with Large Parquet Files (Dask + Modin) #8375
Comments
I'm not aware of a memory leak in pyarrow. If you have some links to the issues, that would be helpful. The logs above show that your workers are running out of memory but there is not much more to go on. If modin is your interface, I suggest to open an issue there. The fact that something is scattering data is very uncommon for dask workloads and definitely unexpected when reading parquet data. |
@fjetter Here are the things I've found about a memory leak in
Since modin is using dask as the base, I opened the ticket here (I also opened #8377, which is just dask + distributed, as well with much of the same detail, albeit a slightly different error). There's clearly a runaway spawned process, but I am unable to figure out what that spawned process is doing. |
As with any open source effort, often 90% of the challenge is identifying where the problem is. One challenge of what you're running into is that your problem could be in any of dask.distributed, dask-cloudprovider, modin, or arrow. My guess is that folks like @fjetter are hesitant to wade into a problem like this because more likely than not the issue is in one of the other projects (and those maintainers will be similarly hesitant). One approach would be to try to reproduce the problem with a toolset well understood by a single set of maintainers. In your case I would probably try to reproduce the issue using dask.dataframe or dask-expr (replacing Modin) and LocalCluster or coiled.Cluster (replacing dask-cloudprovider). With that combination the blame would fall squarely on the shoulders of someone like @fjetter and I'll be he'd be more inclined to spend time understanding what went wrong. Alternatively, you might also find that the problem goes away using that set of tools, in which case you'd have a better sense of where to look. |
@mrocklin FYI I was able to reproduce the bug here: #8377 (comment). I'll work on getting those large parquet files into a publicly accessible S3 bucket so anyone can test it out more easily. |
Given this appears to be related to using Dask rather than deploying Dask I would recommend trying to reproduce without any of the libraries like dask-cloudprovider, coiled, etc and just make it as simple as possible. Maybe just spin up an EC2 instance and use |
Describe the issue:
I am using modin on a dask cluster.
When loading 30 parquet files (each ~150 MB) onto a cluster of 2 workers (32 GB RAM, 500 GB disk, AWS EC2
r5.xlarge
), there is a runaway process of unmanaged memory (specified below) which causes the dask process to be restarted, which causes the rest of the cluster to fail during a shuffle.I've been getting whiffs from online bug reports that there's a memory leak in pyarrow and that dask can't make the same decisions about parquet as it can with CSV (which works flawlessly! 56 GB of CSV, 2 workers, no problem!), so ultimately I think that might be the issue, but I don't know a) how to find where that leak is in pyarrow, and b) what
dask
/distributed
can do to handle it.Logs showing how the worker handles the runaway memory:
The runaway process:
/usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=11, pipe_handle=17) --multiprocessing-fork
The ultimate error:
Minimal Complete Verifiable Example:
Anything else we need to know?:
This is operating in an AWS environment. I'm booting the EC2 instances (via boto3) and running
dask worker
anddask scheduler
and linking them manually (well, through a script with boto3).Like I mentioned above, I'm using modin on top of a dask cluster. (This also fails with regular dask on a dask cluster if the memory is too small and is unable to spill properly — runaway memory of the same process as here, and then spilling never goes beyond 2-3 GB).
Environment:
The text was updated successfully, but these errors were encountered: