-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slice on scan_parquet allocates memory that cannot be released #3972
Comments
This is also because of row groups. We have a single row group leading to a large This would improve if we wrote smaller row groups. |
As of Polars 0.13.54, the code above (which does not use row groups) runs successfully but still yields a large chunk of non-reclaimable memory. Using Row GroupsUnfortunately, attempting to write in row groups makes this worse: the result is an OOM. The Setup: DataFrame of 225 GB (in RAM), written in 10 Row GroupsCreating the Parquet File with Row GroupsInitially, I struggled to control the number of row groups created using the So, I chose instead to create a 225 GB DataFrame composed of 10 equal-sized chunks, so that The sole purpose of the The final DataFrame that will be written to a parquet file is 225 GB (in RAM) with a total of 1,118,481,070 rows and 27 columns, composed of 10 equal-sized chunks. After writing the parquet file, I have pyarrow verify the number of row groups. import polars as pl
import pyarrow.parquet as pq
import math
mem_in_GB = 225
nbr_row_groups = 10
mem_per_row_group = mem_in_GB / nbr_row_groups
def mem_squash(file_size_GB: float, id: int = 1.0) -> pl.DataFrame:
nbr_uint64 = file_size_GB * (2**30) / 8
nbr_cols = math.ceil(nbr_uint64 ** (0.15))
nbr_rows = math.ceil(nbr_uint64 / nbr_cols)
return (
pl.DataFrame(
data={
"col_" + str(col_nbr): pl.arange(0, nbr_rows, eager=True)
for col_nbr in range(nbr_cols-1)
})
.with_column(pl.repeat(id, nbr_rows, eager=True, name="id"))
.select([pl.col('id'), pl.exclude('id')])
)
df = pl.concat(
items=[
mem_squash(mem_per_row_group, id_nbr)
for id_nbr in range(0, nbr_row_groups)
],
rechunk=False,
)
df
df.estimated_size() / (2**30)
df.write_parquet('tmp.parquet')
pq.read_metadata('tmp.parquet')
Scanning the Parquet FileAfter restarting the Python interpreter, attempting to read a slice of the first 100 records results in an OOM. import polars as pl
ldf = pl.scan_parquet('tmp.parquet', parallel="auto")
df = ldf.slice(0, 100).collect()
Reading using
As does using
And
My machine has more than enough RAM to contain even two copies of the 225 GB DataFrame in RAM:
It's odd that reading the file with no row groups runs successfully (but leads to unreclaimable RAM). But using row groups creates an OOM, despite having sufficient available RAM, even when using Please let me know if there's another scenario you'd like me to try. Happy to help. |
The unreclaimable RAM when we have no row groups is expected. Arrow buffers contain of the memory buffer and two attributes The row groups leading to OOM surprises me. I still have to research that one. |
I believe this should help: #4046 |
I'm still confused about something. I'm not concerned with Eager mode -
|
linux might only give memory back to the OS in certain cases. Its normal that your process memory is equal to your peak memory. The RAM is given to the polars process and polars allocator will hold on to it and is free to use it the rest of the process. See a more thorough explanation on this here. Can you do heaptrack run to get more insight? |
Edit, I don't think it matters, I believe we already passed the correct slice length. |
@cbilot; one thing you could do to show whether or not there's a memory leak here is to run Both polars and libarrow use jemalloc, which is generally better about releasing memory back to the operating system. libparquet (from pyarrow) seems to use glibc's malloc. If you want to dig into this further, you can ask glibc and jemalloc to give you details about their allocations. You can also ask them to release memory that they would not otherwise. Here's some code to do that. It may be a bit system-specific, so you may need to adjust it. The polars library distributed by pip doesn't include symbols, so you'll need to install polars from source for this to work. The import psutil
import subprocess
import sys
from cffi import FFI # `pip3 install -U cffi`
import polars as pl # Install from source so as to get symbols.
# Also need `nm` from `binutils` installed.
def get_so_path(so_name):
return [x for x in psutil.Process().memory_maps()
if so_name in x.path][0].path
def get_so_offset(so_path, base_name, name):
syms = {
x[2]: int(x[0], 16) for x in [
x.split(' ') for x in subprocess.check_output([
"nm", "--defined-only", so_path]).decode("utf-8").split("\n")
] if len(x) == 3}
return syms[name] - syms[base_name]
def get_so_fn(ffi, so_path, base_name, name, ftype):
ffi.cdef(f"void {base_name}();")
lib = ffi.dlopen(so_path)
fdiff = get_so_offset(so_path, base_name, name)
baddr = int(ffi.cast("uintptr_t", ffi.addressof(lib, base_name)))
return ffi.cast(ftype, baddr + fdiff)
def glibc_malloc_trim():
ffi = FFI()
ffi.cdef("int malloc_trim(size_t);")
lib = ffi.dlopen(None)
assert(lib.malloc_trim(0) in [0, 1])
def glibc_malloc_info():
ffi = FFI()
ffi.cdef("int malloc_info(int, FILE*);")
lib = ffi.dlopen(None)
assert(lib.malloc_info(0, ffi.cast("FILE*", sys.stdout)) == 0)
def jemalloc_mallctl_purge(so_name, base_name, name):
ffi = FFI()
so_path = get_so_path(so_name)
ftype = "int(*)(const char*, void*, size_t*, void*, size_t*)"
mallctl = get_so_fn(ffi, so_path, base_name, name, ftype)
cmd = b"arena.4096.purge"
assert(mallctl(cmd, ffi.NULL, ffi.NULL, ffi.NULL, ffi.NULL) == 0)
def jemalloc_malloc_stats_print(so_name, base_name, name):
ffi = FFI()
@ffi.callback("void(*)(void*, const char*)")
def write_cb(handle, msg):
stream = ffi.from_handle(handle)
stream.write(ffi.string(msg).decode("utf-8"))
stream = sys.stdout
so_path = get_so_path(so_name)
ftype = "void(*)(void(*)(void*, char*), void*, char*)"
malloc_stats_print = get_so_fn(
ffi, so_path, base_name, name, ftype)
malloc_stats_print(write_cb, ffi.new_handle(stream), ffi.NULL)
def malloc_purge_all():
glibc_malloc_trim()
jemalloc_mallctl_purge(
"/polars.abi3.so", "PyInit_polars", "_rjem_mallctl")
jemalloc_mallctl_purge(
"/libarrow.so", "arrow_strptime", "je_arrow_mallctl")
def malloc_info_all():
print("## polars jemalloc stats\n")
jemalloc_malloc_stats_print(
"/polars.abi3.so", "PyInit_polars", "_rjem_malloc_stats_print")
print("\n## arrow jemalloc stats\n")
jemalloc_malloc_stats_print(
"/libarrow.so", "arrow_strptime", "je_arrow_malloc_stats_print")
print("\n## glibc malloc info\n")
glibc_malloc_info() |
Thanks @traviscross. I really appreciate the help. I have some odd news. Using compiled Polars up to 8f07335, I can no longer recreate this problem reliably. In about 50 tries, I could only reproduce it once. And unfortunately, I quit the session because I had forgotten to install a package, and so I was not able to try out the functions you provided. Later, I'll try to force the issue by writing a loop that keeps trying until the issue appears. (Note: I can still recreate this problem reliably using 0.13.58 downloaded from PyPI.) The only commit since 0.13.58 that seems possibly relevant would be ad15e93, an update of arrow2. Does Or perhaps does compiling Polars somehow change the allocator, compared to using pip? I used the following command during the compile:
Using
Running other non-related queries against the file does cause the unreclaimed RAM to gyrate. One thing I can try is to wait for the next release of Polars, and then see whether this difference between the compiled version and the version of PyPI persists. I tried to compile Polars up to dcb0806 (the 0.13.58 tag), but got a large number of errors. |
As of Polars 0.13.59, the issue is resolved. After 100 tries, I could not replicate the issue. Not a single GB of unreclaimed RAM. The OOM issues with using row groups is also resolved. And thanks again @traviscross . I'm saving your code for querying glibc and jemalloc. That code might come in very handy when trying to troubleshoot future issues. |
What language are you using?
Python
Have you tried latest version of polars?
yes
What version of polars are you using?
0.13.52
What operating system are you using polars on?
Linux Mint 20.3
What language version are you using
3.10.4
Describe your bug.
When using
scan_parquet
and theslice
method, Polars allocates significant system memory that cannot be reclaimed until exiting the Python interpreter.What are the steps to reproduce the behavior?
This is most easily seen when using a large parquet file. I'll reuse the
squash_mem
function from #3971 to create a large parquet file.Choose a significantly large value for
mem_in_GB
that is appropriate for your computing platform. Since writing a parquet file requires creating an in-memory copy of the dataset before writing, a large-ish parquet file that I can comfortably create on my system with total RAM of 512GB is 225 GB.Now, restart the Python interpreter. On my system, when I restart the Python interpreter,
top
shows that I have 3.55GB of memory in use. (This does not include files buffered in RAM by the Linux system.)Now, let's use the
slice
method to read a small number of records from the parquet file in the new Python interpreter instance.However,
top
shows RAM usage swell to 67.1 GB. (That does not include files buffered by the Linux system.)Attempting to free this RAM by forcing a garbage collection in Python does not change this.
Still,
top
shows 67.1 GB of RAM usage. And yet, I have no user-defined variables left in the Python interpreter.When I finally quit the Python interpreter,
top
shows the memory usage fall back to 3.52 GB.Other Notes
Using different types of slices, e.g.,
slice(100, 1000)
yield different amounts of RAM that cannot be reclaimed.I came across this issue when attempting an answer to a Stack Overflow question. The OP's dataset is already sorted in such a way that
scan_parquet
andslice
might be a fast, easy solution. Since the OP shares that memory pressure is an issue for the input dataset, I tried to make sure that my solution would work with large datasets. That's when I noticed odd memory issues on my machine. ( #3971 was also discovered while working on a solution, as I tried to understand why I was seeing odd memory issues.)The text was updated successfully, but these errors were encountered: