slice on scan_parquet allocates memory that cannot be released #3972

cbilot · 2022-07-11T01:04:32Z

What language are you using?

Python

Have you tried latest version of polars?

yes

What version of polars are you using?

0.13.52

What operating system are you using polars on?

Linux Mint 20.3

What language version are you using

3.10.4

Describe your bug.

When using scan_parquet and the slice method, Polars allocates significant system memory that cannot be reclaimed until exiting the Python interpreter.

What are the steps to reproduce the behavior?

This is most easily seen when using a large parquet file. I'll reuse the squash_mem function from #3971 to create a large parquet file.

Choose a significantly large value for mem_in_GB that is appropriate for your computing platform. Since writing a parquet file requires creating an in-memory copy of the dataset before writing, a large-ish parquet file that I can comfortably create on my system with total RAM of 512GB is 225 GB.

import polars as pl
import math

mem_in_GB = 225
def mem_squash(file_size_GB: int) -> pl.DataFrame:
    nbr_uint64 = file_size_GB * (2**30) / 8
    nbr_cols = math.ceil(nbr_uint64 ** (0.15))
    nbr_rows = math.ceil(nbr_uint64 / nbr_cols)

    return pl.DataFrame(
        data={
            "col_" + str(col_nbr): pl.arange(0, nbr_rows, eager=True)
            for col_nbr in range(nbr_cols)
        }
    )


df = mem_squash(mem_in_GB)
df.estimated_size() / (2**30)

df.write_parquet('tmp.parquet')

>>> df.estimated_size() / (2**30)
225.0000001192093

Now, restart the Python interpreter. On my system, when I restart the Python interpreter, top shows that I have 3.55GB of memory in use. (This does not include files buffered in RAM by the Linux system.)

Now, let's use the slice method to read a small number of records from the parquet file in the new Python interpreter instance.

import polars as pl
ldf = pl.scan_parquet('tmp.parquet')

df = ldf.slice(0, 100).collect()
df.estimated_size()

>>> df.estimated_size()
30400

However, top shows RAM usage swell to 67.1 GB. (That does not include files buffered by the Linux system.)

Attempting to free this RAM by forcing a garbage collection in Python does not change this.

del df
del ldf
import gc
gc.collect()

>>> gc.collect()
0

Still, top shows 67.1 GB of RAM usage. And yet, I have no user-defined variables left in the Python interpreter.

When I finally quit the Python interpreter, top shows the memory usage fall back to 3.52 GB.

Other Notes

Using different types of slices, e.g., slice(100, 1000) yield different amounts of RAM that cannot be reclaimed.

I came across this issue when attempting an answer to a Stack Overflow question. The OP's dataset is already sorted in such a way that scan_parquet and slice might be a fast, easy solution. Since the OP shares that memory pressure is an issue for the input dataset, I tried to make sure that my solution would work with large datasets. That's when I noticed odd memory issues on my machine. ( #3971 was also discovered while working on a solution, as I tried to understand why I was seeing odd memory issues.)

The text was updated successfully, but these errors were encountered:

ritchie46 · 2022-07-11T13:09:21Z

This is also because of row groups. We have a single row group leading to a large DataFrame. Now if we slice that DataFrame, we slice the arrow buffers, which only changes the offset and length attributes. No data is accessed nor dropped.

This would improve if we wrote smaller row groups.

cbilot · 2022-07-16T21:22:11Z

As of Polars 0.13.54, the code above (which does not use row groups) runs successfully but still yields a large chunk of non-reclaimable memory.

Using Row Groups

Unfortunately, attempting to write in row groups makes this worse: the result is an OOM.

The Setup: DataFrame of 225 GB (in RAM), written in 10 Row Groups

Creating the Parquet File with Row Groups

Initially, I struggled to control the number of row groups created using the row_group_size parameter of write_parquet.

So, I chose instead to create a 225 GB DataFrame composed of 10 equal-sized chunks, so that write_parquet writes each chunk as a separate row group.

The sole purpose of the mem_squash function below is to generate a DataFrame of a given size in RAM. Using this function, I generate 10 DataFrames designed to be 22.5 GB in RAM each. I then concatenate the 10 DataFrames with concat but purposely choose rechunk=False.

The final DataFrame that will be written to a parquet file is 225 GB (in RAM) with a total of 1,118,481,070 rows and 27 columns, composed of 10 equal-sized chunks.

After writing the parquet file, I have pyarrow verify the number of row groups.

import polars as pl
import pyarrow.parquet as pq
import math

mem_in_GB = 225
nbr_row_groups = 10
mem_per_row_group = mem_in_GB / nbr_row_groups

def mem_squash(file_size_GB: float, id: int = 1.0) -> pl.DataFrame:
    nbr_uint64 = file_size_GB * (2**30) / 8
    nbr_cols = math.ceil(nbr_uint64 ** (0.15))
    nbr_rows = math.ceil(nbr_uint64 / nbr_cols)

    return (
        pl.DataFrame(
            data={
                "col_" + str(col_nbr): pl.arange(0, nbr_rows, eager=True)
                for col_nbr in range(nbr_cols-1)
            })
        .with_column(pl.repeat(id, nbr_rows, eager=True, name="id"))
        .select([pl.col('id'), pl.exclude('id')])
    )


df = pl.concat(
    items=[
        mem_squash(mem_per_row_group, id_nbr)
        for id_nbr in range(0, nbr_row_groups)
    ],
    rechunk=False,
)
df

df.estimated_size() / (2**30)

df.write_parquet('tmp.parquet')
pq.read_metadata('tmp.parquet')

>>> df
shape: (1118481070, 27)
┌─────┬───────────┬───────────┬───────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ id  ┆ col_0     ┆ col_1     ┆ col_2     ┆ ... ┆ col_22    ┆ col_23    ┆ col_24    ┆ col_25    │
│ --- ┆ ---       ┆ ---       ┆ ---       ┆     ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ i64 ┆ i64       ┆ i64       ┆ i64       ┆     ┆ i64       ┆ i64       ┆ i64       ┆ i64       │
╞═════╪═══════════╪═══════════╪═══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0   ┆ 0         ┆ 0         ┆ 0         ┆ ... ┆ 0         ┆ 0         ┆ 0         ┆ 0         │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0   ┆ 1         ┆ 1         ┆ 1         ┆ ... ┆ 1         ┆ 1         ┆ 1         ┆ 1         │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0   ┆ 2         ┆ 2         ┆ 2         ┆ ... ┆ 2         ┆ 2         ┆ 2         ┆ 2         │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0   ┆ 3         ┆ 3         ┆ 3         ┆ ... ┆ 3         ┆ 3         ┆ 3         ┆ 3         │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ...       ┆ ...       ┆ ...       ┆ ... ┆ ...       ┆ ...       ┆ ...       ┆ ...       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9   ┆ 111848103 ┆ 111848103 ┆ 111848103 ┆ ... ┆ 111848103 ┆ 111848103 ┆ 111848103 ┆ 111848103 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9   ┆ 111848104 ┆ 111848104 ┆ 111848104 ┆ ... ┆ 111848104 ┆ 111848104 ┆ 111848104 ┆ 111848104 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9   ┆ 111848105 ┆ 111848105 ┆ 111848105 ┆ ... ┆ 111848105 ┆ 111848105 ┆ 111848105 ┆ 111848105 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9   ┆ 111848106 ┆ 111848106 ┆ 111848106 ┆ ... ┆ 111848106 ┆ 111848106 ┆ 111848106 ┆ 111848106 │
└─────┴───────────┴───────────┴───────────┴─────┴───────────┴───────────┴───────────┴───────────┘

>>> df.estimated_size() / (2**30)
225.00000067055225

>>> pq.read_metadata('tmp.parquet')
<pyarrow._parquet.FileMetaData object at 0x7f2533a24bd0>
  created_by: Arrow2 - Native Rust implementation of Arrow
  num_columns: 27
  num_rows: 1118481070
  num_row_groups: 10
  format_version: 2.6
  serialized_size: 19499

>>> pl.__version__
'0.13.54'

Scanning the Parquet File

After restarting the Python interpreter, attempting to read a slice of the first 100 records results in an OOM.

import polars as pl
ldf = pl.scan_parquet('tmp.parquet', parallel="auto")

df = ldf.slice(0, 100).collect()

>>> import polars as pl
>>> ldf = pl.scan_parquet('tmp.parquet', parallel="auto")
>>> df = ldf.slice(0, 100).collect()
Killed

Reading using parallel="row_groups" does restrict the number of threads being used, but leads to an OOM nonetheless.

>>> ldf = pl.scan_parquet('tmp.parquet', parallel="row_groups")
>>> df = ldf.slice(0, 100).collect()
Killed

As does using parallel="columns"

>>> ldf = pl.scan_parquet('tmp.parquet', parallel="columns")
>>> df = ldf.slice(0, 100).collect()
Killed

And parallel="none" (though it runs single-threaded and so takes considerably longer to reach the OOM).

>>> ldf = pl.scan_parquet('tmp.parquet', parallel="none")
>>> df = ldf.slice(0, 100).collect()
Killed

My machine has more than enough RAM to contain even two copies of the 225 GB DataFrame in RAM:

$ free -g
              total        used        free      shared  buff/cache   available
Mem:            503           3         499           0           0         496
Swap:             0           0           0

It's odd that reading the file with no row groups runs successfully (but leads to unreclaimable RAM). But using row groups creates an OOM, despite having sufficient available RAM, even when using parallel="none".

Please let me know if there's another scenario you'd like me to try. Happy to help.

ritchie46 · 2022-07-17T07:14:26Z

The unreclaimable RAM when we have no row groups is expected. Arrow buffers contain of the memory buffer and two attributes offset and length. If we slice, we only have to update the offset and length buffer and do not touch the memory. When slicing large memory buffers. On eager DatFrames we have shrink_to_fit that forces a reallocation of the memory buffers and releases old memory if it is not used anywhere.

The row groups leading to OOM surprises me. I still have to research that one.

ritchie46 · 2022-07-17T07:30:52Z

I believe this should help: #4046

cbilot · 2022-07-17T18:27:20Z

The unreclaimable RAM when we have no row groups is expected. Arrow buffers contain of the memory buffer and two attributes offset and length. If we slice, we only have to update the offset and length buffer and do not touch the memory. When slicing large memory buffers. On eager DatFrames we have shrink_to_fit that forces a reallocation of the memory buffers and releases old memory if it is not used anywhere.

I'm still confused about something. I'm not concerned with slice and how it updates the offset and length. My concern is something else entirely. It's easiest if I compare available RAM during processing with eager mode versus lazy mode.

Eager mode - `read_parquet`

Reading and slicing

If we take a 225 GB parquet file written as a single row group, and then use slice...

import polars as pl
import psutil
import gc

print(f"Starting RAM available {psutil.virtual_memory()[1] / (2**30):.1f} GB")

df = pl.read_parquet('tmp.parquet').slice(0, 100)
print(f"RAM available after slice {psutil.virtual_memory()[1] / (2**30):.1f} GB")

>>> print(f"Starting RAM available {psutil.virtual_memory()[1] / (2**30):.1f} GB")
Starting RAM available 494.4 GB

>>> df = pl.read_parquet('tmp.parquet').slice(0, 100)
>>> print(f"RAM available after slice {psutil.virtual_memory()[1] / (2**30):.1f} GB")
RAM available after slice 268.8 GB

Available RAM has dropped by 494.4GB - 268.8GB = 225.6 GB. I understand that the slice simply moves the offset and length but the underlying 225 GB dataset remains. That's not what I'm concerned about. Indeed, that's the behavior that I expected.

Deleting the DataFrame

And when we delete the reference to df, force garbage collection, and check our available RAM:

del df
_ = gc.collect()
print(f"RAM available after delete and garbage collection {psutil.virtual_memory()[1] / (2**30):.1f} GB")

>>> print(f"RAM available after delete and garbage collection {psutil.virtual_memory()[1] / (2**30):.1f} GB")
RAM available after delete and garbage collection 493.8 GB

The available RAM is restored back to 493.8 GB, which is close to its original value.

This is the part that I'm concerned about. This is not happening with scan_parquet. The available RAM does not return to its previous value. Even when all references to the DataFrame are deleted and garbage collection is called.

Lazy Mode - `scan_parquet`

Scanning and slicing

Now if we repeat the above exercise using scan_parquet instead of read_parquet, we get:

import polars as pl
import psutil
import gc

print(f"Starting RAM available {psutil.virtual_memory()[1] / (2**30):.1f} GB")

df = pl.scan_parquet('tmp.parquet').slice(0, 100).collect()
print(f"RAM available after slice {psutil.virtual_memory()[1] / (2**30):.1f} GB")

>>> print(f"Starting RAM available {psutil.virtual_memory()[1] / (2**30):.1f} GB")
Starting RAM available 494.4 GB

>>> df = pl.scan_parquet('tmp.parquet').slice(0, 100).collect()
>>> print(f"RAM available after slice {psutil.virtual_memory()[1] / (2**30):.1f} GB")
RAM available after slice 430.1 GB

In this case, we are using 494.4GB - 430.1GB = 64.3 GB (not 225, as with read_parquet).

Deleting the DataFrame

Now, if we delete the reference to our DataFrame and force garbage collection, just as we did in the previous example in eager mode.

del df
_ = gc.collect()
print(f"RAM available after delete and garbage collection {psutil.virtual_memory()[1] / (2**30):.1f} GB")

>>> print(f"RAM available after delete and garbage collection {psutil.virtual_memory()[1] / (2**30):.1f} GB")
RAM available after delete and garbage collection 430.1 GB

Notice that the available RAM does not rebound back to 494 GB.

Available RAM has dropped 494.4 GB - 430.1 GB = 64.3 GB, despite deleting all reference to the DataFrame and forcing garbage collection.

It is this phenomenon that I'm concerned about. There is no way to reclaim the RAM that is lost. The only way to reclaim the lost RAM is to terminate the Python interpreter process.

To me, it seems as if something internal to Polars is holding a reference to the deleted DataFrame, and thus the RAM cannot be reclaimed. This is the phenomenon I'm concerned about.

ritchie46 · 2022-07-17T18:45:37Z

linux might only give memory back to the OS in certain cases. Its normal that your process memory is equal to your peak memory. The RAM is given to the polars process and polars allocator will hold on to it and is free to use it the rest of the process.

See a more thorough explanation on this here.

Can you do heaptrack run to get more insight?

ritchie46 · 2022-07-24T06:23:06Z

~~This should also help jorgecarleitao/arrow2#1180~~

Edit, I don't think it matters, I believe we already passed the correct slice length.

traviscross · 2022-07-27T01:01:23Z

@cbilot; one thing you could do to show whether or not there's a memory leak here is to run scan_parquet in a loop, dropping the references each time. If the memory usage does not climb while doing this, then the behavior you're seeing is just the allocator doing its thing, as @ritchie46 described.

Both polars and libarrow use jemalloc, which is generally better about releasing memory back to the operating system. libparquet (from pyarrow) seems to use glibc's malloc.

If you want to dig into this further, you can ask glibc and jemalloc to give you details about their allocations. You can also ask them to release memory that they would not otherwise. Here's some code to do that. It may be a bit system-specific, so you may need to adjust it. The polars library distributed by pip doesn't include symbols, so you'll need to install polars from source for this to work. The malloc_purge_all function will ask the allocators to release memory, and malloc_info_all will print statistics from each allocator. Making this work involves referencing non-external symbols in the polars and arrow libraries; that's the reason behind much of the trickery below.

import psutil
import subprocess
import sys

from cffi import FFI # `pip3 install -U cffi`
import polars as pl # Install from source so as to get symbols.
# Also need `nm` from `binutils` installed.

def get_so_path(so_name):
  return [x for x in psutil.Process().memory_maps()
          if so_name in x.path][0].path

def get_so_offset(so_path, base_name, name):
  syms = {
    x[2]: int(x[0], 16) for x in [
      x.split(' ') for x in subprocess.check_output([
        "nm", "--defined-only", so_path]).decode("utf-8").split("\n")
    ] if len(x) == 3}
  return syms[name] - syms[base_name]

def get_so_fn(ffi, so_path, base_name, name, ftype):
  ffi.cdef(f"void {base_name}();")
  lib = ffi.dlopen(so_path)
  fdiff = get_so_offset(so_path, base_name, name)
  baddr = int(ffi.cast("uintptr_t", ffi.addressof(lib, base_name)))
  return ffi.cast(ftype, baddr + fdiff)

def glibc_malloc_trim():
  ffi = FFI()
  ffi.cdef("int malloc_trim(size_t);")
  lib = ffi.dlopen(None)
  assert(lib.malloc_trim(0) in [0, 1])

def glibc_malloc_info():
  ffi = FFI()
  ffi.cdef("int malloc_info(int, FILE*);")
  lib = ffi.dlopen(None)
  assert(lib.malloc_info(0, ffi.cast("FILE*", sys.stdout)) == 0)

def jemalloc_mallctl_purge(so_name, base_name, name):
  ffi = FFI()
  so_path = get_so_path(so_name)
  ftype = "int(*)(const char*, void*, size_t*, void*, size_t*)"
  mallctl = get_so_fn(ffi, so_path, base_name, name, ftype)
  cmd = b"arena.4096.purge"
  assert(mallctl(cmd, ffi.NULL, ffi.NULL, ffi.NULL, ffi.NULL) == 0)

def jemalloc_malloc_stats_print(so_name, base_name, name):
  ffi = FFI()
  @ffi.callback("void(*)(void*, const char*)")
  def write_cb(handle, msg):
    stream = ffi.from_handle(handle)
    stream.write(ffi.string(msg).decode("utf-8"))
  stream = sys.stdout
  so_path = get_so_path(so_name)
  ftype = "void(*)(void(*)(void*, char*), void*, char*)"
  malloc_stats_print = get_so_fn(
    ffi, so_path, base_name, name, ftype)
  malloc_stats_print(write_cb, ffi.new_handle(stream), ffi.NULL)

def malloc_purge_all():
  glibc_malloc_trim()
  jemalloc_mallctl_purge(
    "/polars.abi3.so", "PyInit_polars", "_rjem_mallctl")
  jemalloc_mallctl_purge(
    "/libarrow.so", "arrow_strptime", "je_arrow_mallctl")

def malloc_info_all():
  print("## polars jemalloc stats\n")
  jemalloc_malloc_stats_print(
    "/polars.abi3.so", "PyInit_polars", "_rjem_malloc_stats_print")
  print("\n## arrow jemalloc stats\n")
  jemalloc_malloc_stats_print(
    "/libarrow.so", "arrow_strptime", "je_arrow_malloc_stats_print")
  print("\n## glibc malloc info\n")
  glibc_malloc_info()

cbilot · 2022-07-27T20:35:38Z

Thanks @traviscross. I really appreciate the help.

I have some odd news. Using compiled Polars up to 8f07335, I can no longer recreate this problem reliably. In about 50 tries, I could only reproduce it once. And unfortunately, I quit the session because I had forgotten to install a package, and so I was not able to try out the functions you provided. Later, I'll try to force the issue by writing a loop that keeps trying until the issue appears.

(Note: I can still recreate this problem reliably using 0.13.58 downloaded from PyPI.)

The only commit since 0.13.58 that seems possibly relevant would be ad15e93, an update of arrow2. Does scan_parquet uses pyarrow or arrow2? (I have no idea.)

Or perhaps does compiling Polars somehow change the allocator, compared to using pip? I used the following command during the compile:
cd py-polars && maturin develop --release -- -C target-cpu=native

one thing you could do to show whether or not there's a memory leak here is to run scan_parquet in a loop, dropping the references each time. If the memory usage does not climb while doing this, then the behavior you're seeing is just the allocator doing its thing, as @ritchie46 described.

Using 0.13.58 downloaded from pip, I see somewhat asymptotic behavior.

Run #	Unreclaimed RAM (GB)
1	63.7
2	72.5
3	78.8
4	77.8
5	86.2
6	81.9
7	87.0
8	89.0
9	92.9
10	91.1

Running other non-related queries against the file does cause the unreclaimed RAM to gyrate.

One thing I can try is to wait for the next release of Polars, and then see whether this difference between the compiled version and the version of PyPI persists. I tried to compile Polars up to dcb0806 (the 0.13.58 tag), but got a large number of errors.

cbilot · 2022-07-31T21:49:18Z

As of Polars 0.13.59, the issue is resolved. After 100 tries, I could not replicate the issue. Not a single GB of unreclaimed RAM.

The OOM issues with using row groups is also resolved.

And thanks again @traviscross . I'm saving your code for querying glibc and jemalloc. That code might come in very handy when trying to troubleshoot future issues.

cbilot added the bug Something isn't working label Jul 11, 2022

cbilot mentioned this issue Jul 11, 2022

RAM usage and predicate pushdown #3974

Closed

ritchie46 mentioned this issue Jul 17, 2022

parquet: stop reading when slice is reached #4046

Merged

ritchie46 mentioned this issue Jul 17, 2022

parquet: low memory arg #4050

Merged

cbilot closed this as completed Jul 31, 2022

CHDev93 mentioned this issue Aug 9, 2022

Steady increase of unfreed memory #4343

Closed

borchero mentioned this issue Feb 14, 2024

scan_parquet leaks memory with small row group sizes #14482

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slice on scan_parquet allocates memory that cannot be released #3972

slice on scan_parquet allocates memory that cannot be released #3972

cbilot commented Jul 11, 2022

ritchie46 commented Jul 11, 2022

cbilot commented Jul 16, 2022 •

edited

Loading

ritchie46 commented Jul 17, 2022

ritchie46 commented Jul 17, 2022

cbilot commented Jul 17, 2022

ritchie46 commented Jul 17, 2022

ritchie46 commented Jul 24, 2022 •

edited

Loading

traviscross commented Jul 27, 2022

cbilot commented Jul 27, 2022

cbilot commented Jul 31, 2022 •

edited

Loading

slice on scan_parquet allocates memory that cannot be released #3972

slice on scan_parquet allocates memory that cannot be released #3972

Comments

cbilot commented Jul 11, 2022

What language are you using?

Have you tried latest version of polars?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

What are the steps to reproduce the behavior?

Other Notes

ritchie46 commented Jul 11, 2022

cbilot commented Jul 16, 2022 • edited Loading

Using Row Groups

The Setup: DataFrame of 225 GB (in RAM), written in 10 Row Groups

Creating the Parquet File with Row Groups

Scanning the Parquet File

ritchie46 commented Jul 17, 2022

ritchie46 commented Jul 17, 2022

cbilot commented Jul 17, 2022

Eager mode - read_parquet

Reading and slicing

Deleting the DataFrame

Lazy Mode - scan_parquet

Scanning and slicing

Deleting the DataFrame

ritchie46 commented Jul 17, 2022

ritchie46 commented Jul 24, 2022 • edited Loading

traviscross commented Jul 27, 2022

cbilot commented Jul 27, 2022

cbilot commented Jul 31, 2022 • edited Loading

cbilot commented Jul 16, 2022 •

edited

Loading

Eager mode - `read_parquet`

Lazy Mode - `scan_parquet`

ritchie46 commented Jul 24, 2022 •

edited

Loading

cbilot commented Jul 31, 2022 •

edited

Loading