Steady increase of unfreed memory #4343

CHDev93 · 2022-08-09T14:00:37Z

What language are you using?

Python

Have you tried latest version of polars?

[yes]

What version of polars are you using?

0.13.62

What operating system are you using polars on?

Windows 10

What language version are you using

python 3.8

Describe your bug.

I'm doing the following in a loop

reading some parquet files
doing some processing
throwing away the dataframe

I would expect the memory usage "high watermark" to be independent of the number of loops I run. Instead it seems to steadily increase with iteration. Perhaps related to this issue

What are the steps to reproduce the behavior?

The code below should repro the increasing memory when run using mprof run (python's memory-profiler library)

# foo.py

import time
from pathlib import Path
from typing import List

import numpy as np
import polars as pl

DATA_DIR = Path("foo_data")
DATA_DIR.mkdir(exist_ok=True, parents=True)
BYTES_PER_FLOAT = 4


# @profile
def load_files_to_dataframe(files: List[Path]):
    load_query = pl.concat([pl.scan_parquet(f) for f in sorted(files)])
    return load_query


# @profile
def process_data(load_query, columns: List[str]):
    df_diff = pl.col("a") - pl.col("b")
    feature_query = [pl.col(col) / df_diff for col in columns]
    response_query = [
        ((pl.col("p0") / df_diff - 1)).alias("r0"),
        ((pl.col("p1") / df_diff - 1)).alias("r1"),
    ]

    df = (
        load_query.select(response_query + feature_query)
        .drop_nulls(subset=["r0", "r1"])
        .fill_null(0.0)
        .collect()
    )
    return df


# @profile
def load_and_preprocess_data(files: List[Path], columns: List[str]) -> pl.DataFrame:
    load_query = load_files_to_dataframe(files)

    df = process_data(load_query, columns)

    return df


# @profile
def make_data_files(n_files: int) -> List[Path]:
    n = int(1e6)
    files = [DATA_DIR / f"{i:04}.parquet" for i in range(n_files)]

    for f in files:
        df = pl.DataFrame(
            data={
                "a": np.random.rand(n),
                "b": np.random.rand(n),
                "p0": np.random.rand(n),
                "p1": np.random.rand(n),
                "pred_0": np.random.rand(n),
            }
        )

        df.write_parquet(f)

    n_megabytes = n_files * BYTES_PER_FLOAT * len(df.columns) * n / (1 << 20)
    print(f"Wrote {n_megabytes:.2f} megabytes", flush=True)
    return files


# @profile
def main():
    n_files = 5
    files = make_data_files(n_files)
    df = load_and_preprocess_data(files, columns=["pred_0"])
    time.sleep(0.5)


if __name__ == "__main__":
    for i in range(10):
        main()

What is the actual behavior?

Running mprof run foo.py (ideally with the @profile decorator around main added) and then run mprof plot, you'll see each run of main starts at a slightly higher memory usage even though the dataframe is no longer in scope.

I can't upload this image from the environment I'm working in unfortunately but should be very easy to reproduce.

What is the expected behavior?

I still don't entirely understand why the memory usage doesn't go back down ~0 after the dataframe is released (the other issue linked to a reddit post indicating this might be a CPython thing) but I really don't understand why memory usage should be steadily increasing.

For the actual larger problem being worked on, this causes OOM issues that can only be resolved by doing the loading and preprocessing in a subprocess which ensures the memory gets released

The text was updated successfully, but these errors were encountered:

ritchie46 · 2022-08-09T17:13:42Z

I still don't entirely understand why the memory usage doesn't go back down ~0 after the dataframe is released (the other issue linked to a reddit post indicating this might be a CPython thing) but I really don't understand why memory usage should be steadily increasing.

Memory will often not go to 0 because that is not how allocators work. If any object is still alive the memory before that object cannot be returned. However, overtime your memory should saturate to a constant amount if you run things in a loop.

I tried to run your script and also a few other parquet files I have got locally and I cannot reproduce your experience.

Could you share a bit more about your machine? How much memory have you got etc.

CHDev93 · 2022-08-09T22:34:21Z

Okay I see your point regarding the memory allocation. It's not defragmenting and consolidating the heap when you free something.

Anyway, managed to run this on my home machine (macOS) and see something similar but perhaps not as dramatic. It goes from a peak of around 775 to about 1050 (~25% increase). I was sort of expecting it would just go up and down to the same levels each time (modulo a bit of delta from the sampling nature of the profiler)

For the machine I'm actually interested in, I have 256GB RAM running Windows 10 with 32 logical processors.

I ran mprof run --include-children -T 0.01 foo.py just for completeness sake

ghuls · 2022-08-10T09:57:49Z

@CHDev93 On Linux jemalloc is used, while on MacOS and Windows mimalloc is used for memory allocation. I think jemalloc is slightly better at preventing heap fragmentation (although this can change in further releases).

CHDev93 · 2022-08-10T17:21:22Z

@ghuls that's very interesting, thanks for pointing that out. So you don't observe the steadily increasing memory on Linux at all? If it's the allocator implementation then that is very subtle

cbilot · 2022-08-10T18:37:12Z

For reference, here's what I see running your code above on Linux Mint 21 (Ubuntu 22.04 jammy)

CHDev93 · 2022-08-10T21:02:09Z

Will go ahead an close this as it seems to be an allocator specific detail that isn't likely to change. Thanks for clearing up this behaviour @ritchie46 , @ghuls , @cbilot !

CHDev93 added the bug Something isn't working label Aug 9, 2022

CHDev93 closed this as completed Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Steady increase of unfreed memory #4343

Steady increase of unfreed memory #4343

CHDev93 commented Aug 9, 2022

ritchie46 commented Aug 9, 2022

CHDev93 commented Aug 9, 2022

ghuls commented Aug 10, 2022

CHDev93 commented Aug 10, 2022 •

edited

Loading

cbilot commented Aug 10, 2022

CHDev93 commented Aug 10, 2022

Steady increase of unfreed memory #4343

Steady increase of unfreed memory #4343

Comments

CHDev93 commented Aug 9, 2022

What language are you using?

Have you tried latest version of polars?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

ritchie46 commented Aug 9, 2022

CHDev93 commented Aug 9, 2022

ghuls commented Aug 10, 2022

CHDev93 commented Aug 10, 2022 • edited Loading

cbilot commented Aug 10, 2022

CHDev93 commented Aug 10, 2022

CHDev93 commented Aug 10, 2022 •

edited

Loading