Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steady increase of unfreed memory #4343

Closed
CHDev93 opened this issue Aug 9, 2022 · 6 comments
Closed

Steady increase of unfreed memory #4343

CHDev93 opened this issue Aug 9, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@CHDev93
Copy link

CHDev93 commented Aug 9, 2022

What language are you using?

Python

Have you tried latest version of polars?

  • [yes]

What version of polars are you using?

0.13.62

What operating system are you using polars on?

Windows 10

What language version are you using

python 3.8

Describe your bug.

I'm doing the following in a loop

  • reading some parquet files
  • doing some processing
  • throwing away the dataframe

I would expect the memory usage "high watermark" to be independent of the number of loops I run. Instead it seems to steadily increase with iteration. Perhaps related to this issue

What are the steps to reproduce the behavior?

The code below should repro the increasing memory when run using mprof run (python's memory-profiler library)

# foo.py

import time
from pathlib import Path
from typing import List

import numpy as np
import polars as pl

DATA_DIR = Path("foo_data")
DATA_DIR.mkdir(exist_ok=True, parents=True)
BYTES_PER_FLOAT = 4


# @profile
def load_files_to_dataframe(files: List[Path]):
    load_query = pl.concat([pl.scan_parquet(f) for f in sorted(files)])
    return load_query


# @profile
def process_data(load_query, columns: List[str]):
    df_diff = pl.col("a") - pl.col("b")
    feature_query = [pl.col(col) / df_diff for col in columns]
    response_query = [
        ((pl.col("p0") / df_diff - 1)).alias("r0"),
        ((pl.col("p1") / df_diff - 1)).alias("r1"),
    ]

    df = (
        load_query.select(response_query + feature_query)
        .drop_nulls(subset=["r0", "r1"])
        .fill_null(0.0)
        .collect()
    )
    return df


# @profile
def load_and_preprocess_data(files: List[Path], columns: List[str]) -> pl.DataFrame:
    load_query = load_files_to_dataframe(files)

    df = process_data(load_query, columns)

    return df


# @profile
def make_data_files(n_files: int) -> List[Path]:
    n = int(1e6)
    files = [DATA_DIR / f"{i:04}.parquet" for i in range(n_files)]

    for f in files:
        df = pl.DataFrame(
            data={
                "a": np.random.rand(n),
                "b": np.random.rand(n),
                "p0": np.random.rand(n),
                "p1": np.random.rand(n),
                "pred_0": np.random.rand(n),
            }
        )

        df.write_parquet(f)

    n_megabytes = n_files * BYTES_PER_FLOAT * len(df.columns) * n / (1 << 20)
    print(f"Wrote {n_megabytes:.2f} megabytes", flush=True)
    return files


# @profile
def main():
    n_files = 5
    files = make_data_files(n_files)
    df = load_and_preprocess_data(files, columns=["pred_0"])
    time.sleep(0.5)


if __name__ == "__main__":
    for i in range(10):
        main()

What is the actual behavior?

Running mprof run foo.py (ideally with the @profile decorator around main added) and then run mprof plot, you'll see each run of main starts at a slightly higher memory usage even though the dataframe is no longer in scope.

I can't upload this image from the environment I'm working in unfortunately but should be very easy to reproduce.

What is the expected behavior?

I still don't entirely understand why the memory usage doesn't go back down ~0 after the dataframe is released (the other issue linked to a reddit post indicating this might be a CPython thing) but I really don't understand why memory usage should be steadily increasing.

For the actual larger problem being worked on, this causes OOM issues that can only be resolved by doing the loading and preprocessing in a subprocess which ensures the memory gets released

@CHDev93 CHDev93 added the bug Something isn't working label Aug 9, 2022
@ritchie46
Copy link
Member

I still don't entirely understand why the memory usage doesn't go back down ~0 after the dataframe is released (the other issue linked to a reddit post indicating this might be a CPython thing) but I really don't understand why memory usage should be steadily increasing.

Memory will often not go to 0 because that is not how allocators work. If any object is still alive the memory before that object cannot be returned. However, overtime your memory should saturate to a constant amount if you run things in a loop.

I tried to run your script and also a few other parquet files I have got locally and I cannot reproduce your experience.

Could you share a bit more about your machine? How much memory have you got etc.

@CHDev93
Copy link
Author

CHDev93 commented Aug 9, 2022

Okay I see your point regarding the memory allocation. It's not defragmenting and consolidating the heap when you free something.

Anyway, managed to run this on my home machine (macOS) and see something similar but perhaps not as dramatic. It goes from a peak of around 775 to about 1050 (~25% increase). I was sort of expecting it would just go up and down to the same levels each time (modulo a bit of delta from the sampling nature of the profiler)

Screenshot 2022-08-09 at 23 26 06

For the machine I'm actually interested in, I have 256GB RAM running Windows 10 with 32 logical processors.

I ran mprof run --include-children -T 0.01 foo.py just for completeness sake

@ghuls
Copy link
Collaborator

ghuls commented Aug 10, 2022

@CHDev93 On Linux jemalloc is used, while on MacOS and Windows mimalloc is used for memory allocation. I think jemalloc is slightly better at preventing heap fragmentation (although this can change in further releases).

@CHDev93
Copy link
Author

CHDev93 commented Aug 10, 2022

@ghuls that's very interesting, thanks for pointing that out. So you don't observe the steadily increasing memory on Linux at all? If it's the allocator implementation then that is very subtle

@cbilot
Copy link

cbilot commented Aug 10, 2022

Figure_1
For reference, here's what I see running your code above on Linux Mint 21 (Ubuntu 22.04 jammy)

@CHDev93
Copy link
Author

CHDev93 commented Aug 10, 2022

Will go ahead an close this as it seems to be an allocator specific detail that isn't likely to change. Thanks for clearing up this behaviour @ritchie46 , @ghuls , @cbilot !

@CHDev93 CHDev93 closed this as completed Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants