-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAM usage and predicate pushdown #3974
Comments
I will answer this question in relation to #3971 and #3972 as well. I believe there are a few things at play here.
Currently polars writes a parquet file into a single row group. This has the benefit that we don't have to rechunk when reading, but it also has some downsides:
I think we should default to writing to multiple row groups. T
Finally, if RAM is still tight, as a last low memory resort we can turn of paralellism. This will lead to reading a single row group/ single column and therefore only |
Using row groups did not affect the result. I still get OOM in Step 4 (that is, Polars is requiring more than 200GB of RAM to fetch one record from a parquet file of approx 800 million records.) I kept everything else the same, but used the
|
That is interesting (and not what I expected). Do you know what the selectivity of the filter is (e.g. ratio of rows filtered)? I shall explore this further. |
#4006 reduces peak memory when reading over row groups in parallel. |
I think I'm experiencing this issue. I'm trying to inspect a large Parquet file (approx 3GB on disk) using the lazy api. import polars as pl
df = pl.scan_parquet('file.parquet')
df.select([
pl.col('columnA')
]).head().collect()
<polars.internals.lazy_frame.LazyFrame at 0x1e1b2cd3100>
# memory allocation of 2952790016 bytes failed I was hoping that it wouldn't use too much RAM due to using |
@daviewales try upgrading to If I have a chance today, I want to benchmark how much RAM Polars uses to read a Parquet file, given the size of the row groups (in RAM), given a low selectivity, and considering the number of parallel threads that are concurrently decompressing and processing row groups. For example, if each row group is 1GB (in RAM) and I have 8 threads, how much RAM should we predict that Polars will need to process the query on a compressed parquet file. Of course, this means purposely provoking OOM situations, so this might take some time. (I can’t do anything else on my machine while this is running.) |
@cbilot; you might look into cgroups. You can provoke Linux to OOM kill a process long before it threatens your entire machine. E.g.: $ umask 022 && cd /sys/fs/cgroup/
$ sudo mkdir pylimit && cd pylimit
$ echo $((20 * 2**20)) | sudo tee memory.max
$ echo $$ | sudo tee cgroup.procs
$ python -c "x = [x for x in range(100 * 2**20)]"
Killed
$ grep oom memory.events
oom 1
oom_kill 1 |
@cbilot Just upgraded to polars 0.13.58, and I can now do all of the following without crashing due to RAM usage:
The following still crashes:
So, a definite improvement. |
One other natural and useful place to do predicate pushdown is during cross joins. E.g., this query would return only one row, but it instead runs out of memory unless you have quite a lot of it. import polars as pl
x = pl.DataFrame([pl.Series(
"x", pl.arange(0, 2**16 - 1, eager=True) % 2**15
).cast(pl.UInt16)])
x.lazy().join(x.lazy(), how="cross", suffix="_").filter(
(pl.col("x") & pl.col("x_")) == 0x7fff).collect() We hit this in an actual use case. |
Unless anyone has any remaining issues, we can close this. Using Polars 0.13.59, I created a DataFrame that occupies 225 GB of RAM, and stored this DataFrame as a Parquet file split into 10 row groups. (For reference, the saved Parquet file is 120.2 GB on disk.) Thus, each row group of the Parquet file represents (conceptually) a DataFrame that would occupy 22.5GB of RAM when fully loaded. I then ran a simple filter query against the Parquet file using |
@traviscross This was a great suggestion. (FYI, I think #4194 addresses this.) |
#4194 was similar, but then for slices instead of predicates. I will follow up with @traviscross suggestion. |
What language are you using?
Python
Have you tried latest version of polars?
yes
What version of polars are you using?
0.13.52
What operating system are you using polars on?
Linux Mint 20.3
What language version are you using
3.10.4
My Question
When using
filter
,fetch
,limit
, orslice
withscan_parquet
, it seems that the entire contents (or nearly all) of the file are loaded into RAM before the filter is applied. Should this occur?I would post this question on Stack Overflow, except that the MWE setup is somewhat complex.
MWE
This MWE is configurable, so anyone should be able to replicate this. We'll also re-use the
mem_squash
function from #3972 and #3971.RAM usage and Garbage Collection
One complication with observing RAM usage is garbage collection. At any point, the RAM used by Python/Polars might include objects waiting to be garbage-collected. Thus, merely observing RAM usage using
top
(or similar tools) may not be representative of the RAM that is actually required by an algorithm.As such, this MWE is designed to force an OOM situation to demonstrate that Polars is reading the entire contents (or nearly all) of a file into RAM before applying filtering.
Overview
Here's how we'll show that Polars is reading the entire contents of the file into RAM:
scan_parquet
andfilter
for a single record from the parquet file created in step 1 above.In step 4 above, if Polars needs more than 200 GB of RAM to run the
scan_parquet
andfilter
, then an OOM will occur. This should demonstrate that Polars is consuming more than 200 GB to read a single record from the parquet file.Presumably, Python/Polars will use garbage collection to reclaim as much RAM as possible before allowing an OOM. This sidesteps the issues that occur when merely watching overall RAM usage using
top
. In essence, we are forcing the issue to reveal itself.Step 1: Create the Parquet File
I'll create a DataFrame of 225 GB (in RAM) and save it to a parquet file. We'll later use this in step 4 below.
Steps 2 & 3: Restart the Python Interpreter and Create the Boulder
Restart the Python interpreter and using
mem_squash
once again, this time to create the "boulder".The boulder is now occupying 300 GB of my system RAM and cannot be garbage-collected.
top
shows that I have only 200 GB of available RAM for the next step.Step 4:
scan_parquet
andfilter
Now we'll runs a set of queries, using
scan_parquet
along withfilter
,limit
,fetch
, andslice
, and observe what happens. The filter below should return only one record from the parquet file.slice
Re-running Step 2, 3, and 4 (and showing only the last two lines.)
limit
fetch
fetch
does succeed if there are 200 GB of remaining system RAM.However, if I increase the "boulder" to 400 GB, leaving a mere 100 GB for
fetch
to run, we get an OOM.Discussion
The above came about as I was about to propose a solution to a Stack Overflow question. The OP states that memory pressure is an issue, so I wanted to ensure that my solution would work in a situation where a file is too large to load in RAM. (Hence the use of a "boulder" above, as well as the
mem_squash
function.)Is the above issue one of row groups? That is, since my saved parquet file is created with a single row group, is Polars required to read the entire file contents in lazy mode?
Is this a issue of parallelism? That is, should I have set the
parallel
option onscan_parquet
to something other than "auto"?The text was updated successfully, but these errors were encountered: