[Execution State] Linux page cache is holding 425+ GB RAM, so make it drop checkpoint files to free up 294-394GB RAM #2261

fxamacker · 2022-04-06T16:30:50Z

Problem

On execution nodes, Linux eventually holds 425+ GB RAM in its file cache (shown as buff/cache in top). This is caused by Linux automatically caching files we read or write.

Among other problems, Grafana's EN Memory Usage chart (and other tools) doesn't exclude the Linux page cache, which obscures Go's memory usage patterns. E.g. Grafana doesn't show operational memory dropped by 250+ GB from PR #1944.

UPDATE: As of June 15 (mainnet18 spork) checkpoint files are 98GB.

Reading and writing large (98GB) files can cause Linux to cache them even after the program exits. For example, checkpointing includes:

reading 98 GB from old checkpoint file
writing 98 GB to a new checkpoint file

This 196GB growth in the file cache after each checkpointing is cumulative, and Linux can end up automatically caching 3-5 checkpoint files in memory.

Updates epic #1744

The Proposed Solution

Avoid clearing out the entire file system cache.
Drop the new checkpoint file (that was created) from the cache
Drop the old checkpoint file (that was read) from the cache
Optionally, also do this for WAL files

Proof of concept

On benchnet (using 53GB files), checkpoint creation began with OS file cache at around 2 GB. Once checkpoint file loading and creation activity begins, the OS cache use might peak at 106GB and then continue using about 105GB after the benchmark program exits.

Run checkpoint.00003464 creation benchmark.
OS file cache will be around 105 GB after benchmark program exits.
Run dd if=checkpoint.00003464 iflag=nocache count=0 (these params won't modify files).
OS file cache will immediately drop by the checkpoint file size (around 53GB).

Outside of benchnet, @zhangchiqing confirmed using the dd command on 3 checkpoint files also reduced the memory used by OS cache by the combined file sizes.

Caveats

This is primarily aimed at having Grafana, etc. show expected memory use (to avoid hunting for nonexistent leaks, etc.)
May need to look into special considerations when running inside a container.

The text was updated successfully, but these errors were encountered:

fxamacker · 2022-06-03T18:42:30Z

Closed by #2280 on April 18, 2022.

fxamacker changed the title ~~[Execution State] Drop checkpoint files from OS file cache to free up 50-100GB RAM after checkpoint creation~~ [Execution State] Drop checkpoint files from OS file cache to free up 105GB RAM after checkpoint creation Apr 6, 2022

fxamacker self-assigned this Apr 15, 2022

fxamacker changed the title ~~[Execution State] Drop checkpoint files from OS file cache to free up 105GB RAM after checkpoint creation~~ [Execution State] Linux file cache is holding 425+ GB RAM, so make it drop checkpoint files to free up 132-264+GB RAM Apr 18, 2022

fxamacker changed the title ~~[Execution State] Linux file cache is holding 425+ GB RAM, so make it drop checkpoint files to free up 132-264+GB RAM~~ [Execution State] Linux file cache is holding 425+ GB RAM, so make it drop checkpoint files to free up 198-264+GB RAM Apr 18, 2022

fxamacker mentioned this issue Apr 18, 2022

Drop checkpoint files from OS file cache to prevent file cache growing to hundreds of GB #2280

Merged

fxamacker added the Performance label Jun 3, 2022

fxamacker closed this as completed Jun 3, 2022

fxamacker changed the title ~~[Execution State] Linux file cache is holding 425+ GB RAM, so make it drop checkpoint files to free up 198-264+GB RAM~~ [Execution State] Linux page cache is holding 425+ GB RAM, so make it drop checkpoint files to free up 294-394GB RAM Jun 16, 2022

fxamacker added the Execution Cadence Execution Team label Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Execution State] Linux page cache is holding 425+ GB RAM, so make it drop checkpoint files to free up 294-394GB RAM #2261

[Execution State] Linux page cache is holding 425+ GB RAM, so make it drop checkpoint files to free up 294-394GB RAM #2261

fxamacker commented Apr 6, 2022 •

edited

Loading

fxamacker commented Jun 3, 2022

[Execution State] Linux page cache is holding 425+ GB RAM, so make it drop checkpoint files to free up 294-394GB RAM #2261

[Execution State] Linux page cache is holding 425+ GB RAM, so make it drop checkpoint files to free up 294-394GB RAM #2261

Comments

fxamacker commented Apr 6, 2022 • edited Loading

Problem

The Proposed Solution

Proof of concept

Caveats

fxamacker commented Jun 3, 2022

fxamacker commented Apr 6, 2022 •

edited

Loading