Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory load includes the file cache on Docker Linux #49058

Closed
robertdavidsmith opened this issue Mar 3, 2021 · 13 comments · Fixed by #51738
Closed

Memory load includes the file cache on Docker Linux #49058

robertdavidsmith opened this issue Mar 3, 2021 · 13 comments · Fixed by #51738
Labels
area-GC-coreclr tenet-performance Performance related issue
Milestone

Comments

@robertdavidsmith
Copy link

Description

We have a complex C# application that makes heavy use of memory mapped files.

In Docker it spends most its time hitting the container memory limit, but this is only due to page cache so shouldn’t be a problem. Our app isn’t failing due to lack of memory. However it’s running at half the speed on Linux compared to Windows. G2 garbage collections are also way more frequent on Linux.

We suspected this was due to the process garbage collecting too aggressively, as it wrongly thought it was about to run out of memory (although it’s harmlessly hitting the limit due to page cache). We conducted an experiment to test this theory.
We tried allocating/writing to a large chunk of memory outside the container (swap is off so this stays in RAM). This stops the C# app from hitting the container memory limit because page cache size is now constrained by physical RAM on the machine, not by the cgroup limit. Allocating the large chunk of memory outside the container doubles the speed of our C# app, and reduces G2 collections by 8x!

This looks similar to #13371 but for the cgroup code path.

I tried working round the issue by setting System.GC.HighMemoryPercent to 100 – this makes no difference, probably because HighMemoryPercent is capped at 99 in the code (search for min (99, highmem_th_from_config) in https://raw.githubusercontent.com/dotnet/runtime/main/src/coreclr/gc/gc.cpp)

Configuration

This effect has been observed using both the .NET Core 3.1.3 and 5.0.0 runtimes.

$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
$ uname -a
Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ docker -v
Docker version 18.06.1-ce, build e68fc7a

Regression?

This has always been the case AFAIK

Data

Typical example:
~700GiB Memory free at start: 10421 seconds, GC counts: G0=111932, G1=29224, G2=812
27GiB Memory free at start: 5592 seconds, GC Counts: G0=110433, G1=29013, G2=98

(Container memory limit is 36GiB, machine has 768GiB in total)

Analysis

Possibly issue in https://github.com/dotnet/runtime/blob/main/src/coreclr/gc/unix/cgroup.cpp ?

@robertdavidsmith robertdavidsmith added the tenet-performance Performance related issue label Mar 3, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-GC-coreclr untriaged New issue has not been triaged by the area owner labels Mar 3, 2021
@ghost
Copy link

ghost commented Mar 3, 2021

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

We have a complex C# application that makes heavy use of memory mapped files.

In Docker it spends most its time hitting the container memory limit, but this is only due to page cache so shouldn’t be a problem. Our app isn’t failing due to lack of memory. However it’s running at half the speed on Linux compared to Windows. G2 garbage collections are also way more frequent on Linux.

We suspected this was due to the process garbage collecting too aggressively, as it wrongly thought it was about to run out of memory (although it’s harmlessly hitting the limit due to page cache). We conducted an experiment to test this theory.
We tried allocating/writing to a large chunk of memory outside the container (swap is off so this stays in RAM). This stops the C# app from hitting the container memory limit because page cache size is now constrained by physical RAM on the machine, not by the cgroup limit. Allocating the large chunk of memory outside the container doubles the speed of our C# app, and reduces G2 collections by 8x!

This looks similar to #13371 but for the cgroup code path.

I tried working round the issue by setting System.GC.HighMemoryPercent to 100 – this makes no difference, probably because HighMemoryPercent is capped at 99 in the code (search for min (99, highmem_th_from_config) in https://raw.githubusercontent.com/dotnet/runtime/main/src/coreclr/gc/gc.cpp)

Configuration

This effect has been observed using both the .NET Core 3.1.3 and 5.0.0 runtimes.

$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
$ uname -a
Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ docker -v
Docker version 18.06.1-ce, build e68fc7a

Regression?

This has always been the case AFAIK

Data

Typical example:
~700GiB Memory free at start: 10421 seconds, GC counts: G0=111932, G1=29224, G2=812
27GiB Memory free at start: 5592 seconds, GC Counts: G0=110433, G1=29013, G2=98

(Container memory limit is 36GiB, machine has 768GiB in total)

Analysis

Possibly issue in https://github.com/dotnet/runtime/blob/main/src/coreclr/gc/unix/cgroup.cpp ?

Author: robertdavidsmith
Assignees: -
Labels:

area-GC-coreclr, tenet-performance, untriaged

Milestone: -

@mangod9
Copy link
Member

mangod9 commented Mar 3, 2021

Hey @robertdavidsmith, is there a standalone repro you could share for this issue?

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Mar 3, 2021
@mangod9 mangod9 added this to the 6.0.0 milestone Mar 3, 2021
@robertdavidsmith
Copy link
Author

Not currently, the app is closed-source and relies on our internal infrastructure.

If the problem isn't obvious I can try and create one. It looks like https://github.com/dotnet/runtime/blob/main/src/coreclr/gc/unix/cgroup.cpp is including file cache in memory load though.

@mangod9
Copy link
Member

mangod9 commented Mar 3, 2021

Ok sure, we will take a look.

@marcin-krystianc
Copy link
Contributor

Hi @mangod9, I'm working on this issue with @robertdavidsmith, I've prepared a standalone repro for it.

I'm using Docker on windows, v3.2.2, Hyper-V backend with 4GB Memory and 1 GB Swap limit. I've also tested this with Docker on Linux with no swap and my results were very similar.

  • Checkout & build the GcTesting app:
    docker build -t gctesting .
  • Execute the GcTesting app:
    docker run -i -t --env COMPlus_gcServer=0 --memory=3gb --name=gctesting gctesting --memorypressuretask=true --memorypressurerate=100mb --allocationunitsize=160Kb --minimummemoryusage=0mb
  • Every second it is going to print GC stats. We need to wait for a few minutes to get stable GC-Rate numbers, in my case I'm getting about 1920 GC/min:
...
Elapsed:103s, GC-Rate:1920, Gen012:3296,3296,3296, Total:2.91 MB, GcAllocated:10.07 GB, HeapSize:430.40 KB, MemoryLoad:0 Bytes, Committed:792.00 KB, Available:2.25 GB, HighMemoryLoadThreshold:2.70 GB, CGroupUsageInBytes:28.18 MB ...
Elapsed:104s, GC-Rate:1919, Gen012:3328,3328,3328, Total:1.81 MB, GcAllocated:10.16 GB, HeapSize:461.23 KB, MemoryLoad:0 Bytes, Committed:792.00 KB, Available:2.25 GB, HighMemoryLoadThreshold:2.70 GB, CGroupUsageInBytes:27.06 MB ...
Elapsed:105s, GC-Rate:1920, Gen012:3361,3361,3361, Total:2.44 MB, GcAllocated:10.26 GB, HeapSize:492.09 KB, MemoryLoad:0 Bytes, Committed:792.00 KB, Available:2.25 GB, HighMemoryLoadThreshold:2.70 GB, CGroupUsageInBytes:27.81 MB ...
...
  • Start an interactive bash shell on the running container.
     docker exec -i -t gctesting /bin/bash
  • Run htop to see memory stats:
    screen-1
  • Start another interactive bash shell on the container.
  • Run another instance of the test app inside the same container, this time it will perform only I/O operations therefore it will fill up the file cache:
    dotnet GcTesting.dll --memorypressuretask=false --filepressuretask=true --filepressuresize=4gb
  • Wait few more minutes for to get new GC-Rate numbers, in my case I'm getting about 9210 GC/min (almost 5x increase):
    screen-2
...
Elapsed:231s, GC-Rate:9216, Gen012:16296,16296,6956, Total:4.45 MB, GcAllocated:22.67 GB, HeapSize:4.32 MB, MemoryLoad:3.00 GB, Committed:4.46 MB, Available:2.25 GB, HighMemoryLoadThreshold:2.70 GB, CGroupUsageInBytes:3.00 GB ...
...
  • Kill the I/O instance, the GC-Rate will not go down as the I/O cache stays in use
    screen-3
...
Elapsed:298s, GC-Rate:9083, Gen012:26454,26454,8679, Total:1.42 MB, GcAllocated:29.21 GB, HeapSize:637.42 KB, MemoryLoad:2.97 GB, Committed:920.00 KB, Available:2.25 GB, HighMemoryLoadThreshold:2.70 GB, CGroupUsageInBytes:2.98 GB ...
...
  • To get the GC-rate to the initial level we need to purge the I/O cache so we run the GcTesting command again:
    dotnet GcTesting.dll --memorypressuretask=false --unmanagedmemorypressuretask=true --minimumunmanagedmemoryusage=2gb --unmanagedmemorypressurerate=0mb
    screen-4
  • Kill the I/O cache purging instance. This time the GC-Rate goes back to initial level (about 1920/m).
    screen-5
...
Elapsed:432s, GC-Rate:1920, Gen012:37548,37548,12608, Total:3.94 MB, GcAllocated:42.31 GB, HeapSize:1.02 MB, MemoryLoad:983.04 MB, Committed:1.21 MB, Available:2.25 GB, HighMemoryLoadThreshold:2.70 GB, CGroupUsageInBytes:999.05 MB ...
...

@mangod9
Copy link
Member

mangod9 commented Mar 26, 2021

@tmds @janvorli in case they have any thoughts on this.

@robertdavidsmith
Copy link
Author

robertdavidsmith commented Mar 26, 2021

If you want to reproduce the crazy "performance doubles when the machine is low on memory" result you can do the following.

This was tested on a Linux machine with 8GB RAM and swap off - memory settings will need to be different if you have a different amount of RAM.

Open four console windows and type the following

Window 1 (GC tester, memorypressurerate set very high so limiting factor is dotnet performance not this setting):

git clone https://github.com/robertdavidsmith/runtime_49058 # This must be the fork under robertdavidsmith not marcin-krystianc
cd runtime_49058/GcTesting
docker build -t gctesting .
docker run -it --rm --env COMPlus_gcServer=0 --memory=5000mb --name=gctesting gctesting --memorypressurerate=100000mb --allocationunitsize=80Kb --minimummemoryusage=450mb --filepressuretask=false

Window 2 (Hammer the filesystem inside the same container):

docker exec -i -t gctesting dotnet GcTesting.dll --memorypressuretask=false --filepressuresize=6500mb

Window 3 (log stats very 5s):

cd /sys/fs/cgroup/memory/docker
while [ 1 ] ;do echo -------------------- && cat */memory.usage_in_bytes && cat */memory.stat | grep total_inactive_file || true&&free -m && sleep 5; done

Window 4 (waste memory):

git clone https://github.com/robertdavidsmith/wastemem
cd wastemem
gcc -o wastemem wastemem.c
./wastemem 2048 100000000 3600

The gc tester in Window 1 logs out various stats every second, including BlockAllocationsPerformed. BlockAllocationsPerformed is the amount of allocations performed in the last second. When memorypressurerate is large, BlockAllocationsPerformed is effectively a measure of performance.

When the memory waster isn't running, BlockAllocationsPerformed is around 3000 blocks per second
When the memory waster is running, BlockAllocationsPerformed is around 7000 blocks per second

Putting the machine under memory pressure more than doubles performance!

Looking at window 3, you can see memory.usage_in_bytes for the container.

When wastemem isn't running, memory.usage_in_bytes equals the container memory limit (although a lot of this is only file cache, see total_inactive_file in window 3).

When wastemem starts, the OS reclaims memory from the file cache, so the container is no longer at the memory limit. This more than doubles performance as the GC collects less agressively when it's not at the container limit. Also note G2 collections per second roughly halves.

@janvorli
Copy link
Member

From the evidence provided above, it seems to me that we should really exclude the caches from the reported used memory.

@tmds
Copy link
Member

tmds commented Apr 6, 2021

@janvorli maybe we can explore using the events that come from cgroup related to memory usage instead of calculating usage ourselves?

I can have a look into this. I'd start by looking at the cgroupv2 memory.events file (https://www.kernel.org/doc/Documentation/cgroup-v2.txt) and share my findings here.

@tmds
Copy link
Member

tmds commented Apr 8, 2021

I can have a look into this. I'd start by looking at the cgroupv2 memory.events file (https://www.kernel.org/doc/Documentation/cgroup-v2.txt) and share my findings here.

I tried out some things, but it didn't work as I hoped. The memory.events file will only generate an interesting event when someone set the memory level in the memory.high file. By default it is at max and the event happens at oom, which is too late to trigger a GC.

From the evidence provided above, it seems to me that we should really exclude the caches from the reported used memory.

If we exclude the cache from usage, probably it makes sense to exclude it also from the max memory when we calculate load: load = (used - cache) / (max - cache).

For cgroup v2 we may be able to calculate something from memory.stat. A few of the fields related to pages reclaimable under memory pressure. I'm not sure what we should add up and subtract.

@robertdavidsmith you may be able to work around the issue using fadvise with FADV_DONTNEED. Calling this one or twice before you close a file may get it removed from the file cache. There is no managed API for it though, so you need to PInvoke. The managed api allows you to fadvise using FileOptions.RandomAccess and FileOptions.SequentialScan but I think they won't affect the file cache.

@robertdavidsmith
Copy link
Author

robertdavidsmith commented Apr 9, 2021

Regarding cgroup v1 (I haven’t looked at v2):

I think a good starting point would be use load = (used - total_inactive_file) / max where used/max are the existing values and total_inactive_file comes from memory.stat.

I'm not sure whether total_active_file should be subtracted as well - it might be safer to just subtract total_inactive_file for now, it's the smaller change. That's also what docker stats does (see https://docs.docker.com/engine/reference/commandline/stats/#extended-description).

If you don't want to read a 3rd "file", note the existing value of used (which comes from "file" memory.usage_in_bytes) is roughly equal to total_cache+total_rss+total_swap from memory.stat. So an alternative way to calculate memory load could be (total_cache + total_rss + total_swap - total_inactive_file) / max with the first four values coming from memory.stat

That FADV_DONTNEED trick would probably work in many cases, our closed-source program uses the same set of files for most the time it's running, so isn't really suitable in our case though.

Note bug report #50414 - this suggests the GC is too relaxed sometimes. This bug may get worse if we relax the GC further.

@tmds
Copy link
Member

tmds commented Apr 13, 2021

it might be safer to just subtract total_inactive_file for now

I can't say whether this only includes reclaimable pages, or whether unevictable pages may also be counted here.
To be safe, we could subtract unevictable from total_inactive_file.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 23, 2021
@marcin-krystianc
Copy link
Contributor

@janvorli @robertdavidsmith @tmds I've opened a PR with my proposition of excluding page cache from memory load.

According to my knowledge, on Windows the page cache is not considered as used memory but modified/dirty memory is.
So I think we should use the value reported in usage_in_bytes as it was originally and then subtract clean pages of the page cache.

I wanted to avoid reading both memory.usage_in_bytes and memory.stat files, but I don't think it is possible to calculate the value of usage_in_bytes from values in memory.stat. For example I've tried total_cache+total_rss but that is always lower than usage_in_bytes.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 6, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants