High dhcpd memory usage #129

olljanat · 2022-03-08T12:49:25Z

We are seeing very high dhcpd memory usage on our environment with multiple Burmilla nodes:

Burmilla v1.9.3 uses dhcpcd v9.4.0 and there is later version 9.4.1 available. Difference can be seen from NetworkConfiguration/dhcpcd@dhcpcd-9.4.0...dhcpcd-9.4.1 with quick look it sounds that issue would be already fixed on NetworkConfiguration/dhcpcd@ba9f382

The text was updated successfully, but these errors were encountered:

olljanat · 2022-03-09T14:36:06Z

Should be solved on v1.9.4-rc1 but needs more testing.

PrplHaz4 · 2022-03-09T14:44:24Z

I'm not seeing this dhcpd issue so I don't think I could verify a fix.

As a sidebar - what are you using for host/process monitoring with Burmilla?

olljanat · 2022-03-09T14:55:00Z

I'm not seeing this dhcpd issue so I don't think I could verify a fix.

Yea that is tricky part as we see it on multiple servers on but not on all of them so need to run new RC version couple of weeks on some of those problematic ones to be sure.

As a sidebar - what are you using for host/process monitoring with Burmilla?

That picture is from Dynatrace. Deployed as container like described on https://www.dynatrace.com/support/help/setup-and-configuration/setup-on-container-platforms/docker/set-up-dynatrace-oneagent-as-docker-container#run-oneagent-as-a-docker-container

BurmillaOS is unsupported by Dynatrace but looks to be working fine.

PrplHaz4 · 2022-03-09T16:03:22Z

I'm not seeing this dhcpd issue so I don't think I could verify a fix.

Yea that is tricky part as we see it on multiple servers on but not on all of them so need to run new RC version couple of weeks on some of those problematic ones to be sure.

As a sidebar - what are you using for host/process monitoring with Burmilla?

That picture is from Dynatrace. Deployed as container like described on https://www.dynatrace.com/support/help/setup-and-configuration/setup-on-container-platforms/docker/set-up-dynatrace-oneagent-as-docker-container#run-oneagent-as-a-docker-container

BurmillaOS is unsupported by Dynatrace but looks to be working fine.

Thanks - that looks similar to how the Elastic Beats and Telegraf agent containers work - wasn't sure if something like that should be running as a system service or if there was some better way to manage those super privileged containers.

olljanat · 2022-03-09T20:47:44Z

Thanks - that looks similar to how the Elastic Beats and Telegraf agent containers work - wasn't sure if something like that should be running as a system service or if there was some better way to manage those super privileged containers.

On theory optimal solution would be running system-docker containers but as it runs inside of initrd any of the monitoring would not works without heavy modifications.

Also as we use Debian console now it is possible to install services inside of it also if needed. Like example iscsid actually need to run for those of us who need it.

Btw. I just found this issue which might affect our new rc version moby/moby#43262

olljanat · 2022-03-10T19:19:55Z

Cool. Both new Docker v20.10.13 (which based on release notes fixed at least some OOM issue) and new LTS version 2022.02 of buildroot looks to be released today so I will prepare 1.9.4 version based on those.

olljanat · 2022-03-17T16:58:27Z

We see more servers appearing where this issue exist. Most probably it have something to do with dhcpcd log size, etc.

olljanat · 2022-03-22T11:26:13Z

I find out that issue happens on servers where a lot of containers are coming and going. I used this Docker Stack on both v1.9.3 and 1.9.5-rc1:

version: "3.4"
services:
  alpine:
    image: alpine
    command: sleep 30s
    deploy:
      mode: replicated
      replicas: 10

Unfortunately it looks that issue happens still on 1.9.5-rc1 also (maybe situation is little bit less bad but still). However new things which I noticed was that if I use more aggressive settings like 1s sleep and 100 replicas then dhcpcd start using also a lot of CPU so it is definitely listening also DHCP requests from containers which it shouldn't do.

So I will try configuration proposed on here https://unix.stackexchange.com/a/634852 next.

olljanat · 2022-03-24T11:28:46Z

Extending cloud-init config with this one (sudo ros config merge -i memlimit.yml) looks to working workaround which can be deployed to all to existing servers:

rancher:
  services:
    network:
      restart: always
      mem_limit: 20971520

netsandbox · 2023-01-18T17:26:33Z

We have a hardware host with v1.9.5 where the network container permanently runs out of memory.
If the host is idle, the network container has a memory usage of 18MB.
I had to change the memory limit from 20MB to 30MB, to avoid the network container permanently restarts.

I have already set in the network container /etc/dhcpcd.conf denyinterfaces veth* eth1 eth2 eth3,
to exclude the docker interfaces and not connected hardware interfaces (we use only eth0),
but after a network container restart, the memory usage is still 18MB.

Anything I can do to debug this?
The container logs don't show any helpful messages.

olljanat · 2023-01-18T17:42:57Z

@netsandbox how long network container stays running when memory limit is 30 MB?
20 MB was just randomly selected number so might be that it is too tight limit.

Anything I can do to debug this?

Not easily. However I see that there is quite many commits in dhcpcd after 9.4.1 version release NetworkConfiguration/dhcpcd@dhcpcd-9.4.1...master and at least two of those refers memory leak.

We get dhcpcd from buildroot https://github.com/buildroot/buildroot/blob/e644e5df39c4d63ce7ae28ce2d02bfbf2a230cff/package/dhcpcd/dhcpcd.mk#L7

So we probably should try build dhcpcd from latest version on their repo and if that looks fixing issue then request them to release new version and that it gets updated to buildroot.

netsandbox · 2023-01-19T11:51:24Z

When I had a look this morning on the host, I saw that there still where network container restarts in the middle of the night.
So I now increased the memory limit from 30MB to 50MB.

We have planned for tomorrow to upgrade the host from v1.9.5 to v1.9.6. Both versions still uses the same dhcpd version, but maybe the memory problem is related to a kernel library which is used for our network interfaces.
I will have an eye on the memory usage after the upgrade and then report back here.

olljanat · 2023-01-22T15:19:38Z

I think that this is actually same bug than NetworkConfiguration/dhcpcd#157 which is already fixed and plan looks to be that new dhcpcd version will be released after NetworkConfiguration/dhcpcd#149 is fixed.

However os-base build tooling made by Rancher look supporting patches so I managed to build new version of dhcpcd where that single patch is included with https://github.com/burmilla/os-base/blob/c810a8a2c1818ed36bfe4e8b625c3ad7d497026d/patches/dhcpcd-9.4.1-with-405507a.patch

That is now included to just released v2.0.0-beta6

In additionally you can update network container to existing v1.9.6 installation by running these commands:

sudo system-docker pull burmilla/os-base:v1.9.6-dhcpcd-patched1
sudo ros config set rancher.services.network.image burmilla/os-base:v1.9.6-dhcpcd-patched1

and rebooting. But take backup/snapshot of server first and make sure that image was pulled suggesfully before second command. Other why console will not appear on next boot at all.

netsandbox · 2023-02-03T07:46:34Z

After setting network container memory limit to 50MB we see no container restarts in the last 2 weeks.
I saw that you increased the limit for v1.9.7-rc1 to 100MB, which looks reasonable. Thanks!

Regarding the network container memory usage increase, in the last 2 weeks the usage increased on one day from 27.24MiB to 27.31MiB and then stays stable at this value. So from here I don't see anything that looks like a memory leak.
But I have to admit that I don't know how many container starts and stops happened during this time, because we currently have no monitoring for this in place.

olljanat self-assigned this Mar 8, 2022

olljanat added bug Something isn't working version/v1.9.x labels Mar 8, 2022

olljanat mentioned this issue Mar 9, 2022

Publish v1.9.4-rc1 #130

Merged

olljanat closed this as completed Mar 10, 2022

olljanat reopened this Mar 10, 2022

olljanat mentioned this issue Mar 11, 2022

Release v1.9.4 #131

Merged

olljanat mentioned this issue Jul 25, 2022

2.0.0-beta update #138

Merged

olljanat mentioned this issue Sep 12, 2022

Publish v1.9.5 #141

Merged

olljanat closed this as completed Sep 23, 2022

olljanat reopened this Jan 18, 2023

netsandbox mentioned this issue Jan 20, 2023

System-docker version number is incompatible with cAdvisor #151

Closed

olljanat mentioned this issue Jan 31, 2023

Publish v1.9.7-rc1 #152

Merged

olljanat closed this as completed Aug 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High dhcpd memory usage #129

High dhcpd memory usage #129

olljanat commented Mar 8, 2022

olljanat commented Mar 9, 2022

PrplHaz4 commented Mar 9, 2022

olljanat commented Mar 9, 2022

PrplHaz4 commented Mar 9, 2022

olljanat commented Mar 9, 2022

olljanat commented Mar 10, 2022 •

edited

Loading

olljanat commented Mar 17, 2022

olljanat commented Mar 22, 2022

olljanat commented Mar 24, 2022

netsandbox commented Jan 18, 2023

olljanat commented Jan 18, 2023

netsandbox commented Jan 19, 2023

olljanat commented Jan 22, 2023 •

edited

Loading

netsandbox commented Feb 3, 2023

High dhcpd memory usage #129

High dhcpd memory usage #129

Comments

olljanat commented Mar 8, 2022

olljanat commented Mar 9, 2022

PrplHaz4 commented Mar 9, 2022

olljanat commented Mar 9, 2022

PrplHaz4 commented Mar 9, 2022

olljanat commented Mar 9, 2022

olljanat commented Mar 10, 2022 • edited Loading

olljanat commented Mar 17, 2022

olljanat commented Mar 22, 2022

olljanat commented Mar 24, 2022

netsandbox commented Jan 18, 2023

olljanat commented Jan 18, 2023

netsandbox commented Jan 19, 2023

olljanat commented Jan 22, 2023 • edited Loading

netsandbox commented Feb 3, 2023

olljanat commented Mar 10, 2022 •

edited

Loading

olljanat commented Jan 22, 2023 •

edited

Loading