Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High dhcpd memory usage #129

Closed
olljanat opened this issue Mar 8, 2022 · 14 comments
Closed

High dhcpd memory usage #129

olljanat opened this issue Mar 8, 2022 · 14 comments
Assignees
Labels
bug Something isn't working version/v1.9.x

Comments

@olljanat
Copy link
Member

olljanat commented Mar 8, 2022

We are seeing very high dhcpd memory usage on our environment with multiple Burmilla nodes:
image

Burmilla v1.9.3 uses dhcpcd v9.4.0 and there is later version 9.4.1 available. Difference can be seen from NetworkConfiguration/dhcpcd@dhcpcd-9.4.0...dhcpcd-9.4.1 with quick look it sounds that issue would be already fixed on NetworkConfiguration/dhcpcd@ba9f382

@olljanat olljanat self-assigned this Mar 8, 2022
@olljanat olljanat added bug Something isn't working version/v1.9.x labels Mar 8, 2022
@olljanat
Copy link
Member Author

olljanat commented Mar 9, 2022

Should be solved on v1.9.4-rc1 but needs more testing.

@PrplHaz4
Copy link

PrplHaz4 commented Mar 9, 2022

I'm not seeing this dhcpd issue so I don't think I could verify a fix.

As a sidebar - what are you using for host/process monitoring with Burmilla?

@olljanat
Copy link
Member Author

olljanat commented Mar 9, 2022

I'm not seeing this dhcpd issue so I don't think I could verify a fix.

Yea that is tricky part as we see it on multiple servers on but not on all of them so need to run new RC version couple of weeks on some of those problematic ones to be sure.

As a sidebar - what are you using for host/process monitoring with Burmilla?

That picture is from Dynatrace. Deployed as container like described on https://www.dynatrace.com/support/help/setup-and-configuration/setup-on-container-platforms/docker/set-up-dynatrace-oneagent-as-docker-container#run-oneagent-as-a-docker-container

BurmillaOS is unsupported by Dynatrace but looks to be working fine.

@PrplHaz4
Copy link

PrplHaz4 commented Mar 9, 2022

I'm not seeing this dhcpd issue so I don't think I could verify a fix.

Yea that is tricky part as we see it on multiple servers on but not on all of them so need to run new RC version couple of weeks on some of those problematic ones to be sure.

As a sidebar - what are you using for host/process monitoring with Burmilla?

That picture is from Dynatrace. Deployed as container like described on https://www.dynatrace.com/support/help/setup-and-configuration/setup-on-container-platforms/docker/set-up-dynatrace-oneagent-as-docker-container#run-oneagent-as-a-docker-container

BurmillaOS is unsupported by Dynatrace but looks to be working fine.

Thanks - that looks similar to how the Elastic Beats and Telegraf agent containers work - wasn't sure if something like that should be running as a system service or if there was some better way to manage those super privileged containers.

@olljanat
Copy link
Member Author

olljanat commented Mar 9, 2022

Thanks - that looks similar to how the Elastic Beats and Telegraf agent containers work - wasn't sure if something like that should be running as a system service or if there was some better way to manage those super privileged containers.

On theory optimal solution would be running system-docker containers but as it runs inside of initrd any of the monitoring would not works without heavy modifications.

Also as we use Debian console now it is possible to install services inside of it also if needed. Like example iscsid actually need to run for those of us who need it.


Btw. I just found this issue which might affect our new rc version moby/moby#43262

@olljanat
Copy link
Member Author

olljanat commented Mar 10, 2022

Cool. Both new Docker v20.10.13 (which based on release notes fixed at least some OOM issue) and new LTS version 2022.02 of buildroot looks to be released today so I will prepare 1.9.4 version based on those.

@olljanat olljanat reopened this Mar 10, 2022
@olljanat
Copy link
Member Author

We see more servers appearing where this issue exist. Most probably it have something to do with dhcpcd log size, etc.

@olljanat
Copy link
Member Author

I find out that issue happens on servers where a lot of containers are coming and going. I used this Docker Stack on both v1.9.3 and 1.9.5-rc1:

version: "3.4"
services:
  alpine:
    image: alpine
    command: sleep 30s
    deploy:
      mode: replicated
      replicas: 10

Unfortunately it looks that issue happens still on 1.9.5-rc1 also (maybe situation is little bit less bad but still). However new things which I noticed was that if I use more aggressive settings like 1s sleep and 100 replicas then dhcpcd start using also a lot of CPU so it is definitely listening also DHCP requests from containers which it shouldn't do.

So I will try configuration proposed on here https://unix.stackexchange.com/a/634852 next.

@olljanat
Copy link
Member Author

Extending cloud-init config with this one (sudo ros config merge -i memlimit.yml) looks to working workaround which can be deployed to all to existing servers:

rancher:
  services:
    network:
      restart: always
      mem_limit: 20971520

@netsandbox
Copy link

We have a hardware host with v1.9.5 where the network container permanently runs out of memory.
If the host is idle, the network container has a memory usage of 18MB.
I had to change the memory limit from 20MB to 30MB, to avoid the network container permanently restarts.

I have already set in the network container /etc/dhcpcd.conf denyinterfaces veth* eth1 eth2 eth3,
to exclude the docker interfaces and not connected hardware interfaces (we use only eth0),
but after a network container restart, the memory usage is still 18MB.

Anything I can do to debug this?
The container logs don't show any helpful messages.

@olljanat olljanat reopened this Jan 18, 2023
@olljanat
Copy link
Member Author

@netsandbox how long network container stays running when memory limit is 30 MB?
20 MB was just randomly selected number so might be that it is too tight limit.

Anything I can do to debug this?

Not easily. However I see that there is quite many commits in dhcpcd after 9.4.1 version release NetworkConfiguration/dhcpcd@dhcpcd-9.4.1...master and at least two of those refers memory leak.

We get dhcpcd from buildroot https://github.com/buildroot/buildroot/blob/e644e5df39c4d63ce7ae28ce2d02bfbf2a230cff/package/dhcpcd/dhcpcd.mk#L7

So we probably should try build dhcpcd from latest version on their repo and if that looks fixing issue then request them to release new version and that it gets updated to buildroot.

@netsandbox
Copy link

When I had a look this morning on the host, I saw that there still where network container restarts in the middle of the night.
So I now increased the memory limit from 30MB to 50MB.

We have planned for tomorrow to upgrade the host from v1.9.5 to v1.9.6. Both versions still uses the same dhcpd version, but maybe the memory problem is related to a kernel library which is used for our network interfaces.
I will have an eye on the memory usage after the upgrade and then report back here.

@olljanat
Copy link
Member Author

olljanat commented Jan 22, 2023

I think that this is actually same bug than NetworkConfiguration/dhcpcd#157 which is already fixed and plan looks to be that new dhcpcd version will be released after NetworkConfiguration/dhcpcd#149 is fixed.

However os-base build tooling made by Rancher look supporting patches so I managed to build new version of dhcpcd where that single patch is included with https://github.com/burmilla/os-base/blob/c810a8a2c1818ed36bfe4e8b625c3ad7d497026d/patches/dhcpcd-9.4.1-with-405507a.patch

That is now included to just released v2.0.0-beta6

In additionally you can update network container to existing v1.9.6 installation by running these commands:

sudo system-docker pull burmilla/os-base:v1.9.6-dhcpcd-patched1
sudo ros config set rancher.services.network.image burmilla/os-base:v1.9.6-dhcpcd-patched1

and rebooting. But take backup/snapshot of server first and make sure that image was pulled suggesfully before second command. Other why console will not appear on next boot at all.

@netsandbox
Copy link

After setting network container memory limit to 50MB we see no container restarts in the last 2 weeks.
I saw that you increased the limit for v1.9.7-rc1 to 100MB, which looks reasonable. Thanks!

Regarding the network container memory usage increase, in the last 2 weeks the usage increased on one day from 27.24MiB to 27.31MiB and then stays stable at this value. So from here I don't see anything that looks like a memory leak.
But I have to admit that I don't know how many container starts and stops happened during this time, because we currently have no monitoring for this in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working version/v1.9.x
Projects
None yet
Development

No branches or pull requests

3 participants