Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataplane part of forwarder-vpp leaks #1120

Open
5 tasks
denis-tingaikin opened this issue May 28, 2024 · 33 comments
Open
5 tasks

Dataplane part of forwarder-vpp leaks #1120

denis-tingaikin opened this issue May 28, 2024 · 33 comments
Assignees

Comments

@denis-tingaikin
Copy link
Member

image

forwarder-vpp_goroutineprofiles_20240527074108.tar.gz
forwarder-vpp_memprofiles_20240527074153.tar.gz

TODO

  • Find potential prpblem places
  • Create an issue for each problem
  • Estimate each problem
  • Provide a list of issues that can be delivered till 7 Jun
  • Provide fixes from previous point
@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented May 30, 2024

logs from fwd.tar.gz

Provided by @szvincze

@NikitaSkrynnik NikitaSkrynnik moved this from Todo to In Progress in Release v1.13.1 May 30, 2024
@NikitaSkrynnik
Copy link
Contributor

NikitaSkrynnik commented Jun 3, 2024

  • Add CloseOnError chain elements to forwarder and nsmgr - 3h
  • Test forwarder with the new elements - 3h
  • Change timeout in forwarder - 3h
  • Test forwarder with the new timeout - 3h
  • Check if all chain elements cleanup themselves on errors from next - 3h

@NikitaSkrynnik
Copy link
Contributor

NikitaSkrynnik commented Jun 4, 2024

Current state

It looks like vpp reuses indices for interfaces. It somehow affects the deletion of tap interfaces on Closes. Therefore, NSM shows no created tap interfaces but vpp still have them

Next steps

  • Make manual management of interface indices - 3h
  • Test manual management of interface indices - 3h
  • Test without deleting files - 7h
  • Investigate what causes the problem: govpp or vpp - 12h
    • Find a place where to add logs - 2h
    • Add logs to vpp - 3h
    • Try to reproduce the problem - 7h
  • Find out if govpp doesn't delete some intefaces - 3h
  • Test with Closes - 7h

Total: 35h
The estimation may be increased

@denis-tingaikin
Copy link
Member Author

/cc @szvincze

@denis-tingaikin denis-tingaikin moved this from In Progress to Moved to next release in Release v1.13.1 Jun 9, 2024
@denis-tingaikin denis-tingaikin moved this to In Progress in Release v1.13.2 Jun 9, 2024
@NikitaSkrynnik NikitaSkrynnik moved this to In Progress in Release v1.14.0 Jun 18, 2024
@NikitaSkrynnik
Copy link
Contributor

Current status

forwarder-vpp has two controlplane problems and one dataplane problem.

Controlplane problems:

  1. timeout chain element doesn't call Close and doesn't delete vpp intefaces
  2. Even when forwarder-vpp calls Close some vxlan intefaces are not deleted anyway
    Issue: Controlplane part of forwarder-vpp leaks #1129

Dataplane problems:

  1. When forwarder-vpp calls Close some tap interfaces are not deleted, just switched to state: down. After investigation I found out that the problem is inside vpp

@denis-tingaikin denis-tingaikin moved this from In Progress to Done in Release v1.14.0 Jun 26, 2024
@elajkat
Copy link
Contributor

elajkat commented Jul 10, 2024

Hi, @NikitaSkrynnik :Could you please help our investigation with the Dataplane Prblems you mentioned above and point to VPP? I loaded VPP (v23.10-rc0 local build) API with vxlan tunnel creates, tap creates, ACL xConnect create in batch for ~24 hours and was not able to reproduce such issue. I tried to use the same go API calls that are used by the cmd-forwarder-vpp.
If you have any suggestion how to try to reproduce the issue, or what direction could be followed please don't hesitate to add it here, thanks in advance

@NikitaSkrynnik
Copy link
Contributor

Hi, @elajkat! Unfortunately we didn't have enough time to reproduce the problem on VPP. You can try to reproduce it using NSM and check VPP which runs inside a forwader. Here are the steps:

  1. Deploy a basic NSM setup
  2. Start scaling clients and endpoints (should be repeated at least 10 times)
    2.1. Scale the clients and endpoints up to 20 pods each
    2.2. Wait for 40 seconds
    2.3. Scale the clients and endpoints down to 0
    2.4. Wait for 20 seconds
    2.5. Scale to 20 again
    2.6. ...

After repeated scaling up/down I can observe several tap interfaces with state:DOWN in VPP inside the forwarder. The forwarder issues a request for deletion through govpp, but interfaces are still there with state:DOWN

@elajkat
Copy link
Contributor

elajkat commented Jul 12, 2024

Hi, with the suggested scale up - scale down process I got some hanging interfaces, but those are in up state, i.e.:

=== pod/forwarder-vpp-vpf55 ===
              Name               Idx    State  MTU (L3/IP4/IP6/MPLS)     Counter          Count     
host-eth0                         1      up     1500/1500/1500/1500 rx packets                671059
                                                                    rx bytes               335612875
                                                                    tx packets                 14067
                                                                    tx bytes                 1526406
                                                                    drops                     657122
                                                                    ip4                       342905
                                                                    ip6                           79
local0                            0     down          0/0/0/0       
tap25                             42     up     1446/1446/1446/1446 rx packets                    12
                                                                    rx bytes                     936
                                                                    drops                         12
                                                                    ip6                           12
vxlan_tunnel14                    41     up     1446/1446/1446/1446 rx packets                    55
                                                                    rx bytes                    3362
                                                                    tx packets                    50
                                                                    tx bytes                    4860
                                                                    drops                          5

Next week I check this in detail with VPP API trace and forwarder logs etc. and try reproduce this without forwarder with pure VPP and API calls.

@edwarnicke
Copy link
Member

You may find the vpp api trace helpful for capturing the sequence of API calls that leads to the issue for handoff to VPP folks for reproduction.

@rubasov
Copy link

rubasov commented Jul 17, 2024

I also played around with a slightly modified reproduction. I was able to reliably reproduce the leftover tun/vxlan_tunnel interfaces. At this time I only focused on identifying the suggested vpp bug. However I found no indication in vpp's api trace, that we have a vpp bug here. Here's what I did and saw.

The environment:

  • kind with 4 workers (but the repro seems to work with 2 workers too)
  • spire single cluster
  • nsm basic
  • registry-k8s: 2 replicas
  • nsm kernel2ethernet2kernel, slightly modified: we have wrapped the alpine pod in a deployment, so we can scale it

The reproduction loop:

  • while not having a leftover tap and/or vxlan_tunnel interface
  • in each forwarder: start vpp api tracing with an empty buffer (so that at the end we shall see only the api trace of the last up-and-down scale)
  • scale up: nse-kernel, alpine
  • scale down (to 0 replicas): alpine, nse-kernel
  • after all scale operations completed:
  • in each forwarder: check for leftover tap and/or vxlan_tunnel interfaces
  • if we have some:
  • print affected pods and interfaces
  • grep vpp api trace dumps for relevant tap_delete messages
  • grep and count vpp api trace dumps for vxlan_tunnel create/delete messages and see if we have fewer deletes than creates

In the above repro steps it slightly bothered me (due to my weak kubernetes-fu) that we do not wait for the completion of a scale operation (other than sleeping for a hardcoded period) before we start a new scale operation. Therefore I replaced the sleeps with kubectl wait commands (and a final sleep just to make sure) as you can see below:

kubectl -n ns-kernel2ethernet2kernel scale deployment nse-kernel --replicas=5
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for=condition=Available=True deployment.apps/nse-kernel

kubectl -n ns-kernel2ethernet2kernel scale deployment alpine --replicas=10
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for=condition=Available=True deployment.apps/alpine

kubectl -n ns-kernel2ethernet2kernel scale deployment alpine --replicas=0
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for delete pod --selector=app=alpine

kubectl -n ns-kernel2ethernet2kernel scale deployment nse-kernel --replicas=0
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for delete pod --selector=app=nse-kernel

sleep 5

(note: Please correct me if you think these kubectl wait commands to be incorrect.)

With this method I could reliably produce leftover interfaces. Usually already in the 1st or 2nd iteration of the reproduction loop we have some. Always the same number of taps as vxlan_tunnels. Most of the time with the same suffix (like tap3 and vxlan_tunnel3), but sometimes with different suffixes (like tap0 and vxlan_tunnel2). These interfaces are always up.

Out of 20 runs leaving interfaces around, I have seen 20 in which the vpp api trace contains no indication that anybody ever tried to delete the leftover interfaces:

For tap interfaces I could grep for the tap_delete message with a concrete sw_if_index, e.g.:

$ kubectl exec forwarder-vpp-r2mz8 -n nsm-system -- vppctl api trace dump | egrep --no-group-separator -A1 tap_delete | egrep -w 'sw_if_index: 2'

For vxlan_tunnel interfaces unfortunately the vpp api trace contains only identifiers that are harder to pair with sw_if_index, therefore I only counted the number of vxlan_tunnel creates and deletes - looking for having fewer deletes than creates:

$ kubectl exec forwarder-vpp-2jfxd -n nsm-system -- vppctl api trace dump | agrep -d ^vl_api vxlan_add_del_tunnel | egrep is_add | sort -r | uniq -c

The difference between creates and deletes were the number of vxlan_tunnel interfaces left around in all 20 test runs.

In summary I could not confirm the presence of a vpp bug - or at least my reproduction method does not reproduce it.

By the way the fact that the leftover interfaces are always present in tap+vxlan_tunnel pairs also makes me think its cause is more likely to be in nsm than vpp.

If anybody is interested I'm happy to share the scipts I used. Let me know if you have other ideas to refine the reproduction steps.

Let me know what you think!

@elajkat
Copy link
Contributor

elajkat commented Jul 17, 2024

Thanks @rubasov.
I actually used a similar setup like @rubasov and I am able to run the same with some well placed kubectl waits and an extra sleep 10 after scale up (alpine 40, nse 5) and after scale down (alpine 0, nse 0) and I have no leftover interfaces.

@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Jul 19, 2024

Hello @rubasov, thank you for testing it.

We're planning to continue working on this issue soon. As I recall, the main issue is revealed only to leaked vpp infaces with state up. At this moment, it could be helpful if you shared the scripts, and then we could verify steps to reproduce. 

UPD: meant state: DOWN

rubasov pushed a commit to rubasov/deployments-k8s that referenced this issue Jul 22, 2024
Modify the kernel2ethernet2kernel case:

* Remove the antiaffinity, so we'll have k2e2k mixed with k2k.
* Wrap the alpine pod in a deployment, so we can scale it.
* Extend NSM_CIDR_PREFIX, so we have enough IPs to scale.

Signed-off-by: Bence Romsics <bence.romsics@gmail.com>
@rubasov
Copy link

rubasov commented Jul 23, 2024

Hi @denis-tingaikin,

I made some modifications and now I believe I have a better reproduction - hopefully reproducing every bug symptom Nikita described. I'm attaching the scripts I used.

My earlier approach (as in my previous comment) always stopped after the first leftover interface was detected (no matter if it was up or down). However these interfaces were always up and after a few minutes they got deleted, therefore it's not likely they could contribute to a long term memory leak. The current approach does not stop at the first leftover interface, but continues scaling the nsc deployment up and down. Usually after many more iterations (up to a 100, sometimes 150) this approach produces leftover interfaces both up and down (many up, a few down), out of which a few remains indefinitely (the down interfaces always remain).

With this complex reproduction I'm far from having a proper understanding of what's happening, however the following factors seem to be relevant:

  • the first bug of leaving many up interfaces around
  • the later cleanup process of these interfaces overlapping with the load caused by the still continuously ongoing scaling
  • errors in the forwarder logs like (their frequency: in a handful of iterations out of 100):
    vppapi AddNodeNext returned error: context canceled
    vppapi TapCreateV3 returned error: context canceled
    vppapi SwInterfaceTagAddDel returned error: context canceled
    When these are present the corresponding vpp api trace does not show these calls and there's nothing relevant in vppctl show logging.

Regarding the scripts. Please read them before running.

First run:

./setup-1.sh ; ./setup-2.sh ; ./scale.sh 3

setup-1.sh creates a kind cluster. I used it without having other kind clusters around.
scale.sh's parameter is the iteration count. When no parameter is passed it continues until there's a leftover interface in down.
setup-2.sh depends on rubasov/deployments-k8s@54f2d2d

Later runs:

./teardown-2.sh ; ./setup-2.sh ; ./scale.sh | tee scale.sh.log

nsm-scale-iface-leftover.tar.gz

@NikitaSkrynnik NikitaSkrynnik changed the title Investigate forwarder memory leaks Dataplane part of forwarder-vpp leaks Jul 23, 2024
@NikitaSkrynnik
Copy link
Contributor

Hi @rubasov, @elajkat. I collected the traces for the issue with state:DOWN interfaces. Also added a small description: traces.zip. It took five up/down scalings to get two tap interfaces with state: DOWN.

Here are the scripts I use: iface-leaks.zip. Just run ./full-setup.sh and then ./scale.sh

@NikitaSkrynnik NikitaSkrynnik moved this from Done to In Progress in Release v1.14.0 Jul 23, 2024
@Ex4amp1e Ex4amp1e moved this from In Progress to Done in Release v1.14.0 Jul 26, 2024
@Ex4amp1e Ex4amp1e moved this from Done to Under review in Release v1.14.0 Jul 26, 2024
@rubasov
Copy link

rubasov commented Aug 13, 2024

Hi @denis-tingaikin, @NikitaSkrynnik,

While @ljkiraly is out of office we're trying to track the status of this interface leak. We have seen networkservicemesh/sdk#1650 merged. The pull request description says it should fix all interface leaks. However Nikita's comment in networkservicemesh/sdk#1649 reads to me as if the merged patch was only part of a planned fix. We also built a new forwarder with the sdk patch merged and tested it with the reproduction method we discussed earlier. Unfortunately we still see leftover interfaces.

Can you please help us understand what exactly the merged patch fixes? Did we maybe miss another patch we should have used for a full fix?

@NikitaSkrynnik
Copy link
Contributor

Hi, @rubasov. We found out that we need two more patches to fix all interface leaks:

  1. Use a context with exteded timeout on Requests in begin sdk#1656
  2. Use an extended timeout in case of reselect requests sdk#1655

@denis-tingaikin
Copy link
Member Author

@rubasov Many thanks for testing it. Currently, we also looking for other solutions, it would be perfect if you could add these two patches #1120 (comment) (note: requires SDK with commit 301631365421a9d916a5ab7224fa9fbd847320b2) and test it.

@rubasov
Copy link

rubasov commented Aug 15, 2024

Hi @NikitaSkrynnik, @denis-tingaikin,

I did some more testing and wanted to get back to you with the results:

I started with sdk commit 78bedfb4 (which already includes commit 30163136 "Add a timeout for Closes in begin.Server (#1650)").
Then cherry-picked all commits of networkservicemesh/sdk#1656 and networkservicemesh/sdk#1655 on top.

I had to resolve one merge conflict between https://github.com/networkservicemesh/sdk/pull/1656/files#diff-963d51540e41ec0a05680ad81a593e4195fb8092eceb003e3f53778ec502fb74R45 and https://github.com/networkservicemesh/sdk/pull/1655/files#diff-963d51540e41ec0a05680ad81a593e4195fb8092eceb003e3f53778ec502fb74R46.

Then also noticed that the s/closeTimeout/contextTimeout/ rename had to be applied here too:
https://github.com/networkservicemesh/sdk/pull/1655/files#diff-d03d72993c11636268d2d3264a65f9888845d7e1f052b40a1aa1c2c503b10814R91

Using this sdk I built a new forwarder and ran the usual reproduction steps. Unfortunately I still saw leftover interfaces as before.

However seeing these conflicts between PR 1656 and 1655 I'm wondering if I'm really testing with the same code as you are. It feels like I'm still missing something.

Let me know what you think.

@NikitaSkrynnik
Copy link
Contributor

@rubasov Could you try this forwarder image: nikitaxored/cmd-forwarder-vpp:no-leaks? It contains all the patches. Could you also send script you use to reproduce the problem?

@rubasov
Copy link

rubasov commented Aug 21, 2024

Hi @NikitaSkrynnik,

Thanks for the image! I started the repro script (which is still basically the same as I posted earlier) using your forwarder image. It seems to be working. Earlier, without the fixes, I believe the maximum iteration count I ever saw needed for an interface left in down was between 160-170. With the fixes included my repro is at iteration 670. And still no interface left in down. So I believe this at least significantly reduced the frequency or hopefully fully fixed it.

I'm interested in how this image was built differently from my last attempt. If you have the time, please share!

Thanks,
Bence

@NikitaSkrynnik
Copy link
Contributor

@rubasov, we've merged all fixes into sdk main two days ago, you can check them there. I just used them to build the forwarder image.

@Ex4amp1e Ex4amp1e moved this from Blocked to In Progress in Release v1.14.0 Sep 2, 2024
@rubasov
Copy link

rubasov commented Sep 2, 2024

Hi @NikitaSkrynnik,

I wanted to get back after we have done some longer test runs taking the main branch after all your fixes merged. We still occasionally see some leftover interfaces. But it usually takes 500-1000 nsc scale-up-down iterations (with the usual repro scripts) until the error happens once. I believe this is 1-1.5 magnitudes better than without the fixes. At this moment we hope this is good enough even for long-lived deployments. Of course if we learn anything new, we'll get back to you.

Thank you for your help!
Bence

@Ex4amp1e Ex4amp1e moved this from In Progress to Todo in Release v1.14.0 Sep 3, 2024
@Ex4amp1e Ex4amp1e moved this from Todo to In Progress in Release v1.14.0 Sep 4, 2024
@denis-tingaikin
Copy link
Member Author

Statistics from rc2.

Interfaces:
2024-09-05 14:13:03 - Node: control-plane-control-seliics06748-n1 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:04 - Node: control-plane-control-seliics06749-n2 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:05 - Node: control-plane-control-seliics06750-n3 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:07 - Node: pool1-ceph-seliics06744-n7 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:08 - Node: pool1-ceph-seliics06745-n4 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:09 - Node: pool1-ceph-seliics06746-n5 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:10 - Node: pool1-ceph-seliics06747-n6 | number of ^tap interface: 3 | number of ^vxlan_tunnel interface: 0 |

@NikitaSkrynnik NikitaSkrynnik moved this from In Progress to Blocked in Release v1.14.0 Sep 11, 2024
@NikitaSkrynnik
Copy link
Contributor

NikitaSkrynnik commented Sep 17, 2024

We plan to revert the begin chain element networkservicemesh/sdk#1668 when Netlink is released: vishvananda/netlink#1019

@denis-tingaikin denis-tingaikin moved this from Blocked to Moved to next release in Release v1.14.0 Sep 24, 2024
@rubasov
Copy link

rubasov commented Oct 4, 2024

Several weeks back while working with the interface leak problem I also wanted to understand the link between the interface leak and the memory leak. Yesterday talking to Laszlo and Szilard we realized I never shared the results of this. This has practically nothing to do with nsm itself, however it still could be useful in understanding how the interface leak casued the memory leak, so I summarize it here.

The first step was to realize that the main problem were the leftover tap interfaces. Knowing this I no longer needed the complicated and long running reproduction scripts we used in other parts of the analysis. Instead I could just create tap interfaces in bulk and observe the memory usage. I created a test vm (under libvirt/kvm) and installed vpp into it.

# install vpp 24.06 from https://packagecloud.io/fdio/2406
$ sudo apt install vpp vpp-plugin-core

# disable the service that comes from the package
$ sudo systemctl stop vpp
$ sudo systemctl disable vpp
$ sudo mkdir /var/log/vpp/

# do everything as root
$ sudo -s

# start vpp manually instead
$ i=0 ; pkill -f vpp$i ; vpp "unix { cli-listen /run/vpp/cli-vpp$i.sock cli-prompt vpp$i# log /var/log/vpp/vpp$i.log exec /home/vagrant/vpp$i.vpp } api-trace { on } api-segment { prefix vpp$i } statseg { socket-name /run/vpp/statseg-vpp$i.sock }"

# verify that vpp works
$ vppctl -s /run/vpp/cli-vpp$i.sock show version
vpp v24.06.0-2~gcd30ea1dd~b19 built by root on bdc0da124535 at 2024-07-31T17:57:13

# collect memory usage statistics while creating 100 tap interfaces
vpp_pid=$( pgrep vpp )
kill $vpp_pid
while pgrep vpp >/dev/null ; do sleep 1 ; done

i=0 ; vpp "unix { cli-listen /run/vpp/cli-vpp$i.sock cli-prompt vpp$i# log /var/log/vpp/vpp$i.log exec /home/vagrant/vpp$i.vpp } api-trace { on } api-segment { prefix vpp$i } statseg { socket-name /run/vpp/statseg-vpp$i.sock }"
sleep 1
vpp_pid=$( pgrep vpp )

(
echo 0 $( vppctl -s /run/vpp/cli-vpp$i.sock show memory main-heap | egrep -o "used: [0-9]+.[0-9]+." | cut -d ' ' -f2 ) $( ps -o rss=,vsz= $vpp_pid ) $( free -k | awk '/^Mem:/ { print $3 }' )
for j in $( seq 100 )
do
    echo >&2 $j
    vppctl -s /run/vpp/cli-vpp$i.sock create tap >/dev/null
    echo $j $( vppctl -s /run/vpp/cli-vpp$i.sock show memory main-heap | egrep -o "used: [0-9]+.[0-9]+." | cut -d ' ' -f2 ) $( ps -o rss=,vsz= $vpp_pid ) $( free -k | awk '/^Mem:/ { print $3 }' )
done
) > vpp-taps-mem-usage.log # fields in this file: number-of-taps vppctl-show-memory-main-heap-used ps-rss ps-vsz free-k-used

$ head -1 vpp-taps-mem-usage.log ; tail -1 vpp-taps-mem-usage.log
0 56.88M 95180 18012616 2842848
100 59.65M 98480 18012616 3512548

We can see that both the vpp self-reported memory usage and the kernel's report about the vpp process itself shows cc. 30 KiB / tap interface memory usage - which is negligibly low. At this rate we are never running out of today's typical container memory limits (usually hundreds of MiBs of free memory). However the overall used memory in this vm grows much faster. To show the typical values and the variance here are the results of 5 runs:

6697 KiB/tap
6763 KiB/tap
6840 KiB/tap
6897 KiB/tap
6880 KiB/tap

This is more than 200x times what the vpp process itself uses per tap interface. Where does this memory go? I believe this discrepancy was also observed in a k8s environment between the cgroups reported overall memory usage and the vpp process's memory usage. In this simpler environment it's already clear that the extra memory must be used by the linux kernel. Let's find some positive proof of that!

We'll look into /proc/slabinfo where the kernel tracks several kinds of internal allocation statistics (https://man7.org/linux/man-pages/man5/slabinfo.5.html).

# The same statistics collection as before with an additional column where we track the overall memory usage of the kmalloc-8k slab in KiBs.
vpp_pid=$( pgrep vpp )
kill $vpp_pid
while pgrep vpp >/dev/null ; do sleep 1 ; done

i=0 ; vpp "unix { cli-listen /run/vpp/cli-vpp$i.sock cli-prompt vpp$i# log /var/log/vpp/vpp$i.log exec /home/vagrant/vpp$i.vpp } api-trace { on } api-segment { prefix vpp$i } statseg { socket-name /run/vpp/statseg-vpp$i.sock }"
sleep 1
vpp_pid=$( pgrep vpp )

(
echo 0 $( vppctl -s /run/vpp/cli-vpp$i.sock show memory main-heap | egrep -o "used: [0-9]+.[0-9]+." | cut -d ' ' -f2 ) $( ps -o rss=,vsz= $vpp_pid ) $( free -k | awk '/^Mem:/ { print $3 }' ) $( cat /proc/slabinfo | egrep ^kmalloc-8k | awk '{ print $2 * $4 / 1024 }' )
for j in $( seq 100 )
do
    echo >&2 $j
    vppctl -s /run/vpp/cli-vpp$i.sock create tap >/dev/null
    echo $j $( vppctl -s /run/vpp/cli-vpp$i.sock show memory main-heap | egrep -o "used: [0-9]+.[0-9]+." | cut -d ' ' -f2 ) $( ps -o rss=,vsz= $vpp_pid ) $( free -k | awk '/^Mem:/ { print $3 }' ) $( cat /proc/slabinfo | egrep ^kmalloc-8k | awk '{ print $2 * $4 / 1024 }' )
done
) > vpp-taps-mem-usage.log # fields in this file: number-of-taps vppctl-show-memory-main-heap-used ps-rss ps-vsz free-k-used slabinfo-kmalloc-8k-total-k

$ head -1 vpp-taps-mem-usage.log ; tail -1 vpp-taps-mem-usage.log
0 56.88M 95324 18012616 2804360 1360
100 59.65M 98624 18012616 3482088 614208

$ bc -ql
(3482088-2804360)/100
6777.28000000000000000000
(614208-1360)/100
6128.48000000000000000000
(614208-1360)/(3482088-2804360)
.90426837905472401907

That means we already found 90% of the to-be-explained memory usage in the allocation statistics of the kmalloc-8k slabs. At this point I was convinced I had positive proof that the extra memory is used by the kernel and did not look into the changes of other slab types (like kmallocs of different units).

But why does a tap interface use this much memory? Probably because vpp creates multiqueue tap interfaces (I believe for performance reasons on multicore cpus) and all the buffers allocated are multiplied by the number of queues. The relevant code is here:

https://github.com/FDio/vpp/blob/6e8b350a0154caee6e9f037ae4938b83fb5dcf20/src/vnet/devices/tap/tap.c#L129-L737

I shortly looked into creating the same tap interface setup by the ip and tc tools. I started off like this:

modprobe sch_multiq
ip tuntap add dev tap$i mode tap multi_queue vnet_hdr
tc qdisc add dev tap$i root multiq

However now I believe that not all tunables can be set via these tools, so I could not create the exact same thing this way. If somebody is interested, the tap creation code could be extracted from vpp.

Please also note that this memory could appear as if used by vpp - because the tap interfaces are nonpersistent tap interfaces. Which means that if the vpp process dies then the linux kernel automatically destroys all tap interfaces requested by vpp. That's why I did not need to delete the taps in the above examples, I could just kill vpp instead.

@NikitaSkrynnik
Copy link
Contributor

Seems like vpp memory consumption constantly grows. Here is the graph of main-heap after 160h of testing:
chart (1)

We decided to disable binary API tracing. A new test has been running for 10h already and we don't see any memory consumption growth in vpp

@denis-tingaikin denis-tingaikin moved this to In Progress in Release v1.15.0 Nov 21, 2024
@NikitaSkrynnik
Copy link
Contributor

NikitaSkrynnik commented Nov 26, 2024

We can use nitems parameter in vpp config to limit the number of entries of binary API traces. After we reach this limit vpp memory usage stops to increase. Here is the table of forwarder's memory consumption for 5 days:

test duration forwarder-vpp-9x6f6 (vpp memory usage) forwarder-vpp-zdfgr (vpp memory usage)
10m 385Mi (158.96M) 357Mi (157.83M)
160m 384Mi (159.39M) 359Mi (158.21M)
20h 386Mi (159.39M) 367Mi (158.45M)
26h 391Mi (159.39M) 367Mi (158.45M)
3d19h 389Mi (159.39M) 368Mi (158.45M)
4d2h 394Mi (159.39M) 367Mi (158.45M)
5d2h 388Mi (159.39M) 367Mi (158.45M)
7d2h 392Mi (159.39M) 367Mi (158.45M)
11d 390Mi (159.28M) 367Mi (158.45M)

@elajkat
Copy link
Contributor

elajkat commented Nov 26, 2024

Just wanted to comment it here that we started the same measurements with limited nitems, so big +1 for it.
The default value for nitems is 256k, which value you used for your tests?

@elajkat
Copy link
Contributor

elajkat commented Nov 26, 2024

In my local env I started a test with vpp 24.10, nitems=1000 (to be small enough to reach the saturation quick enough), my env is in kind, 2 forwarders.

2024-11-26 14:28:04.439240278+01:00
RX trace: used 197 of 1000 items, is enabled, has not wrapped
Thread 0 vpp_main
base 0x759c6ea01000, size 1g, locked, unmap-on-destroy, name 'main heap'
page stats: page-size 4K, total 262144, mapped 41107, not-mapped 221037
numa 0: 41107 pages, 160.57m bytes
total: 1023.99M, used: 156.67M, free: 867.33M, trimmable: 866.43M
RX trace: used 233 of 1000 items, is enabled, has not wrapped
Thread 0 vpp_main
base 0x73b549e01000, size 1g, locked, unmap-on-destroy, name 'main heap'
page stats: page-size 4K, total 262144, mapped 41107, not-mapped 221037
numa 0: 41107 pages, 160.57m bytes
total: 1023.99M, used: 156.72M, free: 867.28M, trimmable: 866.43M

.....

2024-11-26 16:21:11.986159040+01:00
RX trace: used 1000 of 1000 items, is enabled, has wrapped
Thread 0 vpp_main
base 0x759c6ea01000, size 1g, locked, unmap-on-destroy, name 'main heap'
page stats: page-size 4K, total 262144, mapped 41485, not-mapped 220659
numa 0: 41485 pages, 162.05m bytes
total: 1023.99M, used: 158.15M, free: 865.85M, trimmable: 864.96M
RX trace: used 1000 of 1000 items, is enabled, has wrapped
Thread 0 vpp_main
base 0x72245e201000, size 1g, locked, unmap-on-destroy, name 'main heap'
page stats: page-size 4K, total 262144, mapped 42016, not-mapped 220128
numa 0: 42016 pages, 164.13m bytes
total: 1023.99M, used: 159.64M, free: 864.36M, trimmable: 862.88M

Sorry I have no time to parse the raw outputs to something easier to consume, or to a chart.
I leave this test running for few more hours, but it is visible that the memory consumption is reached a level and stopped around that level.

@NikitaSkrynnik
Copy link
Contributor

I used value 4096

@elajkat
Copy link
Contributor

elajkat commented Nov 27, 2024

I believe that setting the default value of nitems to some fair, low number is perfect, the admins/operators can set anytime the value in vpp cli:
vpp-1# api trace on nitems 10000
vpp-1# api trace status
RX trace: used 0 of 10000 items, is enabled, has not wrapped

The value must be documented with some help for the admins how to change it.

@NikitaSkrynnik
Copy link
Contributor

Hi, @szvincze! We don't see any memory leaks on our side. Could you tell about your test results? Is everything stable on your side? Can we close this issue?

@elajkat
Copy link
Contributor

elajkat commented Nov 29, 2024

Hi @NikitaSkrynnik, we will have a weekend test that covers this issue also (with decreased nitems value for VPP), after that we will have results from our labs. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Moved to next release
Status: In Progress
Status: Moved to next release
Status: No status
Status: In Progress
Development

No branches or pull requests

5 participants