Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[T2][202405] Zebra process consuming a large amount of memory resulting in OOM kernel panics #20337

Closed
arista-nwolfe opened this issue Sep 23, 2024 · 11 comments
Assignees
Labels
Chassis 🤖 Modular chassis support P0 Priority of the issue Triaged this issue has been triaged

Comments

@arista-nwolfe
Copy link
Contributor

arista-nwolfe commented Sep 23, 2024

On full T2 devices in 202405 Arista is seeing the zebra process in FRR consume a large amount of memory (10x compared to 202205).

202405:

root@cmp206-4:~# docker exec -it bgp0 bash
root@cmp206-4:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.2  38116 32024 pts/0    Ss+  21:23   0:01 /usr/bin/python3 /usr/local/bin/supervisord
root          44  0.1  0.2 131684 31888 pts/0    Sl   21:23   0:06 python3 /usr/bin/supervisor-proc-exit-listener --container-name bgp
root          47  0.0  0.0 230080  4164 pts/0    Sl   21:23   0:00 /usr/sbin/rsyslogd -n -iNONE
frr           51 27.5  8.1 2018736 1283692 pts/0 Sl   21:23  16:57 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp

202205:

root@cmp210-3:~# docker exec -it bgp bash
root@cmp210-3:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.1  30524 26232 pts/0    Ss+  21:59   0:00 /usr/bin/python3 /usr/local/bin/supervisord
root          26  0.0  0.1  30808 25712 pts/0    S    21:59   0:00 python3 /usr/bin/supervisor-proc-exit-listener --container-name bgp
root          27  0.0  0.0 220836  3764 pts/0    Sl   21:59   0:00 /usr/sbin/rsyslogd -n -iNONE
frr           31  9.7  0.7 730360 128852 pts/0   Sl   21:59   2:32 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M fpm -M snmp

This results in the system having very low amounts of free memory:

> free -m
               total        used        free      shared  buff/cache   available
Mem:           15388       15304         158         284         481          83

If we run a command which causes zebra to consume even more memory like show ip route it can cause kernel panics due to OOM:

[74531.234009] Kernel panic - not syncing: Out of memory: compulsory panic_on_oom is enabled
[74531.260707] CPU: 1 PID: 735 Comm: auditd Kdump: loaded Tainted: G           OE      6.1.0-11-2-amd64 #1  Debian 6.1.38-4
[74531.313431] Call Trace:
[74531.365891]  <TASK>
[74531.418342]  dump_stack_lvl+0x44/0x5c
[74531.470844]  panic+0x118/0x2ed
[74531.523334]  out_of_memory.cold+0x67/0x7e

When we look at the show memory in FRR we see the max nexthops is significantly higher on 202405 than 202205.
202405:

show memory
Memory statistics for zebra:
  Total heap allocated:  > 2GB
--- qmem libfrr ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Nexthop                       :     1669    160      280536  8113264 1363218720    # ASIC0
Nexthop                       :     1535    160      258120  2097270 352476288     # ASIC1

202205:

show memory
Memory statistics for zebra:
  Total heap allocated:  72 MiB
--- qmem libfrr ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Nexthop                       :     1173    152      178312    36591   5563080

NOTES:
-both 202205 and 202405 have the same number of routes installed
-we also seen an increase on t2-min topologies but the absolute memory usage is at least half of what T2 is seeing so we aren't seeing OOMs on t2-min
-the FRR version changed between 202205=FRRouting 8.2.2 and 202405=FRRouting 8.5.4

@kenneth-arista
Copy link
Contributor

@arlakshm @wenyiz2021

@arlakshm
Copy link
Contributor

arlakshm commented Sep 23, 2024

@bingwang-ms bingwang-ms added Chassis 🤖 Modular chassis support Triaged this issue has been triaged labels Sep 25, 2024
@arlakshm arlakshm added the P0 Priority of the issue label Sep 25, 2024
@arlakshm arlakshm self-assigned this Sep 25, 2024
@anamehra
Copy link
Contributor

anamehra commented Oct 2, 2024

This is the output from Cisco Chassis LC0:

Nexthop                       :     1274    160      214128  1352174 227269168
Nexthop Group                 :        0     32           0        2        80
Nexthop Group Entry           :      440    144       71056     6096    934928
Nexthop Group Connected       :      645     40       25848     1037     41528
Nexthop Group Context         :        0   2104           0        1      2104
Nexthop tracking object       :      173    248       42920      173     42920
Nexthop                       :      159    160       26888      167     28232
Static Nexthop tracking data  :      140     88       12544      140     12544
Static Nexthop                :      141    224       32872      141     32872
root@sfd-t2-lc0:/home/cisco# vtysh  -n 1 -c 'show memory'| grep Next
Nexthop                       :     1270    160      213424  1453138 244261040
Nexthop Group                 :        0     32           0        2        80
Nexthop Group Entry           :      458    144       69824    18917   2878184
Nexthop Group Connected       :      697     40       27880     1437     57480
Nexthop Group Context         :        0   2104           0        1      2104
Nexthop tracking object       :      177    248       43928      177     43928
Nexthop                       :      163    160       27560      173     29240
Static Nexthop tracking data  :      140     88       12752      140     12752
Static Nexthop                :      141    224       33016      141     33016
root@sfd-t2-lc0:/home/cisco# vtysh  -n 0 -c 'show memory'| grep Next
Nexthop                       :     1270    160      213392   189032  31782240
Nexthop Group                 :        0     32           0        2        80
Nexthop Group Entry           :      440    144       69232     2890    439440
Nexthop Group Connected       :      641     40       25672     1191     47672
Nexthop Group Context         :        0   2104           0        1      2104
Nexthop tracking object       :      169    248       41928      169     41928
Nexthop                       :      155    160       26088      161     27096
Static Nexthop tracking data  :      140     88       12688      140     12688
Static Nexthop                :      141    224       32824      141     32824
root@sfd-t2-lc0:/home/cisco# docker exec -it bgp2 ps aux | grep frr
frr           52  0.2  2.2 1418524 718064 pts/0  Sl   05:00   1:46 /usr/lib/frr/
frr           71  0.0  0.0  44380 14460 pts/0    S    05:00   0:00 /usr/lib/frr/
frr           72  0.2  1.1 664720 354504 pts/0   Sl   05:00   1:50 /usr/lib/frr/
root@sfd-t2-lc0:/home/cisco# docker exec -it bgp1 ps aux | grep frr
frr           52  1.6  2.8 1612844 912296 pts/0  Sl   05:00  12:19 /usr/lib/frr/
frr           67  0.0  0.0  44384 14420 pts/0    S    05:00   0:01 /usr/lib/frr/
frr           68  0.2  1.2 703736 395336 pts/0   Sl   05:00   2:12 /usr/lib/frr/
root@sfd-t2-lc0:/home/cisco# docker exec -it bgp0 ps aux | grep frr
frr           58  1.9  4.0 2000064 1298412 pts/0 Sl   05:00  14:38 /usr/lib/frr/
frr           75  0.0  0.0  44380 14460 pts/0    S    05:00   0:00 /usr/lib/frr/
frr           76  0.2  0.9 619544 314072 pts/0   Sl   05:00   1:39 /usr/lib/frr/```

There are some high Max Nexthop entries for asic1. Did not see OOM issue though. Will check 202305 image for comparision.

@arista-nwolfe
Copy link
Contributor Author

Thanks for the output @anamehra
The zebra/frr memory usage looks comparable to what we're seeing on Arista.
Specifically the RSS in the ps aux output
Cisco

root@sfd-t2-lc0:/home/cisco# docker exec -it bgp0 ps aux | grep frr
frr           58  1.9  4.0 2000064 1298412 pts/0 Sl   05:00  14:38 /usr/lib/frr/

Arista

root@cmp206-4:~# docker exec -it bgp0 bash
root@cmp206-4:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
frr           51 27.5  8.1 2018736 1283692 pts/0 Sl   21:23  16:57 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp

I'm guessing the difference in %MEM is due to total memory differences between the two devices.

@arista-nwolfe
Copy link
Contributor Author

arista-nwolfe commented Oct 9, 2024

We tried patching #19717 into 202405 and we saw that the amount of memory Zebra used was significantly reduced:

root@cmp210-3:~# show ip route summary
:
Route Source         Routes               FIB  (vrf default)
kernel               26                   26
connected            28                   28
ebgp                 50841                50841
ibgp                 435                  435
------
Totals               51330                51330

root@cmp210-3:~# ps aux | grep -i zebra
300        37412 23.8  1.4 930336 226108 pts/0   Sl   Oct08   7:21 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp
root       37424  0.0  0.0  96744  7696 ?        Sl   Oct08   0:00 /usr/bin/rsyslog_plugin -r /etc/rsyslog.d/zebra_regex.json -m sonic-events-bgp
root       54932  0.0  0.0   6972  2044 pts/0    S+   00:20   0:00 grep -i zebra

cmp210-3# show memory
Memory statistics for zebra:
System allocator statistics:
  Total heap allocated:  156 MiB
--- qmem libfrr ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Nexthop                       :      675    160      113512   446268  74995760

This explains why master doesn't see high Zebra memory usage as #19717 is only present in master today.

@arlakshm
Copy link
Contributor

arlakshm commented Oct 9, 2024

@lguohan, @StormLiangMS, @dgsudharsan for viz..

@rawal01
Copy link

rawal01 commented Oct 9, 2024

Output from Nokia LC:
2405:
docker exec -it bgp0 ps aux | grep frr
frr 53 0.5 3.6 1916528 1211680 pts/0 Sl Oct08 6:43 /usr/lib/frr/

docker exec -it bgp1 ps aux | grep frr
frr 53 0.7 3.5 1876120 1171496 pts/0 Sl Oct08 8:47 /usr/lib/frr/

issue exists with master too, but seems less than 2405
master:
bgp0:
frr 54 6.9 1.2 1117048 415040 pts/0 Sl 14:21 5:18 /usr/lib/frr/
bgp1:
frr 54 34.4 1.5 1275912 521904 pts/0 Sl 14:21 4:21 /usr/lib/frr/

@arlakshm
Copy link
Contributor

Attaching the full output of show memory
frr_show_mem_output.txt

@anamehra
Copy link
Contributor

anamehra commented Oct 21, 2024

Hi @abdosi , do we plan to pick #19717 for 202405? Based on above comments, looks like this is needed for the zebra memory consumption issue on 202405. Thanks

@anamehra
Copy link
Contributor

Hi @abdosi , do we plan to pick #19717 for 202405? Based on above comments, looks like this is needed for the zebra memory consumption issue on 202405. Thanks

Please ignore, already merged in 202405.

@arlakshm
Copy link
Contributor

#19717 merged to 202405.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chassis 🤖 Modular chassis support P0 Priority of the issue Triaged this issue has been triaged
Projects
Status: Done
Development

No branches or pull requests

6 participants