Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bgpmon] bgpmon error flooding in syslog #5858

Closed
lolyu opened this issue Nov 8, 2020 · 16 comments · Fixed by #5859
Closed

[bgpmon] bgpmon error flooding in syslog #5858

lolyu opened this issue Nov 8, 2020 · 16 comments · Fixed by #5859
Assignees
Labels
Bug 🐛 Master Branch Quality P1 Priority of the issue, lower than P0

Comments

@lolyu
Copy link
Contributor

lolyu commented Nov 8, 2020

Description

bgpmon floods syslog with error messages like below:

Nov  8 03:30:20.736704 str-dx010-acs-4 INFO bgp#/bgpmon: bgpmon service started
Nov  8 03:30:24.209370 str-dx010-acs-4 INFO bgp#supervisord 2020-11-08 03:30:19,574 ERRO pool dependent-startup event buffer overflowed, discarding event 3948
Nov  8 03:30:24.209370 str-dx010-acs-4 INFO bgp#supervisord 2020-11-08 03:30:19,575 INFO exited: bgpmon (exit status 1; not expected)
Nov  8 03:30:24.209370 str-dx010-acs-4 INFO bgp#supervisord 2020-11-08 03:30:20,576 ERRO pool dependent-startup event buffer overflowed, discarding event 3949
Nov  8 03:30:24.209370 str-dx010-acs-4 INFO bgp#supervisord 2020-11-08 03:30:20,578 INFO spawned: 'bgpmon' with pid 4285
Nov  8 03:30:24.209370 str-dx010-acs-4 INFO bgp#supervisord 2020-11-08 03:30:20,581 ERRO pool dependent-startup event buffer overflowed, discarding event 3950
Nov  8 03:30:24.209370 str-dx010-acs-4 INFO bgp#supervisord 2020-11-08 03:30:20,581 INFO success: bgpmon entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Nov  8 03:30:34.799388 str-dx010-acs-4 ERR monit[479]: 'lldp|lldpmgrd' status failed (1) -- 'python /usr/bin/lldpmgrd' is not running.
Nov  8 03:30:34.965631 str-dx010-acs-4 ERR monit[479]: 'bgp|bgpcfgd' status failed (1) -- '/usr/bin/python /usr/local/bin/bgpcfgd' is not running.
Nov  8 03:30:35.012222 str-dx010-acs-4 ERR monit[479]: 'bgp|bgpmon' status failed (1) -- '/usr/bin/python /usr/local/bin/bgpmon' is not running.
Nov  8 03:30:36.058032 str-dx010-acs-4 INFO bgp#supervisord: bgpmon Traceback (most recent call last):
Nov  8 03:30:36.058032 str-dx010-acs-4 INFO bgp#supervisord: bgpmon   File "/usr/local/bin/bgpmon", line 8, in <module>
Nov  8 03:30:36.058149 str-dx010-acs-4 INFO bgp#supervisord: bgpmon     sys.exit(main())
Nov  8 03:30:36.058149 str-dx010-acs-4 INFO bgp#supervisord: bgpmon   File "/usr/local/lib/python3.7/dist-packages/bgpmon/bgpmon.py", line 166, in main
Nov  8 03:30:36.058173 str-dx010-acs-4 INFO bgp#supervisord: bgpmon     bgp_state_get.get_all_neigh_states()
Nov  8 03:30:36.058191 str-dx010-acs-4 INFO bgp#supervisord: bgpmon   File "/usr/local/lib/python3.7/dist-packages/bgpmon/bgpmon.py", line 88, in get_all_neigh_states
Nov  8 03:30:36.058191 str-dx010-acs-4 INFO bgp#supervisord: bgpmon     self.update_new_peer_states(value)
Nov  8 03:30:36.058256 str-dx010-acs-4 INFO bgp#supervisord: bgpmon   File "/usr/local/lib/python3.7/dist-packages/bgpmon/bgpmon.py", line 72, in update_new_peer_states
Nov  8 03:30:36.058256 str-dx010-acs-4 INFO bgp#supervisord: bgpmon     self.new_peer_state[peer_l[i]] = peer_dict["peers"][peer_l[i]]["state"]
Nov  8 03:30:36.058256 str-dx010-acs-4 INFO bgp#supervisord: bgpmon TypeError: 'dict_keys' object is not subscriptable

Steps to reproduce the issue:
1.
2.
3.

Describe the results you received:

  1. bgpmon fails to populate neighbors info in state DB:
admin@str-dx010-acs-4:~$ show ip bg sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 12580
RIB entries 12807, using 2356488 bytes of memory
Peers 4, using 83680 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  64600       3253       5565         0      0       0  00:02:30             6400  ARISTA01T1
10.0.0.59      4  64600       3252       5565         0      0       0  00:02:30             6400  ARISTA02T1
10.0.0.61      4  64600       3252       3255         0      0       0  00:02:26             6400  ARISTA03T1
10.0.0.63      4  64600       3253       3255         0      0       0  00:02:26             6400  ARISTA04T1

Total number of neighbors 4
admin@str-dx010-acs-4:~$ redis-cli -n 6
127.0.0.1:6379[6]> keys "NEIGH*"
(empty array)
127.0.0.1:6379[6]>

Describe the results you expected:

admin@str-dx010-acs-4:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2
RIB entries 3, using 552 bytes of memory
Peers 4, using 83680 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down    State/PfxRcd    NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  64600          0          0         0      0       0  never      Active          ARISTA01T1
10.0.0.59      4  64600          0          0         0      0       0  never      Active          ARISTA02T1
10.0.0.61      4  64600          0          0         0      0       0  never      Connect         ARISTA03T1
10.0.0.63      4  64600          0          0         0      0       0  never      Idle            ARISTA04T1

Total number of neighbors 4
admin@str-dx010-acs-4:~$
admin@str-dx010-acs-4:~$ redis-cli -n 6
127.0.0.1:6379[6]> keys "NEIGH*"
1) "NEIGH_STATE_TABLE|fc00::7e"
2) "NEIGH_STATE_TABLE|10.0.0.63"
3) "NEIGH_STATE_TABLE|fc00::7a"
4) "NEIGH_STATE_TABLE|10.0.0.61"
5) "NEIGH_STATE_TABLE|fc00::72"
6) "NEIGH_STATE_TABLE|fc00::76"
7) "NEIGH_STATE_TABLE|10.0.0.57"
8) "NEIGH_STATE_TABLE|10.0.0.59"

Additional information you deem important (e.g. issue happens only occasionally):

admin@str-dx010-acs-4:~$ show version

SONiC Software Version: SONiC.master.476-13ff7b38
Distribution: Debian 10.6
Kernel: 4.19.0-9-2-amd64
Build commit: 13ff7b38
Build date: Fri Nov  6 11:08:10 UTC 2020
Built by: johnar@jenkins-worker-8

Platform: x86_64-cel_seastone-r0
HwSKU: Celestica-DX010-C32
ASIC: broadcom
Serial Number: DX010F2B118711MS100007
Uptime: 05:09:13 up  9:56,  1 user,  load average: 1.91, 4.34, 4.20

Docker images:
REPOSITORY                    TAG                   IMAGE ID            SIZE
docker-snmp                   latest                bb7ea60ff4d1        458MB
docker-snmp                   master.476-13ff7b38   bb7ea60ff4d1        458MB
docker-teamd                  latest                9e97d785f8fc        455MB
docker-teamd                  master.476-13ff7b38   9e97d785f8fc        455MB
docker-sonic-mgmt-framework   latest                6e28326754d4        583MB
docker-sonic-mgmt-framework   master.476-13ff7b38   6e28326754d4        583MB
docker-router-advertiser      latest                dffdec9f4a6b        422MB
docker-router-advertiser      master.476-13ff7b38   dffdec9f4a6b        422MB
docker-platform-monitor       latest                97d1d3581779        533MB
docker-platform-monitor       master.476-13ff7b38   97d1d3581779        533MB
docker-lldp                   latest                16cab9e37ed8        486MB
docker-lldp                   master.476-13ff7b38   16cab9e37ed8        486MB
docker-dhcp-relay             latest                747c69fc025c        429MB
docker-dhcp-relay             master.476-13ff7b38   747c69fc025c        429MB
docker-database               latest                7bbd2aec4541        422MB
docker-database               master.476-13ff7b38   7bbd2aec4541        422MB
docker-orchagent              latest                dfff38c3ec6e        469MB
docker-orchagent              master.476-13ff7b38   dfff38c3ec6e        469MB
docker-nat                    latest                cd65e3c2f8aa        458MB
docker-nat                    master.476-13ff7b38   cd65e3c2f8aa        458MB
docker-sonic-telemetry        latest                bb411052c9a2        491MB
docker-sonic-telemetry        master.476-13ff7b38   bb411052c9a2        491MB
docker-fpm-frr                latest                0cad74eaccda        471MB
docker-fpm-frr                master.476-13ff7b38   0cad74eaccda        471MB
docker-sflow                  latest                475e4ec4689d        459MB
docker-sflow                  master.476-13ff7b38   475e4ec4689d        459MB
docker-syncd-brcm             latest                fecb5fd7f54a        527MB
docker-syncd-brcm             master.476-13ff7b38   fecb5fd7f54a        527MB    ```
@lguohan
Copy link
Collaborator

lguohan commented Nov 8, 2020

do you have details?

@lolyu lolyu changed the title [bgpmon] bgpmon error docker ps [bgpmon] bgpmon error flooding in syslog Nov 8, 2020
@lolyu
Copy link
Contributor Author

lolyu commented Nov 8, 2020

do you have details?

Yes, updated, just fat-fingered to create before filling the details.

@lguohan
Copy link
Collaborator

lguohan commented Nov 8, 2020

is this catch by log analyzer? or a specific test for bgpmon?

@lguohan lguohan added the P0 Priority of the issue label Nov 8, 2020
@lolyu
Copy link
Contributor Author

lolyu commented Nov 8, 2020

is this catch by log analyzer? or a specific test for bgpmon?

Yes, it is caught by log analyzer in my debugging of pfcwd for master branch.

@lguohan
Copy link
Collaborator

lguohan commented Nov 8, 2020

@gechiang , have you added the bgpmon test? if not, please add then we won't have such regression.

@lolyu
Copy link
Contributor Author

lolyu commented Nov 8, 2020

@lguohan, I have a quick fix for this, would you mind taking a look?

@gechiang
Copy link
Collaborator

gechiang commented Nov 8, 2020

@lolyu Did someone manually edited bgpmon.py in this DUT?
I logged in to this DUT and found that this file had a very new date and 2 line of code got changed that was causing some indentation issues. Did someone by mistake logged into this DUT and updated the wrong file??
Here is the date of the file I found which is "Nov 8 05:04" while the rest shows it should have been "Nov 6 08:52 ish"

admin@str-dx010-acs-4:~$ docker exec -it bgp /bin/bash
root@str-dx010-acs-4:/# ls -l /usr/local/lib/python3.7/dist-packages/bgpmon/bgpmon.py 
-rw-r--r-- 1 root staff 7207 Nov  8 05:04 /usr/local/lib/python3.7/dist-packages/bgpmon/bgpmon.py
root@str-dx010-acs-4:/# cd /usr/local/lib/python3.7/dist-packages       
root@str-dx010-acs-4:/usr/local/lib/python3.7/dist-packages# ls -l
total 892
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 Jinja2-2.11.2.dist-info
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 MarkupSafe-1.1.1.dist-info
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 PyYAML-5.3.1.egg-info
drwxr-sr-x 1 root staff   4096 Nov  6 09:03 __pycache__
drwxr-sr-x 3 root staff   4096 Nov  6 09:03 bgpcfgd
drwxr-sr-x 1 root staff   4096 Nov  8 05:22 bgpmon
drwxr-sr-x 3 root staff   4096 Nov  6 08:52 bitarray
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 bitarray-1.5.3.egg-info
-rw-r--r-- 1 root staff   2488 Nov  6 08:52 config_samples.py
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 ipaddress-1.0.23.dist-info
-rw-r--r-- 1 root staff  79875 Nov  6 08:52 ipaddress.py
drwxr-sr-x 3 root staff   4096 Nov  6 08:52 jinja2
-rw-r--r-- 1 root staff    604 Nov  6 08:52 lazy_re.py
drwxr-sr-x 6 root staff   4096 Nov  6 08:50 lxml
drwxr-sr-x 2 root staff   4096 Nov  6 08:50 lxml-4.6.1.dist-info
drwxr-sr-x 3 root staff   4096 Nov  6 08:52 markupsafe
-rw-r--r-- 1 root staff  58769 Nov  6 08:52 minigraph.py
drwxr-sr-x 4 root staff   4096 Nov  6 08:52 natsort
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 natsort-6.2.1.dist-info
drwxr-sr-x 7 root staff   4096 Nov  6 08:52 netaddr
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 netaddr-0.8.0.dist-info
-rw-r--r-- 1 root staff 508410 Nov  6 08:52 openconfig_acl.py
drwxr-sr-x 5 root staff   4096 Nov  6 08:49 pip
drwxr-sr-x 2 root staff   4096 Nov  6 08:49 pip-20.2.4.dist-info
-rw-r--r-- 1 root staff  10521 Nov  6 08:52 portconfig.py
drwxr-sr-x 6 root staff   4096 Nov  6 08:50 pyang
drwxr-sr-x 2 root staff   4096 Nov  6 08:50 pyang-2.3.2.dist-info
drwxr-sr-x 6 root staff   4096 Nov  6 08:50 pyangbind
drwxr-sr-x 2 root staff   4096 Nov  6 08:50 pyangbind-0.8.1.egg-info
drwxr-sr-x 3 root staff   4096 Nov  6 08:52 redis
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 redis-3.5.3.dist-info
-rw-r--r-- 1 root staff    924 Nov  6 08:52 redis_bcc.py
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 redis_dump_load-1.1.egg-info
-rw-r--r-- 1 root staff  21478 Sep  1  2016 redisdl.py
drwxr-sr-x 3 root staff   4096 Nov  6 08:50 regex
drwxr-sr-x 2 root staff   4096 Nov  6 08:50 regex-2020.10.28.dist-info
drwxr-sr-x 2 root staff   4096 Nov  6 08:50 six-1.15.0.dist-info
-rw-r--r-- 1 root staff  34159 Nov  6 08:50 six.py
drwxr-sr-x 2 root staff   4096 Nov  6 09:03 sonic_bgpcfgd-1.0.dist-info
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 sonic_config_engine-1.0.dist-info
drwxr-sr-x 3 root staff   4096 Nov  6 08:52 sonic_py_common
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 sonic_py_common-1.0.dist-info
drwxr-sr-x 4 root staff   4096 Nov  6 08:52 swsssdk
drwxr-sr-x 2 root staff   4096 Nov  6 08:52 swsssdk-2.0.1.dist-info
drwxr-sr-x 3 root staff   4096 Nov  6 09:03 tests
drwxr-sr-x 3 root staff   4096 Nov  6 08:52 usr
drwxr-sr-x 3 root staff   4096 Nov  6 08:52 yaml
drwxr-sr-x 2 root staff   4096 Nov  6 09:03 zipp-1.2.0.dist-info
-rw-r--r-- 1 root staff   7039 Nov  6 09:03 zipp.py
root@str-dx010-acs-4:/usr/local/lib/python3.7/dist-packages# cd bgpmon/
root@str-dx010-acs-4:/usr/local/lib/python3.7/dist-packages/bgpmon# ls -l
total 12
-rw-r--r-- 1 root staff    0 Nov  6 09:03 __init__.py
drwxr-sr-x 1 root staff 4096 Nov  8 05:04 __pycache__
-rw-r--r-- 1 root staff 7207 Nov  8 05:04 bgpmon.py
root@str-dx010-acs-4:/usr/local/lib/python3.7/dist-packages/bgpmon# 

Also, I compared the code in the repository and the one that got modified in this DUT and found that there are two line of code that got changed by someone...

Here is the original method:

68   def update_new_peer_states(self, peer_dict):
69        peer_l = peer_dict["peers"].keys()
70        self.new_peer_l.extend(peer_l)
71        for i in range (0, len(peer_l)):
72            self.new_peer_state[peer_l[i]] = peer_dict["peers"][peer_l[i]]["state"]

But this what I see now in the DUT:

68    def update_new_peer_states(self, peer_dict):
69        peer_l = peer_dict["peers"].keys()
70        self.new_peer_l.extend(peer_l)
71        for peer in peer_l:
72            self.new_peer_state[peer] = peer_dict["peers"][peer]["state"]

Line 71 and 72 is where the code got changed...

Can you recall what actually happened that may have somehow caused the code get modified?? or you were manually changing this in attempt to fix some other issue?
Thanks!

@gechiang
Copy link
Collaborator

gechiang commented Nov 8, 2020

@lolyu
I see that this is indeed your change to fix the original issue.
Looks like it got introduced when migrating from python2 to python3.
Your changes looks good.
Thanks for investigating and fixing the issue.
I

@gechiang
Copy link
Collaborator

gechiang commented Nov 8, 2020

@lguohan This should have been caught by vs image test as part of the bgp_fact testcase.
Not sure how this vs image test passed when the code went through the python2 to python3 migration?

@lguohan
Copy link
Collaborator

lguohan commented Nov 8, 2020

@gechiang , can you investigate why bgp_fact missed this? how can we improve that?

@lolyu
Copy link
Contributor Author

lolyu commented Nov 8, 2020

Thanks for investigating and fixing the issue.

No problem, my bad not to reset my change on that DUT.

@lguohan lguohan reopened this Nov 8, 2020
@lguohan
Copy link
Collaborator

lguohan commented Nov 8, 2020

keep this issue for missing test cases.

@gechiang
Copy link
Collaborator

gechiang commented Nov 9, 2020

@lguohan I tried to check the VsImage testlog for the PR that introduced this issue (#5746) but unfortunately those results were no longer available. But with the fix that @lolyu made (#5859) I was able to view the test result and it looks like it is due to the vsimage is still built and tested using python2.7 instead of the python3...
Here is what I captured:

23:48:17  /usr/local/lib/python2.7/dist-packages/ansible/parsing/vault/__init__.py:44: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in a future release.
23:48:17    from cryptography.exceptions import InvalidSignature
23:48:17  ============================= test session starts ==============================
23:48:17  platform linux2 -- Python 2.7.17, pytest-4.6.5, py-1.9.0, pluggy-0.13.1
23:48:17  ansible: 2.8.12
23:48:17  rootdir: /data/sonic-mgmt/tests, inifile: pytest.ini
23:48:17  plugins: xdist-1.28.0, repeat-0.9.1, metadata-1.10.0, html-1.22.1, forked-1.3.0, ansible-2.2.2
23:48:17  collected 1 item
23:48:17  
23:49:04  bgp/test_bgp_fact.py::test_bgp_facts[vlab-01-None] PASSED                [100%]
23:49:04  
23:49:04  - generated xml file: /data/sonic-mgmt/tests/logs/1vlan/bgp/test_bgp_fact.xml --
23:49:04  ========================== 1 passed in 48.40 seconds ===========================
23:49:04  /usr/local/lib/python2.7/dist-packages/ansible/parsing/vault/__init__.py:44: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in a future release.
23:49:04    from cryptography.exceptions import InvalidSignature
23:49:04  ============================= test session starts ==============================
23:49:04  platform linux2 -- Python 2.7.17, pytest-4.6.5, py-1.9.0, pluggy-0.13.1
23:49:04  ansible: 2.8.12
23:49:04  rootdir: /data/sonic-mgmt/tests, inifile: pytest.ini
23:49:04  plugins: xdist-1.28.0, repeat-0.9.1, metadata-1.10.0, html-1.22.1, forked-1.3.0, ansible-2.2.2
23:49:04  collected 1 item
23:49:04  
23:51:30  bgp/test_bgp_gr_helper.py::test_bgp_gr_helper_routes_perserved PASSED    [100%]
23:51:30  
23:51:30  - generated xml file: /data/sonic-mgmt/tests/logs/1vlan/bgp/test_bgp_gr_helper.xml -
23:51:30  ========================== 1 passed in 147.00 seconds ==========================

This means we need to change vsimage build to also start using python3 instead of python2 in order to catch issues similar to this...
Perhaps we should also have someone look into how to change vsimage build to also start using python3?

@gechiang
Copy link
Collaborator

gechiang commented Nov 9, 2020

@lguohan As for the unit test test cases for bgpmon it is on my todo list. Let's keep this PR as a reminder for the missing testcases that I will raise a different PR in the near future to enhance test coverage.

@lguohan
Copy link
Collaborator

lguohan commented Nov 9, 2020

@gechiang , i think you need to talk to arvind why your testing code change is removed.

sonic-net/sonic-mgmt@4703756#diff-23e2943a406c883e83d4105aa7fa57eb5f86a590a9f36125872ed375332a729d

and someone either you or arvind needs to add it back.

@lguohan lguohan added P1 Priority of the issue, lower than P0 and removed P0 Priority of the issue labels Nov 9, 2020
@lguohan
Copy link
Collaborator

lguohan commented Nov 9, 2020

test fix sonic-net/sonic-mgmt#2490

@daall daall closed this as completed Dec 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🐛 Master Branch Quality P1 Priority of the issue, lower than P0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants