Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bgpmon under sonic-bgpcfgd to be started as a new daemon under BG… #5426

Merged
merged 1 commit into from
Sep 22, 2020
Merged

Conversation

gechiang
Copy link
Collaborator

This is to port the same change from Master branch "https://github.com/Azure/sonic-buildimage/pull/5329/files" to 201911 branch as the way to start up a daemon in 201911 branch differs from master branch.

This PR is to auto start the bgpmon (BGP Monitor) Daemon under the BGP docker.

The purpose of bgpmon (BGP Monitor daemon) is to assist gathering BGP related information periodically based on BGP activity detection to store BGP Neighbor information into State DB. For components such as SNMP agent that uses this new BGP related DB in state DB to support the MIB operation.
This PR is to auto start the bgpmon daemon under the BGP docker. There is another PR which I originally used to implements the bgpmon processing code (sonic-net/sonic-swss#1429) that was later decided to be moved into this PR instead of under the sonic-swss repo. Many valuable comments are still in that PR so reader should refer to that PR for review comments history purpose only.

Note: This new daemon is also added to be monitored by Monit so that in case it stops running it will be reported.

The HLD doc PR can be found here:

The following PR (SNMP Agent Changes to use State DB for BGP MIB )has a dependency on this PR:

The following PR contains the Pytest code that validate this feature functionality whenever VS Image test is run as part of all PRs as well as used in DUT pytest:
-[Azure-sonic-mgmt] (sonic-net/sonic-mgmt#2241)

Signed-off-by: Gen-Hwa Chiang gechiang@microsoft.com

- What I did

Added a new daemon to be run under BGP docker periodically to gather the BGP peer state information and update them in
the state DB. In order not to continuously running even when there are no BGP peer activities, this new daemon will use the BGP frr.log file timestamp as an indicator whether there are any activities in BGP that warrants it to inspect the peer state info for possible updates.

- How I did it

  • Start a periodic timer to get triggered every 15 second.
  • Check if there is a change in the timestamp of the frr.log file against a cached timestamp from the previous update.
  • Request the back-end handler (FRR/Zebra) to dump out the "show bgp summary json"
  • For each peer state that it just received, check it against the previous state that was cached and if there is a change or this is new peer, update/add the corresponding state DB table entry.
  • At the end of processing all the peer info, check if there are any stale peer entry present and delete the corresponding state DB entry accordingly.

- How to verify it

once the BGP docker is up and running, you can perform "show services" to see that this new daemon "bgpmon.py" is running under the BGP docker service similar to the following output:

admin@SONic:~$ show services
...
bgp     docker
---------------------------
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.2  0.2  28576 22632 pts/0    Ss+  19:10   0:08 /usr/bin/python2 /usr/bin/supervisord
root        21  0.0  0.2  22664 16716 pts/0    S    19:11   0:00 python /usr/bin/supervisor-proc-exit-listener --container-name bgp
root        25  0.0  0.0 225856  3512 pts/0    Sl   19:11   0:00 /usr/sbin/rsyslogd -n -iNONE
frr         29  1.0  0.4 525176 36580 pts/0    Sl   19:11   0:40 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M fpm -M snmp
frr         30  0.0  0.0  43304  6080 pts/0    S    19:11   0:00 /usr/lib/frr/staticd -A 127.0.0.1
frr         31  2.5  1.0 356196 84108 pts/0    Sl   19:11   1:36 /usr/lib/frr/bgpd -A 127.0.0.1 -M snmp
root        40  0.1  0.6  69172 55876 pts/0    S    19:11   0:05 /usr/bin/python /usr/local/bin/bgpcfgd
root        41  0.0  0.1  20988 15508 pts/0    S    19:11   0:00 /usr/bin/python /usr/local/bin/bgpmon
root        42  0.2  0.0  82108  6136 pts/0    Sl   19:11   0:10 fpmsyncd
...

For IPV4 You can perform "show ip bgp summary" and use that output to check against the state DB for each peer.
The state DB key would be something like "NEIGH_STATE_TABLE|10.0.0.33" where 10.0.0.33 is the peer ip address.
you can then read out the state information by issuing the redis cmd hgetall "NEIGH_STATE_TABLE|10.0.0.33". Compare that with the corresponding output of "show ip bgp summary".
Here is a sample output from the State DB:
admin@str-s6000-acs-8:~$ redis-cli -n 6
127.0.0.1:6379[6]> keys "NEIGH*"

  1. "NEIGH_STATE_TABLE|fc00::1a"
  2. "NEIGH_STATE_TABLE|fc00::32"
  3. "NEIGH_STATE_TABLE|10.0.0.33"
  4. "NEIGH_STATE_TABLE|fc00::66"
  5. "NEIGH_STATE_TABLE|fc00::62"
  6. "NEIGH_STATE_TABLE|fc00::46"
  7. "NEIGH_STATE_TABLE|10.0.0.35"
  8. "NEIGH_STATE_TABLE|10.0.0.53"
  9. "NEIGH_STATE_TABLE|10.0.0.47"
  10. "NEIGH_STATE_TABLE|10.0.0.9"
  11. "NEIGH_STATE_TABLE|fc00::6e"
  12. "NEIGH_STATE_TABLE|10.0.0.41"
  13. "NEIGH_STATE_TABLE|10.0.0.63"
  14. "NEIGH_STATE_TABLE|10.0.0.17"
  15. "NEIGH_STATE_TABLE|fc00::4a"
  16. "NEIGH_STATE_TABLE|fc00::52"
  17. "NEIGH_STATE_TABLE|fc00::5e"
  18. "NEIGH_STATE_TABLE|10.0.0.29"
  19. "NEIGH_STATE_TABLE|10.0.0.45"
  20. "NEIGH_STATE_TABLE|fc00::2a"
  21. "NEIGH_STATE_TABLE|fc00::42"
  22. "NEIGH_STATE_TABLE|10.0.0.1"
  23. "NEIGH_STATE_TABLE|10.0.0.59"
    127.0.0.1:6379[6]> hgetall "NEIGH_STATE_TABLE|10.0.0.1"
  24. "state"
  25. "Established"
    127.0.0.1:6379[6]>

you can do the same for ipv6 by performing "show ipv6 bgp summary" and follow the same steps above to validate with the corresponding peer states stored in state DB.

@qiluo-msft
Copy link
Collaborator

I see #5329 lack unit test or vs test. So for the backporting, it is also lack of test.

@gechiang gechiang merged commit 7168fc8 into sonic-net:201911 Sep 22, 2020
@rlhui
Copy link
Contributor

rlhui commented Sep 22, 2020

@abdosi - FYI

# Check for stale state entries to be cleaned up
while len(self.peer_l) > 0:
# remove this from the stateDB and the current nighbor state entry
peer = self.peer_l.pop(0)
Copy link
Contributor

@abdosi abdosi Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gechiang Can't we juse pop here ? if we want to pop from front then queue is better data structure than using list from performance perspective

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abdosi Agreed. There is no need to pop from the front at all. The purpose is to pop one item from this list and handle it for the peer deleted case. We should address this next time we have a need to touch this file to enhance the performance. In most of our "normal" usage the neighbors are not removed so this code segment is not expected to be exercised regularly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gechiang Thanks but why can't this be addressed now ? It should be small change and quick verification

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abdosi Sure. I will pull another PR for this shortly. This PR was already merged before your comments arrived.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gechiang Thanks, I meant in new PR only.

del_key = "NEIGH_STATE_TABLE|%s" % peer
data[del_key] = None
del self.peer_state[peer]
if len(data) > PIPE_BATCH_MAX_COUNT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gechiang do we need this if check in while loop ? even if this check is not there below if condition should take care of flushing data ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abdosi This check is required. The purpose of using pipeline is to be able to batch to reduce transaction overhead. But at the same time the server will also need to be batching the responses that depends on this batch depth and this batching will effectively increase the use/hold of memory from the server side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gechiang ok, so want to keep flushing in batches of PIPE_BATCH_MAX_COUNT for good server performance.

@abdosi
Copy link
Contributor

abdosi commented Sep 22, 2020

@gechiang
Copy link
Collaborator Author

@abdosi @lguohan When bgpmon fail to run it has no impact to the BGP functionality. The only impact it will have is to other potential components that consume the state DB that bgpmon produces. So it depends on the definition of what constitute to be a critical process for the BGP docker. If all agree I can certainly add it to be part of the critical processes.

@abdosi
Copy link
Contributor

abdosi commented Sep 23, 2020

@abdosi @lguohan When bgpmon fail to run it has no impact to the BGP functionality. The only impact it will have is to other potential components that consume the state DB that bgpmon produces. So it depends on the definition of what constitute to be a critical process for the BGP docker. If all agree I can certainly add it to be part of the critical processes.

@gechiang In my opinion this is critical process as SNMP will use it for polling Device BGP health.

@lguohan
Copy link
Collaborator

lguohan commented Sep 23, 2020

i am thinking it is not critical, we can do autorestart in the supervisord configuration, but not restart bgp process.

@abdosi
Copy link
Contributor

abdosi commented Sep 23, 2020

i am thinking it is not critical, we can do autorestart in the supervisord configuration, but not restart bgp process.

ok, that is better than container auto-restart. Supervisor auto-restart should be good for this process

@gechiang
Copy link
Collaborator Author

@lguohan @abdosi Thanks for the comment. I will work on another PR to make this change on master branch and hopefully cherry-pick-able for 201911 branch.

@rlhui
Copy link
Contributor

rlhui commented Oct 21, 2020

@gechiang, should this PR be closed or stay? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants