Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[system-health] [202012] No longer check critical process/service status via monit #109

Closed
wants to merge 167 commits into from

Conversation

Junchao-Mellanox
Copy link
Owner

Why I did it

Command monit summary -B can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check.

How I did it

  1. Get container names from FEATURE table
  2. For each container, collect critical process names from file critical_processes
  3. Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit

How to verify it

  1. Add unit test case to cover it
  2. Adjust sonic-mgmt cases to cover it
  3. Manual test

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

mssonicbld and others added 30 commits August 30, 2021 16:24
Co-authored-by: mssonicbld <vsts@fv-az232-326.x3jni0md3anuvcz2px3t3ecixa.bx.internal.cloudapp.net>
…nic-net#8627)

7041400 [config reload] Call systemctl reset-failed for snmp,telemetry,mgmt-framework services (sonic-net#1773) (sonic-net#1786)
399d370 Fix logic in RIF counters print (sonic-net#1732)
8329544 [vnet_route_check] don't hardcode prefix length of /24 (sonic-net#1756)
193b028 [neighbor-advertiser] delete the tunnel maps appropriately (sonic-net#1663)
2c82bcf [neighbor_advertiser] Use existing tunnel if present for creating tunnel mappings (sonic-net#1589)
8e22960 [202012][Config] Update config command of Kdump. (sonic-net#1778)
be3e5c6 [show][config] cli refactor for muxcable with abstract class implementation from vendors (sonic-net#1722) (sonic-net#1782)
The Lodoga platform also matched crow which was hardcoding the flash
size to 3700. This change enables autodetect on Clearlake which in turns
allows autodetect for Lodoga.

The threshold was bumped from 3700 to 4000 because size computation can
differ slightly and report slightly above 3700.
…ks for first 4 ports (sonic-net#8624)

Why I did it
The first 4 ports on this dut are breakout ports. They might not always be connected in lab. Mark them as 'RJ45' to skip the SFP check since they are by default disabled.

How to verify it
run platform test_reboot.py

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
To enable saiserver docker on different platforms, it needs different configuration files. make the saiserver docker mount them in hwsku folder.

Co-authored-by: Ubuntu <richardyu@richardyu-ubuntu-vm0.trsxrdzozv2e1czsze2t05vqzh.ix.internal.cloudapp.net>
Signed-off-by: Guohan Lu <lguohan@gmail.com>
…c-net#8602)

- Why I did it
Update SDK\FW version to 4.4.3326\2008.3326. This version contains:

New Features:
1. Add support for Fast Boot for SN3800

Bug Fixing:
1. In some cases, when the total number of allocations exceeds the resource limit, an error can occur due to incorrect resource release procedure. This issue is most likely to affect the following resources: flow counters, ACL actions, PBS, WJH filter, Tunnels, ECMP containers, MC (L2 &L3)

2. On Spectrum systems, when using Async Router API with IPV6, an error message in the log regarding failing to remove ECMP container may show up. This error is not functional and can be safely ignored.

3. On Spectrum-2 systems and above, when using warm boot, setting max_bridge_num to a value greater than 1968 will cause an error and potential crash.

4. Some Molex cables do not support speed after reboot

- How I did it
Update submodule and .mk files

- How to verify it
Verified by running regression tests that includes complete sonic-mgmt tests supported

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
* d240291 Update port_rates & rif_rates lua scripts to convert poll_interval to MS (sonic-net#1855)
* a71a5d3 [acl mirror action] Mirror session ref count fix at acl rule attachment (sonic-net#1761)
* 197f427 Fix vs test failure in test_buffer_traditional (sonic-net#1881)
* 8471f42 Revert "[debugcounterorch] check if counter type is supported before querying… (sonic-net#1789)" (sonic-net#1884)

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
…aemons (sonic-net#8607)


This PR updates the following commits in sonic-platform-common
9d2e7d5 Add y-cable driver for simulated mux (#213)
e3e8f09 [Y-Cable][Broadcom] Broadcom implementation of YCable class which inherits from YCableBase required for Y-Cable API's in sonic-platform-daemons (#208)

This PR updates the following commits in sonic-platform-daemons

ebc4f3f [Y-Cable] create unknown entries for mux_cable when there is a cable present but module definition is not present/invalid module
b10c417 [xcvrd] initial support for integrating vendor specfic class objects for calling Y-Cable API's inside xcvrd (#197) (#213)
f3fc1ea [y-cable] fix for logging the xcvrd metrics before writing the state to the State-DB (#208)


Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
Account for sfputil_helper indexing being 0 based

Co-authored-by: Carl Keene <keene@nokia.com>
Commits on Sep 01, 2021
hw-mgmt: attributes: Add PSU power sensor attributes d8fce39

Commits on Sep 02, 2021
Remove MFT package flint tool from hw-management dump generation. 53d06b2
hw-mgmt: debug: Add timeout to generate-dump.sh b661fa3 

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
#### Why I did it
while sonic upgrade, Image will be extracted to tmpfs for installation so tmpfs size should be larger than image size. Image installation will fail if image size is larger than tmpfs size.

we are facing below error while installing debug image with size greater than tmpfs which is 1.5g in marvell armhf platform.

sonic-installer install <url>
New image will be installed, continue? [y/N]: y
Downloading image...
...99%, 1744 MB, 708 KB/s, 0 seconds left...
Installing image SONiC-OS-202012.0-dirty-20210311.224845 and setting it as default...
Command: bash /tmp/sonic_image
tar: installer/fs.zip: Wrote only 7680 of 10240 bytes
tar: installer/onie-image-arm64.conf: Cannot write: No space left on device
tar: Exiting with failure status due to previous errors
Verifying image checksum ... OK.
Preparing image archive ...

#### How I did it
compare downloaded image size with tmpfs size, if size less than image size update the tmpfs size according to image size.

#### How to verify it
Install an Image with size larger than tmpfs. we verified by installing debug image with size 1.9gb which is larger than tmpfs size 1.5gb.
Why I did it
Power cycle test case fails for Z9332f in sonic-mgmt framework(sonic-net#8605).

How I did it
Modified the platform API to return expected strings.

How to verify it
Power cycle the device and verify the reboot reason.
Run sonic-mgmt test_reboot script.
* [Nokia ixs7215] Miscellaneous platform API fixes

This commit delivers the following fixes for the Nokia ixs7215 platform

- Fix bug in a fan API error path
- Add support for setting the fan drawer led
- Add support for getting/setting the front panel PSU status led
- Add support for getting the min/max observed temperature value

* [Nokia ixs7215] code review changes: temperature min/max values
fix mac address format for get_system_eeprom_info
harden pmbus status reading for Clearlake
force loading PSUs on Cloverdale
* 0323d5e noaOrMlnx Fix flex counters logic of converting poll interval to seconds from MS (sonic-net#878)

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
* d03ba4f [202012] [portstat, intfstat] added rates and utilization  (sonic-net#1812)
* 499ad3f [config reload] Fix config reload failure due to sonic.target job cancellation (sonic-net#1814)
* 96d658c [202012][sonic installer] Add swap setup support (sonic-net#1815)
* a9c6970 platform pre-check for reboot in 202012 branch (sonic-net#1788)
* 0e0478b Unify the number format in the ourput of portstat and pfcstat in all cases (sonic-net#1795)
* 2d1e00e [ecnconfig] Fix exception seen during display and add unit tests (sonic-net#1784) (sonic-net#1789)

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
#### Why I did it
This is a partial backport of sonic-net#8034
In order to unblock cherry-pick other commits of test code from master to 202012.
Fix sonic-net#8672

add two missing commits in caclmgrd: monitor state_db to update dhcp acl sonic-net#8222 when porting to 202012 branch
Why I did it
fstrim has dependency on pmon docker.

How I did it
start fstrim timer after sonic.target.

How to verify it
local test and PR test.

Signed-off-by: Ying Xie ying.xie@microsoft.com
*Removed execute permissions from the systemd copp-config.service file. 
Without this we will get a warning: "Configuration file /lib/systemd/system/copp-config.service is marked executable. Please remove executable permission bits. Proceeding anyway."
…pdate submodule sonic-swss-common (sonic-net#8513)

#### Why I did it
Backport sonic-net#8034 to 202012 branch

sonic-swss-common submodule updating includes below commits
```
a6b98da 2021-04-29 | Add support for config_db subscribe and unsubscribe python apis (sonic-net#481) [arlakshm]
2506ca0 2021-08-22 | [ci] Fix azure pipeline DownloadPipelineArtifact source branch (sonic-net#514) [Qi Luo]
```
*Update makefiles for Innovium 202012 support
* 2a8957d 2021-09-14 | [202012][sonic-utilities] CLI support for port auto negotiation (sonic-net#1817) (HEAD, origin/202012) [vdahiya12]

Signed-off-by: Guohan Lu <lguohan@gmail.com>
- add test code to check dhcp acl update
- port sonic-net#8359 (caclmgrd: add test code to check dhcp acl update) to 202012 branch
theasianpianist and others added 27 commits November 10, 2021 18:54
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
[bgp]: Switch mux to standby if BGP container exits

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
[write_standby]: Ignore non-auto interfaces

* In the event that `write_standby.py` is used to automatically switchover interfaces when linkmgrd or bgp crashes, ignore any interfaces that are not configured to auto-switch

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
ead0d5a 2021-11-10 | Exclude *.a files from python deb packages (sonic-net#554) [Qi Luo]
3a660ac 2021-10-20 | Fix the option missing in kernel config issue (sonic-net#541) [xumia]
Signed-off-by: Mykhailo Onipko <monipko@barefootnetworks.com>
…212995, SONIC-51583, CS00012215744, and SONIC-51638 (sonic-net#9252)

This is to pick up BRCM SAI 4.3.5.1-7 fixes which contains the following fixes:

1.  CS00012209390: SONIC-50037, Used SAI_SWITCH_ATTR_QOS_DSCP_TO_TC_MAP as a default decap map for IPinIP tunnels.
2.  CS00012212995: SONIC-50948 SAI_API_QUEUE:_brcm_sai_cosq_stat_get:1353 egress Min limit get failed with error Invalid parameter 
3.  SONIC-51583: Fixed acl group member creation failure with priority of -1
4.  CS00012215744:SONIC-51395 [TH, TH2] WB 3.5 to 4.3 fails at APPLY_VIEW while setting SAI_PORT_ATTR_EGRESS_ACL
5.  SONIC-51638: SDK-249337 ERROR: AddressSanitizer: heap-buffer-overflow in _tlv_print_array

Preliminary tests look fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the following test cases on 7050CX3 (TD3) T0 DUT and all passed:
```
     fib/test_fib.py
     vxlan/test_vxlan_decap.py
     fdb/test_fdb.py
     decap/test_decap.py
     ipfwd/test_dip_sip.py 
     ipfwd/test_dir_bcast.py
     acl/test_acl.py
     vlan/test_vlan.py
     platform_tests/test_reboot.py
```
Fix support for DHCPV6 Relay multi vlan functionality. Make sure the relayed packet is received at correct interface.

How I did it
Bind a socket to each vlan interface's global and link-local address.
Socket binded to global address is used for relaying data from client to server and receiving data from servers.
Socket binded to link-local address is used for relaying data received from server back to the client.
- add a new service "mark_dhcp_packet" to mux container
- apply packet marks on a per-interface basis in ebtables
- write packet marks to "DHCP_PACKET_MARK" table in state_db
- update DHCP_PACKET_MARK schema in state_db
- this is an update over PR: Add service mark_dhcp_packet to mux container sonic-net#9015
9dd3025 2021-05-11 | [Command-Reference.md] Document new SNMP show and config commands (sonic-net#1600) [Travis Van Duyn]
be40767 2021-05-05 | [show][config] Add new snmp commands (sonic-net#1347) [Travis Van Duyn]
…y triggered/restored when pause frames are sent continuously to both queues of a port (sonic-net#9296)

1.  CS00012211718 [4.3] Pfcwd getting continuously triggered/restored when pause frames are sent continuously to both queues of a port (TD2/Th/Th2/TD3) MSFT Default

Preliminary tests look fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the following test cases on 7050CX3 (TD3) T0 DUT and all passed:
```
     fib/test_fib.py
     vxlan/test_vxlan_decap.py
     fdb/test_fdb.py
     decap/test_decap.py
     ipfwd/test_dip_sip.py 
     ipfwd/test_dir_bcast.py
     acl/test_acl.py
     vlan/test_vlan.py
     platform_tests/test_reboot.py
```
6f198d0 (HEAD -> 202012, origin/202012) [Y-Cable][Broadcom] upgrade to support Broadcom Y-Cable API to release (#230)
1c3e422 SSD Health: Retrieve SSD health and temperature values from generic SSD info (#229)

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
c31a362 - 2021-11-18 : [202012][Mux orch] set default as standby, change mux orch priority (sonic-net#2015) [Prince Sunny]
9a9e8e6 - 2021-11-18 : [202012] Check VS test failure (sonic-net#2033) [Prince Sunny]
7eaabca - 2021-11-11 : [202012] Fix random failure in PR/CI build. (sonic-net#2016) [Shilong Liu]
85230fe - 2021-11-04 : [orchagent] Fix group name of port-buffer-drop in flexcounterorch.cpp (sonic-net#1967) [Junchao-Mellanox]
a55c2ca - 2021-11-03 : [teammgrd]: Handle LAGs cleanup gracefully on Warm/Fast reboot. (sonic-net#1934) [Nazarii Hnydyn]
In azure pipeline template 'set -e' not works as expected.
[cherry-pick PR sonic-net#9123 ]

Why I did it
When sshd realizes that this login can't succeed due to internal device state
or configuration, instead of failing right there, it proceeds to prompt for
password, so as the user does not get any clue on where is the failure point.

Yet to ensure that this login does not proceed, sshd replaces user provided password
with a specific pattern of characters matching length of user provided password.
This pattern is "<BS><LF><CR><DEL>INCORRECT", which is bound to fail.

If user provided length is smaller/equal, the substring of pattern is overwritten.
If user provided length is greater, the pattern is repeated until length is exhausted.

But if the PAM-tacacs plugin would send this password to AAA, the user could get
locked out by AAA, for providing incorrect value.

How I did it
Hence this fix, matches obtained password against the pattern. If match, fail just before
reaching AAA server.

How to verify it
Make sure tacacs is properly configured.
Try logging in as, say "user-A"; ensure it succeeds
Pick another user, say user-B and ensure this user has not logged into this device before (look into /etc/passed & folders under /home)
Disable monit service (as that could fix the issue using disk_check.py)
Start TCP dump for all TACACS servers.
Simulate Read-only disk
Try logging in using user-B.
Verify it fails, after 3 attempts
Stop tcp dump.
TCP dump should show "authentication" for user-A only
Submodule update for sonic-utilities with following change:

ec9e5ee Backport [generate_dump] remove secrets from dump files sonic-net#1886 to 202012 (sonic-net#1938)
ce3b856 [fdbshow]: Handle FDB cleanup gracefully. (sonic-net#1926)
1437bf2 [202012] Add DHCPv6 Relay counter and ipv6 helper CLI (sonic-net#1917)
The recent release of redis 4.0.0 or newer (for python3) breaks sonic-config-engine unit test. Fix to last known good version.

ref: https://pypi.org/project/redis/#history
…nit activation delay (sonic-net#8896)

#### Why I did it
With current code the delay will take place even if simple 'config reload' command executed and this is not desired.
This delay should be used only when fast-rebooting.

#### How I did it
Change the type of delay to OnBootSec instead of OnActiveSec.

#### How to verify it
Fast-reboot with this PR and observe the delay.
Run 'config-reload' command and observe no delay is running.
When we update the a sai package downing from a remote server, we need to update the version file as well currently, but the reproducible build feature is not enabled in master, it can only be detected when merging the code into the release branches, such as 202106, 202012, etc.
The reproducible feature is to reduce the build failure, not need to break the build when the version not specified. If version not specified, the best choice is to accept the version from remote server.

Co-authored-by: Ubuntu <xumia@xumia-vm1.jqzc3g5pdlluxln0vevsg3s20h.xx.internal.cloudapp.net>
@Junchao-Mellanox Junchao-Mellanox deleted the system-health-202012 branch June 12, 2023 04:37
Junchao-Mellanox pushed a commit that referenced this pull request Jun 25, 2023
…atically (sonic-net#15373)

src/sonic-telemetry

* 5a83c07 - (HEAD -> 202205, origin/202205) Merge pull request #109 from zbud-msft/backport-202205 (86 minutes ago) [Ying Xie]
* 0f75a64 - [202012] Workaround gomonkey for armhf (#88) (3 weeks ago) [Zain Budhwani]
* 1b0dc75 - Fix Makefile error (3 weeks ago) [Zain Budhwani]
* 72e4540 - Add diff cov (#85) (3 weeks ago) [Zain Budhwani]
* 11aace5 - Add get-update to azp yml (#79) (3 weeks ago) [Zain Budhwani]
* 3eaaccc - Add net core and code coverage results (#77) (3 weeks ago) [Zain Budhwani]
* 3cf3883 - Enable unit test (3 weeks ago) [ganglyu]
* 2a0928f - Change dir name in pipeline (#75) (4 weeks ago) [Zain Budhwani]
* baa845a - Update yml (4 weeks ago) [Zain Budhwani]
* c533d52 - Fix format (4 weeks ago) [ganglyu]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.