Skip to content

Commit

Permalink
[Mellanox] Advance hw-mgmt to v.7.0020.4104 (#13372)
Browse files Browse the repository at this point in the history
- Why I did it
Advance hw-mgmt service to V.7.0020.4100
Add missing thermal sensors that are supported by hw-mgmt package
Delay system health service before hw-mgmt has started on Mellanox platform in order to avoid reading some sensors before ready.
Depends on sonic-net/sonic-linux-kernel#305

- How I did it
1. Update hw mgmt version
2. Add missing sensors
3. Delay service 

- How to verify it
Regression test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
  • Loading branch information
stephenxs authored and mssonicbld committed Mar 1, 2023
1 parent 73f5729 commit 76a5c75
Show file tree
Hide file tree
Showing 11 changed files with 78 additions and 46 deletions.
3 changes: 3 additions & 0 deletions device/mellanox/x86_64-mlnx_msn2010-r0/platform.json
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,9 @@
},
{
"name": "CPU Core 3 Temp"
},
{
"name": "SODIMM 1 Temp"
}
],
"sfps": [
Expand Down
3 changes: 3 additions & 0 deletions device/mellanox/x86_64-mlnx_msn2100-r0/platform.json
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,9 @@
},
{
"name": "CPU Core 3 Temp"
},
{
"name": "SODIMM 1 Temp"
}
],
"sfps": [
Expand Down
3 changes: 3 additions & 0 deletions device/mellanox/x86_64-mlnx_msn2410-r0/platform.json
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@
},
{
"name": "CPU Pack Temp"
},
{
"name": "SODIMM 1 Temp"
}
],
"sfps": [
Expand Down
3 changes: 3 additions & 0 deletions device/mellanox/x86_64-mlnx_msn2700-r0/platform.json
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@
},
{
"name": "CPU Pack Temp"
},
{
"name": "SODIMM 1 Temp"
}
],
"sfps": [
Expand Down
3 changes: 3 additions & 0 deletions device/mellanox/x86_64-nvidia_sn2201-r0/platform.json
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,9 @@
},
{
"name": "ASIC"
},
{
"name": "SODIMM 1 Temp"
}
],
"sfps": [{
Expand Down
2 changes: 1 addition & 1 deletion platform/mellanox/hw-management.mk
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
#
# Mellanox HW Management

MLNX_HW_MANAGEMENT_VERSION = 7.0020.3006
MLNX_HW_MANAGEMENT_VERSION = 7.0020.4104

export MLNX_HW_MANAGEMENT_VERSION

Expand Down
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
From 1a1011b6da491d35001df5a7204d4eecb2769767 Mon Sep 17 00:00:00 2001
From 489764eb124e03087eb408dec27d769fa4f98459 Mon Sep 17 00:00:00 2001
From: keboliu <kebol@mellanox.com>
Date: Fri, 15 Jan 2021 14:41:16 +0800
Subject: [PATCH] Make SONiC determine-reboot-cause service start after hw-mgmt
service
Subject: [PATCH 1/4] Make SONiC determine-reboot-cause service start after
hw-mgmt service

Signed-off-by: Kebo Liu <kebol@nvidia.com>
---
debian/hw-management.hw-management.service | 1 +
1 file changed, 1 insertion(+)

diff --git a/debian/hw-management.hw-management.service b/debian/hw-management.hw-management.service
index 39a2a54..2104b87 100755
index 8bdcaef..1c25ffb 100755
--- a/debian/hw-management.hw-management.service
+++ b/debian/hw-management.hw-management.service
@@ -1,6 +1,7 @@
Expand All @@ -22,5 +22,5 @@ index 39a2a54..2104b87 100755
[Service]
Type=oneshot
--
1.9.1
2.20.1

Original file line number Diff line number Diff line change
@@ -1,59 +1,47 @@
From 79dadd5b0d2f5e860b525c12d4d3843607b03a9f Mon Sep 17 00:00:00 2001
From 422b64397f2f33b394d037820f0ceb4c09e3a725 Mon Sep 17 00:00:00 2001
From: Alexander Allen <arallen@nvidia.com>
Date: Fri, 21 Jan 2022 16:47:19 +0000
Subject: [PATCH] Disable hw-mgmt on SimX platforms
Subject: [PATCH 2/4] Disable hw-mgmt on SimX platforms

---
usr/usr/bin/hw-management-ready.sh | 31 ++++++++++++++++--------------
usr/usr/bin/hw-management-ready.sh | 11 +++++++----
usr/usr/bin/hw-management.sh | 9 +++++++++
2 files changed, 26 insertions(+), 14 deletions(-)
2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/usr/usr/bin/hw-management-ready.sh b/usr/usr/bin/hw-management-ready.sh
index 5a9698c..364f906 100755
index 88672a8..7558c68 100755
--- a/usr/usr/bin/hw-management-ready.sh
+++ b/usr/usr/bin/hw-management-ready.sh
@@ -51,19 +51,22 @@ if [ -d /var/run/hw-management ]; then
@@ -51,17 +51,20 @@ if [ -d /var/run/hw-management ]; then
rm -fr /var/run/hw-management
fi

-case $board_type in
-VMOD0014)
- while [ ! -d /sys/devices/pci0000:00/0000:00:1f.0/NVSN2201:00/mlxreg-hotplug/hwmon ]
- do
- sleep 1
- done
- ;;
+if [ -z "$(lspci -vvv | grep SimX)" ]; then
+ case $board_type in
+ VMOD0014)
if [ ! -d /sys/devices/pci0000:00/0000:00:1f.0/NVSN2201:00/mlxreg-hotplug/hwmon ]; then
timeout 180 bash -c 'until [ -d /sys/devices/pci0000:00/0000:00:1f.0/NVSN2201:00/mlxreg-hotplug/hwmon ]; do sleep 0.2; done'
fi
;;
-*)
- while [ ! -d /sys/devices/platform/mlxplat/mlxreg-hotplug/hwmon ]
- do
- sleep 1
- done
- ;;
+ *)
if [ ! -d /sys/devices/platform/mlxplat/mlxreg-hotplug/hwmon ]; then
timeout 180 bash -c 'until [ -d /sys/devices/platform/mlxplat/mlxreg-hotplug/hwmon ]; do sleep 0.2; done'
fi
;;
-esac
+if [ -z "$(lspci -vvv | grep SimX)" ]; then
+ case $board_type in
+ VMOD0014)
+ while [ ! -d /sys/devices/pci0000:00/0000:00:1f.0/NVSN2201:00/mlxreg-hotplug/hwmon ]
+ do
+ sleep 1
+ done
+ ;;
+ *)
+ while [ ! -d /sys/devices/platform/mlxplat/mlxreg-hotplug/hwmon ]
+ do
+ sleep 1
+ done
+ ;;
+ esac
+ esac
+fi
+
echo "Start Chassis HW management service."
logger -t hw-management -p daemon.notice "Start Chassis HW management service."
diff --git a/usr/usr/bin/hw-management.sh b/usr/usr/bin/hw-management.sh
index ebfabb0..c0c038e 100755
index 1ee05b5..50d922b 100755
--- a/usr/usr/bin/hw-management.sh
+++ b/usr/usr/bin/hw-management.sh
@@ -1495,6 +1495,13 @@ do_chip_down()
@@ -2310,6 +2310,13 @@ do_chip_down()
/usr/bin/hw-management-thermal-events.sh change hotplug_asic down %S %p
}

Expand All @@ -67,7 +55,7 @@ index ebfabb0..c0c038e 100755
__usage="
Usage: $(basename "$0") [Options]

@@ -1520,6 +1527,8 @@ Options:
@@ -2335,6 +2342,8 @@ Options:
force-reload Performs hw-management 'stop' and the 'start.
"

Expand All @@ -77,5 +65,5 @@ index ebfabb0..c0c038e 100755
start)
if [ -d /var/run/hw-management ]; then
--
2.17.1
2.20.1

Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
From 14b06a12802fc0e15116a64f419d002d0d21d695 Mon Sep 17 00:00:00 2001
From 439639e939f896f9aee42a4dbd5216feb728220c Mon Sep 17 00:00:00 2001
From: Alexander Allen <arallen@nvidia.com>
Date: Thu, 17 Feb 2022 04:19:50 +0000
Subject: [PATCH] Remove unused non-upstream kernel modules from load
Subject: [PATCH 3/4] Remove unused non-upstream kernel modules from load

---
usr/etc/modules-load.d/05-hw-management-modules.conf | 2 --
1 file changed, 2 deletions(-)

diff --git a/usr/etc/modules-load.d/05-hw-management-modules.conf b/usr/etc/modules-load.d/05-hw-management-modules.conf
index 39f621e..c0980bc 100644
index cfcfaa4..dd3b5ca 100644
--- a/usr/etc/modules-load.d/05-hw-management-modules.conf
+++ b/usr/etc/modules-load.d/05-hw-management-modules.conf
@@ -15,8 +15,6 @@ xdpe12284
Expand All @@ -21,5 +21,5 @@ index 39f621e..c0980bc 100644
gpio-pca953x
pmbus
--
2.17.1
2.20.1

Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
From 038bce6bf808ec9d082e96fec4184e060b3a85a9 Mon Sep 17 00:00:00 2001
From: Stephen Sun <stephens@nvidia.com>
Date: Mon, 28 Nov 2022 03:55:14 +0000
Subject: [PATCH 4/4] Make system-health service starts after hw-management to
avoid failures

On SN2410, it can fail to read the file led_status_capability if it starts from ONIE

Signed-off-by: Stephen Sun <stephens@nvidia.com>
---
debian/hw-management.hw-management.service | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/debian/hw-management.hw-management.service b/debian/hw-management.hw-management.service
index 1c25ffb..0fbd877 100755
--- a/debian/hw-management.hw-management.service
+++ b/debian/hw-management.hw-management.service
@@ -1,7 +1,7 @@
[Unit]
Description=Chassis HW management service of Mellanox systems
Documentation=man:hw-management.service(8)
-Before=determine-reboot-cause.service
+Before=determine-reboot-cause.service system-health.service

[Service]
Type=oneshot
--
2.20.1

2 changes: 1 addition & 1 deletion platform/mellanox/hw-management/hw-mgmt
Submodule hw-mgmt updated 70 files
+64 −0 .contrib/version_tag.py
+76 −0 .github/workflows/autotag.yml
+17 −0 README.md
+140 −35 debian/Release.txt
+3 −3 debian/changelog
+105 −92 recipes-kernel/linux/Patch_Status_Table.txt
+1 −2 recipes-kernel/linux/linux-4.19/0011-platform-x86-mlx-platform-Modify-setting-for-new-sys.patch
+1 −1 recipes-kernel/linux/linux-4.19/0027-platform-x86-mlx-platform-Remove-PSU-EEPROM-configur.patch
+3 −5 recipes-kernel/linux/linux-4.19/0071-leds-mlxreg-Allow-multi-instantiation-of-same-name-L.patch
+1 −2 recipes-kernel/linux/linux-4.19/0085-platform-x86-mlx-platform-Add-initial-support-for-ne.patch
+357 −0 recipes-kernel/linux/linux-4.19/0157-platform-mellanox-Introduce-support-for-COMe-managem.patch
+0 −1,019 recipes-kernel/linux/linux-4.19/0157-platform-mellanox-Introduce-support-for-NDR-InfiniBa.patch
+167 −76 recipes-kernel/linux/linux-4.19/0158-platform-mellanox-Introduce-support-for-rack-switch-.patch
+11 −11 recipes-kernel/linux/linux-4.19/0159-platform-mellanox-Add-COME-board-revision-register.patch
+52 −68 recipes-kernel/linux/linux-4.19/0160-platform-mellanox-Introduce-support-for-rack-manager.patch
+0 −0 recipes-kernel/linux/linux-4.19/0161-DS-leds-leds-mlxreg-Send-udev-event-from-leds-mlxreg.patch
+0 −0 recipes-kernel/linux/linux-4.19/0162-platform-mellanox-mlxreg-io-Add-locking-for-io-opera.patch
+0 −0 recipes-kernel/linux/linux-4.19/0163-TMP-mlxsw-i2c-Prevent-transaction-execution-for-spec.patch
+0 −0 recipes-kernel/linux/linux-4.19/0164-i2c-mlxcpld-Fix-register-setting-for-400KHz-frequenc.patch
+0 −0 recipes-kernel/linux/linux-4.19/0165-platform-mellanox-mlxreg-lc-Fix-cleanup-on-failure-a.patch
+59 −0 recipes-kernel/linux/linux-4.19/0166-mlxsw-core-Add-support-for-OSFP-transceiver-modules.patch
+259 −0 recipes-kernel/linux/linux-4.19/0167-hwmon-pmbus-Add-support-for-Infineon-Digital-Multi-p.patch
+2,246 −0 recipes-kernel/linux/linux-4.19/0168-DS-iio-pressure-icp20100-add-driver-for-InvenSense-ICP-.patch
+30 −0 recipes-kernel/linux/linux-4.19/0169-platform-mellanox-fix-reset_pwr_converter_fail-attri.patch
+8 −10 recipes-kernel/linux/linux-5.10/0051-leds-mlxreg-Allow-multi-instantiation-of-same-name-L.patch
+1 −2 recipes-kernel/linux/linux-5.10/0056-platform-x86-mlx-platform-Add-initial-support-for-ne.patch
+20 −23 recipes-kernel/linux/linux-5.10/0160-platform-mellanox-Introduce-support-for-COMe-managem.patch
+4 −4 recipes-kernel/linux/linux-5.10/0161-platform-x86-mlx-platform-Add-support-for-new-system.patch
+10 −10 recipes-kernel/linux/linux-5.10/0162-platform-mellanox-Add-COME-board-revision-register.patch
+111 −81 recipes-kernel/linux/linux-5.10/0163-platform-mellanox-Introduce-support-for-rack-manager.patch
+0 −0 recipes-kernel/linux/linux-5.10/0164-hwmon-jc42-Add-support-for-Seiko-Instruments-S-34TS0.patch
+0 −0 recipes-kernel/linux/linux-5.10/0165-platform-mellanox-mlxreg-io-Add-locking-for-io-opera.patch
+0 −0 recipes-kernel/linux/linux-5.10/0166-DS-leds-leds-mlxreg-Send-udev-event-from-leds-mlxreg.patch
+2 −2 recipes-kernel/linux/linux-5.10/0167-DS-lan743x-Add-support-for-fixed-phy.patch
+2 −3 recipes-kernel/linux/linux-5.10/0168-TMP-mlxsw-minimal-Ignore-error-reading-SPAD-register.patch
+0 −0 recipes-kernel/linux/linux-5.10/0169-TMP-mlxsw-i2c-Prevent-transaction-execution-for-spec.patch
+0 −0 recipes-kernel/linux/linux-5.10/0170-i2c-mlxcpld-Fix-register-setting-for-400KHz-frequenc.patch
+0 −0 recipes-kernel/linux/linux-5.10/0171-platform-mellanox-mlxreg-lc-Fix-cleanup-on-failure-a.patch
+10 −10 recipes-kernel/linux/linux-5.10/0172-DS-platform-mlx-platform-Add-SPI-path-for-rack-switc.patch
+59 −0 recipes-kernel/linux/linux-5.10/0173-core-Add-support-for-OSFP-transceiver-modules.patch
+271 −0 recipes-kernel/linux/linux-5.10/0175-hwmon-pmbus-Add-support-for-Infineon-Digital-Multi-p.patch
+31 −0 recipes-kernel/linux/linux-5.10/0176-platform-mellanox-fix-reset_pwr_converter_fail-attri.patch
+38 −0 recipes-kernel/linux/linux-5.10/0177-Documentation-ABI-fix-description-of-fix-reset_pwr_c.patch
+295 −0 recipes-kernel/linux/linux-5.10/0178-platform-mellanox-Introduce-support-for-next-generat.patch
+2,246 −0 recipes-kernel/linux/linux-5.10/0179-DS-iio-pressure-icp20100-add-driver-for-InvenSense-ICP-.patch
+65 −0 recipes-kernel/linux/linux-5.10/0180-hwmon-pmbus-Fix-sensors-readouts-for-MPS-Multi-phase.patch
+109 −0 recipes-kernel/linux/linux-5.10/0181-Revert-Fix-out-of-bounds-memory-accesses-in-thermal.patch
+1 −1 usr/etc/hw-management-sensors/mqm9510_sensors.conf
+2 −2 usr/etc/hw-management-sensors/mqm9520_sensors.conf
+133 −7 usr/etc/hw-management-sensors/mqm9700_rev1_sensors.conf
+192 −0 usr/etc/hw-management-sensors/msn3700_A1_sensors.conf
+105 −0 usr/etc/hw-management-sensors/msn4700_respin_sensors.conf
+25 −11 usr/etc/hw-management-sensors/sn3750sx_sensors.conf
+301 −0 usr/etc/hw-management-sensors/sn5600_sensors.conf
+1 −0 usr/etc/modules-load.d/05-hw-management-modules.conf
+30 −5 usr/lib/udev/rules.d/50-hw-management-events.rules
+94 −10 usr/usr/bin/hw-management-chassis-events.sh
+239 −0 usr/usr/bin/hw-management-devtree-check.sh
+599 −0 usr/usr/bin/hw-management-devtree.sh
+1 −1 usr/usr/bin/hw-management-generate-dump.sh
+59 −0 usr/usr/bin/hw-management-helpers.sh
+74 −0 usr/usr/bin/hw-management-if-rename.sh
+46 −0 usr/usr/bin/hw-management-liquid-cooling.sh
+2 −1 usr/usr/bin/hw-management-parse-eeprom.sh
+6 −8 usr/usr/bin/hw-management-ready.sh
+2 −69 usr/usr/bin/hw-management-start-post.sh
+58 −42 usr/usr/bin/hw-management-thermal-control.sh
+196 −86 usr/usr/bin/hw-management-thermal-events.sh
+555 −218 usr/usr/bin/hw-management.sh
+11 −2 usr/usr/bin/hw_management_psu_fw_update_delta.py

0 comments on commit 76a5c75

Please sign in to comment.