-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pmon] Chassis DB cleanup when module is down #394
Conversation
Following changes are done in this commit: - In line card pmon:chassisd: - line card's hostname and number of asics is pushed to a new table CHASSIS_MODULE_TABLE in chassis state db in redis_chassis server. The hostname and number of asics are required to clean up chassis add db. 'hostname' and asic name are used to construct key for chassis app db entries. - In supervisor pomon:chassisd: - When midplane connectivity loss is deteced, error syslog is generated. No chassis app db clean up is done for midplane connectivity loss. - When a module goes down and if it is in down state for more than 30 minutes, chassis app db clean up is done for all the asics of the module that went down. As part of the clean up entries created by all the asics of the down module are deleted from the following tables in chassis app db in redis_chassis server in supervisor. (1) SYSTEM_NEIGH (2) SYSTEM_INTERFACE (3) SYSTEM_LAG_MEMBER_TABLE (4) SYSTEM_LAG_TABLE The LAG IDs used by the asics of the down module are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET. Signed-off-by: vedganes <veda.ganesan@nokia.com>
@vganesan-nokia , can you add more unit test coverage so that the coverage % is improved? Pull Request Coverage Thanks! |
@vganesan-nokia , @aravindmani-1 , @judyjoseph , @abdosi , |
Changes for the review comments. Changes include: - Syslog for module down and up state changes - Unit test cases added to improve code coverage of the chassis db consistency implementation - mock_swsscommon.py changes for Table::get() function to return value similar to what real swsscommon.py returns. Changes are done in chassisd other unit tests to accommodate this mock_swsscommon.py change Signed-off-by: vedganes <veda.ganesan@nokia.com>
From the chassis db clean up point of view, The clean up (from supervisor) and host name update (from linecard) happen in chassis app db (db id 12). So if there is no supervisor card or if there is no chassis app db, all places of the chassis db consistency code will be "check and no op" |
Added unit tests for chassis db cleanup changes |
@gechiang @judyjoseph @arlakshm @deepak-singhal0408 @rlhui Please help review this PR which is for the DB consistency check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. minor comments
sonic-chassisd/scripts/chassisd
Outdated
@@ -12,8 +12,11 @@ try: | |||
import signal | |||
import sys | |||
import threading | |||
import time | |||
import subprocess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort imports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
sonic-chassisd/scripts/chassisd
Outdated
hostname_key = "{}{}".format(ModuleBase.MODULE_TYPE_LINE, int(self.my_slot) - 1) | ||
hostname = try_get(device_info.get_hostname, default="None") | ||
hostname_fvs = swsscommon.FieldValuePairs([(CHASSIS_MODULE_INFO_SLOT_FIELD, self.my_slot), | ||
(CHASSIS_MODULE_INFO_HOSTNAME_FIELD, hostname), | ||
(CHASSIS_MODULE_INFO_NUM_ASICS_FIELD, str(len(module_info_dict[CHASSIS_MODULE_INFO_ASICS])))]) | ||
self.hostname_table.set(hostname_key, hostname_fvs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is not right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
sonic-chassisd/scripts/chassisd
Outdated
else: | ||
# Module is operational. Remove it from down time tracking. | ||
if down_module_key in self.down_modules: | ||
self.log_notice("Module {} recoverecd on-line!".format(key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
dc886f0
Changes for review comments - Sorted the imported modules - Corrected indentation for the line to push host name to the chassis app db. - Module recovery syslog typo fixed Signed-off-by: vedganes <veda.ganesan@nokia.com>
dc886f0
to
1cb9fcd
Compare
@arlakshm , @deepak-singhal0408 and @gechiang, while fixing the review comments, I had to amend a commit (to avoid an extra commit) and force push which reset your approvals. Would you please review and re-approve? Thanks |
sonic-chassisd/scripts/chassisd
Outdated
@@ -186,10 +193,16 @@ class ModuleUpdater(logger.Logger): | |||
if self._is_supervisor(): | |||
self.asic_table = swsscommon.Table(self.chassis_state_db, | |||
CHASSIS_FABRIC_ASIC_INFO_TABLE) | |||
self.lc_asic_table = swsscommon.Table(self.chassis_state_db, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is lc_asic_table used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed unused table.
sonic-chassisd/scripts/chassisd
Outdated
hostname = fvs[CHASSIS_MODULE_INFO_HOSTNAME_FIELD] | ||
down_module_key = key+'|'+hostname | ||
else: | ||
down_module_key = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any cleanup required for fabric cards which might become un-operational? If not, a comment might be useful to indicate the fabric cards donot need this cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No cleanup required for supervisor/fabric cards and also for the unequipped/unprovisioned slots. Added to comment to clarify this.
@@ -266,6 +303,16 @@ class ModuleUpdater(logger.Logger): | |||
(CHASSIS_ASIC_ID_IN_MODULE_FIELD, str(asic_id))]) | |||
self.asic_table.set(asic_key, asic_fvs) | |||
|
|||
# In line card push the hostname of the module and num_asics to the chassis state db. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be done periodically or it can be done once during initialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is possibility of changing hostname and restarting the swss without restarting the pmon or rebooting line card. For these cases, this is the simplest way to use the existing infrastructure code to get the changed host name to the supervisor.
Following changes done for review comments - To avoid repeated module down/up syslogs for unequipped slots - To avoid doing chassis clean up for supervisor and unequipped/ unprovisioned linecards which will not have hostname - Comment added to indicate that cleanup will not be attempted for supervisor/fabric cards. - Added syslog for midplane connectivity recovery - Removed connection to unused table (linecard asic table is not required in supervisor any more. A separate hostname table is defined in chassis app db Signed-off-by: vedganes <veda.ganesan@nokia.com>
@yxieca , @StormLiangMS |
for module in self.down_modules: | ||
if self.down_modules[module]['cleaned'] == False: | ||
down_time = self.down_modules[module]['down_time'] | ||
delta = (time_now - down_time) / 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, SYSTEM_NEIGH table requires more frequent clean up then 30mins. When the entries for a asic that is no more existing, then leaving those entries for 30mins in the neighbor table will be cause a traffic blackhole for remote neighbors of other asics. So is it possible to have a more granular 'cleanup' instead of clearing all the tables after 30mins ?
Description Following changes are done in this PR - In line card pmon:chassisd: - line card's hostname and number of asics is pushed to a new table CHASSIS_MODULE_TABLE in chassis state db in redis_chassis server. The hostname and number of asics are required to clean up chassis add db. 'hostname' and asic name are used to construct key for chassis app db entries. - In supervisor pomon:chassisd: - When midplane connectivity loss is deteced, error syslog is generated. No chassis app db clean up is done for midplane connectivity loss. - When a module goes down and if it is in down state for more than 30 minutes, chassis app db clean up is done for all the asics of the module that went down. As part of the clean up entries created by all the asics of the down module are deleted from the following tables in chassis app db in redis_chassis server in supervisor. (1) SYSTEM_NEIGH (2) SYSTEM_INTERFACE (3) SYSTEM_LAG_MEMBER_TABLE (4) SYSTEM_LAG_TABLE The LAG IDs used by the asics of the down module are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET. Motivation and Context In an operational system, if a line card is brought down the entries created by the down line card are still present in the chassis db and hence the corresponding voq system entries (such as system interface, systen neighbor and so on ) in all other line cards. These stale entries may affect the accuracy of the current entries. To fix this, we cleanup the chassis db for all asics of a given line card that is detected to be down for a long period of time. How Has This Been Tested? After the chassis is up with a line card (1) Pull out the line card. Notice the error log message indicating line card down. Re-insert the line card/Bring up the line card within 30 minutes and observe that the chassis db entries created by this line card are not cleaned. (2) Pull out the line card. Notice the erroe log message indicating line card down. After more than 30 minutes, observe that the chassis db does not have entries created by the removed line card.
if not self._is_supervisor(): | ||
hostname_key = "{}{}".format(ModuleBase.MODULE_TYPE_LINE, int(self.my_slot) - 1) | ||
hostname = try_get(device_info.get_hostname, default="None") | ||
hostname_fvs = swsscommon.FieldValuePairs([(CHASSIS_MODULE_INFO_SLOT_FIELD, self.my_slot), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is causing crash on LC
sonic-net/sonic-buildimage#16465
Looks like the code assumes the get_my_slot API to provide a string. Is that expected, I do not see it enforced by the module.
…t#394) Description Following changes are done in this PR - In line card pmon:chassisd: - line card's hostname and number of asics is pushed to a new table CHASSIS_MODULE_TABLE in chassis state db in redis_chassis server. The hostname and number of asics are required to clean up chassis add db. 'hostname' and asic name are used to construct key for chassis app db entries. - In supervisor pomon:chassisd: - When midplane connectivity loss is deteced, error syslog is generated. No chassis app db clean up is done for midplane connectivity loss. - When a module goes down and if it is in down state for more than 30 minutes, chassis app db clean up is done for all the asics of the module that went down. As part of the clean up entries created by all the asics of the down module are deleted from the following tables in chassis app db in redis_chassis server in supervisor. (1) SYSTEM_NEIGH (2) SYSTEM_INTERFACE (3) SYSTEM_LAG_MEMBER_TABLE (4) SYSTEM_LAG_TABLE The LAG IDs used by the asics of the down module are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET. Motivation and Context In an operational system, if a line card is brought down the entries created by the down line card are still present in the chassis db and hence the corresponding voq system entries (such as system interface, systen neighbor and so on ) in all other line cards. These stale entries may affect the accuracy of the current entries. To fix this, we cleanup the chassis db for all asics of a given line card that is detected to be down for a long period of time. How Has This Been Tested? After the chassis is up with a line card (1) Pull out the line card. Notice the error log message indicating line card down. Re-insert the line card/Bring up the line card within 30 minutes and observe that the chassis db entries created by this line card are not cleaned. (2) Pull out the line card. Notice the erroe log message indicating line card down. After more than 30 minutes, observe that the chassis db does not have entries created by the removed line card.
Description Following changes are done in this PR - In line card pmon:chassisd: - line card's hostname and number of asics is pushed to a new table CHASSIS_MODULE_TABLE in chassis state db in redis_chassis server. The hostname and number of asics are required to clean up chassis add db. 'hostname' and asic name are used to construct key for chassis app db entries. - In supervisor pomon:chassisd: - When midplane connectivity loss is deteced, error syslog is generated. No chassis app db clean up is done for midplane connectivity loss. - When a module goes down and if it is in down state for more than 30 minutes, chassis app db clean up is done for all the asics of the module that went down. As part of the clean up entries created by all the asics of the down module are deleted from the following tables in chassis app db in redis_chassis server in supervisor. (1) SYSTEM_NEIGH (2) SYSTEM_INTERFACE (3) SYSTEM_LAG_MEMBER_TABLE (4) SYSTEM_LAG_TABLE The LAG IDs used by the asics of the down module are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET. Motivation and Context In an operational system, if a line card is brought down the entries created by the down line card are still present in the chassis db and hence the corresponding voq system entries (such as system interface, systen neighbor and so on ) in all other line cards. These stale entries may affect the accuracy of the current entries. To fix this, we cleanup the chassis db for all asics of a given line card that is detected to be down for a long period of time. How Has This Been Tested? After the chassis is up with a line card (1) Pull out the line card. Notice the error log message indicating line card down. Re-insert the line card/Bring up the line card within 30 minutes and observe that the chassis db entries created by this line card are not cleaned. (2) Pull out the line card. Notice the erroe log message indicating line card down. After more than 30 minutes, observe that the chassis db does not have entries created by the removed line card.
Description
Following changes are done in this PR
- In line card pmon:chassisd:
- line card's hostname and number of asics is pushed to a new table CHASSIS_MODULE_TABLE in chassis state db in redis_chassis server. The hostname and number of asics are required to clean up chassis add db. 'hostname' and asic name are used to construct key for chassis app db entries.
- In supervisor pomon:chassisd:
- When midplane connectivity loss is deteced, error syslog is generated. No chassis app db clean up is done for midplane connectivity loss.
- When a module goes down and if it is in down state for more than 30 minutes, chassis app db clean up is done for all the asics of the module that went down. As part of the clean up entries created by all the asics of the down module are deleted from the following tables in chassis app db in redis_chassis server in supervisor.
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
The LAG IDs used by the asics of the down module are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET.
Motivation and Context
In an operational system, if a line card is brought down the entries created by the down line card are still present in the chassis db and hence the corresponding voq system entries (such as system interface, systen neighbor and so on ) in all other line cards. These stale entries may affect the accuracy of the current entries. To fix this, we cleanup the chassis db for all asics of a given line card that is detected to be down for a long period of time.
How Has This Been Tested?
Additional Information (Optional)