Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pmon] Chassis DB cleanup when module is down #394

Merged
merged 4 commits into from
Sep 1, 2023

Conversation

vganesan-nokia
Copy link
Contributor

Description

Following changes are done in this PR
- In line card pmon:chassisd:
- line card's hostname and number of asics is pushed to a new table CHASSIS_MODULE_TABLE in chassis state db in redis_chassis server. The hostname and number of asics are required to clean up chassis add db. 'hostname' and asic name are used to construct key for chassis app db entries.
- In supervisor pomon:chassisd:
- When midplane connectivity loss is deteced, error syslog is generated. No chassis app db clean up is done for midplane connectivity loss.
- When a module goes down and if it is in down state for more than 30 minutes, chassis app db clean up is done for all the asics of the module that went down. As part of the clean up entries created by all the asics of the down module are deleted from the following tables in chassis app db in redis_chassis server in supervisor.
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
The LAG IDs used by the asics of the down module are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET.

Motivation and Context

In an operational system, if a line card is brought down the entries created by the down line card are still present in the chassis db and hence the corresponding voq system entries (such as system interface, systen neighbor and so on ) in all other line cards. These stale entries may affect the accuracy of the current entries. To fix this, we cleanup the chassis db for all asics of a given line card that is detected to be down for a long period of time.

How Has This Been Tested?

  • After the chassis is up with a line card
  • (1) Pull out the line card. Notice the error log message indicating line card down. Re-insert the line card/Bring up the line card within 30 minutes and observe that the chassis db entries created by this line card are not cleaned.
  • (2) Pull out the line card. Notice the erroe log message indicating line card down. After more than 30 minutes, observe that the chassis db does not have entries created by the removed line card.

Additional Information (Optional)

Following changes are done in this commit:
    - In line card pmon:chassisd:
        - line card's hostname and number of asics is pushed to a new table
        CHASSIS_MODULE_TABLE in chassis state db in redis_chassis server.
        The hostname and number of asics are required to clean up chassis add db.
        'hostname' and asic name are used to construct  key for chassis app db entries.
    - In supervisor pomon:chassisd:
        - When midplane connectivity loss is deteced, error syslog is generated.
        No chassis app db clean up is done for midplane connectivity loss.
        - When a module goes down and if it is in down state for more than 30
        minutes, chassis app db clean up is done for all the asics of the module
        that went down. As part of the clean up entries created by all the asics
        of the down module are deleted from the following tables in chassis app db
        in redis_chassis server in supervisor.
        (1) SYSTEM_NEIGH
        (2) SYSTEM_INTERFACE
        (3) SYSTEM_LAG_MEMBER_TABLE
        (4) SYSTEM_LAG_TABLE
        The LAG IDs used by the asics of the down module are also
        de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET.

Signed-off-by: vedganes <veda.ganesan@nokia.com>
@gechiang
Copy link
Contributor

@vganesan-nokia , can you add more unit test coverage so that the coverage % is improved?
Currently it is at this coverage %:

Pull Request Coverage
Total: 69 lines
Missing: 42 lines
Coverage: 39%
Threshold: 50%

Thanks!

@gechiang
Copy link
Contributor

@vganesan-nokia , @aravindmani-1 , @judyjoseph , @abdosi ,
This chassisd script is Only run/used with "Chassis" platform and never used on "Pizzabox" platform am I correct?
Trying to check the impact of this change in terms of its blast radius in case there is a question on whether this change can be safely cherry-picked (back-ported) to releases such as 202205.

Changes for the review comments. Changes include:
- Syslog for module down and up state changes
- Unit test cases added to improve code coverage of the chassis db
consistency implementation
- mock_swsscommon.py changes for Table::get() function to return value
similar to what real swsscommon.py returns. Changes are done in chassisd
other unit tests to accommodate this mock_swsscommon.py change

Signed-off-by: vedganes <veda.ganesan@nokia.com>
@vganesan-nokia
Copy link
Contributor Author

@vganesan-nokia , @aravindmani-1 , @judyjoseph , @abdosi , This chassisd script is Only run/used with "Chassis" platform and never used on "Pizzabox" platform am I correct? Trying to check the impact of this change in terms of its blast radius in case there is a question on whether this change can be safely cherry-picked (back-ported) to releases such as 202205.

From the chassis db clean up point of view, The clean up (from supervisor) and host name update (from linecard) happen in chassis app db (db id 12). So if there is no supervisor card or if there is no chassis app db, all places of the chassis db consistency code will be "check and no op"

@vganesan-nokia
Copy link
Contributor Author

vganesan-nokia commented Aug 25, 2023

@vganesan-nokia , can you add more unit test coverage so that the coverage % is improved? Currently it is at this coverage %:

Pull Request Coverage Total: 69 lines Missing: 42 lines Coverage: 39% Threshold: 50%

Thanks!

Added unit tests for chassis db cleanup changes

@mlok-nokia
Copy link
Contributor

@gechiang @judyjoseph @arlakshm @deepak-singhal0408 @rlhui Please help review this PR which is for the DB consistency check

gechiang
gechiang previously approved these changes Aug 29, 2023
arlakshm
arlakshm previously approved these changes Aug 29, 2023
Copy link
Contributor

@arlakshm arlakshm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor comments

@@ -12,8 +12,11 @@ try:
import signal
import sys
import threading
import time
import subprocess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort imports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 309 to 314
hostname_key = "{}{}".format(ModuleBase.MODULE_TYPE_LINE, int(self.my_slot) - 1)
hostname = try_get(device_info.get_hostname, default="None")
hostname_fvs = swsscommon.FieldValuePairs([(CHASSIS_MODULE_INFO_SLOT_FIELD, self.my_slot),
(CHASSIS_MODULE_INFO_HOSTNAME_FIELD, hostname),
(CHASSIS_MODULE_INFO_NUM_ASICS_FIELD, str(len(module_info_dict[CHASSIS_MODULE_INFO_ASICS])))])
self.hostname_table.set(hostname_key, hostname_fvs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is not right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

else:
# Module is operational. Remove it from down time tracking.
if down_module_key in self.down_modules:
self.log_notice("Module {} recoverecd on-line!".format(key))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

sonic-chassisd/scripts/chassisd Show resolved Hide resolved
Changes for review comments
    - Sorted the imported modules
    - Corrected indentation for the line to push host name to the
    chassis app db.
    - Module recovery syslog typo fixed

Signed-off-by: vedganes <veda.ganesan@nokia.com>
@vganesan-nokia
Copy link
Contributor Author

@arlakshm , @deepak-singhal0408 and @gechiang, while fixing the review comments, I had to amend a commit (to avoid an extra commit) and force push which reset your approvals. Would you please review and re-approve? Thanks

@@ -186,10 +193,16 @@ class ModuleUpdater(logger.Logger):
if self._is_supervisor():
self.asic_table = swsscommon.Table(self.chassis_state_db,
CHASSIS_FABRIC_ASIC_INFO_TABLE)
self.lc_asic_table = swsscommon.Table(self.chassis_state_db,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is lc_asic_table used?

Copy link
Contributor Author

@vganesan-nokia vganesan-nokia Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unused table.

hostname = fvs[CHASSIS_MODULE_INFO_HOSTNAME_FIELD]
down_module_key = key+'|'+hostname
else:
down_module_key = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any cleanup required for fabric cards which might become un-operational? If not, a comment might be useful to indicate the fabric cards donot need this cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No cleanup required for supervisor/fabric cards and also for the unequipped/unprovisioned slots. Added to comment to clarify this.

@@ -266,6 +303,16 @@ class ModuleUpdater(logger.Logger):
(CHASSIS_ASIC_ID_IN_MODULE_FIELD, str(asic_id))])
self.asic_table.set(asic_key, asic_fvs)

# In line card push the hostname of the module and num_asics to the chassis state db.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be done periodically or it can be done once during initialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is possibility of changing hostname and restarting the swss without restarting the pmon or rebooting line card. For these cases, this is the simplest way to use the existing infrastructure code to get the changed host name to the supervisor.

sonic-chassisd/scripts/chassisd Show resolved Hide resolved
arlakshm
arlakshm previously approved these changes Aug 31, 2023
Following changes done for review comments
    - To avoid repeated module down/up syslogs for unequipped slots
    - To avoid doing chassis clean up for supervisor and unequipped/
    unprovisioned linecards which will not have hostname
    - Comment added to indicate that cleanup will not be attempted for
    supervisor/fabric cards.
    - Added syslog for midplane connectivity recovery
    - Removed connection to unused table (linecard asic table is not
    required in supervisor any more. A separate hostname table
    is defined in chassis app db

Signed-off-by: vedganes <veda.ganesan@nokia.com>
@arlakshm arlakshm merged commit c1c43f6 into sonic-net:master Sep 1, 2023
4 checks passed
@gechiang
Copy link
Contributor

gechiang commented Sep 1, 2023

@yxieca , @StormLiangMS
MSFT ADO: 25038353
Appreciate approval for the requested branches.
Thanks!

for module in self.down_modules:
if self.down_modules[module]['cleaned'] == False:
down_time = self.down_modules[module]['down_time']
delta = (time_now - down_time) / 60

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, SYSTEM_NEIGH table requires more frequent clean up then 30mins. When the entries for a asic that is no more existing, then leaving those entries for 30mins in the neighbor table will be cause a traffic blackhole for remote neighbors of other asics. So is it possible to have a more granular 'cleanup' instead of clearing all the tables after 30mins ?

yxieca pushed a commit that referenced this pull request Sep 1, 2023
Description
Following changes are done in this PR
- In line card pmon:chassisd:
- line card's hostname and number of asics is pushed to a new table CHASSIS_MODULE_TABLE in chassis state db in redis_chassis server. The hostname and number of asics are required to clean up chassis add db. 'hostname' and asic name are used to construct key for chassis app db entries.
- In supervisor pomon:chassisd:
- When midplane connectivity loss is deteced, error syslog is generated. No chassis app db clean up is done for midplane connectivity loss.
- When a module goes down and if it is in down state for more than 30 minutes, chassis app db clean up is done for all the asics of the module that went down. As part of the clean up entries created by all the asics of the down module are deleted from the following tables in chassis app db in redis_chassis server in supervisor.
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
The LAG IDs used by the asics of the down module are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET.

Motivation and Context
In an operational system, if a line card is brought down the entries created by the down line card are still present in the chassis db and hence the corresponding voq system entries (such as system interface, systen neighbor and so on ) in all other line cards. These stale entries may affect the accuracy of the current entries. To fix this, we cleanup the chassis db for all asics of a given line card that is detected to be down for a long period of time.

How Has This Been Tested?
After the chassis is up with a line card
(1) Pull out the line card. Notice the error log message indicating line card down. Re-insert the line card/Bring up the line card within 30 minutes and observe that the chassis db entries created by this line card are not cleaned.
(2) Pull out the line card. Notice the erroe log message indicating line card down. After more than 30 minutes, observe that the chassis db does not have entries created by the removed line card.
if not self._is_supervisor():
hostname_key = "{}{}".format(ModuleBase.MODULE_TYPE_LINE, int(self.my_slot) - 1)
hostname = try_get(device_info.get_hostname, default="None")
hostname_fvs = swsscommon.FieldValuePairs([(CHASSIS_MODULE_INFO_SLOT_FIELD, self.my_slot),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is causing crash on LC
sonic-net/sonic-buildimage#16465

Looks like the code assumes the get_my_slot API to provide a string. Is that expected, I do not see it enforced by the module.

tshalvi pushed a commit to tshalvi/sonic-platform-daemons that referenced this pull request Sep 11, 2023
…t#394)

Description
Following changes are done in this PR
- In line card pmon:chassisd:
- line card's hostname and number of asics is pushed to a new table CHASSIS_MODULE_TABLE in chassis state db in redis_chassis server. The hostname and number of asics are required to clean up chassis add db. 'hostname' and asic name are used to construct key for chassis app db entries.
- In supervisor pomon:chassisd:
- When midplane connectivity loss is deteced, error syslog is generated. No chassis app db clean up is done for midplane connectivity loss.
- When a module goes down and if it is in down state for more than 30 minutes, chassis app db clean up is done for all the asics of the module that went down. As part of the clean up entries created by all the asics of the down module are deleted from the following tables in chassis app db in redis_chassis server in supervisor.
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
The LAG IDs used by the asics of the down module are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET.

Motivation and Context
In an operational system, if a line card is brought down the entries created by the down line card are still present in the chassis db and hence the corresponding voq system entries (such as system interface, systen neighbor and so on ) in all other line cards. These stale entries may affect the accuracy of the current entries. To fix this, we cleanup the chassis db for all asics of a given line card that is detected to be down for a long period of time.

How Has This Been Tested?
After the chassis is up with a line card
(1) Pull out the line card. Notice the error log message indicating line card down. Re-insert the line card/Bring up the line card within 30 minutes and observe that the chassis db entries created by this line card are not cleaned.
(2) Pull out the line card. Notice the erroe log message indicating line card down. After more than 30 minutes, observe that the chassis db does not have entries created by the removed line card.
StormLiangMS pushed a commit that referenced this pull request Sep 21, 2023
Description
Following changes are done in this PR
- In line card pmon:chassisd:
- line card's hostname and number of asics is pushed to a new table CHASSIS_MODULE_TABLE in chassis state db in redis_chassis server. The hostname and number of asics are required to clean up chassis add db. 'hostname' and asic name are used to construct key for chassis app db entries.
- In supervisor pomon:chassisd:
- When midplane connectivity loss is deteced, error syslog is generated. No chassis app db clean up is done for midplane connectivity loss.
- When a module goes down and if it is in down state for more than 30 minutes, chassis app db clean up is done for all the asics of the module that went down. As part of the clean up entries created by all the asics of the down module are deleted from the following tables in chassis app db in redis_chassis server in supervisor.
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
The LAG IDs used by the asics of the down module are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET.

Motivation and Context
In an operational system, if a line card is brought down the entries created by the down line card are still present in the chassis db and hence the corresponding voq system entries (such as system interface, systen neighbor and so on ) in all other line cards. These stale entries may affect the accuracy of the current entries. To fix this, we cleanup the chassis db for all asics of a given line card that is detected to be down for a long period of time.

How Has This Been Tested?
After the chassis is up with a line card
(1) Pull out the line card. Notice the error log message indicating line card down. Re-insert the line card/Bring up the line card within 30 minutes and observe that the chassis db entries created by this line card are not cleaned.
(2) Pull out the line card. Notice the erroe log message indicating line card down. After more than 30 minutes, observe that the chassis db does not have entries created by the removed line card.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.