Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Port oper down error status notification #3350

Merged
merged 1 commit into from
Nov 8, 2024

Conversation

prgeor
Copy link
Contributor

@prgeor prgeor commented Nov 2, 2024

Handler Port oper down error status notification as per sai_port_error_status_t flag as part of existing port oper status notification sai_port_oper_status_notification_t

What I did
Whenever Orchagent gets the notification for port_state_change for link down event, OA will update the error count for the error flags that are set and the timestamp of the error port down event in new STATE_DB table PORT_OPERR_TABLE

Why I did it
To correlate various error events like MAC local and remote fault with the link oper down status notification.

How I verified it
Since we don't have real hardware to test this modified notification, I modified the sairedis to generate this new notification on every link down event

root@smsn2700~# redis-cli -n 6 hgetall "PORT_OPERR_TABLE|Ethernet4"
 1) "oper_error_status"
 2) "5442"
 3) "mac_local_fault_count"
 4) "2"
 5) "mac_local_fault_time"
 6) "2024-11-02 04:00:05"
 7) "fec_sync_loss_count"
 8) "2"
 9) "fec_sync_loss_time"
10) "2024-11-02 04:00:05"
11) "fec_alignment_loss_count"
12) "2"
13) "fec_alignment_loss_time"
14) "2024-11-02 04:00:05"
15) "high_ser_error_count"
16) "2"
17) "high_ser_error_time"
18) "2024-11-02 04:00:05"
19) "high ber_error_count"
20) "2"
21) "high ber_error_time"
22) "2024-11-02 04:00:05"
23) "data_unit_crc_error_count"
24) "2"
25) "data_unit_crc_error_time"
26) "2024-11-02 04:00:05"
27) "data_unit_misalignment_error_count"
28) "2"
29) "data_unit_misalignment_error_time"
30) "2024-11-02 04:00:05"
31) "signal_local_error_count"
32) "2"
33) "signal_local_error_time"
34) "2024-11-02 04:00:05"
35) "mac_remote_fault_count"
36) "2"
37) "mac_remote_fault_time"
38) "2024-11-02 04:00:50"
39) "crc_rate_count"
40) "2"
41) "crc_rate_time"
42) "2024-11-02 04:00:50"
43) "data_unit_size_count"
44) "2"
45) "data_unit_size_time"
46) "2024-11-02 04:00:50"
47) "code_group_error_count"
48) "2"
49) "code_group_error_time"
50) "2024-11-02 04:00:50"
51) "no_rx_reachability_count"
52) "2"
53) "no_rx_reachability_time"
54) "2024-11-02 04:00:50"
root@msn2700~# 
``

**Details if related**

@prgeor prgeor requested a review from prsunny as a code owner November 2, 2024 18:09
@prgeor prgeor changed the title Handler Port oper down error status notification Handle Port oper down error status notification Nov 2, 2024
@prgeor
Copy link
Contributor Author

prgeor commented Nov 2, 2024

@moshemos @eddyk-nvidia please review this PR

orchagent/portsorch.cpp Outdated Show resolved Hide resolved

if (port.m_portOperErrorToEvent.find(error_status) == port.m_portOperErrorToEvent.end())
{
++errors;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if error_status is 0, no need query the map. If it is not 0 and not in the map, we probably need record an error log.

static const std::unordered_map<sai_port_error_status_t, std::string> db_key_errors;

private:
sai_port_error_status_t m_errorFlag = SAI_PORT_ERROR_STATUS_CLEAR;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this member? I think the port event from redis already contains the error status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox probably not required but i thought in case i need to take OA coredump for future debubbing i don't need redis dump.

private:
sai_port_error_status_t m_errorFlag = SAI_PORT_ERROR_STATUS_CLEAR;
size_t m_errorCount = 0;
std::string m_dbKeyError; // DB key for this port error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this member? The DB key name is already in the static map

sai_port_error_status_t m_errorFlag = SAI_PORT_ERROR_STATUS_CLEAR;
size_t m_errorCount = 0;
std::string m_dbKeyError; // DB key for this port error
std::time_t m_eventTime = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this member? The event time can be get whenever there is an error event. Seems no need to save it in a class.

@@ -193,6 +232,9 @@ class Port
sai_object_id_t m_system_side_id = 0;
sai_object_id_t m_line_side_id = 0;

/* Port oper error status to event map*/
std::unordered_map<sai_port_error_status_t, PortOperErrorEvent> m_portOperErrorToEvent;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suppose we only need a std::unordered_map<sai_port_error_status_t, uint32_t> map here to record the occur count for each error type.

const sai_port_error_status_t error_status = error.first;
std::string error_name = error.second;

port.m_portOperErrorToEvent[error_status] = PortOperErrorEvent(error_status, error_name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no sure this is required


auto key = pevent->getDbKey();
vector<FieldValueTuple> tuples;
FieldValueTuple tup1("oper_error_status", std::to_string(port.m_oper_error_status));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the latest change to notification structure - opencomputeproject/SAI#2087
Was discussed in the SAI community meeting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddyk-nvidia that is for future use case. I can't wait for that to merge and start using as its already not meeting 202411 release time.

@prgeor
Copy link
Contributor Author

prgeor commented Nov 4, 2024

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prgeor prgeor force-pushed the port-err branch 3 times, most recently from 023e14f to 6955d48 Compare November 7, 2024 22:49
Signed-off-by: Prince George <prgeor@microsoft.com>
@prsunny prsunny merged commit 956ebd6 into sonic-net:master Nov 8, 2024
17 checks passed
stepanblyschak pushed a commit to stepanblyschak/sonic-swss that referenced this pull request Nov 13, 2024
Handler Port oper down error status notification as per sai_port_error_status_t flag as part of existing port oper status notification sai_port_oper_status_notification_t

What I did
Whenever Orchagent gets the notification for port_state_change for link down event, OA will update the error count for the error flags that are set and the timestamp of the error port down event in new STATE_DB table PORT_OPERR_TABLE

Why I did it
To correlate various error events like MAC local and remote fault with the link oper down status notification.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants