Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xcvrd crash observed during boot #11707

Closed
anamehra opened this issue Aug 11, 2022 · 4 comments · Fixed by sonic-net/sonic-platform-daemons#286
Closed

xcvrd crash observed during boot #11707

anamehra opened this issue Aug 11, 2022 · 4 comments · Fixed by sonic-net/sonic-platform-daemons#286
Assignees
Labels
Triaged this issue has been triaged

Comments

@anamehra
Copy link
Contributor

Description

During boot on Line cards, xcvrd crash is observed which cause the port optics init failure:

Aug 11 17:09:52.008708 sfd-vt2-lc0 INFO pmon#supervisord 2022-08-11 17:09:52,007 INFO success: xcvrd entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
Aug 11 17:09:56.747253 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd ERROR: execvpe(/usr/sbin/smartctl) failed
Aug 11 17:09:56.747582 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd : [2] No such file or directory
Aug 11 17:09:56.751125 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd ERROR: command '/usr/sbin/smartctl' failed
Aug 11 17:09:56.751354 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd : [116] Stale file handle
Aug 11 17:10:57.423570 sfd-vt2-lc0 NOTICE pmon#xcvrd[121]: CMIS: Starting...
Aug 11 17:10:57.526003 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd Process Process-1:
Aug 11 17:10:57.526805 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd Traceback (most recent call last):
Aug 11 17:10:57.526805 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
Aug 11 17:10:57.526805 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd self.run()
Aug 11 17:10:57.527188 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
Aug 11 17:10:57.527204 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd self._target(*self._args, **self._kwargs)
Aug 11 17:10:57.527212 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1268, in task_worker
Aug 11 17:10:57.527212 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd self.port_dict[lport]['admin_status'] = self.get_port_admin_status(lport)
Aug 11 17:10:57.527223 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1226, in get_port_admin_status
Aug 11 17:10:57.527223 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd admin_status = dict(port_info)['admin_status']
Aug 11 17:10:57.527236 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd KeyError: 'admin_status'
Aug 11 17:10:57.527274 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd Starting
Aug 11 17:10:57.929726 sfd-vt2-lc0 INFO pmon#supervisord: xcvrd DBG _optics_init_once:OPTICS_INIT_ONCE: start one time optics lib initialization

Steps to reproduce the issue:

  1. Boot the image on Line cards.
  2. Check ps aux| grep xcvrd to make sure all 3 threads are running.
  3. Only 2 threads were running and syslog shows the crash

Describe the results you received:

Front panel ports failed to come oper up.

Describe the results you expected:

no xcvrd crash. port should come oper up

Output of show version:

sha1 used to build the image:
sonic-net/sonic-buildimage-msft@b6bfd6a

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@anamehra
Copy link
Contributor Author

anamehra commented Aug 11, 2022

@abdosi , @prsunny , FYI-

I used the following fix in xcvrd, line 1226, to resolve the issue:

1226c1226
<             admin_status = dict(port_info)['admin_status']
---
>             admin_status = dict(port_info).get('admin_status', 'down')

@yxieca
Copy link
Contributor

yxieca commented Aug 17, 2022

@anamehra do you mind create a PR to address the issue since you already identified it? @vdahiya12 and/or @prgeor to review.

@yxieca yxieca added the Triaged this issue has been triaged label Aug 17, 2022
@prgeor
Copy link
Contributor

prgeor commented Aug 23, 2022

Instead of fixing this crash in Xcvrd we should see why 'admin_status' field is missing in PORT table of CONFIG_DB. Xcvrd start running code only after Portconfigdone https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-xcvrd/xcvrd/xcvrd.py#L1295

@anamehra
Copy link
Contributor Author

Instead of fixing this crash in Xcvrd we should see why 'admin_status' field is missing in PORT table of CONFIG_DB. Xcvrd start running code only after Portconfigdone https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-xcvrd/xcvrd/xcvrd.py#L1295

Hi Prince, what I observed from my debugging is that at the point of the crash, the data being used is from config db.
I observed that the Ethernet ports, which are not configured for far-end connectivity, the admin_state field is missing in the config. And that is what caused the crash. I think, if the field is missing, the admin_status is considered as "down" by default.

for example, the Ethernet31 is configured for a far-end, and has this field populated while Ethernet32, as was not being used, has this field missing.

    "Ethernet31": {
        "admin_status": "up",
        "alias": "Eth0/2/31",
        "asic_port_name": "Eth31-ASIC2",
        "description": "ARISTA32T3:Ethernet1",
        "fec": "rs",
        "index": "31",
        "lanes": "512,513,514,515",
        "mtu": "9100",
        "pfc_asym": "off",
        "role": "Ext",
        "speed": "100000",
        "tpid": "0x8100"
    },
    "Ethernet32": {
        "alias": "Eth0/2/32",
        "asic_port_name": "Eth32-ASIC2",
        "description": "Eth0/2/32",
        "fec": "rs",
        "index": "32",
        "lanes": "264,265,266,267",
        "mtu": "9100",
        "pfc_asym": "off",
        "role": "Ext",
        "speed": "100000",
        "tpid": "0x8100"
    },

@abdosi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
4 participants