Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arista] Increase switch PCIe timeout for 7060 #9248

Merged
merged 1 commit into from
Dec 17, 2021

Conversation

zzhiyuan
Copy link
Contributor

The platform-init, similar to hwsku-init, would be scripts that need to
be called for a specific platform.

Why I did it

Arista 7060 platform has a rare and unreproduceable PCIe timeout that could possibly be solved with increasing the switch PCIe timeout value. To do this we'll call a script for this platform to increase the PCIe timeout on boot-up.

No issues would be expected from the setpci command. From the PCIe spec:

"Software is permitted to change the value in this field at any
time. For Requests already pending when the Completion
Timeout Value is changed, hardware is permitted to use either
the new or the old value for the outstanding Requests, and is
permitted to base the start time for each Request either on when
this value was changed or on when each request was issued. "

How I did it

Add "platform-init" support in swss docker similar to how "hwsku-init" is called, only this would be for any device belonging to a platform. Then the script would reside in device data folder.

Additionally, add pciutils dependency to docker-orchagent so it can run the setpci commands.

How to verify it

On bootup of an Arista 7060, can execute:
lspci -vv -s 01:00.0 | grep -i "devctl2"
In order to check that the timeout has changed.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

The platform-init, similar to hwsku-init, would be scripts that need to
be called for a specific platform.
@zzhiyuan zzhiyuan changed the title Add platform-init support to docker-orchagent [Arista] Increase switch PCIe timeout for 7060 Nov 16, 2021
@zzhiyuan
Copy link
Contributor Author

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 9248 in repo Azure/sonic-buildimage

@zzhiyuan
Copy link
Contributor Author

Testing done:

  1. On 7060 platform, changed counter poll interval to 10ms.
  2. With 50ms PCIe timeout, saw no errors.
  3. With 210ms PCIe timeout, saw no errors.
  4. Run script to alternate PCIe timeout from 50ms to 210ms and back again repeatedly with no sleep - in an hour of run saw 18 occurences of "getResAvailableCounters: Failed to get availability for object_type 63"
  5. Run script to alternate PCIe timeout from 50ms to 210ms and back again repeatedly and echo the loop count - in several hours of run with 500,000+ loops, did not see any errors.

From these findings I believe there is no danger to changing the timeout on a production device.

@sujinmkang
Copy link
Collaborator

sujinmkang commented Dec 2, 2021

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list.

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sujinmkang sujinmkang self-assigned this Dec 3, 2021
@sujinmkang sujinmkang self-requested a review December 17, 2021 16:19
@sujinmkang sujinmkang self-requested a review December 17, 2021 16:27
@sujinmkang
Copy link
Collaborator

@zzhiyuan is x86_64-arista_7060_cx32s the only platform sku applicable for this change?

@sujinmkang sujinmkang merged commit a6d0a27 into sonic-net:master Dec 17, 2021
zzhiyuan added a commit to zzhiyuan/sonic-buildimage that referenced this pull request Jan 19, 2022
Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>
Why I did it
Arista 7060 platform has a rare and unreproduceable PCIe timeout that could possibly be solved with increasing the switch PCIe timeout value. To do this we'll call a script for this platform to increase the PCIe timeout on boot-up.

No issues would be expected from the setpci command. From the PCIe spec:

"Software is permitted to change the value in this field at any
time. For Requests already pending when the Completion
Timeout Value is changed, hardware is permitted to use either
the new or the old value for the outstanding Requests, and is
permitted to base the start time for each Request either on when
this value was changed or on when each request was issued. "

How I did it
Add "platform-init" support in swss docker similar to how "hwsku-init" is called, only this would be for any device belonging to a platform. Then the script would reside in device data folder.

Additionally, add pciutils dependency to docker-orchagent so it can run the setpci commands.

How to verify it
On bootup of an Arista 7060, can execute:
lspci -vv -s 01:00.0 | grep -i "devctl2"
In order to check that the timeout has changed.
abdosi pushed a commit that referenced this pull request Mar 2, 2022
Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>
Why I did it
Arista 7060 platform has a rare and unreproduceable PCIe timeout that could possibly be solved with increasing the switch PCIe timeout value. To do this we'll call a script for this platform to increase the PCIe timeout on boot-up.

No issues would be expected from the setpci command. From the PCIe spec:

"Software is permitted to change the value in this field at any
time. For Requests already pending when the Completion
Timeout Value is changed, hardware is permitted to use either
the new or the old value for the outstanding Requests, and is
permitted to base the start time for each Request either on when
this value was changed or on when each request was issued. "

How I did it
Add "platform-init" support in swss docker similar to how "hwsku-init" is called, only this would be for any device belonging to a platform. Then the script would reside in device data folder.

Additionally, add pciutils dependency to docker-orchagent so it can run the setpci commands.

How to verify it
On bootup of an Arista 7060, can execute:
lspci -vv -s 01:00.0 | grep -i "devctl2"
In order to check that the timeout has changed.
@gechiang
Copy link
Collaborator

looks like this fix is already included and picked up in 20191130 branch but missing in 202012 and 20181130 branches.
202205 was later branch after this fix already exists so it has the changes by default.

qiluo-msft pushed a commit that referenced this pull request Nov 23, 2022
Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>
Why I did it
Arista 7060 platform has a rare and unreproduceable PCIe timeout that could possibly be solved with increasing the switch PCIe timeout value. To do this we'll call a script for this platform to increase the PCIe timeout on boot-up.

No issues would be expected from the setpci command. From the PCIe spec:

"Software is permitted to change the value in this field at any
time. For Requests already pending when the Completion
Timeout Value is changed, hardware is permitted to use either
the new or the old value for the outstanding Requests, and is
permitted to base the start time for each Request either on when
this value was changed or on when each request was issued. "

How I did it
Add "platform-init" support in swss docker similar to how "hwsku-init" is called, only this would be for any device belonging to a platform. Then the script would reside in device data folder.

Additionally, add pciutils dependency to docker-orchagent so it can run the setpci commands.

How to verify it
On bootup of an Arista 7060, can execute:
lspci -vv -s 01:00.0 | grep -i "devctl2"
In order to check that the timeout has changed.
richardyu-ms pushed a commit to richardyu-ms/sonic-buildimage that referenced this pull request Nov 25, 2022
yxieca pushed a commit that referenced this pull request Jan 25, 2023
Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>
Why I did it
Arista 7060 platform has a rare and unreproduceable PCIe timeout that could possibly be solved with increasing the switch PCIe timeout value. To do this we'll call a script for this platform to increase the PCIe timeout on boot-up.

No issues would be expected from the setpci command. From the PCIe spec:

"Software is permitted to change the value in this field at any
time. For Requests already pending when the Completion
Timeout Value is changed, hardware is permitted to use either
the new or the old value for the outstanding Requests, and is
permitted to base the start time for each Request either on when
this value was changed or on when each request was issued. "

How I did it
Add "platform-init" support in swss docker similar to how "hwsku-init" is called, only this would be for any device belonging to a platform. Then the script would reside in device data folder.

Additionally, add pciutils dependency to docker-orchagent so it can run the setpci commands.

How to verify it
On bootup of an Arista 7060, can execute:
lspci -vv -s 01:00.0 | grep -i "devctl2"
In order to check that the timeout has changed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants