Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[action] [PR:17567] [Mellanox] Disable SSD NCQ on Mellanox platforms #17960

Merged
merged 1 commit into from
Jan 31, 2024

Conversation

mssonicbld
Copy link
Collaborator

Why I did it

Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.

Syslog error message examples:

  • Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
  • Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".

Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:

Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:

Work item tracking
  • Microsoft ADO (number only):

How I did it

Add a kernel parameter to tell libata to disable NCQ

How to verify it

Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4

Test results with NCQ enabled:

 READ: bw=128MiB/s (135MB/s), 128MiB/s-128MiB/s (135MB/s-135MB/s), io=247MiB (259MB), run=1924-1924msec
WRITE: bw=131MiB/s (138MB/s), 131MiB/s-131MiB/s (138MB/s-138MB/s), io=253MiB (265MB), run=1924-1924msec
…
 READ: bw=130MiB/s (136MB/s), 130MiB/s-130MiB/s (136MB/s-136MB/s), io=247MiB (259MB), run=1902-1902msec
WRITE: bw=133MiB/s (139MB/s), 133MiB/s-133MiB/s (139MB/s-139MB/s), io=253MiB (265MB), run=1902-1902msec
…
 READ: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=247MiB (259MB), run=1919-1919msec
WRITE: bw=132MiB/s (138MB/s), 132MiB/s-132MiB/s (138MB/s-138MB/s), io=253MiB (265MB), run=1919-1919msec

Test results with NCQ disabled:

 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2354-2354msec
WRITE: bw=107MiB/s (113MB/s), 107MiB/s-107MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2354-2354msec
…
 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2349-2349msec
WRITE: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2349-2349msec
…
 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2349-2349msec
WRITE: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2349-2349msec

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305
  • 202311

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

- Why I did it
Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.

Syslog error message examples:

Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".
Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:

[Arista] Disable ATA NCQ for a few products sonic-net#13739 [Arista] Disable ATA NCQ for a few products
[Arista] Disable SSD NCQ on DCS-7050CX3-32S sonic-net#13964 [Arista] Disable SSD NCQ on DCS-7050CX3-32S
Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:

https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous

- How I did it
Add a kernel parameter to tell libata to disable NCQ

- How to verify it
Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
@mssonicbld
Copy link
Collaborator Author

Original PR: #17567

@mssonicbld mssonicbld merged commit 15a0f91 into sonic-net:202305 Jan 31, 2024
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants