Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[action] [PR:17567] [Mellanox] Disable SSD NCQ on Mellanox platforms #17960

Merged
merged 1 commit into from
Jan 31, 2024

Commits on Jan 31, 2024

  1. [Mellanox] Disable SSD NCQ on Mellanox platforms (sonic-net#17567)

    - Why I did it
    Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
    There seems to be a problem between some kernel versions and some SATA controllers.
    
    Syslog error message examples:
    
    Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
    Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".
    Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:
    
    [Arista] Disable ATA NCQ for a few products sonic-net#13739 [Arista] Disable ATA NCQ for a few products
    [Arista] Disable SSD NCQ on DCS-7050CX3-32S sonic-net#13964 [Arista] Disable SSD NCQ on DCS-7050CX3-32S
    Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:
    
    https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous
    
    - How I did it
    Add a kernel parameter to tell libata to disable NCQ
    
    - How to verify it
    Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
    volodymyrsamotiy authored and mssonicbld committed Jan 31, 2024
    Configuration menu
    Copy the full SHA
    158e5fe View commit details
    Browse the repository at this point in the history