Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix QAT allocation failure return value #9788

Merged
merged 1 commit into from
Jan 6, 2020

Conversation

behlendorf
Copy link
Contributor

Motivation and Context

Issue #9784

Description

When qat_compress() fails to allocate the required contigeous memory
it mistakenly returns success. This prevents the fallback software
compression from taking over and (un)compressing the block.

Resolve the issue by correctly setting the local 'status' variable
on all exit paths. Furthermore, initialize it to CPA_STATUS_FAIL
to ensure qat_compress() always fails safe to guard against any
similar bugs in the future.

How Has This Been Tested?

Visually inspected. Unfortunately, I don't currently have access to
a qat accelerator to verify the fix.

@wli5 @cfzhu would it be possible for you to review and test this.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 30, 2019
@behlendorf behlendorf requested a review from tcaputi December 30, 2019 18:47
@codecov
Copy link

codecov bot commented Dec 31, 2019

Codecov Report

Merging #9788 into master will increase coverage by <1%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #9788    +/-   ##
========================================
+ Coverage      79%      80%   +<1%     
========================================
  Files         385      385            
  Lines      121481   121481            
========================================
+ Hits        96461    96600   +139     
+ Misses      25020    24881   -139
Flag Coverage Δ
#kernel 80% <ø> (ø) ⬇️
#user 67% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82e996c...bd2a933. Read the comment docs.

When qat_compress() fails to allocate the required contigeous memory
it mistakenly returns success.  This prevents the fallback software
compression from taking over and (un)compressing the block.

Resolve the issue by correctly setting the local 'status' variable
on all exit paths.  Furthermore, initialize it to CPA_STATUS_FAIL
to ensure qat_compress() always fails safe to guard against any
similar bugs in the future.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#9784
@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing and removed Type: Defect Incorrect behavior (e.g. crash, hang) labels Dec 31, 2019
@wli5
Copy link
Contributor

wli5 commented Dec 31, 2019

@behlendorf thanks for finding and fixing the issue, we are on vacation, will validate it after back.

@AGI-chandler
Copy link

AGI-chandler commented Dec 31, 2019

Thanks but I have a problem building on CentOS 6.10 Linux 2.6.32-754.23.1.el6,

/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:544: error: expected declaration specifiers or ‘...’ before ‘zfs_kernel_param_t’
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c: In function ‘param_set_qat_compress’:
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:547: error: ‘kp’ undeclared (first use in this function)
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:547: error: (Each undeclared identifier is reported only once
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:547: error: for each function it appears in.)
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c: At top level:
/usr/local/src/zfs-0.8.2/module/zfs/qat_compress.c:565: warning: initialization from incompatible pointer type

I ran ./autogen.sh then ./configure --with-qat=/usr/local/src/qat1.7.l.4.7.0-00006 then make and attached the config.log, is there anything else that would be helpful?

I tried to use newer GCC 7 from Software Collections but got an error that it can't build modules,

configure:45666: checking whether modules can be built
configure:45700: cp conftest.c conftest.h build && make modules -C /usr/src/kernels/2.6.32-754.23.1.el6.x86_64 EXTRA_CFLAGS=-Werror -Wframe-larger-than=4096   M=/usr/loc
al/src/zfs-0.8.2/build 
In file included from include/linux/compiler.h:43:0,
                 from include/linux/stddef.h:4,
                 from include/linux/posix_types.h:4,
                 from include/linux/types.h:14,
                 from include/linux/list.h:4,
                 from include/linux/module.h:9,
                 from /usr/local/src/zfs-0.8.2/build/conftest.mod.c:1:
include/linux/compiler-gcc.h:94:1: fatal error: linux/compiler-gcc7.h: No such file or directory
 #include gcc_header(__GNUC__)
 ^~~~
compilation terminated.
make[1]: *** [/usr/src/kernels/2.6.32-754.23.1.el6.x86_64/scripts/Makefile.modpost:117: /usr/local/src/zfs-0.8.2/build/conftest.mod.o] Error 1
make: *** [Makefile:1452: modules] Error 2

Well I know it's an old system, but it is our most important and most expensive, plus CentOS 6's EOL isn't happening for 11 more months. We are planning to replace it within a couple months, though, and will put the latest CentOS on it. So if you can still help out with getting this old machine working it would be appreciated.

@AGI-chandler
Copy link

If I change line 544 to use const struct kernel_param *kp then I just get warnings and the build finishes, but I'm pretty sure that breaks the fix for #9268

@PrivatePuffin

This comment has been minimized.

@behlendorf
Copy link
Contributor Author

This is something we'll want to include in master and 0.8.3 (which does still support CentOS 6). So we'll make sure to resolve any qat build issues.

@AGI-chandler
Copy link

AGI-chandler commented Jan 2, 2020

Thanks, we would like to implement the fix ASAP though so we can read our data. If you know a definitive fix please let us know.

@PrivatePuffin

This comment has been minimized.

@behlendorf
Copy link
Contributor Author

Manually, setting zfs_kernel_param for your build shouldn't cause any issues. Though, it's unclear why the configure check didn't work as intended. I was able to build this PR applied to 0.8.2 with QAT support enabled on CentOS 7 without warnings.

If you know a definitive fix please let us know.

Assuming your systems encountered the low memory error handling problem fixed by this PR then this will prevent any future damaged blocks. Blocks written prior to applying this fix may not have been compressed correctly on disk resulting in the errors reported.

@AGI-chandler
Copy link

Thanks @behlendorf, but what do you mean specifically by "Manually, setting zfs_kernel_param for your build"? Sorry am not very familiar with the code.

Since, after restoring data to the ZFS volume, I am getting a new full backup (which reads all the data), if I see I/O errors then would know that data is corrupted and should be restored from the old backup, and the rest of the data is still ok.

@behlendorf
Copy link
Contributor Author

what do you mean specifically by "Manually, setting zfs_kernel_param for your build"?

Specifically, I was referencing the change you made to resolve the compilation failure. That change shouldn't cause any problems, we'll just need to investigate why it was needed at all.

@AGI-chandler
Copy link

Thanks for clarifying that, we will give it a shot.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jan 6, 2020
@behlendorf behlendorf merged commit bc9cef1 into openzfs:master Jan 6, 2020
@cfzhu
Copy link
Contributor

cfzhu commented Jan 7, 2020

Thanks for fixing the issue, I have run some tests to verify this fix, it works well.

@AGI-chandler
Copy link

@cfzhu what tests did you run to tell it works? It doesn't work for us. We copy a file to QAT-compressed ZFS storage and dc_fails counter increases, there are no messages in the system log indicating the CPU was used to compress the file, and it gets corrupted--when we try to read the file we get input/output errors.

@PrivatePuffin
Copy link
Contributor

@AGI-admin just to be sure, did you make sure you rebooted to make sure all modules are loaded correctly?

@AGI-chandler
Copy link

@Ornias1993 Yes rebooted and updated the kernel, recompiled ZFS against the new kernel and rebooted again

@cfzhu
Copy link
Contributor

cfzhu commented Jan 10, 2020

@AGI-admin I also copy a test file to the ZFS storage with QAT compression , it works well,

[root@qat-server-104 ~]# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 9422905776166652 9423335546751550
name                            type data
comp_requests                   4    0
comp_total_in_bytes             4    0
comp_total_out_bytes            4    0
decomp_requests                 4    0
decomp_total_in_bytes           4    0
decomp_total_out_bytes          4    0
dc_fails                        4    0
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    20
cksum_total_in_bytes            4    1933312
cksum_fails                     4    0
[root@qat-server-104 ~]# cp test /testpool/fs/
[root@qat-server-104 ~]# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 9422905776166652 9423397240391439
name                            type data
comp_requests                   4    6378
comp_total_in_bytes             4    835977216
comp_total_out_bytes            4    145444956
decomp_requests                 4    0
decomp_total_in_bytes           4    0
decomp_total_out_bytes          4    0
dc_fails                        4    0
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    20
cksum_total_in_bytes            4    1933312
cksum_fails                     4    0
[root@qat-server-104 ~]# cp  /testpool/fs/test  testa
[root@qat-server-104 ~]# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 9422905776166652 9423442609172061
name                            type data
comp_requests                   4    8292
comp_total_in_bytes             4    1086849024
comp_total_out_bytes            4    179660916
decomp_requests                 4    1693
decomp_total_in_bytes           4    180093440
decomp_total_out_bytes          4    221904896
dc_fails                        4    0
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    20
cksum_total_in_bytes            4    1933312
cksum_fails                     4    0
[root@qat-server-104 ~]# diff test testa
[root@qat-server-104 ~]#

In addition, I have run test when status was CPA_STATUS_FAIL,

 [root@qat-server-104 zfs]# git diff
 diff --git a/module/os/linux/zfs/qat_compress.c b/module/os/linux/zfs/qat_compress.c
 index ad3ead3..7cf364d 100644
 --- a/module/os/linux/zfs/qat_compress.c
 +++ b/module/os/linux/zfs/qat_compress.c
 @@ -279,6 +279,7 @@ qat_compress_impl(qat_compress_dir_t dir, char *src, int src_len,

        status = QAT_PHYS_CONTIG_ALLOC(&in_pages,
            num_src_buf * sizeof (struct page *));
+        status = CPA_STATUS_FAIL;
        if (status != CPA_STATUS_SUCCESS)
                goto fail;

[root@qat-server-104 zfs]# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 9423816134663902 9423903589269567
name                            type data
comp_requests                   4    0
comp_total_in_bytes             4    0
comp_total_out_bytes            4    0
decomp_requests                 4    0
decomp_total_in_bytes           4    0
decomp_total_out_bytes          4    0
dc_fails                        4    0
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    20
cksum_total_in_bytes            4    1933312
cksum_fails                     4    0
[root@qat-server-104 ~]# cp test /testpool/fs/
[root@qat-server-104 ~]# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 9424064643306808 9424121325531885
name                            type data
comp_requests                   4    0
comp_total_in_bytes             4    0
comp_total_out_bytes            4    0
decomp_requests                 4    0
decomp_total_in_bytes           4    0
decomp_total_out_bytes          4    0
dc_fails                        4    5125
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    20
cksum_total_in_bytes            4    1933312
cksum_fails                     4    0
[root@qat-server-104 ~]# cp /testpool/fs/test  testb
[root@qat-server-104 ~]# cat /proc/spl/kstat/zfs/qat
19 1 0x01 17 4624 9424064643306808 9424161484988615
name                            type data
comp_requests                   4    0
comp_total_in_bytes             4    0
comp_total_out_bytes            4    0
decomp_requests                 4    0
decomp_total_in_bytes           4    0
decomp_total_out_bytes          4    0
dc_fails                        4    10688
encrypt_requests                4    0
encrypt_total_in_bytes          4    0
encrypt_total_out_bytes         4    0
decrypt_requests                4    0
decrypt_total_in_bytes          4    0
decrypt_total_out_bytes         4    0
crypt_fails                     4    0
cksum_requests                  4    20
cksum_total_in_bytes            4    1933312
cksum_fails                     4    0
[root@qat-server-104 ~]# diff test testb
[root@qat-server-104 ~]#

@PrivatePuffin
Copy link
Contributor

@AGI-admin Are you certain your QAT card or chipset is (still) good?
As it surely isn't layer 8, just to make sure it isn't layer 1 in your case...

@AGI-chandler
Copy link

@Ornias1993 It is a new QAT card only 1-2 months old but the server is over 5 years old, still running fine though as far as I can tell... is there a way to know if chipset or other hardware is going bad? I don't notice any other major problems like this. We are planning to replace the server in the next 1-2 months and will move the QAT card+storage to it. I also filed a case with Intel for the QAT card so will see what they say. It could be a different bug that we are experiencing, according to recent notes on #9784

@wli5
Copy link
Contributor

wli5 commented Jan 10, 2020

@AGI-admin if you can run QAT sample code successfully that means QAT hardware is fine. Have you ever run ZFS with QAT successfully before?

@AGI-chandler
Copy link

@wli5 I've tried make samples but there is an error with "undefined reference" so maybe will ask Intel about that. We have ZFS 0.8.1 with QAT running fine so far on our new backup server with Debian 10.

@PrivatePuffin
Copy link
Contributor

Interesting... 1-2 Months, is within the DOA range of hardware...
(most hardware either fails in the first 3 months or after a few year, an inverse normal distribution simply put)

@AGI-chandler
Copy link

Yes and doesn't seem any one can re-create the error with the same data.

tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Jan 22, 2020
When qat_compress() fails to allocate the required contiguous memory
it mistakenly returns success.  This prevents the fallback software
compression from taking over and (un)compressing the block.

Resolve the issue by correctly setting the local 'status' variable
on all exit paths.  Furthermore, initialize it to CPA_STATUS_FAIL
to ensure qat_compress() always fails safe to guard against any
similar bugs in the future.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9784
Closes openzfs#9788
tonyhutter pushed a commit that referenced this pull request Jan 23, 2020
When qat_compress() fails to allocate the required contiguous memory
it mistakenly returns success.  This prevents the fallback software
compression from taking over and (un)compressing the block.

Resolve the issue by correctly setting the local 'status' variable
on all exit paths.  Furthermore, initialize it to CPA_STATUS_FAIL
to ensure qat_compress() always fails safe to guard against any
similar bugs in the future.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9784
Closes #9788
@behlendorf behlendorf deleted the issue-9784 branch April 19, 2021 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants