-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix kernel panic induced by redacted send #11297
Conversation
For folks who may run into this bug in the future (in older releases), it would be helpful if the PR/commit message contained a description of the bugs that this is fixing, and their impact. I think it's:
|
In the redaction list traversal code, there is a bug in the binary search logic when looking for the resume point. Maxbufid can be decremented to -1, causing us to read the last possible block of the object instead of the one we wanted. This can cause incorrect resume behavior, or possibly even a hang in some cases. In addition, when examining non-last blocks, we can treat the block as being the same size as the last block, causing us to miss entries in the redaction list when determining where to resume. Finally, we were ignoring the case where the resume point was found in the buffer being searched, and resuming from minbufid. All these issues have been corrected, and the code has been significantly simplified to make future issues less likely. Signed-off-by: Paul Dagnelie <pcd@delphix.com>
It looks like this change causes an ASSERT to be hit when running the redacted_send tests. [12368.647136] VERIFY3(redact_block_zb_compare(rb, resume) >= 0) failed (-1 >= 0)
[12368.649085] PANIC at dsl_bookmark.c:1684:dsl_redaction_list_traverse()
[12368.650696] Showing stack for process 3370004
[12368.651814] CPU: 1 PID: 3370004 Comm: redact_list Kdump: loaded Tainted: P OE --------- - - 4.18.0-240.1.1.el8_3.x86_64 #1
[12368.654914] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017
[12368.656504] Call Trace:
[12368.657137] dump_stack+0x5c/0x80
[12368.657979] spl_panic+0xd3/0xfb [spl]
[12368.669539] dsl_redaction_list_traverse+0x321/0x350 [zfs]
[12368.672040] redact_list_thread+0x7f/0xe0 [zfs]
[12368.674333] thread_generic_wrapper+0x78/0xb0 [spl]
[12368.675547] kthread+0x112/0x130 |
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
The problem is that the assertion is incorrect; we won't necessarily always come out of the binary search after or equal to the record that overlaps with the resume point. In particular, there may be no such record; if we are beyond the final redaction record, for example, the binary search will return the final redaction record, but the resume point will be after it. There may be other situations where this can occur, depending on exactly how the binary search behaves. With that in mind, I've reverted that portion of the change, and am testing it now. |
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
In the redaction list traversal code, there is a bug in the binary search logic when looking for the resume point. Maxbufid can be decremented to -1, causing us to read the last possible block of the object instead of the one we wanted. This can cause incorrect resume behavior, or possibly even a hang in some cases. In addition, when examining non-last blocks, we can treat the block as being the same size as the last block, causing us to miss entries in the redaction list when determining where to resume. Finally, we were ignoring the case where the resume point was found in the buffer being searched, and resuming from minbufid. All these issues have been corrected, and the code has been significantly simplified to make future issues less likely. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes openzfs#11297
In the redaction list traversal code, there is a bug in the binary search logic when looking for the resume point. Maxbufid can be decremented to -1, causing us to read the last possible block of the object instead of the one we wanted. This can cause incorrect resume behavior, or possibly even a hang in some cases. In addition, when examining non-last blocks, we can treat the block as being the same size as the last block, causing us to miss entries in the redaction list when determining where to resume. Finally, we were ignoring the case where the resume point was found in the buffer being searched, and resuming from minbufid. All these issues have been corrected, and the code has been significantly simplified to make future issues less likely. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #11297
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in openzfs#11297. Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Co-authored-by: John Kennedy <john.kennedy@delphix.com>
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in #11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes #11764
In the redaction list traversal code, there is a bug in the binary search logic when looking for the resume point. Maxbufid can be decremented to -1, causing us to read the last possible block of the object instead of the one we wanted. This can cause incorrect resume behavior, or possibly even a hang in some cases. In addition, when examining non-last blocks, we can treat the block as being the same size as the last block, causing us to miss entries in the redaction list when determining where to resume. Finally, we were ignoring the case where the resume point was found in the buffer being searched, and resuming from minbufid. All these issues have been corrected, and the code has been significantly simplified to make future issues less likely. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes openzfs#11297
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in openzfs#11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes openzfs#11764
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in openzfs#11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes openzfs#11764
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in openzfs#11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes openzfs#11764
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in openzfs#11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes openzfs#11764
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in openzfs#11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes openzfs#11764
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in openzfs#11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes openzfs#11764
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in openzfs#11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes openzfs#11764
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in openzfs#11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes openzfs#11764
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in openzfs#11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes openzfs#11764
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in #11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes #11764
In the redaction list traversal code, there is a bug in the binary search logic when looking for the resume point. Maxbufid can be decremented to -1, causing us to read the last possible block of the object instead of the one we wanted. This can cause incorrect resume behavior, or possibly even a hang in some cases. In addition, when examining non-last blocks, we can treat the block as being the same size as the last block, causing us to miss entries in the redaction list when determining where to resume. Finally, we were ignoring the case where the resume point was found in the buffer being searched, and resuming from minbufid. All these issues have been corrected, and the code has been significantly simplified to make future issues less likely. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes openzfs#11297
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in openzfs#11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes openzfs#11764
This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in #11297. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com> Closes #11764
Signed-off-by: Paul Dagnelie pcd@delphix.com
Motivation and Context
There is a bug in the binary search in the redacted send resume logic. This changes simplifies the code by switching to a per-record binary search, and fixes outstanding issues causing a kernel panic.
How Has This Been Tested?
Passes the zfs-test suite and testing involving redacted send/recv
Types of changes
Checklist:
Signed-off-by
.