Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kernel panic induced by redacted send #11297

Merged
merged 6 commits into from
Dec 11, 2020
Merged

Conversation

pcd1193182
Copy link
Contributor

Signed-off-by: Paul Dagnelie pcd@delphix.com

Motivation and Context

There is a bug in the binary search in the redacted send resume logic. This changes simplifies the code by switching to a per-record binary search, and fixes outstanding issues causing a kernel panic.

How Has This Been Tested?

Passes the zfs-test suite and testing involving redacted send/recv

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

module/zfs/dsl_bookmark.c Outdated Show resolved Hide resolved
module/zfs/dsl_bookmark.c Outdated Show resolved Hide resolved
module/zfs/dsl_bookmark.c Show resolved Hide resolved
module/zfs/dsl_bookmark.c Show resolved Hide resolved
module/zfs/dsl_bookmark.c Outdated Show resolved Hide resolved
module/zfs/dsl_bookmark.c Outdated Show resolved Hide resolved
@ahrens
Copy link
Member

ahrens commented Dec 7, 2020

For folks who may run into this bug in the future (in older releases), it would be helpful if the PR/commit message contained a description of the bugs that this is fixing, and their impact. I think it's:

  1. maxbufid can be decremented to -1, causing us to read the last possible block of the object (offset 2^64 - 128K to offset 2^64-1) instead of the one we wanted. maxbufid = midbufid - 1; Not sure what the impact is upstream?
  2. when examining non-last blocks, we compare with an entry in the middle of the block, rather than the last entry in the block. redact_block_zb_compare(&buf[last_entry(rl, bufsize, maxbufid)], resume) maxbufid should be midbufid. This can cause us to not invoke the callback for some of the redaction records, when resuming.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 7, 2020
In the redaction list traversal code, there is a bug in the binary search
logic when looking for the resume point. Maxbufid can be decremented to -1,
causing us to read the last possible block of the object instead of the one we
wanted. This can cause incorrect resume behavior, or possibly even a hang in
some cases. In addition, when examining non-last blocks, we can treat the
block as being the same size as the last block, causing us to miss entries in
the redaction list when determining where to resume. Finally, we were ignoring
the case where the resume point was found in the buffer being searched, and
resuming from minbufid. All these issues have been corrected, and the code has
been significantly simplified to make future issues less likely.

Signed-off-by: Paul Dagnelie <pcd@delphix.com>
@ahrens ahrens added the Component: Send/Recv "zfs send/recv" feature label Dec 8, 2020
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) Status: Code Review Needed Ready for review and testing and removed Status: Code Review Needed Ready for review and testing Status: Accepted Ready to integrate (reviewed, tested) labels Dec 9, 2020
@behlendorf
Copy link
Contributor

It looks like this change causes an ASSERT to be hit when running the redacted_send tests.

[12368.647136] VERIFY3(redact_block_zb_compare(rb, resume) >= 0) failed (-1 >= 0)
[12368.649085] PANIC at dsl_bookmark.c:1684:dsl_redaction_list_traverse()
[12368.650696] Showing stack for process 3370004
[12368.651814] CPU: 1 PID: 3370004 Comm: redact_list Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
[12368.654914] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017
[12368.656504] Call Trace:
[12368.657137]  dump_stack+0x5c/0x80
[12368.657979]  spl_panic+0xd3/0xfb [spl]
[12368.669539]  dsl_redaction_list_traverse+0x321/0x350 [zfs]
[12368.672040]  redact_list_thread+0x7f/0xe0 [zfs]
[12368.674333]  thread_generic_wrapper+0x78/0xb0 [spl]
[12368.675547]  kthread+0x112/0x130

Signed-off-by: Paul Dagnelie <pcd@delphix.com>
@pcd1193182
Copy link
Contributor Author

The problem is that the assertion is incorrect; we won't necessarily always come out of the binary search after or equal to the record that overlaps with the resume point. In particular, there may be no such record; if we are beyond the final redaction record, for example, the binary search will return the final redaction record, but the resume point will be after it. There may be other situations where this can occur, depending on exactly how the binary search behaves. With that in mind, I've reverted that portion of the change, and am testing it now.

Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 10, 2020
@behlendorf behlendorf merged commit 7d4b365 into openzfs:master Dec 11, 2020
ghost pushed a commit to zfsonfreebsd/ZoF that referenced this pull request Dec 23, 2020
In the redaction list traversal code, there is a bug in the binary search
logic when looking for the resume point. Maxbufid can be decremented to -1,
causing us to read the last possible block of the object instead of the one we
wanted. This can cause incorrect resume behavior, or possibly even a hang in
some cases. In addition, when examining non-last blocks, we can treat the
block as being the same size as the last block, causing us to miss entries in
the redaction list when determining where to resume. Finally, we were ignoring
the case where the resume point was found in the buffer being searched, and
resuming from minbufid. All these issues have been corrected, and the code has
been significantly simplified to make future issues less likely.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#11297
behlendorf pushed a commit that referenced this pull request Dec 23, 2020
In the redaction list traversal code, there is a bug in the binary search
logic when looking for the resume point. Maxbufid can be decremented to -1,
causing us to read the last possible block of the object instead of the one we
wanted. This can cause incorrect resume behavior, or possibly even a hang in
some cases. In addition, when examining non-last blocks, we can treat the
block as being the same size as the last block, causing us to miss entries in
the redaction list when determining where to resume. Finally, we were ignoring
the case where the resume point was found in the buffer being searched, and
resuming from minbufid. All these issues have been corrected, and the code has
been significantly simplified to make future issues less likely.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #11297
palashgandhi pushed a commit to palashgandhi/zfs that referenced this pull request Mar 17, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in openzfs#11297.

Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
behlendorf pushed a commit that referenced this pull request Mar 20, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in #11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes #11764
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
In the redaction list traversal code, there is a bug in the binary search
logic when looking for the resume point. Maxbufid can be decremented to -1,
causing us to read the last possible block of the object instead of the one we
wanted. This can cause incorrect resume behavior, or possibly even a hang in
some cases. In addition, when examining non-last blocks, we can treat the
block as being the same size as the last block, causing us to miss entries in
the redaction list when determining where to resume. Finally, we were ignoring
the case where the resume point was found in the buffer being searched, and
resuming from minbufid. All these issues have been corrected, and the code has
been significantly simplified to make future issues less likely.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#11297
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in openzfs#11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes openzfs#11764
adamdmoss pushed a commit to adamdmoss/zfs that referenced this pull request Apr 10, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in openzfs#11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes openzfs#11764
@PaulZ-98 PaulZ-98 mentioned this pull request Apr 26, 2021
12 tasks
ghost pushed a commit to truenas/zfs that referenced this pull request May 6, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in openzfs#11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes openzfs#11764
ghost pushed a commit to truenas/zfs that referenced this pull request May 6, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in openzfs#11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes openzfs#11764
ghost pushed a commit to truenas/zfs that referenced this pull request May 7, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in openzfs#11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes openzfs#11764
ghost pushed a commit to truenas/zfs that referenced this pull request May 10, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in openzfs#11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes openzfs#11764
ghost pushed a commit to truenas/zfs that referenced this pull request May 10, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in openzfs#11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes openzfs#11764
ghost pushed a commit to truenas/zfs that referenced this pull request May 10, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in openzfs#11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes openzfs#11764
ghost pushed a commit to truenas/zfs that referenced this pull request May 13, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in openzfs#11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes openzfs#11764
behlendorf pushed a commit that referenced this pull request May 20, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in #11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes #11764
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
In the redaction list traversal code, there is a bug in the binary search
logic when looking for the resume point. Maxbufid can be decremented to -1,
causing us to read the last possible block of the object instead of the one we
wanted. This can cause incorrect resume behavior, or possibly even a hang in
some cases. In addition, when examining non-last blocks, we can treat the
block as being the same size as the last block, causing us to miss entries in
the redaction list when determining where to resume. Finally, we were ignoring
the case where the resume point was found in the buffer being searched, and
resuming from minbufid. All these issues have been corrected, and the code has
been significantly simplified to make future issues less likely.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#11297
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in openzfs#11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes openzfs#11764
tonyhutter pushed a commit that referenced this pull request Jun 23, 2021
This change adds a new test that covers a bug fix in the binary search
in the redacted send resume logic that causes a kernel panic.
The bug was fixed in #11297.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Palash Gandhi <palash.gandhi@delphix.com>
Closes #11764
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Send/Recv "zfs send/recv" feature Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants