Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block cloning conditionally destroy ARC buffer #16337

Merged
merged 1 commit into from
Aug 2, 2024

Conversation

bwatkinson
Copy link
Contributor

@bwatkinson bwatkinson commented Jul 10, 2024

dmu_buf_will_clone() calls arc_buf_destroy() if there is an assosciated ARC buffer with the dbuf. However, this can only be done conditionally. If the preivous dirty record's dr_data is pointed at db_dbf then destroying it can lead to NULL pointer deference when syncing out the previous dirty record.

This updates dmu_buf_fill_clone() to only call arc_buf_destroy() if the previous dirty records dr_data is not pointing to db_buf. The block clone wil still set the dbuf's db_buf and db_data to NULL, but this will not cause any issues as any previous dirty record dr_data will still be pointing at the ARC buffer.

Updated dmu_buf_will_clone() to conditionally call arc_buf_destroy().

Motivation and Context

dmu_buf_will_clone() always called arc_buf_destroy() if there was an associated ARC buffer with the dbuf. However, this can lead to a NULL pointer dereference, which can occur when a previous dirty record is being synced and it's dr_data is pointing at the ARC buffer also pointed to by db_buf.

Description

This updates dmu_buf_fill_clone() to only call arc_buf_destroy() if the previous dirty records dr_data is not pointing to db_buf. The block clone wil still set the dbuf's db_buf and db_data to NULL, but this will not cause any issues as any previous dirty record dr_data will still be pointing at the ARC buffer.

How Has This Been Tested?

Ran ZTS tests with bclone and block_cloning takes for 5 iterations without any issues.

Testing was done with kernel 4.18.0-408.el8.x86_64.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@bwatkinson bwatkinson mentioned this pull request Jul 10, 2024
17 tasks
module/zfs/dbuf.c Outdated Show resolved Hide resolved
@bwatkinson bwatkinson force-pushed the bclone_arc_buf_destroy branch from 6a55f0c to 330b0e7 Compare July 11, 2024 01:19
@bwatkinson bwatkinson force-pushed the bclone_arc_buf_destroy branch from 330b0e7 to e463c2b Compare July 11, 2024 20:14
@bwatkinson bwatkinson requested a review from amotin July 11, 2024 20:16
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still looks OK to me. I am not sure I like running heavy I/O test for 3 minutes each time to test for one scenario though. Tests already take too much time. But I'll leave it to somebody who is closer to the tests area to comment. I wonder, since we know where was the problem, could we try to craft something more specific to trigger it?

@bwatkinson
Copy link
Contributor Author

bwatkinson commented Jul 12, 2024

It still looks OK to me. I am not sure I like running heavy I/O test for 3 minutes each time to test for one scenario though. Tests already take too much time. But I'll leave it to somebody who is closer to the tests area to comment. I wonder, since we know where was the problem, could we try to craft something more specific to trigger it?

No disagreement from me that running a test for 3 mins is not great... I have direct I/O tests that could also trigger this, but often times it would take running them for 100+ iterations to get it to hit. It is hard to get things to trigger with timing stuff like this. This is also why I ask @ixhamza for a test case he might supply us that was a good case that could easily duplicate the issue. We might be able to craft something for sure. I just wonder if we will get into the same dilemma of it being such a timing thing to trigger, we would still wind up having to run multiple iterations of it. Open to ideas though for a better reproducer if we could be craft one.

@tonyhutter
Copy link
Contributor

Looks like the new test case timed out on FreeBSD 13:

Test: /usr/local/share/zfs/zfs-tests/tests/functional/block_cloning/block_cloning_overwrites (run as root) [10:00] [KILLED]

@bwatkinson bwatkinson force-pushed the bclone_arc_buf_destroy branch from e463c2b to a7ba93c Compare July 18, 2024 20:35
@bwatkinson
Copy link
Contributor Author

Looks like the new test case timed out on FreeBSD 13:

Test: /usr/local/share/zfs/zfs-tests/tests/functional/block_cloning/block_cloning_overwrites (run as root) [10:00] [KILLED]

Yeah, look at the output the test never made it back doing the first sync of the pool. I played around with this a bit in a FreeBSD VM with little resources to mimic the CI runners. The vast majority of the time was spent just doing the initial dd command. I wound tweaking the block_cloning_overwrite test a bit so it is less intense. I removed setting the dataset records sizes to 4k and reduced the overall file size down to 128M vs the original 1G file that was written. This functionally should be exactly the same as the example @ixhamza provided us, but less CPU intensive on the CI runners. Let's see if this now works better with the test runs.

@tonyhutter
Copy link
Contributor

Looking at the rest results it seems like for every ~20 clonefiles/sync:

clonefile -f /testpool/testfs1/file1 /testpool/testfs2/file2

..there's 1 dd:

dd if=/dev/urandom of=/testpool/testfs2/file2 bs=1M count=128

is that ok?

dmu_buf_will_clone() calls arc_buf_destroy() if there is an assosciated
ARC buffer with the dbuf. However, this can only be done conditionally.
If the preivous dirty record's dr_data is pointed at db_dbf then
destroying it can lead to NULL pointer deference when syncing out the
previous dirty record.

This updates dmu_buf_fill_clone() to only call arc_buf_destroy() if the
previous dirty records dr_data is not pointing to db_buf. The block
clone wil still set the dbuf's db_buf and db_data to NULL, but this will
not cause any issues as any previous dirty record dr_data will still be
pointing at the ARC buffer.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
@bwatkinson bwatkinson force-pushed the bclone_arc_buf_destroy branch from a7ba93c to d3bb628 Compare July 26, 2024 19:18
@bwatkinson
Copy link
Contributor Author

Looking at the rest results it seems like for every ~20 clonefiles/sync:

clonefile -f /testpool/testfs1/file1 /testpool/testfs2/file2

..there's 1 dd:

dd if=/dev/urandom of=/testpool/testfs2/file2 bs=1M count=128

is that ok?

I did some local testing as well on a node I have, and I was seeing about the same thing. I just decided to remove the test case. It has been locally tested with the original test case by @ixhamza to show. without this patch the error occurs. We might be able to come up with a test case in the future that would be better at stressing this, but I don't think we should hold off merging a NULL pointer dereference based on a test case.

@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Aug 1, 2024
@behlendorf behlendorf merged commit c8184d7 into openzfs:master Aug 2, 2024
24 checks passed
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
dmu_buf_will_clone() calls arc_buf_destroy() if there is an associated
ARC buffer with the dbuf. However, this can only be done conditionally.
If the previous dirty record's dr_data is pointed at db_dbf then
destroying it can lead to NULL pointer deference when syncing out the
previous dirty record.

This updates dmu_buf_fill_clone() to only call arc_buf_destroy() if the
previous dirty records dr_data is not pointing to db_buf. The block
clone wil still set the dbuf's db_buf and db_data to NULL, but this will
not cause any issues as any previous dirty record dr_data will still be
pointing at the ARC buffer.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes openzfs#16337
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants