Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid retrieving unused snapshot props #8077

Merged
merged 1 commit into from
Mar 12, 2019
Merged

Conversation

alek-p
Copy link
Contributor

@alek-p alek-p commented Oct 31, 2018

This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.

Motivation and Context

Currently, when we do zfs rollback and send operations we have to iterate through the full list of snapshots even though we don't end up doing anything with some of those snapshots that the kernel gives us.

Description

To make the snapshot iteration more efficient when possible we added new parameters to the zfs_ioc_snapshot_list_next() ioctl. When passed in they tell us to skip snapshots created before and/or after the specified creation txg. This allows the kernel to avoid returning snapshots to the caller that the caller doesn't need. As a result, we reduce the number of userland/kernel boundary crossings and we also avoid the slow (often read-from-disk) operation of reading in snapshot stats (zfs_ioc_objset_stats_impl()) for the skipped snapshot.

How Has This Been Tested?

I've ran zfs-tests and traced ioctls to confirm the number of times zfs_ioc_snapshot_list_next() ioctl is called now depends on which snapshots we are working with.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@alek-p alek-p requested a review from sdimitro October 31, 2018 17:35
@ahrens ahrens added the Status: Design Review Needed Architecture or design is under discussion label Oct 31, 2018
@ahrens
Copy link
Member

ahrens commented Oct 31, 2018

This is what channel programs were designed for. Is there any missing functionality that would prevent us from doing this with a channel program, instead of adding more ioctl-specific tweaks?

@alek-p
Copy link
Contributor Author

alek-p commented Oct 31, 2018

The advantage is we don't have to write a channel program to get some of the efficiency gains. Everyone already using the involved zfs commands would get the benefits when they upgrade.

module/zfs/zfs_ioctl.c Outdated Show resolved Hide resolved
@ahrens
Copy link
Member

ahrens commented Oct 31, 2018

@alek-p I'm suggesting that libzfs use channel programs to get the benefit for everyone using the CLI. The user experience would be identical to what you have here. Where you are using the LIST_NEXT ioctl with the new arguments, instead you can use a channel program to find the next snapshot whose birth is within the specified range.

Another consideration (for either the LIST_NEXT ioctl or channel program) is making sure that new userland bits will run on old kernel bits. That may "just work" the way you have it now (with the old kernel ignoring the birth time constraint), but for channel programs libzfs should probably check that the kernel supports channel programs and if not then fall back on LIST_NEXT.

@alek-p
Copy link
Contributor Author

alek-p commented Oct 31, 2018

I see what you mean now, I can check what it would take to make this happen through zcp. I guess the concern is modifying existing IOCTL could break 3rd party software.

@ahrens
Copy link
Member

ahrens commented Oct 31, 2018

@alek-p I think the main concern with modifying the ioctl is that when upgrading, you can run new libzfs on an old kernel. In that case we still want zfs rollback etc to work.

@ahrens
Copy link
Member

ahrens commented Oct 31, 2018

We should also think about how long this ioctl can hold various locks, if you have a lot of snapshots. It looks like we could hold the dp_config_rwlock while skipping over a lot of snapshots. This can prevent spa_sync() from making progress. We'd have the same issue with a channel program as well, and it might be easier to address this with the LIST_NEXT ioctl than with the channel program (at least with the existing channel program functionality).

@ahrens
Copy link
Member

ahrens commented Oct 31, 2018

We might want to think at a higher level about how to improve zfs rollback and zfs send -R/-I. For example, we might want to introduce sorted iteration starting from a named snapshot as a first-class concept to channel programs (using ds_prev/next_snap_obj). This could allow us to do some of these tasks in O(snapshots we care about) rather than O(total number of snapshots).

I'm not sure any of my ideas are total replacements for what you've proposed here (it is a pretty simple way to save looking up the snapshot details), but we should probably consider if there are other ways to improve the high-level operations.

Have you measured how much performance improvement you get from this? And for curiosity, do you know which blocks we avoid reading from disk? Looks like if it's a zvol we save reading its contents (zvol_get_stats) which could be a big win. Otherwise the only blocks I see are for snapshot properties, which are not often used. Looks like looks like we can already avoid getting both zvol stats and props if we don't pass in a zc_nvlist_dst.

@alek-p
Copy link
Contributor Author

alek-p commented Oct 31, 2018

Yeah it looks like with the current channel program functionality the way to do this would be similar to what I've done - I would need to add a zcp based LIST_NEXT basically. As you pointed out that's not ideal, and should have an interface that is O(1).
In instances where we do need to iterate over most/all snapshots, it would also be nice to have an interface that is able to iterate over more than one snapshots at a time. Not sure all of that belongs in this PR.
When I did simple timeing performance test the results depended on which snapshot to do the comparison. The reasons I was trying to prevent zfs_ioc_objset_stats_impl() calls was I because the full snapshots properties aren't often used so are likely not cached.

@alek-p
Copy link
Contributor Author

alek-p commented Oct 31, 2018

I've tried using old userland binaries with new kernel module and that seems to work. Looks like the right thing is happening because of the check for the presence of the src nvl.

@alek-p alek-p changed the title snapshot iteration should be more efficient avoid retrieving unused snapshot props Nov 2, 2018
@alek-p
Copy link
Contributor Author

alek-p commented Nov 2, 2018

I've run some performance tests to show the impact of not fetching props that are discarded in userspace. This was a simple timeing of how long it took zfs to complete the cli command. The uncached results are the time it takes to complete the command right after the zfs module is loaded while cached is the time it takes to complete the same command again.

cached send -R uncached send -R cached rollback uncached rollback
SSD 10k snaps 103s 106s 61s 72s
SSD 10k snaps skipped 84s 87s 31s 32s
10k snaps skipped on SSD % gain 18% 18% 49% 55%
HDD 35k snaps 8m 41.5m 80s 32m
HDD 35k snaps skipped 7m 41m 7s 21m
35k snaps skipped on HDD % gain 12% 2% 91% 34%

Rollback is benefiting more than send so there may be more room for optimisation of sends. While testing I noticed even for a small filesystem that just have 10 snapshots, doing a replicate (-R) send on the first snapshot causes 33 LIST_NEXT ioctl calls while doing the same thing but with this patch causes 24 LIST_NEXT ioctl calls.
The cached vs uncached results are somewhat surprising as I expected most of the benefit to come from not having to read the drives than from skipping the ioctls. It would be good to see someone else confirm these results.

@alek-p
Copy link
Contributor Author

alek-p commented Nov 3, 2018

I should have mentioned that the SSD and the HDD systems are different physical servers with the HDD one having 256 GB ram which is 8x the ram of SSD system.

@ahrens
Copy link
Member

ahrens commented Nov 6, 2018

It looks like the only cases that are improving by more than a minute are the HDD rollback ones. I'm not totally opposed to this change, but I wonder if we can do better in terms of avoiding i/o in these cases. For example, with zfs rollback, if we are not doing -r/-R, we could just check if there are any snapshots after the target one in O(1) time, rather than listing all of them (e.g. with the (hidden) prevsnap property, or by letting the kernel do the check for us). If we wanted, we could even retain the existing behavior (of printing all the problematic snapshots) if the number of snapshots is small.

@alek-p
Copy link
Contributor Author

alek-p commented Nov 8, 2018

To me, the perf test data is showing that this patch has a much higher impact on rollback command even in the SSD case. The send case could probably use more optimization as I alluded to in my previous comment.
Thanks for the suggestion on how to avoid more I/O in the simpler rollbacks. I think what I came up with is a decent balance of avoiding unneeded I/O and preserving the potentially useful prints of younger bookmarks/snapshots.

@alek-p
Copy link
Contributor Author

alek-p commented Dec 12, 2018

I've rebased this PR in order to pick up the ztest patches that make testing more reliable

@ahrens
Copy link
Member

ahrens commented Dec 20, 2018

I'd like to review this but I probably won't have time to until after the new year. Please wait for my review if possible.

@alek-p
Copy link
Contributor Author

alek-p commented Jan 5, 2019

In this latest update, I made additional optimizations for the zfs send use cases. The number of snap list next calls is now down to 6 (from 33) in the simple 10 snapshot test case described at the bottom of this comment: #8077 (comment)
I'll work on getting more zfs send performance data.

@behlendorf behlendorf added the Type: Feature Feature request or new feature label Jan 18, 2019
@loli10K
Copy link
Contributor

loli10K commented Jan 18, 2019

This seems to work great, thank you. Tested on my raspberry with 10K snapshots

master:

root@linux:~# time zfs send -nv -i gorgon/pull-8077-10K@s9999 gorgon/pull-8077-10K@s10000
send from @s9999 to gorgon/pull-8077-10K@s10000 estimated size is 624B
total estimated size is 624B

real	1m31.617s
user	0m7.350s
sys	1m24.100s
root@linux:~# time zfs send -nv -i gorgon/pull-8077-10K@s9999 gorgon/pull-8077-10K@s10000
send from @s9999 to gorgon/pull-8077-10K@s10000 estimated size is 624B
total estimated size is 624B

real	1m31.470s
user	0m7.190s
sys	1m24.210s
root@linux:~# time zfs send -nv -i gorgon/pull-8077-10K@s9999 gorgon/pull-8077-10K@s10000
send from @s9999 to gorgon/pull-8077-10K@s10000 estimated size is 624B
total estimated size is 624B

real	1m32.138s
user	0m7.680s
sys	1m24.450s

running this PR:

root@linux:~# time zfs send -nv -i gorgon/pull-8077-10K@s9999 gorgon/pull-8077-10K@s10000
send from @s9999 to gorgon/pull-8077-10K@s10000 estimated size is 624B
total estimated size is 624B

real	0m9.205s
user	0m0.020s
sys	0m8.780s
root@linux:~# time zfs send -nv -i gorgon/pull-8077-10K@s9999 gorgon/pull-8077-10K@s10000
send from @s9999 to gorgon/pull-8077-10K@s10000 estimated size is 624B
total estimated size is 624B

real	0m8.486s
user	0m0.000s
sys	0m8.480s
root@linux:~# time zfs send -nv -i gorgon/pull-8077-10K@s9999 gorgon/pull-8077-10K@s10000
send from @s9999 to gorgon/pull-8077-10K@s10000 estimated size is 624B
total estimated size is 624B

real	0m8.482s
user	0m0.000s
sys	0m8.480s
root@linux:~# 

include/sys/fs/zfs.h Outdated Show resolved Hide resolved
lib/libzfs/libzfs_iter.c Show resolved Hide resolved
lib/libzfs/libzfs_sendrecv.c Outdated Show resolved Hide resolved
module/zfs/zfs_ioctl.c Show resolved Hide resolved
module/zfs/zfs_ioctl.c Show resolved Hide resolved
module/zfs/zfs_ioctl.c Show resolved Hide resolved
module/zfs/zfs_ioctl.c Show resolved Hide resolved
@alek-p alek-p force-pushed the snap_iter branch 2 times, most recently from 585f3bc to 6cc638c Compare February 7, 2019 22:45
@alek-p
Copy link
Contributor Author

alek-p commented Feb 8, 2019

Turns out skipping snapshots when doing a zfs send --replicate/-R isn't appropriate so this latest patch no longer optimizes "replicate" sends.

@alek-p
Copy link
Contributor Author

alek-p commented Feb 11, 2019

would be awesome if someone from @delphix could run their internal zfs send/recv testing on this patch as it does change the way we do snapshot iteration for some sends and rollbacks

Copy link
Contributor

@tcaputi tcaputi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Would be nice if we could skip right to the dataset we wanted and iterate from there. Maybe as future work.

@alek-p alek-p force-pushed the snap_iter branch 2 times, most recently from 8eff959 to b300ce6 Compare February 19, 2019 23:11
@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Design Review Needed Architecture or design is under discussion labels Feb 20, 2019
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay getting this reviewed. The current version of this looks good, only a few cosmetic issues.

cmd/zfs/zfs_main.c Outdated Show resolved Hide resolved
lib/libzfs/libzfs_iter.c Outdated Show resolved Hide resolved
@alek-p alek-p force-pushed the snap_iter branch 2 times, most recently from 777335c to 8ee991a Compare February 25, 2019 14:29
@@ -2313,7 +2313,8 @@ zfs_ioc_dataset_list_next(zfs_cmd_t *zc)
* inputs:
* zc_name name of filesystem
* zc_cookie zap cursor
* zc_nvlist_dst_size size of buffer for property nvlist
* zc_nvlist_dst property nvlist
* zc_nvlist_dst_size size of property nvlist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to describe zc_nvlist_src[_size] here?

* zfs_iter_snapshot/bookmark iteration so we can fail fast and
* avoid iterating over the rest of the younger objects
*/
(void) fprintf(stderr, gettext("these are the first %d "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"first" implies a sorting which I don't think is relevant to the user (it's in the ZAP hash order, right?). Maybe we should instead say something like "Output limited to %d snapshots/bookmarks"?

* the "list next snapshot" ioctl
*/
#define SNAP_ITER_SKIP_AFTER_TXG "snap_iter_skip_after_txg"
#define SNAP_ITER_SKIP_TO_TXG "snap_iter_skip_to_txg"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about SKIP_AFTER_TXG and SKIP_BEFORE_TXG, or perhaps better yet,SNAP_ITER_MAX_TXG and SNAP_ITER_MIN_TXG?

@@ -141,7 +141,7 @@ zfs_iter_filesystems(zfs_handle_t *zhp, zfs_iter_f func, void *data)
*/
int
zfs_iter_snapshots(zfs_handle_t *zhp, boolean_t simple, zfs_iter_f func,
void *data)
void *data, nvlist_t *range_nvl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should take the min and max as uint64_t arguments, rather than making the caller construct the range_avl. This minimizes the code that needs to know about how the arguments are marshalled to the kernel.

@@ -282,7 +289,8 @@ zfs_snapshot_compare(const void *larg, const void *rarg)
}

int
zfs_iter_snapshots_sorted(zfs_handle_t *zhp, zfs_iter_f callback, void *data)
zfs_iter_snapshots_sorted(zfs_handle_t *zhp, zfs_iter_f callback, void *data,
nvlist_t *range_nvl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -141,7 +141,7 @@ zfs_iter_filesystems(zfs_handle_t *zhp, zfs_iter_f func, void *data)
*/
int
zfs_iter_snapshots(zfs_handle_t *zhp, boolean_t simple, zfs_iter_f func,
void *data)
void *data, nvlist_t *range_nvl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about changing this interface and breaking 3rd party libzfs consumers, right? They should be using libzfs_core, the CLI, or the kernel interface directly if they want a stable interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm good point; I could create zfs_iter_snapshots_impl() and use that so that we leave the signature for zfs_iter_snapshots() as is but it sounds like that's not necessary

if (!sd->replicate && fromsnap_txg != 0) {
range_nvl = fnvlist_alloc();
fnvlist_add_uint64(range_nvl, SNAP_ITER_SKIP_TO_TXG,
fromsnap_txg_save);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the check use fromsnap_txg, but we "skip to" fromsnap_txg_save? From reading the code, it seems like fromsnap_txg is the txg of the fromsnap in this fs, whereas fromsnap_txg_save is the txg of the fromsnap in the parent filesystem. So we should be using fromsnap_txg in both places here? The test case would be one where the parent and child fromsnap's have different txg's (i.e. snapshots with the same name were created at different times).

if (range_nvl == NULL)
range_nvl = fnvlist_alloc();
fnvlist_add_uint64(range_nvl, SNAP_ITER_SKIP_AFTER_TXG,
sd->tosnap_txg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if tosnap_txg is nonzero then sd->tosnap_txg will be the same, but I think this assumption make the code harder to understand. I think we should be using the local tosnap_txg here.

module/zfs/zfs_ioctl.c Show resolved Hide resolved
error = zfs_ioc_objset_stats_impl(zc, ossnap);
if ((skip_after != 0 && dsl_get_creationtxg(ds) > skip_after) ||
(skip_to != 0 && dsl_get_creationtxg(ds) < skip_to)) {
dsl_dataset_rele(ds, FTAG);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's just github, but the indentation here looks wrong (one too many tabs?)

@alek-p alek-p force-pushed the snap_iter branch 2 times, most recently from f5b38e3 to 5ea09e4 Compare March 1, 2019 07:08
@alek-p
Copy link
Contributor Author

alek-p commented Mar 1, 2019

Looks like I need to rework how replicate send is detected, I should be able to get to that by Monday.

@alek-p alek-p force-pushed the snap_iter branch 2 times, most recently from 9c5df87 to 4d13ad4 Compare March 5, 2019 01:56
@alek-p
Copy link
Contributor Author

alek-p commented Mar 5, 2019

thanks for the review @ahrens! I've updated the PR to address your comments

lib/libzfs/libzfs_sendrecv.c Outdated Show resolved Hide resolved
module/zfs/zfs_ioctl.c Outdated Show resolved Hide resolved
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.

Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
@codecov
Copy link

codecov bot commented Mar 12, 2019

Codecov Report

Merging #8077 into master will increase coverage by <.01%.
The diff coverage is 89.1%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8077      +/-   ##
==========================================
+ Coverage   78.57%   78.58%   +<.01%     
==========================================
  Files         380      380              
  Lines      116057   116069      +12     
==========================================
+ Hits        91194    91208      +14     
+ Misses      24863    24861       -2
Flag Coverage Δ
#kernel 79.03% <86.84%> (ø) ⬆️
#user 67.23% <90.47%> (-0.14%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1b94e9...13f3dd0. Read the comment docs.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 12, 2019
@behlendorf behlendorf merged commit 4c0883f into openzfs:master Mar 12, 2019
@alek-p alek-p deleted the snap_iter branch March 12, 2019 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants