-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid retrieving unused snapshot props #8077
Conversation
This is what channel programs were designed for. Is there any missing functionality that would prevent us from doing this with a channel program, instead of adding more ioctl-specific tweaks? |
The advantage is we don't have to write a channel program to get some of the efficiency gains. Everyone already using the involved zfs commands would get the benefits when they upgrade. |
@alek-p I'm suggesting that libzfs use channel programs to get the benefit for everyone using the CLI. The user experience would be identical to what you have here. Where you are using the LIST_NEXT ioctl with the new arguments, instead you can use a channel program to find the next snapshot whose birth is within the specified range. Another consideration (for either the LIST_NEXT ioctl or channel program) is making sure that new userland bits will run on old kernel bits. That may "just work" the way you have it now (with the old kernel ignoring the birth time constraint), but for channel programs libzfs should probably check that the kernel supports channel programs and if not then fall back on LIST_NEXT. |
I see what you mean now, I can check what it would take to make this happen through zcp. I guess the concern is modifying existing IOCTL could break 3rd party software. |
@alek-p I think the main concern with modifying the ioctl is that when upgrading, you can run new libzfs on an old kernel. In that case we still want |
We should also think about how long this ioctl can hold various locks, if you have a lot of snapshots. It looks like we could hold the dp_config_rwlock while skipping over a lot of snapshots. This can prevent spa_sync() from making progress. We'd have the same issue with a channel program as well, and it might be easier to address this with the LIST_NEXT ioctl than with the channel program (at least with the existing channel program functionality). |
We might want to think at a higher level about how to improve I'm not sure any of my ideas are total replacements for what you've proposed here (it is a pretty simple way to save looking up the snapshot details), but we should probably consider if there are other ways to improve the high-level operations. Have you measured how much performance improvement you get from this? And for curiosity, do you know which blocks we avoid reading from disk? Looks like if it's a zvol we save reading its contents (zvol_get_stats) which could be a big win. Otherwise the only blocks I see are for snapshot properties, which are not often used. Looks like looks like we can already avoid getting both zvol stats and props if we don't pass in a zc_nvlist_dst. |
Yeah it looks like with the current channel program functionality the way to do this would be similar to what I've done - I would need to add a zcp based LIST_NEXT basically. As you pointed out that's not ideal, and should have an interface that is O(1). |
I've tried using old userland binaries with new kernel module and that seems to work. Looks like the right thing is happening because of the check for the presence of the src nvl. |
I've run some performance tests to show the impact of not fetching props that are discarded in userspace. This was a simple
Rollback is benefiting more than send so there may be more room for optimisation of sends. While testing I noticed even for a small filesystem that just have 10 snapshots, doing a replicate ( |
I should have mentioned that the SSD and the HDD systems are different physical servers with the HDD one having 256 GB ram which is 8x the ram of SSD system. |
It looks like the only cases that are improving by more than a minute are the HDD rollback ones. I'm not totally opposed to this change, but I wonder if we can do better in terms of avoiding i/o in these cases. For example, with |
To me, the perf test data is showing that this patch has a much higher impact on rollback command even in the SSD case. The send case could probably use more optimization as I alluded to in my previous comment. |
I've rebased this PR in order to pick up the ztest patches that make testing more reliable |
I'd like to review this but I probably won't have time to until after the new year. Please wait for my review if possible. |
In this latest update, I made additional optimizations for the |
This seems to work great, thank you. Tested on my raspberry with 10K snapshots master:
running this PR:
|
585f3bc
to
6cc638c
Compare
Turns out skipping snapshots when doing a |
would be awesome if someone from @delphix could run their internal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Would be nice if we could skip right to the dataset we wanted and iterate from there. Maybe as future work.
8eff959
to
b300ce6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the delay getting this reviewed. The current version of this looks good, only a few cosmetic issues.
777335c
to
8ee991a
Compare
module/zfs/zfs_ioctl.c
Outdated
@@ -2313,7 +2313,8 @@ zfs_ioc_dataset_list_next(zfs_cmd_t *zc) | |||
* inputs: | |||
* zc_name name of filesystem | |||
* zc_cookie zap cursor | |||
* zc_nvlist_dst_size size of buffer for property nvlist | |||
* zc_nvlist_dst property nvlist | |||
* zc_nvlist_dst_size size of property nvlist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean to describe zc_nvlist_src[_size]
here?
cmd/zfs/zfs_main.c
Outdated
* zfs_iter_snapshot/bookmark iteration so we can fail fast and | ||
* avoid iterating over the rest of the younger objects | ||
*/ | ||
(void) fprintf(stderr, gettext("these are the first %d " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"first" implies a sorting which I don't think is relevant to the user (it's in the ZAP hash order, right?). Maybe we should instead say something like "Output limited to %d snapshots/bookmarks"?
include/sys/fs/zfs.h
Outdated
* the "list next snapshot" ioctl | ||
*/ | ||
#define SNAP_ITER_SKIP_AFTER_TXG "snap_iter_skip_after_txg" | ||
#define SNAP_ITER_SKIP_TO_TXG "snap_iter_skip_to_txg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about SKIP_AFTER_TXG
and SKIP_BEFORE_TXG
, or perhaps better yet,SNAP_ITER_MAX_TXG
and SNAP_ITER_MIN_TXG
?
lib/libzfs/libzfs_iter.c
Outdated
@@ -141,7 +141,7 @@ zfs_iter_filesystems(zfs_handle_t *zhp, zfs_iter_f func, void *data) | |||
*/ | |||
int | |||
zfs_iter_snapshots(zfs_handle_t *zhp, boolean_t simple, zfs_iter_f func, | |||
void *data) | |||
void *data, nvlist_t *range_nvl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should take the min and max as uint64_t arguments, rather than making the caller construct the range_avl. This minimizes the code that needs to know about how the arguments are marshalled to the kernel.
lib/libzfs/libzfs_iter.c
Outdated
@@ -282,7 +289,8 @@ zfs_snapshot_compare(const void *larg, const void *rarg) | |||
} | |||
|
|||
int | |||
zfs_iter_snapshots_sorted(zfs_handle_t *zhp, zfs_iter_f callback, void *data) | |||
zfs_iter_snapshots_sorted(zfs_handle_t *zhp, zfs_iter_f callback, void *data, | |||
nvlist_t *range_nvl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
lib/libzfs/libzfs_iter.c
Outdated
@@ -141,7 +141,7 @@ zfs_iter_filesystems(zfs_handle_t *zhp, zfs_iter_f func, void *data) | |||
*/ | |||
int | |||
zfs_iter_snapshots(zfs_handle_t *zhp, boolean_t simple, zfs_iter_f func, | |||
void *data) | |||
void *data, nvlist_t *range_nvl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't care about changing this interface and breaking 3rd party libzfs consumers, right? They should be using libzfs_core, the CLI, or the kernel interface directly if they want a stable interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm good point; I could create zfs_iter_snapshots_impl() and use that so that we leave the signature for zfs_iter_snapshots() as is but it sounds like that's not necessary
lib/libzfs/libzfs_sendrecv.c
Outdated
if (!sd->replicate && fromsnap_txg != 0) { | ||
range_nvl = fnvlist_alloc(); | ||
fnvlist_add_uint64(range_nvl, SNAP_ITER_SKIP_TO_TXG, | ||
fromsnap_txg_save); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the check use fromsnap_txg
, but we "skip to" fromsnap_txg_save
? From reading the code, it seems like fromsnap_txg
is the txg of the fromsnap in this fs, whereas fromsnap_txg_save
is the txg of the fromsnap in the parent filesystem. So we should be using fromsnap_txg
in both places here? The test case would be one where the parent and child fromsnap's have different txg's (i.e. snapshots with the same name were created at different times).
lib/libzfs/libzfs_sendrecv.c
Outdated
if (range_nvl == NULL) | ||
range_nvl = fnvlist_alloc(); | ||
fnvlist_add_uint64(range_nvl, SNAP_ITER_SKIP_AFTER_TXG, | ||
sd->tosnap_txg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like if tosnap_txg
is nonzero then sd->tosnap_txg
will be the same, but I think this assumption make the code harder to understand. I think we should be using the local tosnap_txg
here.
module/zfs/zfs_ioctl.c
Outdated
error = zfs_ioc_objset_stats_impl(zc, ossnap); | ||
if ((skip_after != 0 && dsl_get_creationtxg(ds) > skip_after) || | ||
(skip_to != 0 && dsl_get_creationtxg(ds) < skip_to)) { | ||
dsl_dataset_rele(ds, FTAG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's just github, but the indentation here looks wrong (one too many tabs?)
f5b38e3
to
5ea09e4
Compare
Looks like I need to rework how replicate send is detected, I should be able to get to that by Monday. |
9c5df87
to
4d13ad4
Compare
thanks for the review @ahrens! I've updated the PR to address your comments |
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it to take input parameters that alter the way looping through the list of snapshots is performed. The idea here is to restrict functions that throw away some of the snapshots returned by the ioctl to a range of snapshots that these functions actually use. This improves efficiency and execution speed for some rollback and send operations. Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Codecov Report
@@ Coverage Diff @@
## master #8077 +/- ##
==========================================
+ Coverage 78.57% 78.58% +<.01%
==========================================
Files 380 380
Lines 116057 116069 +12
==========================================
+ Hits 91194 91208 +14
+ Misses 24863 24861 -2
Continue to review full report at Codecov.
|
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.
Motivation and Context
Currently, when we do zfs rollback and send operations we have to iterate through the full list of snapshots even though we don't end up doing anything with some of those snapshots that the kernel gives us.
Description
To make the snapshot iteration more efficient when possible we added new parameters to the
zfs_ioc_snapshot_list_next()
ioctl. When passed in they tell us to skip snapshots created before and/or after the specified creation txg. This allows the kernel to avoid returning snapshots to the caller that the caller doesn't need. As a result, we reduce the number of userland/kernel boundary crossings and we also avoid the slow (often read-from-disk) operation of reading in snapshot stats (zfs_ioc_objset_stats_impl()) for the skipped snapshot.How Has This Been Tested?
I've ran zfs-tests and traced ioctls to confirm the number of times
zfs_ioc_snapshot_list_next()
ioctl is called now depends on which snapshots we are working with.Types of changes
Checklist:
Signed-off-by
.