Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs create/destroy/mount bad scalability with respect to number of mounted datasets #845

Closed
baryluk opened this issue Jul 20, 2012 · 31 comments

Comments

@baryluk
Copy link

baryluk commented Jul 20, 2012

With about 1000 datasets on my zfs pool, it takes about 20 seconds to create new one, or to mount one more.
zfs import or zfs mount -a can take about hour.

I belive bug is in mount.zfs, which checks /etc/mtab and /proc/mounts multiple times
and updates this files. This extremally unacassary, and makes
zfs mount/create/destroy/import extermally slow when system or zfs have multiple mounted datasets.

Solution would be to have a something better than plain linear
structure in mtab, or better just ignore mtab at all, and try to mount without checking
if it is mounted, and do checking only in case of error.

root@lavinia:~# time zpool import smptank3

real    22m18.381s
user    2m34.898s
sys 19m9.852s
root@lavinia:~# zfs list | wc -l
722

root@lavinia:~# time zfs create smptank3/a

real    0m4.796s
user    0m0.888s
sys 0m3.420s

Precise measurments visualised on this plot http://i.imgur.com/oZGlb.png

strace zfs create / mount, shows that big amount of time is spent on reading/writing to /etc/mtab.


open("/dev/zfs", O_RDWR)                = 4
open("/etc/mtab", O_RDONLY)             = 5
open("/etc/dfs/sharetab", O_RDONLY)     = 6
open("/etc/dfs/sharetab", O_RDONLY)     = 7
close(7)                                = 0
lseek(5, 0, SEEK_SET)                   = 0
read whole file
lseek(5, 0, SEEK_SET)                   = 0
read whole file
lseek(5, 0, SEEK_SET)                   = 0
read whole file
lseek(5, 0, SEEK_SET)                   = 0
read whole file
...
/// exactly 724 lseeks + full file reads, which is equal to number of already mounted datases
...

open("/etc/dfs/sharetab.wAwWn4", O_RDWR|O_CREAT|O_EXCL, 0600) = 7
lseek(7, 0, SEEK_CUR)                   = 0
close(7)                                = 0
close(4)                                = 0
close(5)                                = 0
close(6)                                = 0

...

open("/etc/mtab", O_RDONLY)             = 9
open("/etc/dfs/sharetab", O_RDONLY)     = 10
open("/etc/dfs/sharetab", O_RDONLY)     = 11
close(11)                               = 0
lseek(9, 0, SEEK_SET)                   = 0
lseek(9, 0, SEEK_SET)                   = 0
lseek(9, 0, SEEK_SET)                   = 0
lseek(9, 0, SEEK_SET)                   = 0

// similary 724 lseeks

close(5)                                = 0
close(6)                                = 0
close(7)                                = 0
open("/etc/dfs/sharetab.BQrkk0", O_RDWR|O_CREAT|O_EXCL, 0600) = 5
lseek(5, 0, SEEK_CUR)                   = 0
close(5)                                = 0
close(8)                                = 0
close(9)                                = 0
close(10)                               = 0
close(4)                                = 0

aftear each lseek, a whole /etc/mtab file is read.

/etc/mtab should be read only once, and stored in memory.
If possible it should be not used at all. Just try to mount, if it fails, check why,
if it successed, just append proper line to /etc/mtab (with proper file locking).

Thanks,
Witek

@baryluk
Copy link
Author

baryluk commented Jul 23, 2012

I was thinking that maybe commit b740d60 is to blame, which fixes Issue #329, by disabling mnttab caching in zfs command, but it looks not to be the main performance problem. (reverting this commit and reenabling cache, doesnt make it much faster in my case - yes, mtab file is read once and parsed, but still command took essentially same amount of time, mostly due lots of ioctls).

I discovered that actually any zfs/zpool command starts to take lots of time. Even zfs -? which should just print usage/help, takes 7 seconds every time.

I tracked this to the function libshare_init which is called before zfs_main.c : main(), in lib/libshare/libshare.c It calls sa_init which calls update_zfs_shares ( Link to code ) which in turn calls zfs_is_mounted function ( Links to context ), which due not enabled mnttab caching by default takes long time. So I enabled cache by calling libzfs_mnttab_cache in libzfs_init, which is called in sa_init, and it shows that mtab file is read once, but still update_zfs_shares takes lots of time. I traced this to lots of ioctls dones by zfs command to the kernel (probably calling zfs_iter_filesystems recusivly). Disabling update_zfs_shares at all in sa_init solved problem for me, every command now is much more quicker. Because it says in libshare_init that sa_init is only called here to make sure actuall nfs/smb exports are consistent with datasets properties, I probably can safely remove call to update_zfs_shares in sa_init/libshare_init. Same functionality can still be achived using manually calling zfs share -a probably. It can be easly done in initscript once.

Other possibility to fix it, is to actually disable sa_fini(sa_init()) call in libshare_init.

Other is to invent better transport than ioctls, because this can be a problem here.

Differences with update_zfs_shares disabled in sa_init:

zfs -?: 0.06s from 6.11s
zfs list: 6.3s from 12.1s

zfs -? without changes, does about 2500 contaxt changes (according to perf), but after change it is just 24.

@baryluk
Copy link
Author

baryluk commented Aug 3, 2012

I belive best option would be to wrap update_zfs_shares in sa_init() with condition on some environment variable. Like ZFS_AUTO_UPDATE_SHARES=yes

I can prepare a patch for this.

@behlendorf
Copy link
Contributor

@baryluk This was recently fixed similarly to how your originally proposed, see commit 27ccd41 . Basically we now only invoke nfs_check_exportfs() when invoking a zfs/zpool command which requires the results. This fix is in -rc11.

@williamstein
Copy link

(I'm not a ZFS developer -- just a big fan and user.) Perhaps this ticket should not be closed. The problem is still in released 0.6.2, and trivial to replicate -- just make a pool on a sparse image file (say), then create 1000 filesystems (all empty), and do "time zfs -?" and it takes several seconds. There's another ticket at #821 and several discussions all over the web about how one can't use ZFS with a large number of filesystems, all because of this.

If you comment out the line "update_zfs_shares(impl_handle, NULL);" in lib/libshare/libshare.c, then re-install zfs from source, "time zfs -?" is nearly instant again, as is everything else, including things like listing all snapshots, making snapshots, etc. They all become "1000 times" faster for me. For my project, which absolutely requires having 10000+ filesystems in a single pool, this speedup is absolutely critical. I'm not using NFS at all, so disabling the update_zfs_shares is fine for me.

@FransUrbo
Copy link
Contributor

Even though this is closed (which it shouldn't) it is related to #1484.

@Rudd-O
Copy link
Contributor

Rudd-O commented Dec 19, 2013

wat the fuk, why would update_zfs_shares() be called for listings? this takes AGES to execute on spinning rust. Please fix! :-(

@FransUrbo
Copy link
Contributor

libshare is initialized globaly, for all commands.

This means, that even something like zpool status suffer from this delay! It takes my system nine seconds to run this, compared to 'instantaneous' without libshare...

I've tried to find where and when it (libshare) is initialized/started, but it's even initialized before zpool:main(), so I have no idea how to fix this.

@behlendorf
Copy link
Contributor

@williamstein @FransUrbo At a minimum for 0.6.3 we should be able to update the code so update_zfs_shares isn't called for all zfs commands, just those that manipulate shares.

@behlendorf behlendorf reopened this Dec 19, 2013
@FransUrbo
Copy link
Contributor

@behlendorf That was my idea as well. I started with this, but when I noticed that libshare was initialized before main() and couldn't figure out why, I didn't know how to proceed.

Have any hints for me?

@behlendorf
Copy link
Contributor

@FransUrbo Sure, that's caused by the libshare_init() function having the constructor attribute set. What needs to happen if for the constructor attribute to be removed and for sa_init() to only be called prior to an operation which requires libshare. This infrastructure is already in place for this and zfs_init_libshare() will call sa_init() during share and unshare for the initialization. The only wrinkle that really remains is ensuring libshare_nfs_init() and libshare_smb_init() are called prior to sa_init().

It's worth taking a look at _zfs_init_libshare() in Illumos to see what they've done. They dlopen() the library in a constructor as well since I suppose it might not always be available on their systems. But for ZoL we opted for a normal shared library which we guarantee will be available since it's installed with the package.

@behlendorf
Copy link
Contributor

For anyone suffering from this issue could you please try the following patch. It does two things:

  • Enables the /etc/mtab cache which prevents the zfs command from having to repeatedly open and read from the /etc/mtab file. Instead an AVL tree of the mounted filesystems is created and used. This means that if non-zfs filesystems are mounted using the normal mount(8) command zfs will not be aware of them. Disabling this was originally done out of an abundance of paranoia.
  • Removes the unconditional sharetab update when running a zfs command. This means the sharetab might become out of date if users are manually adding/removing shares with exportfs. But we shouldn't punish all callers to zfs in order to handle that unlikely case. If we observe issues because of this it can always be added back to just the share/unshare call paths.

I wasn't able to consistently reproduce the slow behavior in my VM so I'd be interested to see how these two small changes help your systems.

diff --git a/cmd/zfs/zfs_main.c b/cmd/zfs/zfs_main.c
index 3f54985..9fac5b2 100644
--- a/cmd/zfs/zfs_main.c
+++ b/cmd/zfs/zfs_main.c
@@ -6467,7 +6467,7 @@ main(int argc, char **argv)
        /*
         * Run the appropriate command.
         */
-       libzfs_mnttab_cache(g_zfs, B_FALSE);
+       libzfs_mnttab_cache(g_zfs, B_TRUE);
        if (find_command_idx(cmdname, &i) == 0) {
                current_command = &command_table[i];
                ret = command_table[i].func(argc - 1, argv + 1);
diff --git a/lib/libshare/libshare.c b/lib/libshare/libshare.c
index 6625a1b..ea59dcd 100644
--- a/lib/libshare/libshare.c
+++ b/lib/libshare/libshare.c
@@ -105,14 +105,6 @@ libshare_init(void)
 {
        libshare_nfs_init();
        libshare_smb_init();
-
-       /*
-        * This bit causes /etc/dfs/sharetab to be updated before libzfs gets a
-        * chance to read that file; this is necessary because the sharetab file
-        * might be out of sync with the NFS kernel exports (e.g. due to reboots
-        * or users manually removing shares)
-        */
-       sa_fini(sa_init(0));
 }

 static void

@FransUrbo
Copy link
Contributor

@behlendorf In my tests for #1484 I found that enabling the mtab cache DO help. Not by a huge amount, but anyway. But is it a good idea?

The script I used to test this looks like this:

#!/bin/bash

zfs create share/tests
zfs create share/tests/multitst

do_create () {
    path=$*
    zfs create share/tests/multitst/$path
}

for i in {1..100}; do
    echo -n "."
    i1=`printf "%03d" $i`
    do_create tst$i1
    for j in {1..2}; do
        j1=`printf "%03d" $j`
        do_create tst$i1/sub$j1
        for k in {1..2}; do
            k1=`printf "%03d" $k`
            do_create tst$i1/sub$j1/lst$k1
        done
    done
done
echo

This took several hours initially, but some of the improvements I've set pull requests for and mtab cache enable (including a much, much newer ZoL) have cut this substantially. I'll rerun the test again, with and without libshare and see some numbers as soon as I'm sure that my ZVOLs works as the're supposed to.

But if you change the sub levels from {1..2} to something like {1..100}, I'm sure the universe will be very empty and cold once it's done :)

But it's weird that you can't reproduce the problem. Did you test with libshare completely disabled/enabled, not just the libshare_init() fix above?

On my live machine, it currently takes about an hour and a half to mount 613 filesystems.

@chrisrd
Copy link
Contributor

chrisrd commented Dec 23, 2013

For what it's worth...

When I had approximately 1000 filesystems in my pool, zfs mount -a was taking over an hour.

I had a look at strace -f -o /tmp/strace zfs mount xxx and couldn't believe it was doing open("/etc/mtab") over 1700 times for each mount. Given that on my (debian) system /etc/mtab is a symlink to /proc/mounts and as such is maintained automatically by the kernel, that seemed completely unnecessary.

In the end, mount() is just a system call, so I wrote a simple zfs-mount wrapper around the system call. Of course this is far less flexible than the proper zfs mount call, but on the upside it's fast!

I currently have 2100 filesystems in my pool and using the script and zfs-mount program below it takes less than 4 minutes to mount everything.

#!/bin/bash
#
# Mount all ZFS filesystems
#
# snap
#
zfs list -H -t filesystem -o name,mountpoint |
while IFS=$'\t' read a b
do
        zfs-mount "${a}" "${b}"
done

Source for zfs-mount:

/*
 * ZFS mounter
 *
 * WARNING: hard-coded mount options!!
 *
 * compile with:
 *   make zfs-mount
 */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <getopt.h>
#include <errno.h>
#include <string.h>
#include <sys/mount.h>

static void
usage()
{
    fprintf(stderr, "Usage: zfs-mount [options] zfs mountpoint\n\n");
    fprintf(stderr, "Options:\n");
    fprintf(stderr, "\t--help|-h\tthis message\n");
    exit(EXIT_FAILURE);
}

extern char *optarg;
extern int optind, opterr, optopt;

static char *opts = "h";
static struct option longopts[] = {
    { "help", no_argument, 0, 'h', },
};

int
main(int argc, char *argv[])
{
    while (1) {
        int c;
        int optidx;
        c = getopt_long(argc, argv, opts, longopts, &optidx);
        if (c == -1)
            break;
        switch (c) {
        case 'h':
            usage();
            break;
        default:
            abort();
        }
    }

    if (argc - optind != 2)
        usage();

    char *source = argv[optind++];
    char *target = argv[optind++];

    unsigned long mountflags = MS_NOATIME;
    char *opts = "noatime,xattr,rw,zfsutil";

    /*
     * Snapshots are mounted ro
     */
    if (strchr(source, '@')) {
        mountflags |= MS_RDONLY;
        opts = "noatime,xattr,ro";
    }

    if (mount(source, target, "zfs", mountflags, opts) != 0) {
        fprintf(stderr, "mount(%s, %s) returned %d: %s\n", source, target, errno, strerror(errno));
        return(EXIT_FAILURE);
    }

    return(EXIT_SUCCESS);
}

@behlendorf
Copy link
Contributor

I found that enabling the mtab cache DO help. Not by a huge amount, but anyway. But is it a good idea?

Allowing the mtab file to be cached means there's a window where the version cached by the zfs command and the actual /etc/mtab could be out of sync if there's another process mounting filesystems. Depending on how your systems is configured that may or may not be likely. Because I was being uber paranoid when I originally wrote this code I disabled the cache. However, in practice if they're slightly out of sync it's not the worst thing ever any may be a reasonable tradeoff to make. It's less of an issue on say Illumos where they only have 2 real filesystems so they have high confidence all mounts are going through zfs. At a minimum we could make using the cache a zfs mount command line option and enable it by default. @chrisrd I expect this would substantially help your use case.

@FransUrbo I'll try your test script, I did something similiar for my testing but I could never take it more than about 1 minute in my VM to mount and share 1000 filesystems.

@chrisrd
Copy link
Contributor

chrisrd commented Dec 23, 2013

@behlendorf I'll do some testing with the mtab caching enabled. However I'm wondering, for the situations where the mtab is managed by the kernel, is there a way of detecting this and turning off the mtab manipulation altogether?

It's also interesting that your haven't been able produce more than a minute delay in mounting 1000 filesystems, whereas I was seeing over an hour. I'll see what strace can tell us about where the time is being spent, and perhaps dive deeper into the kernel if any particular system calls stand out.

@behlendorf
Copy link
Contributor

is there a way of detecting this and turning off the mtab manipulation altogether?

To some degree it already done. These days the mount helper will detect if /etc/mtab is a symlink and if so won't manually up date it. Enabling the mtab cache should resolve the rest of the issue, /etc/mtab will be read once and the cached copy used for subsequent checks for if a fs is mounted or not. So with the cache enabled I'd expected you to see good results. If not we should figure out where the time is going and why.

If someone could write a little script and verify it reproduces the bad behavior in a VM that would be helpful to me. I'm not having much luck causing a probably as serious as that described here. What happens cleanly isn't optimal, but in my testing you could certainly live with it.

@chrisrd
Copy link
Contributor

chrisrd commented Dec 24, 2013

@behlendorf With the mtab cache enabled the time to zfs mount -a 1170 filesystems is a bit longer at 2m55s compared to my custom zfs-mount at 2m33s, but that's not a difference that's going to worry me. I vote for enabling the mtab cache by default or by switch!

@FransUrbo
Copy link
Contributor

I've now been running without the 'sa_fini(sa_init(0))' part for about a week and everything seems to be working. I have not done any speed tests to see if it actually did any difference, but it doesn't seems like it do...

@behlendorf have you been successful in reproducing the problem?

PS. I've just recently (yesterday!) been able to boot my ZFS root installation with Debian GNU/Linux Wheezy (all 64bit) which DO have mtab as a symlink, but I don't notice any difference in mount speed. Note though, that I was extremely paranoid when I created the dataset, so I used dedup=on,copies=3,sync=standard,encryption=off so that might take a lot of the speed out of the patches posted above (speed problems regarding copies described elsewhere).

Oh, I just tripple checked. I missed the mtab cache enable. I'll enabled that as well and see what I find.

@FransUrbo
Copy link
Contributor

Do note that the mtab cache part is incomplete. I found in #1484 that there where a lot of places where the code opens/reads/seeks/closes mtab directly, without going through libzfs... Finding and fixing those might speed things up even more.

@behlendorf
Copy link
Contributor

It sounds like re-enabling the cache has been helpful for @chrisrd so I'll apply that patch. I'm also going to make the sa_fini(sa_init(0)) change which hasn't caused any problems. If something unexpected crops up after the merge we can always revert this before the official tag. But that seems unlikely.

@FransUrbo I can easily believe that not everything uses the mtab cache. We should probably address those cases one by one as we discover them, but in the meantime I don't think that needs to prevent us from enabling the cache.

Unfortunately, I still haven't been able to reproduce the issue on any of my test systems.

behlendorf added a commit that referenced this issue Jan 7, 2014
Re-enable the /etc/mtab cache to prevent the zfs command from
having to repeatedly open and read from the /etc/mtab file.
Instead an AVL tree of the mounted filesystems is created and
used to vastly speed up lookups. This means that if non-zfs
filesystems are mounted concurrently the 'zfs mount' will not
immediately detect them.  In practice that will rarely happen
and even if it does the absolute worst case would be a failed
mount.  This was originally disabled out of an abundance of
paranoia.

NOTE: There may still be some parts of the code which do not
consult the mtab cache.  They should be updated to check the
mtab cache as they as discovered to be a problem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #845
behlendorf added a commit that referenced this issue Jan 7, 2014
Removes the unconditional sharetab update when running any zfs
command. This means the sharetab might become out of date if
users are manually adding/removing shares with exportfs.  But
we shouldn't punish all callers to zfs in order to handle that
unlikely case. In the unlikely event we observe issues because
of this it can always be added back to just the share/unshare
call paths where we need an up to date sharetab.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #845
@behlendorf
Copy link
Contributor

Bumping to 0.6.4. Things have improved here somewhat but there's clearly more to do.

@douardda
Copy link

I've been seriously hit by this bug tonight after upgrading my zfsonlinux host from squeeze to wheezy (that bug prevented the system to boot, at least I had no patience enough). Now, the zpool import remains quite long (maybe a minute or 2), and the "zfs mount -a" take less than 2 minutes also. Back to business (and having "zfs -h" that respond immediately is greatly appreciated).

David

@behlendorf
Copy link
Contributor

@douardda @chrisrd @baryluk @williamstein @Rudd-O in your option are we at a point with the 0.6.3 tag where the performance when managing 1000's of filesystems is generally acceptable?

@behlendorf behlendorf added Bug - Minor and removed Bug Type: Documentation Indicates a requested change to the documentation Type: Feature Feature request or new feature labels Oct 6, 2014
@behlendorf behlendorf removed this from the 0.6.4 milestone Oct 6, 2014
@FransUrbo
Copy link
Contributor

are we at a point with the 0.6.3 tag where the performance when managing 1000's of filesystems is generally acceptable?

Depends on what's acceptable, but I'd say so:

# time zfs list -tall | wc -l
671

real    0m9.487s
user    0m0.044s
sys     0m0.536s

The large part of the problem I believe was fixed in 0bc7a7a and abbfdca (Issue #1498). =

@williamstein
Copy link

On Mon, Oct 6, 2014 at 11:45 PM, Turbo Fredriksson <notifications@github.com

wrote:

are we at a point with the 0.6.3 tag where the performance when managing
1000's of filesystems is generally acceptable?

Depends on what's acceptable, but I'd say so:

# time zfs list -tall | wc -l
671

real 0m9.487s
user 0m0.044s
sys 0m0.536s

Is it linear in the number of filesystems, so if there were 6710, then it
would take 94 seconds just to do the above?

For what it's worth, I re-architect my site (cloud.sagemath.com) to use one
ZFS filesystem, lots
of directories, and rsync for replication, because of this performance
issue (even with the workarounds).
Most of my VM's have around 10,000 users (hence directories), so several
minutes waiting at bootup (or for various
other operations) isn't fast enough. So I'm not easily able to test
things. (I'm very happy with how ZFS is
improving in general!)

The large part of the problem I believe was fixed in
0bc7a7a and
abbfdca (Issue #1498). =

Reply to this email directly or view it on GitHub
#845 (comment).

William Stein
Professor of Mathematics
University of Washington
http://wstein.org

@behlendorf
Copy link
Contributor

Is it linear in the number of filesystems, so if there were 6710, then it would take 94 seconds just to do the above?

Yes, it should be roughly linear.

OK, thanks for the feedback. I'm glad you you found a way to make ZFS work in your environment.

@FransUrbo
Copy link
Contributor

It doesn't look like it's linear:

# time zpool import -d /dev/disk/by-id/ rpool

real    2m48.028s
user    0m6.576s
sys     2m4.900s
# time zfs list | wc -l
3863

real    0m1.633s
user    0m0.164s
sys     0m1.472s

I honestly don't know why that list takes less than two, and the one with only 670 filesystems took almost ten...

Creating those 3k8 filesystems took "for ever" though!

@behlendorf
Copy link
Contributor

OK, then I'm closing out this issue. If there are still specific use cases which need to be improved lets open new bugs for them.

@mailinglists35
Copy link

2015-12-12_02-45-19
I have 600 datasets in a 30% fragmented mirror pool build from two regular 1Tb sata drives, operating in an old system (intel Q33 Express chipset with 4Gb of ram and core2duo E8200 @ 2.66GHz). cache and log vdevs are stored on a separate ssd drive.

zpool import takes more than 30 seconds. the import timeline looks like this: first zpool saturates the disk and cpu, then dbu_evict takes a significant part of cpu, then short bursts of multiple dmu_objset_find happen quickly with dbu_evict still running, then zpool goes up the cpu again while dbu_evict continues keeps a significant part of cpu busy, then finally lots of cpu with no i/o waiting for systemd processes - I'll try to attach a screencast -

what are expected numbers (like, let's assume ZOL code is perfect and there's nothing can be done to speed it up) per create/destroy/mount operation given the above configuration? (create was much slower, took some several minutes to create the datasets)

Also, I've noticed a few seconds (3-4) from total are consumed by systemd --user when mount command is running (mount takes 0..1% cpu and systemd process takes 20% for each opened login session - and this multiplies with number of login shells I have open - near the end of the import operation when zfs code has finished initialization and waits for mount to finish) - is this normal or should I file an issue to systemd? btw zpool export takes very long waiting for systemd (zpool export
itself would have finished instantly if systemd wasn't in the way)

@Rudd-O
Copy link
Contributor

Rudd-O commented Jan 16, 2016

This may be worthy of another bug, @mailinglists35.

@mailinglists35
Copy link

mailinglists35 commented Jul 2, 2016

@behlendorf is it normal/expected on latest release (which already has mtab cache enabled) to still observe zfs create reading mtab as many times as the number of existing datasets?

root@linux:/usr/local/src/zfs-linux-0.6.5.7# time strace -f zfs create pool/test > /tmp/create 2>&1

real    0m13.391s
user    0m0.228s
sys     0m12.632s

root@linux:/usr/local/src/zfs-linux-0.6.5.7# time strace -f zfs list > /tmp/list 2>&1     
real    0m12.356s
user    0m0.144s
sys     0m11.920s

root@linux:/usr/local/src/zfs-linux-0.6.5.7# grep -c mtab /tmp/create /tmp/list
/tmp/create:1638
/tmp/list:1

however create -o mountpoint=none OR create -o canmount=off reads mtab only once and finishes instantly

pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
…ndex run (openzfs#845)

The zettacache index cache is updated as part of merging the
PendingChanges into the on-disk index.  The merge task sends the updates
to the checkpoint task, as part of a `MergeProgress` message.  The index
cache updates are then made from a spawned blocking (CPU-bound) task.
The updates are completed (waited for) before the next checkpoint
completes.

During the merge, it's expected that lookups can see IndexEntry's from
the old index, either from reading the old index itself, or from the
index entry cache.  These stale entries are "corrected" by either
`PendingChanges::update()`'s call to `Remap::remap()`, or
`MergeState::entry_disposition()`'s check of
`PendingChanges::freeing()`.

When the `MergeMessage::Complete` is received it calls
`Locked::rotate_index()` which deletes the old on-disk index, and calls
`PendingChanges::set_remap(None)` and `Locked::merge.take()`.  This ends
the stale entry "corrections" mentioned above, which are no longer
necessary because we can no longer see stale entries from the old
on-disk index.

The problem occurs when the `MergeMessage::Complete` is received and
processed before the spawned blocking task completes.  In this case, we
end the stale entry "corrections", but we can still see stale entries
from the index cache.

This PR addresses the problem by waiting for the index cache updates to
complete before processing the `MergeMessage::Complete`.

The problem was introduced by openzfs#808.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants