Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfilling metadnode degrades object create rates #4460

Closed
wants to merge 1 commit into from

Conversation

nedbass
Copy link
Contributor

@nedbass nedbass commented Mar 26, 2016

Object creation rates may be degraded when dmu_object_alloc() tries
to backfill the metadnode array by restarting its search at offset 0.
The method of searching the dnode space for holes is inefficient and
unreliable, leading to many failed attempts to obtain a dnode hold.
These failed attempts are expensive and limit overall system
throughput. This patch changes the default behavior to disable
backfilling, and it adds a zfs_metadnode_backfill module parameter to
allow the old behavior to be enabled.

The search offset restart happens at most once per call to
dmu_object_alloc() when the previously allocated object number is a
multiple of 4096. If the hold on the requested object fails because
the object is allocated, dmu_object_next() is called to find the next
hole. That function should theoretically identify the next free
object that the next loop iteration can successfully obtain a hold
on. In practice, however, dmu_object_next() may falsely identify a
recently allocated dnode as free because the in-memory copy of the
dnode_phys_t is not up to date. The next hold attempt then fails, and
this process repeats for up to 4096 loop iterations before the search
skips ahead to a sparse region of the metadnode. A similar pathology
occurs if dmu_object_next() returns ESRCH when it fails to find a
hole in the current dnode block. In this case dmu_object_alloc()
simply increments the object number and retries, resulting again in
up to 4096 failed dnode hold attempts.

We can avoid these pathologies by not attempting to backfill the
metadnode array. This may result in sparse dnode blocks, potentially
costing disk space, memory overhead, and increased disk I/O. These
penalties appear to be outweighed by the performance cost of the
current approach. Future work could implement a more efficient means
to search for holes and allow us to reenable backfilling by default.

=== Benchmark Results ===

We measured a 46% increase in average file creation rate by
setting zfs_metadnode_backfill=0.

The createmany benchmark used is available at
http://github.com/nedbass/createmany. It used 32 threads to create 16
million files over 16 iterations. The pool was freshly created for each
of the two tests. The test system was a d2.xlarge Amazon AWS virtual
machine with 3 2TB disks in a raidz pool.

zfs_metadnode_backfill Average creates/second


                 0                  43879
                 1                  30040

$ zpool create tank raidz /dev/xvd{b,c,d}
$ echo 0 > /sys/module/zfs/parameters/zfs_metadnode_backfill
$ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp
-d /tank/XXXXX) 1000000 ; done
total: 1000000 creates in 21.142829 seconds: 47297.359852 creates/second
total: 1000000 creates in 21.421943 seconds: 46681.108566 creates/second
total: 1000000 creates in 21.996960 seconds: 45460.826977 creates/second
total: 1000000 creates in 22.031947 seconds: 45388.637143 creates/second
total: 1000000 creates in 21.597262 seconds: 46302.165727 creates/second
total: 1000000 creates in 21.194397 seconds: 47182.281302 creates/second
total: 1000000 creates in 23.844561 seconds: 41938.285457 creates/second
total: 1000000 creates in 25.678497 seconds: 38943.089478 creates/second
total: 1000000 creates in 22.400553 seconds: 44641.757449 creates/second
total: 1000000 creates in 22.011262 seconds: 45431.290857 creates/second
total: 1000000 creates in 21.848749 seconds: 45769.211022 creates/second
total: 1000000 creates in 26.574808 seconds: 37629.622928 creates/second
total: 1000000 creates in 22.326124 seconds: 44790.580077 creates/second
total: 1000000 creates in 23.562593 seconds: 42440.152541 creates/second
total: 1000000 creates in 26.825597 seconds: 37277.828270 creates/second
total: 1000000 creates in 22.277026 seconds: 44889.297413 creates/second

$ zpool destroy tank
$ zpool create tank raidz /dev/xvd{b,c,d}
$ echo 1 > /sys/module/zfs/parameters/zfs_metadnode_backfill
$ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp
-d /tank/XXXXX) 1000000 ; done
total: 1000000 creates in 31.947285 seconds: 31301.564265 creates/second
total: 1000000 creates in 31.511260 seconds: 31734.687822 creates/second
total: 1000000 creates in 31.984121 seconds: 31265.515618 creates/second
total: 1000000 creates in 31.960720 seconds: 31288.406458 creates/second
total: 1000000 creates in 32.651408 seconds: 30626.550663 creates/second
total: 1000000 creates in 32.579218 seconds: 30694.414826 creates/second
total: 1000000 creates in 36.163562 seconds: 27652.143474 creates/second
total: 1000000 creates in 33.621352 seconds: 29743.003829 creates/second
total: 1000000 creates in 33.097268 seconds: 30213.974061 creates/second
total: 1000000 creates in 34.419482 seconds: 29053.313476 creates/second
total: 1000000 creates in 34.014244 seconds: 29399.448204 creates/second
total: 1000000 creates in 32.972573 seconds: 30328.236705 creates/second
total: 1000000 creates in 34.757156 seconds: 28771.054526 creates/second
total: 1000000 creates in 32.194859 seconds: 31060.859951 creates/second
total: 1000000 creates in 32.464407 seconds: 30802.966165 creates/second
total: 1000000 creates in 37.443681 seconds: 26706.776650 creates/second

Signed-off-by: Ned Bass bass6@llnl.gov

Object creation rates may be degraded when dmu_object_alloc() tries
to backfill the metadnode array by restarting its search at offset 0.
The method of searching the dnode space for holes is inefficient and
unreliable, leading to many failed attempts to obtain a dnode hold.
These failed attempts are expensive and limit overall system
throughput. This patch changes the default behavior to disable
backfilling, and it adds a zfs_metadnode_backfill module parameter to
allow the old behavior to be enabled.

The search offset restart happens at most once per call to
dmu_object_alloc() when the previously allocated object number is a
multiple of 4096. If the hold on the requested object fails because
the object is allocated, dmu_object_next() is called to find the next
hole. That function should theoretically identify the next free
object that the next loop iteration can successfully obtain a hold
on. In practice, however, dmu_object_next() may falsely identify a
recently allocated dnode as free because the in-memory copy of the
dnode_phys_t is not up to date. The next hold attempt then fails, and
this process repeats for up to 4096 loop iterations before the search
skips ahead to a sparse region of the metadnode. A similar pathology
occurs if dmu_object_next() returns ESRCH when it fails to find a
hole in the current dnode block. In this case dmu_object_alloc()
simply increments the object number and retries, resulting again in
up to 4096 failed dnode hold attempts.

We can avoid these pathologies by not attempting to backfill the
metadnode array. This may result in sparse dnode blocks, potentially
costing disk space, memory overhead, and increased disk I/O. These
penalties appear to be outweighed by the performance cost of the
current approach. Future work could implement a more efficient means
to search for holes and allow us to reenable backfilling by default.

=== Benchmark Results ===

We measured a 46% increase in average file creation rate by
setting zfs_metadnode_backfill=0.

The createmany benchmark used is available at
http://github.com/nedbass/createmany. It used 32 threads to create 16
million files over 16 iterations. The pool was freshly created for each
of the two tests. The test system was a d2.xlarge Amazon AWS virtual
machine with 3 2TB disks in a raidz pool.

zfs_metadnode_backfill Average creates/second
---------------------- ----------------------
                     0                  43879
                     1                  30040

$ zpool create tank raidz /dev/xvd{b,c,d}
$ echo 0 > /sys/module/zfs/parameters/zfs_metadnode_backfill
$ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp
  -d /tank/XXXXX) 1000000 ; done
total: 1000000 creates in 21.142829 seconds: 47297.359852 creates/second
total: 1000000 creates in 21.421943 seconds: 46681.108566 creates/second
total: 1000000 creates in 21.996960 seconds: 45460.826977 creates/second
total: 1000000 creates in 22.031947 seconds: 45388.637143 creates/second
total: 1000000 creates in 21.597262 seconds: 46302.165727 creates/second
total: 1000000 creates in 21.194397 seconds: 47182.281302 creates/second
total: 1000000 creates in 23.844561 seconds: 41938.285457 creates/second
total: 1000000 creates in 25.678497 seconds: 38943.089478 creates/second
total: 1000000 creates in 22.400553 seconds: 44641.757449 creates/second
total: 1000000 creates in 22.011262 seconds: 45431.290857 creates/second
total: 1000000 creates in 21.848749 seconds: 45769.211022 creates/second
total: 1000000 creates in 26.574808 seconds: 37629.622928 creates/second
total: 1000000 creates in 22.326124 seconds: 44790.580077 creates/second
total: 1000000 creates in 23.562593 seconds: 42440.152541 creates/second
total: 1000000 creates in 26.825597 seconds: 37277.828270 creates/second
total: 1000000 creates in 22.277026 seconds: 44889.297413 creates/second

$ zpool destroy tank
$ zpool create tank raidz /dev/xvd{b,c,d}
$ echo 1 > /sys/module/zfs/parameters/zfs_metadnode_backfill
$ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp
  -d /tank/XXXXX) 1000000 ; done
total: 1000000 creates in 31.947285 seconds: 31301.564265 creates/second
total: 1000000 creates in 31.511260 seconds: 31734.687822 creates/second
total: 1000000 creates in 31.984121 seconds: 31265.515618 creates/second
total: 1000000 creates in 31.960720 seconds: 31288.406458 creates/second
total: 1000000 creates in 32.651408 seconds: 30626.550663 creates/second
total: 1000000 creates in 32.579218 seconds: 30694.414826 creates/second
total: 1000000 creates in 36.163562 seconds: 27652.143474 creates/second
total: 1000000 creates in 33.621352 seconds: 29743.003829 creates/second
total: 1000000 creates in 33.097268 seconds: 30213.974061 creates/second
total: 1000000 creates in 34.419482 seconds: 29053.313476 creates/second
total: 1000000 creates in 34.014244 seconds: 29399.448204 creates/second
total: 1000000 creates in 32.972573 seconds: 30328.236705 creates/second
total: 1000000 creates in 34.757156 seconds: 28771.054526 creates/second
total: 1000000 creates in 32.194859 seconds: 31060.859951 creates/second
total: 1000000 creates in 32.464407 seconds: 30802.966165 creates/second
total: 1000000 creates in 37.443681 seconds: 26706.776650 creates/second

Signed-off-by: Ned Bass <bass6@llnl.gov>
@nedbass
Copy link
Contributor Author

nedbass commented Mar 26, 2016

@ahrens this might be of interest to your metadata performance work.

It highlights a problem I noticed while working on large dnodes #3542. When dnode_next_offset_level() searches the dnode space it looks at dnode_phys_t buffers. But if the object was recently created that buffer is not in sync with the in-core dnode_t. It will be empty and dnode_next_offset_level() falsely identifies a recently allocated dnode as a hole because the dn_type field appears to be DMU_OT_NONE. In #3542 I had to inspect both dnode_phys_t and dnode handles to deal with this issue.

Addressing that doesn't fix this performance issue, however, because dmu_object_alloc() doesn't handle ESRCH errors from dmu_object_next(). It simply increments the object number and retries in that case, so it still may end up iterating through a whole L2 block pointer's worth of allocated dnodes before moving on.

Disabling backfill is a band aid. In the long term we need a better way to find holes, i.e. spacemaps.

@ahrens
Copy link
Member

ahrens commented Mar 26, 2016

Thanks @nedbass. I recently discovered this as well. I agree we need to take into account allocated dnodes that have not been written to disk yet. I'll be working on a design for that. I'll try to avoid changing the on disk structure though (i.e. Not add space maps. Range trees could be useful though, for tracking what's allocated but not yet synced. )

@behlendorf
Copy link
Contributor

@nedbass nice find. This issue of not being able to cheaply determine if a dnode has just been dirtied but not yet written has come up a few times recently. It clearly has a significant impact on create performance. I think space maps, range trees, or even bitmaps might all be reasonable approaches depending on exactly what use case is being optimized. None of these solutions necessarily require us to change the on disk format (which I agree would be a good thing).

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Mar 27, 2016
@behlendorf behlendorf added this to the 0.7.0 milestone Mar 27, 2016
@nedbass
Copy link
Contributor Author

nedbass commented Mar 30, 2016

Related to openzfs/openzfs#82

@@ -31,6 +31,8 @@
#include <sys/zap.h>
#include <sys/zfeature.h>

int zfs_metadnode_backfill = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to add a pool parameter for this instead of just a module option, so that it can be set persistently if users care more about dense allocations than create performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to make it a zpool property? I think that would be a bad idea, since this is essentially a workaround for a performance bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we can reliably detect holes using openzfs/openzfs#82, and add appropriate handling of ESRCH from dmu_object_next(), then this patch shouldn't be needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that openzfs/openzfs#82 is sufficient to address the performance problem that this patch is working around. The problem is that dnode_next_offset (and dmu_object_next) don’t take into account objects allocated in memory but not yet synced to disk. Therefore if we allocate more than a L1 (the comment about L2 is inaccurate) worth of dnodes in one txg, we will end up calling dnode_hold_impl / dmu_object_next on every allocated-in-memory object when cross the L1 boundary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I misunderstood what that patch fixes. The symptoms are similar (dnode_next_offset() can detect fictional holes) but happen under different conditions.

@sempervictus
Copy link
Contributor

If this patch were to be applied in a stack, used on a pool, and then removed for a subsequent iteration, will the holes missed due to this workaround be back-filled?
Secondly, what, if any effect would this have on ZVOLs? We've observed degrading performance on ZVOLs under heavy write load (while the VDEV volumes comprising the pool show ~20% load), and i'm wondering if that could be related, and potentially (temporarily) addressed this way.
Thanks.

@nedbass
Copy link
Contributor Author

nedbass commented Apr 8, 2016

@sempervictus Yes, the holes would be backfilled. In fact the backfilling behavior would be immediately restored if you dynamically set zfs_metadnode_backfill=1.

This patch will no effect on ZVOL performance since a ZVOL is effectively one giant object and doesn't allocate new objects internally.

@behlendorf
Copy link
Contributor

While this isn't the ideal long term solution for this problem it is a small safe change which does significantly improve meta-data performance today. @nedbass let me know if you're happy with this as a final version so it can be merged.

@nedbass
Copy link
Contributor Author

nedbass commented Apr 12, 2016

@behlendorf I'm fine with it in its current form. We'll probably want to revert this once the underlying issues are fixed.

@adilger
Copy link
Contributor

adilger commented Apr 13, 2016

AFAIK, the problem with using this patch in production is that it will result in monotonously increasing dnode numbers, and for a workload that is creating and deleting files continuously the metadnode would become very large and sparse, until the filesystem is remounted? For long-running servers this might be unacceptable.

Is there some way to track the number of freed dnodes (in-memory per pupil counter) and reset the scanning once it hits some threshold (e.g. 4x the average number of dnodes allocated in the past few TXGs)? That makes it worthwhile to go back and re-scan, while ensuring the new dnodes are unlikely to hit recently allocated dnodes. It may have a noticeable performance cost to change from allocating all new dnodes in a block to filling in holes.

@nedbass
Copy link
Contributor Author

nedbass commented Apr 13, 2016

@adilger That's a good point. I wonder if metadata compression makes it less of an issue though. I think zeroed-out blocks do not actually consume space on disk. And mostly-zero blocks should compress well. There is still a memory overhead problem with a very sparse metadnode if the working set is spread across the entire dnode space.

@adilger
Copy link
Contributor

adilger commented Apr 13, 2016 via email

@ahrens
Copy link
Member

ahrens commented Apr 19, 2016

@adilger

the problem with using this patch in production is that it will result in monotonously increasing dnode numbers, and for a workload that is creating and deleting files continuously the metadnode would become very large and sparse, until the filesystem is remounted

Yes

track the number of freed dnodes (in-memory per pupil counter) and reset the scanning once it hits some threshold

That's a good idea.

@behlendorf
Copy link
Contributor

the problem with using this patch in production is that it will result in monotonously increasing dnode numbers, and for a workload that is creating and deleting files continuously the metadnode would become very large and sparse, until the filesystem is remounted

Yes, but is having a sparse dnode object actually a real problem? Sure, it's not the ideal long term fix but aside from possibly slightly worse memory utilization this seems like it wouldn't cause any issues.

That said, I'm happy to hold off on doing anything here. If we get a little time I agree it would be interesting to take a crack at @adilger suggested which is nice and simple.

@adilger
Copy link
Contributor

adilger commented Apr 22, 2016

I think there are a few potential drawbacks of never trying to backfill in a workload that does both creates and unlinks:

  • depending on the workload the leaf blocks may not be totally empty, so the whole block will be kept in cache even if only one dnode is left, which might consume a fair amount of memory. It is better to try and keep those blocks more full.
  • metadnode gets very large and sparse, which increases the depth of the tree needed to address leaf blocks which (possibly, I don't recall the exact implementation) will increase the transaction size for each dnode update based on the max file size. That is a relatively small effect, if any (it might be that the transaction size is based on the number of indirect blocks to rewrite any data block for the maximum possible file size instead of the actual file size)
  • if the metadnode gets too large and sparse, it will take more IOPS to do scrubs and other dnode traversals, since it could only get a few dnodes per lead block, while a dense metadnode will get many dnodes from disk for each read.
  • if we wait too long before trying to backfill (e.g. after remount instead of after a few tens of seconds if there are many files being deleted) then the creation of new files will have to read these many metadnode blocks from disk to fill in space, rather than getting them from ARC

@ahrens
Copy link
Member

ahrens commented Apr 22, 2016

@adilger

  • The metadnode unfortunately has a fixed depth.
  • Sparseness hurts other operations that need to visit all dnodes (in addition to scrub), most notably zfs send and zfs destroy. These could devolve into 1 dnode per block (or even 1 dnode per 2 blocks -- if there's only one object under each L1 indirect block!).
  • You could fill up the metadnode and reach the limit of # files per filesystem (2^48) - though admittedly it would take 45 years at 200,000 creations/second.

@bzzz77
Copy link
Contributor

bzzz77 commented Apr 27, 2016

is it correct to say that the major issue is dnode_next_offset_level() incapable to detect just allocated dnodes?

@nedbass
Copy link
Contributor Author

nedbass commented Apr 27, 2016

@bzzz77 yes that's correct. A secondary issue is that dmu_object_alloc() doesn't handle ESRCH returned from dmu_object_next(). So if the L1 BP of dnode blocks are full, as will be the case with a create-only workload, it still calls dnode_hold_impl() on every one.

@bzzz77
Copy link
Contributor

bzzz77 commented Apr 27, 2016

then what if we have an in-memory structure tracking allocations in current TXG and consult with that? then TXG sync would release that structure

@nedbass
Copy link
Contributor Author

nedbass commented Apr 27, 2016

@bzzz77 I think that's exactly what @ahrens was proposing to do using range trees.

@adilger
Copy link
Contributor

adilger commented Apr 28, 2016

Even if there is an efficient structure for tracking in-progress allocations, I think there is still a benefit from not doing any scanning of metadnode blocks if there aren't any files being unlinked. For HPC at least, there may be a few 100k's of file creates in one group, and then a similar number of deletes in a later group, so rescanning the metadnode for holes repeatedly during creation is wasteful unless there is some reason to expect that there are new holes (i.e. some reasonably number of dnodes have been deleted since the last time the metadnode was scanned).

@bzzz77
Copy link
Contributor

bzzz77 commented Apr 28, 2016

yes, obviously it makes sense to track deletes someway as well.

@nedbass
Copy link
Contributor Author

nedbass commented May 28, 2016

Closing in favor of #4711

@nedbass nedbass closed this May 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants