Hierarchical bandwidth and operations rate limits. #16205

pjd · 2024-05-17T03:16:39Z

Introduce six new properties: limit_{bw,op}_{read,write,total}.

The limit_bw_* properties limit the read, write, or combined bandwidth, respectively, that a dataset and its descendants can consume. Limits are applied to both file systems and ZFS volumes.

The configured limits are hierarchical, just like quotas; i.e., even if a higher limit is configured on the child dataset, the parent's lower limit will be enforced.

The limits are applied at the VFS level, not at the disk level. The dataset is charged for each operation even if no disk access is required (e.g., due to caching, compression, deduplication, or NOP writes) or if the operation will cause more traffic (due to the copies property, mirroring, or RAIDZ).

Read bandwidth consumption is based on:

read-like syscalls, eg., aio_read(2), pread(2), preadv(2), read(2), readv(2), sendfile(2)
syscalls like getdents(2) and getdirentries(2)
reading via mmaped files
zfs send

Write bandwidth consumption is based on:

write-like syscalls, eg., aio_write(2), pwrite(2), pwritev(2), write(2), writev(2)
writing via mmaped files
zfs receive

The limit_op_* properties limit the read, write, or both metadata operations, respectively, that dataset and its descendants can generate.

Read operations consumption is based on:

read-like syscalls where the number of operations is equal to the number of blocks being read (never less than 1)
reading via mmaped files, where the number of operations is equal to the number of pages being read (never less than 1)
syscalls accessing metadata: readlink(2), stat(2)

Write operations consumption is based on:

write-like syscalls where the number of operations is equal to the number of blocks being written (never less than 1)
writing via mmaped files, where the number of operations is equal to the number of pages being written (never less than 1)
syscalls modifing a directory's content: bind(2) (UNIX-domain sockets), link(2), mkdir(2), mkfifo(2), mknod(2), open(2) (file creation), rename(2), rmdir(2), symlink(2), unlink(2)
syscalls modifing metadata: chflags(2), chmod(2), chown(2), utimes(2)
updating the access time of a file when reading it

Just like limit_bw_* limits, the limit_op_* limits are also hierarchical and applied at the VFS level.

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

tonyhutter · 2024-05-20T16:44:29Z

Some first-pass questions/comments:

Maybe I missed it, but do you specify the units anywhere? Are the limit_bw_* props in units of bytes/second and limit_op_* in ops/second? Can you mention the units in the man pages?

If you specify 100MB/s and they system is idle with 8GB/s bandwidth available, will it still only use 100MB/s?

What happens if someone specifies a *_total value that is larger/smaller than *_read or *_write combined? Does it just get capped to the minimum?

Does anything bad happen if you cap it to something super low, like 1 byte/sec?

The zfsprops.7 man page is roughly in alphabetical order. Could you move the limits_* section to just after the keylocation sections?

Is copy_file_range() counted?

pjd · 2024-05-20T21:17:30Z

Maybe I missed it, but do you specify the units anywhere? Are the limit_bw_* props in units of bytes/second and limit_op_* in ops/second? Can you mention the units in the man pages?

Sure, I'll add that.

If you specify 100MB/s and they system is idle with 8GB/s bandwidth available, will it still only use 100MB/s?

Correct.

What happens if someone specifies a *_total value that is larger/smaller than *_read or *_write combined? Does it just get capped to the minimum?

Correct. Always the lowest limit will be enforced. The same if the parent has a lower limit then the child or children combined.

Does anything bad happen if you cap it to something super low, like 1 byte/sec?

It cannot be lower than the resolution (which is 16 per second), so it will be rounded up to 16, but it will also allocate large number of slots to keep the history, so here one slot per byte of each pending request.

The zfsprops.7 man page is roughly in alphabetical order. Could you move the limits_* section to just after the keylocation sections?

Sure.

Is copy_file_range() counted?

So it was counted when it was doing fall back to read/write, but it wasn't counted in case of block cloning, which I think it should, so I just added it.

tonyhutter · 2024-05-24T18:33:59Z

I did some hand testing of this and it works just as described 👍

$ ./zpool create -f tank -O compression=off /dev/nvme{0..8}n1 && dd if=/dev/zero of=/tank/bigfile conv=sync bs=1M count=10000 && ./zpool destroy tank 
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 3.91677 s, 2.7 GB/s


$ ./zpool create -f tank -O compression=off /dev/nvme{0..8}n1 && ./zfs set limit_bw_write=$((1024 * 1024 * 200)) tank && dd if=/dev/zero of=/tank/bigfile conv=sync bs=1M count=10000 && ./zpool destroy tank
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 50.9869 s, 206 MB/s

I also verified that it worked for multithreaded writes, and verified that top-level dataset's values correctly overrode their children's values.

pjd · 2024-05-24T18:36:45Z

I did some hand testing of this and it works just as described 👍

$ ./zpool create -f tank -O compression=off /dev/nvme{0..8}n1 && dd if=/dev/zero of=/tank/bigfile conv=sync bs=1M count=10000 && ./zpool destroy tank 
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 3.91677 s, 2.7 GB/s


$ ./zpool create -f tank -O compression=off /dev/nvme{0..8}n1 && ./zfs set limit_bw_write=$((1024 * 1024 * 200)) tank && dd if=/dev/zero of=/tank/bigfile conv=sync bs=1M count=10000 && ./zpool destroy tank
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 50.9869 s, 206 MB/s

I also verified that it worked for multithreaded writes, and verified that top-level dataset's values correctly overrode their children's values.

Thank you!

BTW. You can use suffixes got limit_bw_* properties, eg. limit_bw_write=200M

module/os/linux/zfs/zfs_vnops_os.c

module/zfs/vfs_ratelimit.c

tonyhutter · 2024-05-24T21:18:08Z

tests/runfiles/bclone.run

-# This run file contains all of the common functional tests. When
-# adding a new test consider also adding it to the sanity.run file
-# if the new test runs to completion in only a few seconds.


Did you want this change included in this PR?

Not really part of this PR, but I was hoping to smuggle it in instead of creating a separate PR for it:)

Introduce six new properties: limit_{bw,op}_{read,write,total}. The limit_bw_* properties limit the read, write, or combined bandwidth, respectively, that a dataset and its descendants can consume. Limits are applied to both file systems and ZFS volumes. The configured limits are hierarchical, just like quotas; i.e., even if a higher limit is configured on the child dataset, the parent's lower limit will be enforced. The limits are applied at the VFS level, not at the disk level. The dataset is charged for each operation even if no disk access is required (e.g., due to caching, compression, deduplication, or NOP writes) or if the operation will cause more traffic (due to the copies property, mirroring, or RAIDZ). Read bandwidth consumption is based on: - read-like syscalls, eg., aio_read(2), pread(2), preadv(2), read(2), readv(2), sendfile(2) - syscalls like getdents(2) and getdirentries(2) - reading via mmaped files - zfs send Write bandwidth consumption is based on: - write-like syscalls, eg., aio_write(2), pwrite(2), pwritev(2), write(2), writev(2) - writing via mmaped files - zfs receive The limit_op_* properties limit the read, write, or both metadata operations, respectively, that dataset and its descendants can generate. Read operations consumption is based on: - read-like syscalls where the number of operations is equal to the number of blocks being read (never less than 1) - reading via mmaped files, where the number of operations is equal to the number of pages being read (never less than 1) - syscalls accessing metadata: readlink(2), stat(2) Write operations consumption is based on: - write-like syscalls where the number of operations is equal to the number of blocks being written (never less than 1) - writing via mmaped files, where the number of operations is equal to the number of pages being written (never less than 1) - syscalls modifing a directory's content: bind(2) (UNIX-domain sockets), link(2), mkdir(2), mkfifo(2), mknod(2), open(2) (file creation), rename(2), rmdir(2), symlink(2), unlink(2) - syscalls modifing metadata: chflags(2), chmod(2), chown(2), utimes(2) - updating the access time of a file when reading it Just like limit_bw_* limits, the limit_op_* limits are also hierarchical and applied at the VFS level. Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

robn

I've done a first pass and added some fairly minor comments. Overall the structure looks pretty sensible and straightforward.

I'll do a more thorough read of the logic soon, hopefully later today. I expect to be sticking a +1 on this before long.

Good stuff!

robn · 2024-06-10T00:39:03Z

include/sys/vfs_ratelimit.h

+extern "C" {
+#endif
+
+struct vfs_ratelimit;


This barely matters, but we don't have anything else really called vfs_. Should this be zfs_ratelimit, or zfs_vfs_ratelimit, or something like that?

(and change the function names etc to match, of course)

FOLLOWUP: Having just read the zvol changes, I'm even more convinced this should definitely not say just be zfs_ratelimit, not vfs or anything else. vfs is sort of a strange term to see in the zvol code.

Of course I wanted to use zfs_ratelimit, but it is already taken:) See module/zfs/zfs_ratelimit.c. I'm open to changing the name.

module/zfs/vfs_ratelimit.c

module/os/linux/zfs/zfs_vnops_os.c

robn · 2024-06-10T02:25:36Z

module/os/freebsd/zfs/zfs_vnops_os.c

+ * would be calling vfs_ratelimit_data_read() way too often and each
+ * call accounts for a single operation.
+ */
+ vfs_ratelimit_data_read(os, zp->z_blksz, outcount);


Why are directory entries accounted as a data read and not a metadata read? Maybe it's a question of the definition of metadata.

anything not file contents: directory entries are metadata

anything managed by the user: directory entries are data

data created by the user vs its overhead: filenames are data, struct dirent is metadata

I don't think you should change it necessarily, so long as it's consistent. For example, I see getattr/setattr are accounted as metadata reads. They're more like the dirent part perhaps? On the other hand, filenames are vary in length as the user wishes, which makes them more obviously like data.

I don't want to think very hard, and I don't think it matters too much as long as there's no glaring loopholes. So if you tell me "yes, I thought about it, they're all consistent" then that's good enough for me :)

The reason I decided to treat directory content as data is that directories can be big and require a lot of disk bandwidth. All the other metadata operations are just a single block updates. Also note that each data operation is accounted for both bandwidth and operations where operations=total_size/record_size.

robn · 2024-06-10T03:02:47Z

module/zfs/dsl_dir.c

@@ -304,6 +387,10 @@ dsl_dir_hold_obj(dsl_pool_t *dp, uint64_t ddobj,
 dd = winner;
 } else {
 spa_open_ref(dp->dp_spa, dd);
+
+ if (dd->dd_myname[0] != '$') {


FYI, I considered excluding '%' here (temporary snapshots), and I think I just come down on the side of leaving it out, ie, no change. They're not really seen by the user, but they are real work, and so should be counted.

man/man7/zfsprops.7

robn · 2024-06-10T03:23:31Z

man/man7/zfsprops.7

+.It
+reading via mmaped files
+.It
+.Nm zfs Cm send


Similar to above, maybe mention snapshot mounts?

amotin · 2024-06-12T14:38:13Z

Does anything bad happen if you cap it to something super low, like 1 byte/sec?

It cannot be lower than the resolution (which is 16 per second), so it will be rounded up to 16, but it will also allocate large number of slots to keep the history, so here one slot per byte of each pending request.

I am not sure what was original Tony's concern, but my first thought was that some 1MB write operation may stuck in kernel for 11 days. I am only starting to read the code, and I only hope that the wait is interruptible, otherwise it may become an administrative nightmare.

pjd · 2024-06-12T15:09:29Z

Does anything bad happen if you cap it to something super low, like 1 byte/sec?

It cannot be lower than the resolution (which is 16 per second), so it will be rounded up to 16, but it will also allocate large number of slots to keep the history, so here one slot per byte of each pending request.

I am not sure what was original Tony's concern, but my first thought was that some 1MB write operation may stuck in kernel for 11 days. I am only starting to read the code, and I only hope that the wait is interruptible, otherwise it may become an administrative nightmare.

It was not interruptible on purpose initially as I assumed it will be easy to generate more traffic than configured, but as long as we do accounting before we go to sleep we should be fine. There are few cases were I do accounting after the fact, but those are not the main ones. I'll make it interruptible.

amotin · 2024-06-12T16:24:52Z

module/os/linux/zfs/zfs_vnops_os.c

@@ -3294,6 +3327,9 @@ zfs_symlink(znode_t *dzp, char *name, vattr_t *vap, char *link,
 zfs_exit(zfsvfs, FTAG);
 return (SET_ERROR(EDQUOT));
 }
+
+ vfs_ratelimit_metadata_write(zfsvfs->z_os);


An observation: you account for file names when directories are read, but not when they are written, or when symlinks are created. I think at least symlink values same as extended attributes could be considered data, at least when it is easy.

module/os/linux/zfs/zfs_vnops_os.c

amotin · 2024-06-12T16:55:45Z

module/zfs/dsl_dir.c

+ if (strcmp(myname, setpoint) != 0) {
+ /* Property is not set here. */


I am not very familiar with this, but do we need all this name magic, considering that the properties are not inheritable? Will dsl_prop_get_dd() ever return us parent's or something else's values in some cases?

module/zfs/dsl_dir.c

module/zfs/zfs_vnops.c

module/zfs/vfs_ratelimit.c

snajpa · 2024-06-21T20:48:40Z

While some way to limit IO with OpenZFS is badly needed, I think this is not going to be very useful for setups that put the advantages of the ARC and the writeback buffering + optimizations towards the user. I'd like to communicate an idea rather than any negativity, this is awesome work and with the world moving on to NVMe only setups, it could get a pass even in our lower-cost environment eventually :)

There's the obvious catch with pairing ZIOs to objsets, but I'm thinking, can't that be solved using the SPL's tsd, thread specific data? You've done the hardest parts I didn't want to do (identifying accounting points + all the infrastructure for the limits management) to get to the point to be able to actually prove this idea, so I might just go and try that on top of whatever eventually lands; but if you have the time, you could probably try that way sooner before I get to it. The idea is to mark the thread when entering the ZPL with specific objset id in the top layers where that info is available and pull it out of tsd probably in zio_create directly to struct zio.

If you guys don't see why that approach won't work for any obvious reason, the very least we can do now put some thought how the UI should look like if we eventually get to have the ability to limit only L1/L2 read misses, synchronous writes, so it doesn't cause headaches later on...
limit_{bw,op}_write_sync, limit_{bw,op}_read_l1arc_miss, limit_{bw,op}_read_l2arc_miss? I think that could work :)

pjd · 2024-09-08T00:38:05Z

If you guys don't see why that approach won't work for any obvious reason, the very least we can do now put some thought how the UI should look like if we eventually get to have the ability to limit only L1/L2 read misses, synchronous writes, so it doesn't cause headaches later on... limit_{bw,op}_write_sync, limit_{bw,op}_read_l1arc_miss, limit_{bw,op}_read_l2arc_miss? I think that could work :)

One of the goals for this design was to provide predictability. When you move benefits of the ARC to the user and we don't charge the user for what's in the ARC already it will be hard to tell what to expect. Also, if we do that, why not to move the benefits of deduplication, NOP writes, block cloning and compression to the user as well? And while we are here why not to charge the user for some extra costs, like write inflation caused by RAIDZ, mirroring or the copies property or even metadata? Not to mention that limiting writes too low is a problem when we are in the TXG syncing context as we don't want to slow down the whole transaction.

This design I think fits better to environments where most people would like to use it - where you share the storage among many independent consumers like VMs or containers. As a storage provider you get the benefits of compression, deduplication, etc. but you also handle the cost of various inflations that would be hard to justify to the consumers. Also, with this design it is trivial for the consumer to test that they are getting the limits the see being configured.

Introduce six new properties: limit_{bw,op}_{read,write,total}. The limit_bw_* properties limit the read, write, or combined bandwidth, respectively, that a dataset and its descendants can consume. Limits are applied to both file systems and ZFS volumes. The configured limits are hierarchical, just like quotas; i.e., even if a higher limit is configured on the child dataset, the parent's lower limit will be enforced. The limits are applied at the VFS level, not at the disk level. The dataset is charged for each operation even if no disk access is required (e.g., due to caching, compression, deduplication, or NOP writes) or if the operation will cause more traffic (due to the copies property, mirroring, or RAIDZ). Read bandwidth consumption is based on: - read-like syscalls, eg., aio_read(2), pread(2), preadv(2), read(2), readv(2), sendfile(2) - syscalls like getdents(2) and getdirentries(2) - reading via mmaped files - zfs send Write bandwidth consumption is based on: - write-like syscalls, eg., aio_write(2), pwrite(2), pwritev(2), write(2), writev(2) - writing via mmaped files - zfs receive The limit_op_* properties limit the read, write, or both metadata operations, respectively, that dataset and its descendants can generate. Read operations consumption is based on: - read-like syscalls where the number of operations is equal to the number of blocks being read (never less than 1) - reading via mmaped files, where the number of operations is equal to the number of pages being read (never less than 1) - syscalls accessing metadata: readlink(2), stat(2) Write operations consumption is based on: - write-like syscalls where the number of operations is equal to the number of blocks being written (never less than 1) - writing via mmaped files, where the number of operations is equal to the number of pages being written (never less than 1) - syscalls modifing a directory's content: bind(2) (UNIX-domain sockets), link(2), mkdir(2), mkfifo(2), mknod(2), open(2) (file creation), rename(2), rmdir(2), symlink(2), unlink(2) - syscalls modifing metadata: chflags(2), chmod(2), chown(2), utimes(2) - updating the access time of a file when reading it Just like limit_bw_* limits, the limit_op_* limits are also hierarchical and applied at the VFS level. Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

pjd · 2024-09-10T07:58:34Z

Does anything bad happen if you cap it to something super low, like 1 byte/sec?

It cannot be lower than the resolution (which is 16 per second), so it will be rounded up to 16, but it will also allocate large number of slots to keep the history, so here one slot per byte of each pending request.

I am not sure what was original Tony's concern, but my first thought was that some 1MB write operation may stuck in kernel for 11 days. I am only starting to read the code, and I only hope that the wait is interruptible, otherwise it may become an administrative nightmare.

I reimplemented all the vfs_ratelimit functions to be interruptible.

snajpa · 2024-09-10T11:42:24Z

@pjd actually the use-case for this for us (vpsfree.org) is to ratelimit containers, which mirror the original Zones/Jails/OpenVZ motivations - there is an admin per each of those separate virtual environments - and they manage these containers as if they were a full VM (even running their own lxd + docker, etc., nested down).

A full VM would give those individual admins VFS caching, but if we limit VFS ops instead of IO ops, they're getting hit by a limit that isn't there when you run a full VM. Consider a static web serving a simple example - that relies on VFS caching heavily, a performance-wise sane design is to ensure frequently touched data fits in RAM. People usually don't have to resort to explicitly caching the contents (and metadata) of those files/directories in the userspace, the kernel does it for them well-enough to catch the gist of such use-cases.

Arguably, ZFS should be even better for these use-cases due to MRU/MFU split by design.

Now, when we're limiting IOPS or IO bandwidth, it's due to actual backing storage limitations, the only limitation for RAM there is - is its capacity. The backing devices are what's bandwidth/IOPS constrained and when thinking of QoS, this is where I'd like to ensure fairness the most. I see some of our containers pulling multiple GB/s from the ARC regularly in peak hours and I think that's wonderful they're able to do that - all while not touching the backing store. I don't think we can find a set of values for limits on the VFS layer to be able to really ensure fairness on the backend, while not severely limiting the frontend case (where data is already in RAM), when the difference between those two in absolute numbers is still this high.

Once the hardware levels up with RAM, then arguably, ARC is basically relevant only for caching decompressed/decrypted hot (meta)data, but that I still would like to push towards the admins of our containers, because that's still what they would get from a full VM, so it is what they're expecting.

While the backing-store QoS is made lvl-hard in ZFS due to the ZFS IO pipeline thread split, I think that is actually the area where it makes sense to focus on the most, given the general containers use-case especially. Even for docker/k8s/etc. Moreover, QEMU and other hypervisors can already do IO limiting on their own... so containers remain the primary concern - and they I think are used to their VFS cache being a cache.

And yes, being able to partition ARC or actually to have upper bounds on possible ARC usage for a single container, would be even better to have. And nice to have, too. But in practice, this has never been a concern, I only ever dug around the dbuf kstats out of curiosity (to see how much ARC does each container occupy and how that evolves during the day) - I thought that would be a concern for us too, but it wasn't in the end.

I can imagine it being a concern when the ARC is sized below the full workingset size. But frankly we like the ARC and are giving it 12.5-25% of system RAM on every box (12.5% on systems with >1T; 25% for 256G to 1T systems), so we get constantly awesome hitrates.

snajpa · 2024-09-10T19:03:00Z

Also, if we do that, why not to move the benefits of deduplication, NOP writes, block cloning and compression to the user as well?

To my best knowledge, these are all actually accounted in a way that benefits the user, already. For example, there is no quota I could set that would cap the uncompressed size, it's the compressed size that is under the quota (and that also gets presented to the userspace). Same goes for all the rest, I believe.

allanjude · 2024-09-10T20:36:28Z

Also, if we do that, why not to move the benefits of deduplication, NOP writes, block cloning and compression to the user as well?

To my best knowledge, these are all actually accounted in a way that benefits the user, already. For example, there is no quota I could set that would cap the uncompressed size, it's the compressed size that is under the quota (and that also gets presented to the userspace). Same goes for all the rest, I believe.

user quotas in ZFS are on the logical data, not the compressed (physical) data.

snajpa · 2024-09-11T08:35:44Z

@allanjude I've never used those and most people never do, those are user quotas where there are some preexisting expectations webhosters especially have... but OK, that's one area I didn't know about, all the rest is to benefit the user, not the provider, I'd say that user level quotas are a huge exception. Can you find another?

I think it would definitely be nice to have lsize-based quotas for datasets, FWIW.

But I also said I'd go and try to implement the idea with TSD marking + ZIO objset id propagation, I'm not saying anyone other than me should do the work, I wouldn't even dare to think that :) I'm just asking for some consideration, if we'd be okay with additional set of limits with suffixes (it's those _misses that are the most interesting to me)... and if anyone sees any obvious reason why that approach would be doomed even before I try :)

amotin · 2024-09-12T14:40:09Z

module/os/freebsd/zfs/zfs_vnops_os.c

@@ -1642,8 +1673,7 @@ zfs_rmdir_(vnode_t *dvp, vnode_t *vp, const char *name, cred_t *cr)
 error = dmu_tx_assign(tx, TXG_WAIT);
 if (error) {
 dmu_tx_abort(tx);
- zfs_exit(zfsvfs, FTAG);
- return (error);
+ goto out;


What are we committing to ZIL here in case of error?

Nothing. This function is a mess. You can see above that we are jumping to out even when file type != VDIR and will do zil_commit(). We should only do zil_commit() when error==0. Unfortunately there are more cases like this (where we either commit to ZIL on error or updating atime on error).

amotin · 2024-09-12T14:50:08Z

module/os/freebsd/zfs/zfs_vnops_os.c

 vnevent_rmdir(vp, dvp, name, ct);

+ vfs_ratelimit_metadata_write(zfsvfs->z_os);


I see on both FreeBSD and Linux you account some removals as two operations. Could you shed a light why? You are not doing it for creations, so why removals?

Hmm, this was some kind of mismerge. There should be no vfs_ratelimit_*() calls without checking for error.

module/os/linux/zfs/zfs_vnops_os.c

amotin · 2024-09-12T16:31:03Z

module/os/linux/zfs/zfs_vnops_os.c

 if (zp->z_atime_dirty && zp->z_unlinked == B_FALSE) {
+ if (vfs_ratelimit_metadata_write(zfsvfs->z_os) != 0) {
+ goto out;


I am not very good in Linux VFS inode lifecycle, but I have suspicion that zfs_inactive() may be not a good place to rate limit atime updates. I worry that it may penalize the host system, not the user who caused it. Please correct me if I am wrong.

module/os/linux/zfs/zfs_vnops_os.c

amotin · 2024-09-12T16:44:32Z

module/zfs/dmu_send.c

+ * vfs_ratelimit_data_read_spin() will sleep in short periods and return
+ * immediately when a signal is pending.
+ */
+ vfs_ratelimit_data_read_spin(os, 0, BP_GET_LSIZE(bp));


Unrelated to this particular chunk, I wonder about expected practice of backup/restore datasets with rate limits. From a server admin standpoint I am not sure I'd like my backup and especially restore procedures to be throttled by user-oriented limits, or even accounted towards them.

amotin · 2024-09-12T17:14:08Z

module/zfs/vfs_ratelimit.c

+ratelimit_is_none(const uint64_t *limits)
+{
+
+ for (int i = ZFS_RATELIMIT_FIRST; i < ZFS_RATELIMIT_NTYPES; i++) {


Suggested change

for (int i = ZFS_RATELIMIT_FIRST; i < ZFS_RATELIMIT_NTYPES; i++) {

for (int i = ZFS_RATELIMIT_FIRST; i <= ZFS_RATELIMIT_LAST; i++) {

The effect is the same, just IMHO using ZFS_RATELIMIT_NTYPES in combination with ZFS_RATELIMIT_FIRST is wrong. If you prefer ZFS_RATELIMIT_NTYPES, then I'd start from 0. Though both ways assume there are no holes. The same is below in vfs_ratelimit_alloc().

amotin · 2024-09-12T17:31:53Z

module/zfs/vfs_ratelimit.c

+ }
+
+ rl->rl_timeslot[type] += count / rl->rl_limits[type];
+ rl->rl_reminder[type] = count % rl->rl_limits[type];;


Suggested change

rl->rl_reminder[type] = count % rl->rl_limits[type];;

rl->rl_reminder[type] = count % rl->rl_limits[type];

amotin · 2024-09-12T17:45:58Z

module/zfs/vfs_ratelimit.c

+ rrm_enter_read(&os->os_spa->spa_ratelimit_lock, FTAG);
+
+ timeslot = ratelimit_account_all(os, counts);
+
+ rrm_exit(&os->os_spa->spa_ratelimit_lock, FTAG);
+
+ return (ratelimit_sleep(timeslot));


I think it would be nice if in case of abort we undo accounting done, otherwise following requests will still wait for whatever insane time we could have been given.

amotin · 2024-09-12T17:56:43Z

module/zfs/zfs_vnops.c

-static uint64_t zfs_vnops_read_chunk_size = 1024 * 1024;
+//static uint64_t zfs_vnops_read_chunk_size = 1024 * 1024;
+static uint64_t zfs_vnops_read_chunk_size = 1024 * 512;


What's this? Leak of some debugging?

amotin · 2024-09-12T17:58:10Z

module/zfs/vfs_ratelimit.c

+ */
+#ifdef _KERNEL
+ if (delay_sig((hz / RATELIMIT_RESOLUTION) * (timeslot - now))) {
+ error = EINTR;


Suggested change

error = EINTR;

error = SET_ERROR(EINTR);

pjd force-pushed the ratelimits branch from 02ce57b to 5148390 Compare May 20, 2024 21:17

tonyhutter reviewed May 24, 2024

View reviewed changes

module/os/linux/zfs/zfs_vnops_os.c Outdated Show resolved Hide resolved

tonyhutter reviewed May 24, 2024

View reviewed changes

module/zfs/vfs_ratelimit.c Outdated Show resolved Hide resolved

tonyhutter reviewed May 24, 2024

View reviewed changes

module/zfs/vfs_ratelimit.c Outdated Show resolved Hide resolved

tonyhutter reviewed May 24, 2024

View reviewed changes

module/zfs/vfs_ratelimit.c Outdated Show resolved Hide resolved

tonyhutter reviewed May 24, 2024

View reviewed changes

pjd force-pushed the ratelimits branch 2 times, most recently from ce9c37a to 61f0f95 Compare May 24, 2024 22:29

pjd force-pushed the ratelimits branch from 61f0f95 to b4c7ee6 Compare June 1, 2024 16:36

robn reviewed Jun 10, 2024

View reviewed changes

amotin reviewed Jun 12, 2024

View reviewed changes

module/zfs/vfs_ratelimit.c Outdated Show resolved Hide resolved

amotin reviewed Sep 12, 2024

View reviewed changes

		if (strcmp(myname, setpoint) != 0) {
		/* Property is not set here. */

		vnevent_rmdir(vp, dvp, name, ct);

		vfs_ratelimit_metadata_write(zfsvfs->z_os);

	for (int i = ZFS_RATELIMIT_FIRST; i < ZFS_RATELIMIT_NTYPES; i++) {
	for (int i = ZFS_RATELIMIT_FIRST; i <= ZFS_RATELIMIT_LAST; i++) {

	rl->rl_reminder[type] = count % rl->rl_limits[type];;
	rl->rl_reminder[type] = count % rl->rl_limits[type];

Hierarchical bandwidth and operations rate limits. #16205

Are you sure you want to change the base?

Hierarchical bandwidth and operations rate limits. #16205

Conversation

pjd commented May 17, 2024 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

tonyhutter commented May 20, 2024

pjd commented May 20, 2024

tonyhutter commented May 24, 2024

pjd commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotin commented Jun 12, 2024

pjd commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snajpa commented Jun 21, 2024 • edited Loading

pjd commented Sep 8, 2024

pjd commented Sep 10, 2024

snajpa commented Sep 10, 2024 • edited Loading

snajpa commented Sep 10, 2024

allanjude commented Sep 10, 2024

snajpa commented Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotin Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

amotin Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjd commented May 17, 2024 •

edited

Loading

snajpa commented Jun 21, 2024 •

edited

Loading

snajpa commented Sep 10, 2024 •

edited

Loading

snajpa commented Sep 11, 2024 •

edited

Loading

amotin Sep 12, 2024 •

edited

Loading

amotin Sep 12, 2024 •

edited

Loading