aggregate reads across gaps (openzfs#828)

ahrens · web-flow · commit 4bceb28968f0 · 2023-04-28T10:02:57.000-07:00
The zettacache does aggregation of contiguous reads, meaning that we
will do one large `pread()` syscall instead of several small, contiguous
`pread()`'s.  This increases performance by decreasing the iops load.
However, if there are small gaps between blocks that we
need to read, aggregation will not be performed, leading to decreased
performance both due to increasing the iops load, and also the
implementation issues subsequent almost-contiguous i/os (up to
`DISK_AGG_CHUNK`) serially (i.e. one after another, rather than in
parallel)

This commit improves performance by allowing small gaps (up to
`DISK_READ_GAP_LIMIT`=8KB) to be spanned over to create a larger
aggregate read.

On a workload of 1MB reads from 4 files (with recordsize=8k), which were
written concurrently, performance doubles (on disks that can do 15K
IOPS, we go from 14K IOPS to 33K IOPS).
diff --git a/cmd/zfs_object_agent/zettacache/src/block_access.rs b/cmd/zfs_object_agent/zettacache/src/block_access.rs
@@ -53,6 +53,10 @@ tunable! {
     // Stop aggregating if run would exceed DISK_READ/WRITE_MAX_AGGREGATION_SIZE
     pub static ref DISK_READ_MAX_AGGREGATION_SIZE: ByteSize = ByteSize::kib(128);
     pub static ref DISK_WRITE_MAX_AGGREGATION_SIZE: ByteSize = ByteSize::kib(128);
+    // Typical disks and instances will max out iops and throughput simultaneously at 16KB i/o
+    // size.  By keeping the gap limit below this, we will not hit the throughput limit due to
+    // unneeded gaps.
+    static ref DISK_READ_GAP_LIMIT: ByteSize = ByteSize::kib(8);
     // CHUNK must be > MAX_AGG_SIZE, see Disk::write()
     static ref DISK_AGG_CHUNK: ByteSize = ByteSize::mib(1);
     static ref DISK_WRITE_QUEUE_EMPTY_DELAY: Duration = Duration::from_millis(1);
@@ -287,19 +291,23 @@ impl Disk {
             } else {
                 return None;
             };
+            assert_ne!(len, 0);
             for (&offset, message) in iter {
-                if len > 0 && len + message.size > DISK_READ_MAX_AGGREGATION_SIZE.as_usize() {
-                    break;
-                }
                 if message.io_type != io_type {
-                    break;
+                    break; // different type
                 }
-                if offset == run[0] + len as u64 {
-                    run.push(offset);
-                    len += message.size;
-                } else {
-                    break;
+                let Some(gap) = offset.checked_sub(run[0] + len as u64) else {
+                    break; // message is before the run
+                };
+                let gap = gap.as_usize();
+                if gap > DISK_READ_GAP_LIMIT.as_usize() {
+                    break; // gap is too large
+                }
+                if len + gap + message.size > DISK_READ_MAX_AGGREGATION_SIZE.as_usize() {
+                    break; // run is too large
                 }
+                run.push(offset);
+                len += gap + message.size;
             }
             Some((run, len, io_type))
         }