Skip to content

Commit 4bceb28

Browse files
authored
aggregate reads across gaps (openzfs#828)
The zettacache does aggregation of contiguous reads, meaning that we will do one large `pread()` syscall instead of several small, contiguous `pread()`'s. This increases performance by decreasing the iops load. However, if there are small gaps between blocks that we need to read, aggregation will not be performed, leading to decreased performance both due to increasing the iops load, and also the implementation issues subsequent almost-contiguous i/os (up to `DISK_AGG_CHUNK`) serially (i.e. one after another, rather than in parallel) This commit improves performance by allowing small gaps (up to `DISK_READ_GAP_LIMIT`=8KB) to be spanned over to create a larger aggregate read. On a workload of 1MB reads from 4 files (with recordsize=8k), which were written concurrently, performance doubles (on disks that can do 15K IOPS, we go from 14K IOPS to 33K IOPS).
1 parent dd1e7cd commit 4bceb28

File tree

1 file changed

+17
-9
lines changed

1 file changed

+17
-9
lines changed

cmd/zfs_object_agent/zettacache/src/block_access.rs

+17-9
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,10 @@ tunable! {
5353
// Stop aggregating if run would exceed DISK_READ/WRITE_MAX_AGGREGATION_SIZE
5454
pub static ref DISK_READ_MAX_AGGREGATION_SIZE: ByteSize = ByteSize::kib(128);
5555
pub static ref DISK_WRITE_MAX_AGGREGATION_SIZE: ByteSize = ByteSize::kib(128);
56+
// Typical disks and instances will max out iops and throughput simultaneously at 16KB i/o
57+
// size. By keeping the gap limit below this, we will not hit the throughput limit due to
58+
// unneeded gaps.
59+
static ref DISK_READ_GAP_LIMIT: ByteSize = ByteSize::kib(8);
5660
// CHUNK must be > MAX_AGG_SIZE, see Disk::write()
5761
static ref DISK_AGG_CHUNK: ByteSize = ByteSize::mib(1);
5862
static ref DISK_WRITE_QUEUE_EMPTY_DELAY: Duration = Duration::from_millis(1);
@@ -287,19 +291,23 @@ impl Disk {
287291
} else {
288292
return None;
289293
};
294+
assert_ne!(len, 0);
290295
for (&offset, message) in iter {
291-
if len > 0 && len + message.size > DISK_READ_MAX_AGGREGATION_SIZE.as_usize() {
292-
break;
293-
}
294296
if message.io_type != io_type {
295-
break;
297+
break; // different type
296298
}
297-
if offset == run[0] + len as u64 {
298-
run.push(offset);
299-
len += message.size;
300-
} else {
301-
break;
299+
let Some(gap) = offset.checked_sub(run[0] + len as u64) else {
300+
break; // message is before the run
301+
};
302+
let gap = gap.as_usize();
303+
if gap > DISK_READ_GAP_LIMIT.as_usize() {
304+
break; // gap is too large
305+
}
306+
if len + gap + message.size > DISK_READ_MAX_AGGREGATION_SIZE.as_usize() {
307+
break; // run is too large
302308
}
309+
run.push(offset);
310+
len += gap + message.size;
303311
}
304312
Some((run, len, io_type))
305313
}

0 commit comments

Comments
 (0)