ZFS utterly Slow on NVME (BTRFS vs ZFS) #16993

Ramalama2 · 2025-01-27T19:11:43Z

This is maybe known to everyone, but something has to change because ZFS gets more and more unusable on fast NVME Drives.

System information

Distribution Name | Proxmox 8.3.3 (Debian 12)
Kernel Version | 6.8 / 6.11
Architecture | x64
OpenZFS Version | 2.2.7
CPU: Genoa 9374F
Memory: 12x 64GB Dimms (768GB Total)
Drives: 8x Micron 7450 Max 3,2TB / 2x Micron 7500 Max 3,2TB

I made a lot of Benchmarks, FIO and complex MySQL Queries on ZFS / EXT4 & BTRFS
Tested MariaDB version: 11.6
Total Database Space used: 446 GB

FIO Command:
fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=1 --size=4G --numjobs=6 --runtime=60 --time_based=1 --group_reporting --filename=/var/lib/mysql/testfile

Query Command:
Complex Select Query with Left Joints, Inner Joins, Group by, from multiple Databases.

MariaDB 11.6 Results:

I tested 6 Queries, they are all different but "Query 1" and "Query 3" are hard to Cache, so i would look only to these 2.
After the Startup of the VM, each Query was run only 3x and only the third Result is measured.
(The First Result on BTRFS vs ZFS, is even a lot more painfull, btrfs takes on first time after startup ~5s, while zfs takes over 15 minutes!)

MySQL Queries:	BTRFS 1x nvme (default)	BTRFS (Raid0) 2x nvme	EXT4 1x nvme	ZFS 1x nvme (default)	ZFS 1x nvme (tuned)	ZFS (Raid10) 8x nvme (tuned)
Query 1:	3,823	3,778	3,749	629,239	158,180	162,816
Query 2:	0,023	0,022	0,028	0,028	0,025	0,025
Query 3:	0,329	0,323	0,332	74,994	41,417	42,032
Query 4:	0,030	0,028	0,034	0,034	0,030	0,030
Query 5:	0,022	0,022	0,028	0,029	0,025	0,025
Query 6:	0,023	0,022	0,028	0,028	0,025	0,025
Total:	4,25s	4,195s	4,199s	704,352s	199,702s	204,953s

FIO Results (Paired for the MySQL-Benchmarks above):

ZFS-Default means:

primarycache=metadata
Everything else default

ZFS-Tuned means:

primarycache=metadata
recordsize=16k
atime=off
xattr=sa
logbias=throughput
dnodesize=auto
redundant_metadata=most
Everything else default

ZFS-Default (single, 1x 7450, 7,9k IOPS, 20ms latency)

Jobs: 4 (f=4): [r(4)][100.0%][r=30.9MiB/s][r=7915 IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=1284: Wed Jan 22 22:19:04 2025
  read: IOPS=7914, BW=30.9MiB/s (32.4MB/s)(1855MiB/60001msec)
    slat (usec): min=3, max=2406, avg=504.25, stdev=234.26
    clat (nsec): min=1773, max=26833k, avg=15665600.77, stdev=1402757.85
     lat (usec): min=543, max=27441, avg=16169.85, stdev=1425.07
    clat percentiles (usec):
     |  1.00th=[12387],  5.00th=[13304], 10.00th=[13960], 20.00th=[14484],
     | 30.00th=[15008], 40.00th=[15401], 50.00th=[15664], 60.00th=[16057],
     | 70.00th=[16319], 80.00th=[16909], 90.00th=[17433], 95.00th=[17695],
     | 99.00th=[19006], 99.50th=[20055], 99.90th=[21890], 99.95th=[22676],
     | 99.99th=[24773]
   bw (  KiB/s): min=29584, max=32920, per=100.00%, avg=31658.49, stdev=129.42, samples=476
   iops        : min= 7396, max= 8230, avg=7914.62, stdev=32.35, samples=476
  lat (usec)   : 2=0.01%, 4=0.01%, 750=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.02%, 20=99.48%, 50=0.49%
  cpu          : usr=0.20%, sys=3.38%, ctx=394391, majf=14, minf=433
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=474864,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=30.9MiB/s (32.4MB/s), 30.9MiB/s-30.9MiB/s (32.4MB/s-32.4MB/s), io=1855MiB (1945MB), run=60001-60001msec

ZFS-Tuned (single, 1x 7450, 27,7k IOPS, 10ms latency)

Jobs: 4 (f=4): [r(4)][100.0%][r=108MiB/s][r=27.7k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=1054: Wed Jan 22 23:36:47 2025
  read: IOPS=27.6k, BW=108MiB/s (113MB/s)(6460MiB/60001msec)
    slat (usec): min=3, max=3754, avg=144.08, stdev=74.25
    clat (nsec): min=1733, max=8822.9k, avg=4499481.31, stdev=449415.46
     lat (usec): min=6, max=9019, avg=4643.56, stdev=457.49
    clat percentiles (usec):
     |  1.00th=[ 3326],  5.00th=[ 3720], 10.00th=[ 3916], 20.00th=[ 4146],
     | 30.00th=[ 4293], 40.00th=[ 4424], 50.00th=[ 4555], 60.00th=[ 4621],
     | 70.00th=[ 4752], 80.00th=[ 4883], 90.00th=[ 5080], 95.00th=[ 5211],
     | 99.00th=[ 5407], 99.50th=[ 5538], 99.90th=[ 5800], 99.95th=[ 6259],
     | 99.99th=[ 7439]
   bw (  KiB/s): min=105912, max=129912, per=100.00%, avg=110265.88, stdev=987.47, samples=476
   iops        : min=26478, max=32478, avg=27566.47, stdev=246.87, samples=476
  lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=13.15%, 10=86.85%
  cpu          : usr=0.60%, sys=10.79%, ctx=1299849, majf=14, minf=491
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1653684,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=6460MiB (6773MB), run=60001-60001msec

ZFS-Tuned (Raid-10, 8x 7450, 26,7k IOPS, 10ms latency)

Jobs: 4 (f=4): [r(4)][100.0%][r=104MiB/s][r=26.7k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=1057: Wed Jan 22 22:26:15 2025
  read: IOPS=26.5k, BW=103MiB/s (108MB/s)(6204MiB/60001msec)
    slat (usec): min=3, max=2108, avg=150.01, stdev=74.80
    clat (nsec): min=1871, max=7816.5k, avg=4684713.57, stdev=424597.41
     lat (usec): min=182, max=7998, avg=4834.73, stdev=431.56
    clat percentiles (usec):
     |  1.00th=[ 3654],  5.00th=[ 3982], 10.00th=[ 4146], 20.00th=[ 4359],
     | 30.00th=[ 4490], 40.00th=[ 4621], 50.00th=[ 4686], 60.00th=[ 4817],
     | 70.00th=[ 4948], 80.00th=[ 5014], 90.00th=[ 5211], 95.00th=[ 5342],
     | 99.00th=[ 5604], 99.50th=[ 5669], 99.90th=[ 5932], 99.95th=[ 6128],
     | 99.99th=[ 6980]
   bw (  KiB/s): min=102320, max=109584, per=100.00%, avg=105884.57, stdev=374.69, samples=476
   iops        : min=25580, max=27396, avg=26471.14, stdev=93.67, samples=476
  lat (usec)   : 2=0.01%, 4=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=5.81%, 10=94.19%
  cpu          : usr=0.59%, sys=11.38%, ctx=1259655, majf=13, minf=276
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1588287,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=103MiB/s (108MB/s), 103MiB/s-103MiB/s (108MB/s-108MB/s), io=6204MiB (6506MB), run=60001-60001msec

EXT4 (single, 1x 7450, 703k IOPS, 250µs latency)

Jobs: 4 (f=4): [r(4)][100.0%][r=2747MiB/s][r=703k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=1059: Wed Jan 22 17:23:54 2025
  read: IOPS=701k, BW=2739MiB/s (2872MB/s)(160GiB/60002msec)
    slat (usec): min=2, max=4011, avg= 3.49, stdev= 2.43
    clat (usec): min=5, max=4174, avg=178.80, stdev=62.09
     lat (usec): min=8, max=4178, avg=182.29, stdev=61.89
    clat percentiles (usec):
     |  1.00th=[   48],  5.00th=[  111], 10.00th=[  125], 20.00th=[  143],
     | 30.00th=[  155], 40.00th=[  163], 50.00th=[  172], 60.00th=[  180],
     | 70.00th=[  192], 80.00th=[  212], 90.00th=[  245], 95.00th=[  269],
     | 99.00th=[  330], 99.50th=[  363], 99.90th=[  766], 99.95th=[  988],
     | 99.99th=[ 1532]
   bw (  MiB/s): min= 2423, max= 3070, per=100.00%, avg=2740.79, stdev=28.37, samples=476
   iops        : min=620500, max=786132, avg=701642.39, stdev=7263.04, samples=476
  lat (usec)   : 10=0.01%, 20=0.03%, 50=1.09%, 100=2.42%, 250=87.56%
  lat (usec)   : 500=8.69%, 750=0.10%, 1000=0.06%
  lat (msec)   : 2=0.04%, 4=0.01%, 10=0.01%
  cpu          : usr=10.38%, sys=71.87%, ctx=661563, majf=17, minf=275
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=42069353,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
 
Run status group 0 (all jobs):
   READ: bw=2739MiB/s (2872MB/s), 2739MiB/s-2739MiB/s (2872MB/s-2872MB/s), io=160GiB (172GB), run=60002-60002msec
 
Disk stats (read/write):
  loop0: ios=42008706/21, sectors=336069648/432, merge=0/33, ticks=3829799/8, in_queue=3829806, util=100.00%

BTRFS (single, 1x 7450, 673k IOPS, 250µs latency)

Jobs: 4 (f=4): [r(4)][100.0%][r=2629MiB/s][r=673k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=1058: Wed Jan 22 20:06:34 2025
  read: IOPS=676k, BW=2643MiB/s (2771MB/s)(155GiB/60002msec)
    slat (usec): min=2, max=6080, avg= 3.97, stdev= 4.51
    clat (usec): min=9, max=6426, avg=184.95, stdev=56.29
     lat (usec): min=12, max=6429, avg=188.93, stdev=56.24
    clat percentiles (usec):
     |  1.00th=[  105],  5.00th=[  133], 10.00th=[  143], 20.00th=[  159],
     | 30.00th=[  169], 40.00th=[  176], 50.00th=[  180], 60.00th=[  184],
     | 70.00th=[  188], 80.00th=[  204], 90.00th=[  235], 95.00th=[  260],
     | 99.00th=[  318], 99.50th=[  359], 99.90th=[  490], 99.95th=[  660],
     | 99.99th=[ 2311]
   bw (  MiB/s): min= 2323, max= 2786, per=100.00%, avg=2645.18, stdev=20.80, samples=476
   iops        : min=594816, max=713440, avg=677164.82, stdev=5324.43, samples=476
  lat (usec)   : 10=0.01%, 20=0.01%, 50=0.08%, 100=0.84%, 250=92.45%
  lat (usec)   : 500=6.54%, 750=0.05%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=9.21%, sys=79.07%, ctx=431313, majf=0, minf=419
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=40591264,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
 
Run status group 0 (all jobs):
   READ: bw=2643MiB/s (2771MB/s), 2643MiB/s-2643MiB/s (2771MB/s-2771MB/s), io=155GiB (166GB), run=60002-60002msec
 
Disk stats (read/write):
  loop1: ios=40533616/22, sectors=324268936/432, merge=0/32, ticks=2531160/3, in_queue=2531164, util=99.95%

BTRFS (Raid-0, 2x 7500, 642k IOPS, 250µs latency)

Jobs: 3 (f=0): [f(1),_(1),f(2)][100.0%][r=2506MiB/s][r=642k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=1132: Mon Jan 27 13:42:20 2025
  read: IOPS=660k, BW=2578MiB/s (2703MB/s)(151GiB/60002msec)
    slat (usec): min=2, max=7019, avg= 3.87, stdev= 3.76
    clat (usec): min=10, max=7185, avg=189.80, stdev=65.71
     lat (usec): min=13, max=7188, avg=193.67, stdev=65.52
    clat percentiles (usec):
     |  1.00th=[  110],  5.00th=[  129], 10.00th=[  145], 20.00th=[  161],
     | 30.00th=[  167], 40.00th=[  172], 50.00th=[  176], 60.00th=[  180],
     | 70.00th=[  186], 80.00th=[  217], 90.00th=[  265], 95.00th=[  297],
     | 99.00th=[  371], 99.50th=[  412], 99.90th=[  578], 99.95th=[  816],
     | 99.99th=[ 2180]
   bw (  MiB/s): min= 2267, max= 2796, per=100.00%, avg=2579.29, stdev=28.73, samples=476
   iops        : min=580424, max=715926, avg=660299.14, stdev=7353.97, samples=476
  lat (usec)   : 20=0.01%, 50=0.14%, 100=0.70%, 250=85.98%, 500=12.99%
  lat (usec)   : 750=0.13%, 1000=0.03%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.01%
  cpu          : usr=9.08%, sys=75.75%, ctx=367562, majf=0, minf=1286
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=39593164,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2578MiB/s (2703MB/s), 2578MiB/s-2578MiB/s (2703MB/s-2703MB/s), io=151GiB (162GB), run=60002-60002msec

Disk stats (read/write):
  loop1: ios=39448612/25, sectors=315588896/624, merge=0/53, ticks=2908870/6, in_queue=2908877, util=99.90%

ZFS is most horrible on Databases especially, on General-Usage/Filetransfer and so on, ZFS is almost on par with BTRFS.
In the end, on Mixed Environments, BTRFS is a much better Choice for NVME-Drives.

The text was updated successfully, but these errors were encountered:

amotin · 2025-01-27T19:24:56Z

When some test shows two orders of magnitude difference, it usually means you are comparing incomparable.

Ramalama2 · 2025-01-27T19:55:20Z

When some test shows two orders of magnitude difference, it usually means you are comparing incomparable.

What is incomparable there?
So what? Mysql Benchmarks are not allowed on ZFS? Or how do i need to understand that?

tonyhutter · 2025-01-27T19:56:02Z

quick observations -

fio blocksize should match recordsize. fio --numjobs=6 could be higher for your Genoa 9374F (32-core/64-thread) CPU. Consider increasing it to 32 or 64. You're running fio with --direct=1, but also running zfs-2.2.7 which doesn't natively support Direct IO. Consider upgrading to zfs-2.3.0. Consider benchmarking against a zvol.

Side note - we have a PR going though review #16591 that could potentially double IOPS for fio runs like this.

clhedrick · 2025-01-27T20:04:12Z

We'd certainly notice a 20 msec latency.

On a production system having home directories we see 241 usec read, 198 usec write (from nfsiostat on the client). I think those are a bit high, as Linux doesn't do a good job of giving latency when several transactions are in the pipeline at once.

A system on hard disk used for people who need more space gives 1 msec read and 373 usec write, with the same caveat.

Of course we get fairly high hit rates on the arc, so that may be a bit misleading, but I certainly wouldn't expect to see 20 msec average read.

To see what happens when things actually hit the disk, I use zpool iostat -v -l.

It shows around 13 msec read and 30 msec write on hard disk. (That's total wait. Disk wait is like 11 read and 9 write.) Users would never see the write latencies, because writes are acknowledged as soon as the data goes into the arc or gets written to an NVMe slog, depending. iostat on our NVMe system shows around 140 usec total read and 50 usec write.

These are not heavily loaded systems. Presumably you can find a way to load a system badly enough that latencies will go very high.

Ramalama2 · 2025-01-27T20:21:50Z

quick observations -

fio blocksize should match recordsize. fio --numjobs=6 could be higher for your Genoa 9374F (32-core/64-thread) CPU. Consider increasing it to 32 or 64. You're running fio with --direct=1, but also running zfs-2.2.7 which doesn't natively support Direct IO. Consider upgrading to zfs-2.3.0. Consider benchmarking against a zvol.

Side note - we have a PR going though review #16591 that could potentially double IOPS for fio runs like this.

I know that directio is comming with ZFS 2.3.0 (which i cannot test at the moment sadly), but im pretty sure that directio will greatly increase NVME IOPS.
The issue is, i fear that directio in the beginning is risky. (Don't forget about blk-mq data corruptions...)
Sure ZFS without directio is much safer as BTRFS, but not with directio (at least in the beginning).
I dont believe that ZFS with directio, will be Performancewise comparable to BTRFS either.

1. Without direct, the IOPS on BTRFS will skyrocket while on ZFS this wont change anything:

fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --size=4G --numjobs=4 --runtime=60 --time_based=1 --group_reporting --filename=/var/lib/mysql/testfile

ZFS-Tuned (Raid-10, 8x nvme, 26.5k IOPS, 10ms latency)

Jobs: 4 (f=4): [r(4)][100.0%][r=104MiB/s][r=26.5k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=1088: Mon Jan 27 21:09:35 2025
  read: IOPS=26.0k, BW=101MiB/s (106MB/s)(6088MiB/60001msec)
    slat (usec): min=3, max=1770, avg=152.91, stdev=69.90
    clat (nsec): min=1634, max=8063.4k, avg=4774462.53, stdev=398207.21
     lat (usec): min=143, max=8257, avg=4927.38, stdev=404.78
    clat percentiles (usec):
     |  1.00th=[ 3785],  5.00th=[ 4080], 10.00th=[ 4228], 20.00th=[ 4424],
     | 30.00th=[ 4555], 40.00th=[ 4686], 50.00th=[ 4817], 60.00th=[ 4883],
     | 70.00th=[ 5014], 80.00th=[ 5145], 90.00th=[ 5276], 95.00th=[ 5407],
     | 99.00th=[ 5604], 99.50th=[ 5669], 99.90th=[ 5866], 99.95th=[ 5932],
     | 99.99th=[ 6915]
   bw (  KiB/s): min=100296, max=107960, per=99.99%, avg=103886.92, stdev=395.60, samples=476
   iops        : min=25074, max=26990, avg=25971.73, stdev=98.90, samples=476
  lat (usec)   : 2=0.01%, 4=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=3.17%, 10=96.82%
  cpu          : usr=0.56%, sys=10.33%, ctx=1280241, majf=0, minf=295
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1558472,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=101MiB/s (106MB/s), 101MiB/s-101MiB/s (106MB/s-106MB/s), io=6088MiB (6384MB), run=60001-60001msec

BTRFS (Raid-0, 2x nvme, 1178k IOPS, 100-250µs latency)

Jobs: 4 (f=4): [r(4)][100.0%][r=4600MiB/s][r=1178k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=2346: Mon Jan 27 20:00:02 2025
  read: IOPS=615k, BW=2402MiB/s (2518MB/s)(141GiB/60001msec)
    slat (nsec): min=1010, max=11379k, avg=5341.16, stdev=7605.82
    clat (nsec): min=1574, max=289679k, avg=202367.65, stdev=1466173.70
     lat (usec): min=3, max=289680, avg=207.71, stdev=1466.58
    clat percentiles (usec):
     |  1.00th=[   73],  5.00th=[   74], 10.00th=[   75], 20.00th=[   77],
     | 30.00th=[   90], 40.00th=[  108], 50.00th=[  137], 60.00th=[  178],
     | 70.00th=[  235], 80.00th=[  314], 90.00th=[  424], 95.00th=[  494],
     | 99.00th=[  562], 99.50th=[  570], 99.90th=[  619], 99.95th=[  717],
     | 99.99th=[ 1254]
   bw (  MiB/s): min=  379, max= 6456, per=99.01%, avg=2377.99, stdev=409.28, samples=476
   iops        : min=97044, max=1652924, avg=608765.60, stdev=104775.58, samples=476
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=36.08%, 250=36.11%, 500=23.11%, 750=4.66%, 1000=0.02%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 500=0.01%
  cpu          : usr=8.48%, sys=56.65%, ctx=9443003, majf=0, minf=254
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=36890308,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2402MiB/s (2518MB/s), 2402MiB/s-2402MiB/s (2518MB/s-2518MB/s), io=141GiB (151GB), run=60001-60001msec

Disk stats (read/write):
  loop1: ios=9434251/25, sectors=75474008/448, merge=0/31, ticks=88814/6, in_queue=88821, util=65.65%

2. Increasing Blocksize to Recordsize:

fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=16k --size=4G --numjobs=4 --runtime=60 --time_based=1 --group_reporting --filename=/var/lib/mysql/testfile

ZFS-Tuned (Raid-10, 8x nvme, 21,8k IOPS, 10ms latency)

Jobs: 4 (f=4): [r(4)][100.0%][r=340MiB/s][r=21.8k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=1160: Mon Jan 27 21:15:10 2025
  read: IOPS=21.7k, BW=338MiB/s (355MB/s)(19.8GiB/60001msec)
    slat (usec): min=3, max=2133, avg=183.57, stdev=16.41
    clat (nsec): min=1643, max=7821.2k, avg=5724788.11, stdev=134541.72
     lat (usec): min=148, max=8004, avg=5908.36, stdev=137.60
    clat percentiles (usec):
     |  1.00th=[ 5342],  5.00th=[ 5538], 10.00th=[ 5604], 20.00th=[ 5604],
     | 30.00th=[ 5669], 40.00th=[ 5669], 50.00th=[ 5735], 60.00th=[ 5735],
     | 70.00th=[ 5800], 80.00th=[ 5800], 90.00th=[ 5866], 95.00th=[ 5932],
     | 99.00th=[ 5997], 99.50th=[ 5997], 99.90th=[ 6325], 99.95th=[ 6456],
     | 99.99th=[ 7177]
   bw (  KiB/s): min=337536, max=360544, per=100.00%, avg=346584.91, stdev=1354.53, samples=476
   iops        : min=21096, max=22534, avg=21661.50, stdev=84.65, samples=476
  lat (usec)   : 2=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=99.99%
  cpu          : usr=0.42%, sys=9.57%, ctx=1296859, majf=0, minf=673
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1299724,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=19.8GiB (21.3GB), run=60001-60001msec

BTRFS (Raid-0, 2x nvme, 615k IOPS, 250-500µs latency)

Jobs: 4 (f=4): [r(4)][100.0%][r=9603MiB/s][r=615k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=2774: Mon Jan 27 21:17:34 2025
  read: IOPS=348k, BW=5432MiB/s (5696MB/s)(320GiB/60324msec)
    slat (nsec): min=1505, max=18407k, avg=8557.97, stdev=12083.31
    clat (usec): min=4, max=553038, avg=357.37, stdev=5820.89
     lat (usec): min=63, max=553041, avg=365.93, stdev=5821.08
    clat percentiles (usec):
     |  1.00th=[   110],  5.00th=[   118], 10.00th=[   121], 20.00th=[   123],
     | 30.00th=[   137], 40.00th=[   157], 50.00th=[   196], 60.00th=[   260],
     | 70.00th=[   351], 80.00th=[   474], 90.00th=[   652], 95.00th=[   758],
     | 99.00th=[   865], 99.50th=[   889], 99.90th=[  1045], 99.95th=[  1123],
     | 99.99th=[522191]
   bw (  MiB/s): min=  514, max=15524, per=100.00%, avg=5484.25, stdev=982.37, samples=478
   iops        : min=32934, max=993584, avg=350991.80, stdev=62871.48, samples=478
  lat (usec)   : 10=0.01%, 250=58.89%, 500=22.75%, 750=13.16%, 1000=5.06%
  lat (msec)   : 2=0.13%, 4=0.01%, 10=0.01%, 20=0.01%, 750=0.01%
  cpu          : usr=5.02%, sys=65.73%, ctx=5246511, majf=0, minf=682
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=20971524,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=5432MiB/s (5696MB/s), 5432MiB/s-5432MiB/s (5696MB/s-5696MB/s), io=320GiB (344GB), run=60324-60324msec

Disk stats (read/write):
  loop1: ios=5244373/22, sectors=167818512/432, merge=0/32, ticks=80576/4, in_queue=80580, util=65.46%

3. Increasing numjobs, fio would benefit a little, but mysql not. So this is basically not really helpfull in any way. However:

fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=16k --size=4G --numjobs=6 --runtime=60 --time_based=1 --group_reporting --filename=/var/lib/mysql/testfile

ZFS-Tuned (Raid-10, 8x nvme, 31,2k IOPS, 10ms latency)

Jobs: 6 (f=6): [r(6)][100.0%][r=488MiB/s][r=31.2k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=6): err= 0: pid=1067: Mon Jan 27 21:01:17 2025
  read: IOPS=31.5k, BW=492MiB/s (516MB/s)(28.8GiB/60001msec)
    slat (usec): min=4, max=10122, avg=189.34, stdev=37.93
    clat (nsec): min=1634, max=57735k, avg=5904775.41, stdev=585473.13
     lat (usec): min=190, max=60107, avg=6094.12, stdev=601.68
    clat percentiles (usec):
     |  1.00th=[ 5604],  5.00th=[ 5735], 10.00th=[ 5735], 20.00th=[ 5800],
     | 30.00th=[ 5800], 40.00th=[ 5866], 50.00th=[ 5866], 60.00th=[ 5932],
     | 70.00th=[ 5932], 80.00th=[ 5997], 90.00th=[ 5997], 95.00th=[ 6063],
     | 99.00th=[ 6390], 99.50th=[ 7111], 99.90th=[10945], 99.95th=[16909],
     | 99.99th=[36963]
   bw (  KiB/s): min=361472, max=525920, per=100.00%, avg=504160.57, stdev=2435.07, samples=714
   iops        : min=22592, max=32870, avg=31509.98, stdev=152.19, samples=714
  lat (usec)   : 2=0.01%, 4=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=99.88%, 20=0.08%, 50=0.03%
  lat (msec)   : 100=0.01%
  cpu          : usr=0.51%, sys=10.99%, ctx=1888299, majf=0, minf=1604
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1890143,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=492MiB/s (516MB/s), 492MiB/s-492MiB/s (516MB/s-516MB/s), io=28.8GiB (31.0GB), run=60001-60001msec

BTRFS (Raid-0, 2x nvme, 941k IOPS, 250µs latency)

Jobs: 6 (f=6): [r(6)][100.0%][r=14.4GiB/s][r=941k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=6): err= 0: pid=2421: Mon Jan 27 20:15:21 2025
  read: IOPS=576k, BW=9001MiB/s (9438MB/s)(528GiB/60071msec)
    slat (nsec): min=1485, max=14164k, avg=7539.35, stdev=11243.87
    clat (usec): min=3, max=544945, avg=323.58, stdev=5632.81
     lat (usec): min=42, max=544948, avg=331.12, stdev=5633.05
    clat percentiles (usec):
     |  1.00th=[   111],  5.00th=[   121], 10.00th=[   123], 20.00th=[   125],
     | 30.00th=[   127], 40.00th=[   129], 50.00th=[   145], 60.00th=[   178],
     | 70.00th=[   251], 80.00th=[   383], 90.00th=[   627], 95.00th=[   807],
     | 99.00th=[   979], 99.50th=[  1029], 99.90th=[  1188], 99.95th=[  1369],
     | 99.99th=[505414]
   bw (  MiB/s): min=  100, max=23579, per=100.00%, avg=9049.01, stdev=1159.70, samples=717
   iops        : min= 6438, max=1509094, avg=579136.82, stdev=74221.00, samples=717
  lat (usec)   : 4=0.01%, 10=0.01%, 100=0.01%, 250=69.83%, 500=15.68%
  lat (usec)   : 750=7.97%, 1000=5.77%
  lat (msec)   : 2=0.74%, 4=0.01%, 10=0.01%, 20=0.01%, 500=0.01%
  lat (msec)   : 750=0.01%
  cpu          : usr=5.44%, sys=66.87%, ctx=5781226, majf=0, minf=1043
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=34603014,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=9001MiB/s (9438MB/s), 9001MiB/s-9001MiB/s (9438MB/s-9438MB/s), io=528GiB (567GB), run=60071-60071msec

Disk stats (read/write):
  loop1: ios=5775437/23, sectors=184812488/432, merge=0/31, ticks=110198/4, in_queue=110202, util=67.96%

4. Conclusion:

Dont forget, i try to improve MySQL Performance, and BTRFS is worlds ahead for Mysql. Actually almost everything real-world related.
However, none of that helps MySQL Performance or shows any big Performance Improvements compared to BTRFS.

amotin · 2025-01-27T20:30:16Z

What is incomparable there?

You haven't specified what "Queries" have you tested. Are those all selects or is there any writes? ZFS is known for its rigorous transactional safety guaranties, and even among SSDs not all are made equal in respect to synchronous writes, plentifully produced on database requests. Do other file system provide the same guaranties?

You've disabled all data caching in ZFS and actively running sub-block accesses. It predictably creates massive read/write inflation. Do other file system provide checksuming, compression, etc, preventing sub-block disk accesses? What block sizes do they use?

ZFS does have some IOPS bottlenecks, for some of which there are already PRs floating around, but to my experience those bottlenecks start closer to 300-400K IOPS. 27K IOPS is not serious. There must be some reason(s).

Ramalama2 · 2025-01-27T21:32:44Z

What is incomparable there?

You haven't specified what "Queries" have you tested. Are those all selects or is there any writes? ZFS is known for its rigorous transactional safety guaranties, and even among SSDs not all are made equal in respect to synchronous writes, plentifully produced on database requests. Do other file system provide the same guaranties?

You've disabled all data caching in ZFS and actively running sub-block accesses. It predictably creates massive read/write inflation. Do other file system provide checksuming, compression, etc, preventing sub-block disk accesses? What block sizes do they use?

ZFS does have some IOPS bottlenecks, for some of which there are already PRs floating around, but to my experience those bottlenecks start closer to 300-400K IOPS. 27K IOPS is not serious. There must be some reason(s).

I wrote already, those are only Select Queries with some Inner Joins, means no Writing, just reading.
Its just reading from a 446gb Database, and not some Wordpress/Nextcloud mini Database that can be cached in Memory.
Im caching metadata, for the reason that i dont want to Cache my Queries in ARC.
For such fast NVME Devices, ARC makes no sense anyway, especially not for 8x Micron 7450 Max in ZFS Raid-10.
BTRFS supports Compression, Checksumming for Data & Metadata, and "etc"...

Ramalama2 · 2025-01-27T21:37:25Z

Here is the Example Query 1 (The one for the Benchmark):

SELECT
    T1.col_artikel AS Artikel,
    REPLACE(REPLACE(T2.col_text, '\n', ''), '\r', '') AS Text,
    T1.col_company AS Company,
    T1.col_account AS Account,
    IF(T1.col_company = 'A', T3.col_editor, '') AS Editor,
    CONCAT(DATE_FORMAT(T1.col_date_start, '%d.%m.%y'), ' - ', DATE_FORMAT(T1.col_date_end, '%d.%m.%y')) AS Validity,
    T1.col_min_qty AS MinQty,
    T1.col_price AS Price,
    CONCAT(T1.col_discount_1, ' / ', T1.col_discount_2) AS Discount,
    ROUND(T1.col_price * (100 - T1.col_discount_1) / 100 * (100 - T1.col_discount_2) / 100, 3) AS NetPrice1,
    T1.col_status AS Status,
    CONCAT(DATE_FORMAT(T4.col_date_start, '%d.%m.%y'), ' - ', DATE_FORMAT(T4.col_date_end, '%d.%m.%y')) AS Validity2,
    T4.col_min_qty AS MinQty2,
    T4.col_price AS Price2,
    CONCAT(T4.col_discount_1, ' / ', T4.col_discount_2) AS Discount2,
    ROUND(T4.col_price * (100 - T4.col_discount_1) / 100 * (100 - T4.col_discount_2) / 100, 3) AS NetPrice2,
    IF(T1.col_company = 'A', '1', '2') AS SortOrder,
    (SELECT user_email
     FROM users
     WHERE user_shortname = T3.col_editor
       AND user_firm IN ('01', '51', '52', '53')
     LIMIT 1) AS EditorEmail
FROM
    prices AS T1
LEFT JOIN
    prices AS T4
    ON T1.col_artikel = T4.col_artikel
       AND T1.col_company = T4.col_company
       AND T1.col_account = T4.col_account
       AND T1.col_date_start = T4.col_date_start
       AND T1.col_date_end = T4.col_date_end
       AND T1.col_min_qty < T4.col_min_qty
INNER JOIN
    items AS T2
    ON T2.col_item_number = T1.col_artikel
LEFT JOIN
    accounts AS T3
    ON T3.col_account_number = T2.col_supplier
WHERE
    CURDATE() BETWEEN T1.col_date_start AND T1.col_date_end
    AND T4.col_price IS NOT NULL
    AND ROUND(T1.col_price * (100 - T1.col_discount_1) / 100 * (100 - T1.col_discount_2) / 100, 3) < ROUND(T4.col_price * (100 - T4.col_discount_1) / 100 * (100 - T4.col_discount_2) / 100, 3)
    AND T1.col_artikel NOT IN ('EXCLUDE_ARTICLE')
    AND T1.col_account NOT IN ('EXCLUDE_ACCOUNT')
    AND T1.col_company NOT IN ('EXCLUDE_COMPANY')
ORDER BY
    IF(T1.col_company = 'A', '1', '2'),
    T3.col_editor,
    T1.col_account,
    T1.col_artikel,
    T1.col_date_start;

Just that others do not complain about the Query that i use.

Ramalama2 · 2025-01-27T22:10:22Z

We'd certainly notice a 20 msec latency.

On a production system having home directories we see 241 usec read, 198 usec write (from nfsiostat on the client). I think those are a bit high, as Linux doesn't do a good job of giving latency when several transactions are in the pipeline at once.

A system on hard disk used for people who need more space gives 1 msec read and 373 usec write, with the same caveat.

Of course we get fairly high hit rates on the arc, so that may be a bit misleading, but I certainly wouldn't expect to see 20 msec average read.

To see what happens when things actually hit the disk, I use zpool iostat -v -l.

It shows around 13 msec read and 30 msec write on hard disk. (That's total wait. Disk wait is like 11 read and 9 write.) Users would never see the write latencies, because writes are acknowledged as soon as the data goes into the arc or gets written to an NVMe slog, depending. iostat on our NVMe system shows around 140 usec total read and 50 usec write.

These are not heavily loaded systems. Presumably you can find a way to load a system badly enough that latencies will go very high.

Its similar here:

                                                       capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool                                                 alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
---------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
Storage-Default                                      5.02T  6.60T    810    283  19.6M  15.3M  252us    1ms  200us  217us    3us    1us   70us    1ms  141us    2ms      -
  mirror-0                                           1.25T  1.65T    202     71  4.89M  3.83M  251us    1ms  199us  217us    3us  917ns   69us    1ms  150us    2ms      -
    nvme-Micron_7450_MTFDKCC3T2TFS_XXXXXXXXXXXX          -      -    101     35  2.44M  1.91M  251us    1ms  199us  217us    3us  890ns   69us    1ms  186us    2ms      -
    nvme-Micron_7450_MTFDKCC3T2TFS_XXXXXXXXXXXX_1        -      -    101     35  2.44M  1.91M  250us    1ms  199us  217us    3us  943ns   69us    1ms  115us    2ms      -
  mirror-1                                           1.25T  1.65T    202     72  4.90M  3.87M  251us    1ms  199us  215us    3us    2us   69us    1ms  205us    2ms      -
    nvme-Micron_7450_MTFDKCC3T2TFS_XXXXXXXXXXXX          -      -    101     36  2.45M  1.94M  251us    1ms  199us  215us    3us    2us   69us    1ms  161us    2ms      -
    nvme-Micron_7450_MTFDKCC3T2TFS_XXXXXXXXXXXX          -      -    101     36  2.45M  1.94M  251us    1ms  199us  215us    3us    2us   69us    1ms  246us    2ms      -
  mirror-2                                           1.26T  1.65T    202     70  4.91M  3.81M  253us    1ms  200us  219us    3us    1us   71us    1ms  133us    2ms      -
    nvme-Micron_7450_MTFDKCC3T2TFS_XXXXXXXXXXXX          -      -    101     35  2.45M  1.90M  253us    1ms  200us  220us    3us    1us   71us    1ms  158us    2ms      -
    nvme-Micron_7450_MTFDKCC3T2TFS_XXXXXXXXXXXX          -      -    101     35  2.46M  1.90M  252us    1ms  200us  217us    3us    1us   71us    1ms  110us    2ms      -
  mirror-3                                           1.26T  1.65T    201     69  4.91M  3.78M  252us    1ms  200us  219us    3us  871ns   69us    1ms   15us    2ms      -
    nvme-Micron_7450_MTFDKCC3T2TFS_XXXXXXXXXXXX_1        -      -    100     34  2.46M  1.89M  252us    1ms  200us  219us    3us  835ns   69us    1ms   15us    2ms      -
    nvme-Micron_7450_MTFDKCC3T2TFS_XXXXXXXXXXXX_1        -      -    101     34  2.46M  1.89M  252us    1ms  200us  219us    3us  907ns   69us    1ms   15us    2ms      -
---------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
rpool                                                27.7G   860G     15     31   351K   414K  130us  298us  130us   47us  651ns  491ns   14us  270us  467us      -      -
  mirror-0                                           27.7G   860G     15     31   351K   414K  130us  298us  130us   47us  651ns  491ns   14us  270us  467us      -      -
    nvme-eui.3636353057b1XXXXXXXXXXXe00000001-part3      -      -      8     15   243K   207K  146us  297us  146us   47us  650ns  472ns   15us  269us  544us      -      -
    nvme-eui.3636353057b1XXXXXXXXXXXe00000001-part3      -      -      7     15   108K   207K  112us  298us  112us   47us  652ns  510ns   13us  270us  397us      -      -
---------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

But the Storage for the MariaDB vm is not Directly on the Pool, its on a ZFS-FS on a dataset:

ZFS Storage Details

zfs list | grep 182
Storage-Default/Storage-Mysql/subvol-182-disk-0    385G   215G   385G  /Storage-Default/Storage-Mysql/subvol-182-disk-0
Storage-Default/subvol-182-disk-0                  672M  15.3G   672M  /Storage-Default/subvol-182-disk-0

zfs get all Storage-Default/Storage-Mysql/subvol-182-disk-0
NAME                                             PROPERTY              VALUE                                             SOURCE
Storage-Default/Storage-Mysql/subvol-182-disk-0  type                  filesystem                                        -
Storage-Default/Storage-Mysql/subvol-182-disk-0  creation              Mon Jan 27 21:12 2025                             -
Storage-Default/Storage-Mysql/subvol-182-disk-0  used                  385G                                              -
Storage-Default/Storage-Mysql/subvol-182-disk-0  available             215G                                              -
Storage-Default/Storage-Mysql/subvol-182-disk-0  referenced            385G                                              -
Storage-Default/Storage-Mysql/subvol-182-disk-0  compressratio         1.08x                                             -
Storage-Default/Storage-Mysql/subvol-182-disk-0  mounted               yes                                               -
Storage-Default/Storage-Mysql/subvol-182-disk-0  quota                 none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  reservation           none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  recordsize            16K                                               inherited from Storage-Default/Storage-Mysql
Storage-Default/Storage-Mysql/subvol-182-disk-0  mountpoint            /Storage-Default/Storage-Mysql/subvol-182-disk-0  default
Storage-Default/Storage-Mysql/subvol-182-disk-0  sharenfs              off                                               default
Storage-Default/Storage-Mysql/subvol-182-disk-0  checksum              on                                                default
Storage-Default/Storage-Mysql/subvol-182-disk-0  compression           on                                                inherited from Storage-Default
Storage-Default/Storage-Mysql/subvol-182-disk-0  atime                 off                                               inherited from Storage-Default/Storage-Mysql
Storage-Default/Storage-Mysql/subvol-182-disk-0  devices               on                                                default
Storage-Default/Storage-Mysql/subvol-182-disk-0  exec                  on                                                default
Storage-Default/Storage-Mysql/subvol-182-disk-0  setuid                on                                                default
Storage-Default/Storage-Mysql/subvol-182-disk-0  readonly              off                                               default
Storage-Default/Storage-Mysql/subvol-182-disk-0  zoned                 off                                               default
Storage-Default/Storage-Mysql/subvol-182-disk-0  snapdir               hidden                                            default
Storage-Default/Storage-Mysql/subvol-182-disk-0  aclmode               discard                                           default
Storage-Default/Storage-Mysql/subvol-182-disk-0  aclinherit            restricted                                        default
Storage-Default/Storage-Mysql/subvol-182-disk-0  createtxg             5244121                                           -
Storage-Default/Storage-Mysql/subvol-182-disk-0  canmount              on                                                default
Storage-Default/Storage-Mysql/subvol-182-disk-0  xattr                 sa                                                local
Storage-Default/Storage-Mysql/subvol-182-disk-0  copies                1                                                 default
Storage-Default/Storage-Mysql/subvol-182-disk-0  version               5                                                 -
Storage-Default/Storage-Mysql/subvol-182-disk-0  utf8only              off                                               -
Storage-Default/Storage-Mysql/subvol-182-disk-0  normalization         none                                              -
Storage-Default/Storage-Mysql/subvol-182-disk-0  casesensitivity       sensitive                                         -
Storage-Default/Storage-Mysql/subvol-182-disk-0  vscan                 off                                               default
Storage-Default/Storage-Mysql/subvol-182-disk-0  nbmand                off                                               default
Storage-Default/Storage-Mysql/subvol-182-disk-0  sharesmb              off                                               default
Storage-Default/Storage-Mysql/subvol-182-disk-0  refquota              600G                                              local
Storage-Default/Storage-Mysql/subvol-182-disk-0  refreservation        none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  guid                  3357670813066738377                               -
Storage-Default/Storage-Mysql/subvol-182-disk-0  primarycache          metadata                                          inherited from Storage-Default/Storage-Mysql
Storage-Default/Storage-Mysql/subvol-182-disk-0  secondarycache        all                                               default
Storage-Default/Storage-Mysql/subvol-182-disk-0  usedbysnapshots       0B                                                -
Storage-Default/Storage-Mysql/subvol-182-disk-0  usedbydataset         385G                                              -
Storage-Default/Storage-Mysql/subvol-182-disk-0  usedbychildren        0B                                                -
Storage-Default/Storage-Mysql/subvol-182-disk-0  usedbyrefreservation  0B                                                -
Storage-Default/Storage-Mysql/subvol-182-disk-0  logbias               throughput                                        inherited from Storage-Default/Storage-Mysql
Storage-Default/Storage-Mysql/subvol-182-disk-0  objsetid              66392                                             -
Storage-Default/Storage-Mysql/subvol-182-disk-0  dedup                 off                                               default
Storage-Default/Storage-Mysql/subvol-182-disk-0  mlslabel              none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  sync                  standard                                          default
Storage-Default/Storage-Mysql/subvol-182-disk-0  dnodesize             auto                                              inherited from Storage-Default
Storage-Default/Storage-Mysql/subvol-182-disk-0  refcompressratio      1.08x                                             -
Storage-Default/Storage-Mysql/subvol-182-disk-0  written               385G                                              -
Storage-Default/Storage-Mysql/subvol-182-disk-0  logicalused           416G                                              -
Storage-Default/Storage-Mysql/subvol-182-disk-0  logicalreferenced     416G                                              -
Storage-Default/Storage-Mysql/subvol-182-disk-0  volmode               default                                           default
Storage-Default/Storage-Mysql/subvol-182-disk-0  filesystem_limit      none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  snapshot_limit        none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  filesystem_count      none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  snapshot_count        none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  snapdev               hidden                                            default
Storage-Default/Storage-Mysql/subvol-182-disk-0  acltype               posix                                             local
Storage-Default/Storage-Mysql/subvol-182-disk-0  context               none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  fscontext             none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  defcontext            none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  rootcontext           none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  relatime              on                                                default
Storage-Default/Storage-Mysql/subvol-182-disk-0  redundant_metadata    most                                              inherited from Storage-Default/Storage-Mysql
Storage-Default/Storage-Mysql/subvol-182-disk-0  overlay               on                                                default
Storage-Default/Storage-Mysql/subvol-182-disk-0  encryption            off                                               default
Storage-Default/Storage-Mysql/subvol-182-disk-0  keylocation           none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  keyformat             none                                              default
Storage-Default/Storage-Mysql/subvol-182-disk-0  pbkdf2iters           0                                                 default
Storage-Default/Storage-Mysql/subvol-182-disk-0  special_small_blocks  0                                                 default
Storage-Default/Storage-Mysql/subvol-182-disk-0  prefetch              all                                               default

Its an LXC Container on Proxmox (not a VM), so there is no addiotional overhead like another Scheduler etc...
The Filesystem is simply passed through, or in other words, the container just has native access... (Basically Bare-Metal)

So im not even sure what the issue is, or lets say i dont believe that there is an issue.
I believe that most people are simply using ARC and think because of that that ZFS is fast.

The Funny Part is, that i get on an SAS-Array (on my Backup-Server) with ARC 573k IOPS:

ZFS (Raid-10, 8x HDD(SAS) + 2x Special nvme, 573k IOPS, 500µs latency)

  pool: HDD-SAS
 state: ONLINE
  scan: scrub repaired 0B in 04:26:55 with 0 errors on Sun Jan 12 04:50:56 2025
config:

	NAME                                                STATE     READ WRITE CKSUM
	HDD-SAS                                             ONLINE       0     0     0
	  mirror-0                                          ONLINE       0     0     0
	    wwn-0x5000cca2c1848944                          ONLINE       0     0     0
	    wwn-0x5000cca2cc0005b4                          ONLINE       0     0     0
	  mirror-1                                          ONLINE       0     0     0
	    wwn-0x5000cca25806aee0                          ONLINE       0     0     0
	    wwn-0x5000cca259029b90                          ONLINE       0     0     0
	  mirror-2                                          ONLINE       0     0     0
	    wwn-0x5000cca25806ef60                          ONLINE       0     0     0
	    wwn-0x5000cca2db04db2c                          ONLINE       0     0     0
	  mirror-3                                          ONLINE       0     0     0
	    wwn-0x50000399d8914779                          ONLINE       0     0     0
	    wwn-0x50000399d892a6a1                          ONLINE       0     0     0
	special
	  mirror-4                                          ONLINE       0     0     0
	    nvme-SAMSUNG_MZPLJ1T6HBJR-00007_S55JNC0W900178  ONLINE       0     0     0
	    nvme-SAMSUNG_MZPLJ1T6HBJR-00007_S55JNC0W900193  ONLINE       0     0     0

errors: No known data errors

FIO:
Jobs: 6 (f=6): [r(6)][100.0%][r=8956MiB/s][r=573k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=6): err= 0: pid=1656637: Mon Jan 27 23:05:34 2025
  read: IOPS=610k, BW=9539MiB/s (10.0GB/s)(559GiB/60001msec)
    slat (usec): min=3, max=2515, avg= 8.79, stdev=14.52
    clat (nsec): min=1838, max=7249.4k, avg=305328.10, stdev=136332.47
     lat (usec): min=7, max=7682, avg=314.12, stdev=139.92
    clat percentiles (usec):
     |  1.00th=[  262],  5.00th=[  265], 10.00th=[  269], 20.00th=[  273],
     | 30.00th=[  273], 40.00th=[  273], 50.00th=[  277], 60.00th=[  277],
     | 70.00th=[  281], 80.00th=[  285], 90.00th=[  293], 95.00th=[  494],
     | 99.00th=[ 1020], 99.50th=[ 1237], 99.90th=[ 1729], 99.95th=[ 1926],
     | 99.99th=[ 2442]
   bw (  MiB/s): min= 4134, max=10635, per=99.96%, avg=9534.58, stdev=182.46, samples=714
   iops        : min=264626, max=680644, avg=610212.84, stdev=11677.35, samples=714
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.10%, 500=95.21%, 750=2.51%, 1000=1.12%
  lat (msec)   : 2=1.03%, 4=0.04%, 10=0.01%
  cpu          : usr=11.96%, sys=87.94%, ctx=4610, majf=7, minf=2687
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=36629131,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=9539MiB/s (10.0GB/s), 9539MiB/s-9539MiB/s (10.0GB/s-10.0GB/s), io=559GiB (600GB), run=60001-60001msec

So ARC does a ton, but i don't want ARC, because i have really huge Databases and Servers with only 768GB Memory.
So i can simply switch in my Case to BTRFS and i won't need almost any Memory at all...

Because of this fact, ZFS with primarycache=metadata, seems broken to me.

Ramalama2 · 2025-01-27T22:37:49Z

primarycache=all vs primarycache=metadata

Command:
fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=16k --size=4G --numjobs=5 --runtime=60 --time_based=1 --group_reporting --filename=/datasets/Backup-HDD-SATA/testfile

ZFS HDD-SATA Storage Details

  pool: HDD-SATA
 state: ONLINE
  scan: resilvered 1.01T in 03:01:25 with 0 errors on Sat Jan 25 10:31:43 2025
config:

	NAME                                     STATE     READ WRITE CKSUM
	HDD-SATA                                 ONLINE       0     0     0
	  raidz1-0                               ONLINE       0     0     0
	    ata-ST4000NM000B-2TF100_WX1078TY     ONLINE       0     0     0
	    ata-MB4000GCWDC_Z1Z61VC4             ONLINE       0     0     0
	    ata-MB4000GCWLV_PEG58SJX             ONLINE       0     0     0
	  raidz1-1                               ONLINE       0     0     0
	    ata-WDC_WD4003FFBX-68MU3N0_VBJ0A9EF  ONLINE       0     0     0
	    ata-ST4000NM000B-2TF100_WX108XSL     ONLINE       0     0     0
	    ata-MB4000GCWLV_PEGNL51T             ONLINE       0     0     0
	  raidz1-2                               ONLINE       0     0     0
	    ata-MB4000GCWLV_PEGNMRYT             ONLINE       0     0     0
	    ata-MB4000GCWDC_Z1Z61453             ONLINE       0     0     0
	    ata-ST4000NM000B-2TF100_WX108YAF     ONLINE       0     0     0
	  raidz1-3                               ONLINE       0     0     0
	    ata-WDC_WD4003FFBX-68MU3N0_VBJ0AGHF  ONLINE       0     0     0
	    ata-MB4000GCWLV_PEHGY31S             ONLINE       0     0     0
	    ata-MB4000GCWLV_PEHH4KUS             ONLINE       0     0     0
	spares
	  ata-ST4000NM000B-2TF100_WX108YLL       AVAIL

errors: No known data errors

zfs get all HDD-SATA
NAME      PROPERTY              VALUE                  SOURCE
HDD-SATA  type                  filesystem             -
HDD-SATA  creation              Fri Apr  5 11:38 2024  -
HDD-SATA  used                  10.8T                  -
HDD-SATA  available             20.9T                  -
HDD-SATA  referenced            5.62T                  -
HDD-SATA  compressratio         2.41x                  -
HDD-SATA  mounted               yes                    -
HDD-SATA  quota                 none                   default
HDD-SATA  reservation           none                   default
HDD-SATA  recordsize            1M                     local
HDD-SATA  mountpoint            /HDD-SATA              local
HDD-SATA  sharenfs              off                    default
HDD-SATA  checksum              on                     default
HDD-SATA  compression           zstd                   local
HDD-SATA  atime                 on                     local
HDD-SATA  devices               on                     default
HDD-SATA  exec                  on                     default
HDD-SATA  setuid                on                     default
HDD-SATA  readonly              off                    default
HDD-SATA  zoned                 off                    default
HDD-SATA  snapdir               hidden                 default
HDD-SATA  aclmode               discard                default
HDD-SATA  aclinherit            restricted             default
HDD-SATA  createtxg             1                      -
HDD-SATA  canmount              on                     default
HDD-SATA  xattr                 sa                     local
HDD-SATA  copies                1                      default
HDD-SATA  version               5                      -
HDD-SATA  utf8only              off                    -
HDD-SATA  normalization         none                   -
HDD-SATA  casesensitivity       sensitive              -
HDD-SATA  vscan                 off                    default
HDD-SATA  nbmand                off                    default
HDD-SATA  sharesmb              off                    default
HDD-SATA  refquota              none                   default
HDD-SATA  refreservation        none                   default
HDD-SATA  guid                  1106475049581885209    -
HDD-SATA  primarycache          metadata               local
HDD-SATA  secondarycache        all                    default
HDD-SATA  usedbysnapshots       0B                     -
HDD-SATA  usedbydataset         5.62T                  -
HDD-SATA  usedbychildren        5.18T                  -
HDD-SATA  usedbyrefreservation  0B                     -
HDD-SATA  logbias               throughput             local
HDD-SATA  objsetid              54                     -
HDD-SATA  dedup                 on                     local
HDD-SATA  mlslabel              none                   default
HDD-SATA  sync                  standard               default
HDD-SATA  dnodesize             auto                   local
HDD-SATA  refcompressratio      3.55x                  -
HDD-SATA  written               5.62T                  -
HDD-SATA  logicalused           25.9T                  -
HDD-SATA  logicalreferenced     19.9T                  -
HDD-SATA  volmode               default                default
HDD-SATA  filesystem_limit      none                   default
HDD-SATA  snapshot_limit        none                   default
HDD-SATA  filesystem_count      none                   default
HDD-SATA  snapshot_count        none                   default
HDD-SATA  snapdev               hidden                 default
HDD-SATA  acltype               off                    default
HDD-SATA  context               none                   default
HDD-SATA  fscontext             none                   default
HDD-SATA  defcontext            none                   default
HDD-SATA  rootcontext           none                   default
HDD-SATA  relatime              on                     default
HDD-SATA  redundant_metadata    all                    default
HDD-SATA  overlay               on                     default
HDD-SATA  encryption            off                    default
HDD-SATA  keylocation           none                   default
HDD-SATA  keyformat             none                   default
HDD-SATA  pbkdf2iters           0                      default
HDD-SATA  special_small_blocks  0                      default
HDD-SATA  prefetch              all                    default

ZFS primarycache=metadata (781 IOPS, 500ms latency)

Jobs: 5 (f=0): [f(5),_(1)][100.0%][r=12.2MiB/s][r=781 IOPS][eta 00m:00s]
randread: (groupid=0, jobs=6): err= 0: pid=1661936: Mon Jan 27 23:19:00 2025
  read: IOPS=532, BW=8524KiB/s (8729kB/s)(500MiB/60023msec)
    slat (usec): min=4, max=307509, avg=11255.39, stdev=12387.16
    clat (usec): min=3, max=958991, avg=347838.62, stdev=77674.88
     lat (msec): min=14, max=973, avg=359.09, stdev=79.06
    clat percentiles (msec):
     |  1.00th=[  192],  5.00th=[  236], 10.00th=[  259], 20.00th=[  288],
     | 30.00th=[  309], 40.00th=[  326], 50.00th=[  342], 60.00th=[  359],
     | 70.00th=[  380], 80.00th=[  405], 90.00th=[  443], 95.00th=[  477],
     | 99.00th=[  575], 99.50th=[  634], 99.90th=[  827], 99.95th=[  877],
     | 99.99th=[  936]
   bw (  KiB/s): min= 1792, max=12448, per=99.34%, avg=8468.01, stdev=280.37, samples=714
   iops        : min=  112, max=  778, avg=529.13, stdev=17.54, samples=714
  lat (usec)   : 4=0.01%, 10=0.01%
  lat (msec)   : 20=0.02%, 50=0.04%, 100=0.15%, 250=7.51%, 500=89.35%
  lat (msec)   : 750=2.75%, 1000=0.17%
  cpu          : usr=0.05%, sys=0.59%, ctx=19106, majf=0, minf=1101
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.3%, 32=99.4%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=31978,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=8524KiB/s (8729kB/s), 8524KiB/s-8524KiB/s (8729kB/s-8729kB/s), io=500MiB (524MB), run=60023-60023msec

ZFS primarycache=all (536k IOPS, 500µs latency)

Jobs: 5 (f=5): [r(5)][100.0%][r=8382MiB/s][r=536k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=5): err= 0: pid=1667557: Mon Jan 27 23:32:36 2025
  read: IOPS=512k, BW=8004MiB/s (8393MB/s)(469GiB/60002msec)
    slat (usec): min=3, max=7354, avg= 8.67, stdev=14.94
    clat (nsec): min=1784, max=11157k, avg=303233.82, stdev=135451.62
     lat (usec): min=8, max=11166, avg=311.90, stdev=138.96
    clat percentiles (usec):
     |  1.00th=[  255],  5.00th=[  260], 10.00th=[  262], 20.00th=[  265],
     | 30.00th=[  269], 40.00th=[  269], 50.00th=[  273], 60.00th=[  273],
     | 70.00th=[  277], 80.00th=[  281], 90.00th=[  293], 95.00th=[  498],
     | 99.00th=[  996], 99.50th=[ 1205], 99.90th=[ 1598], 99.95th=[ 1762],
     | 99.99th=[ 2245]
   bw (  MiB/s): min= 4185, max= 9122, per=100.00%, avg=8010.26, stdev=186.69, samples=595
   iops        : min=267860, max=583862, avg=512656.77, stdev=11948.16, samples=595
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.27%, 500=94.76%, 750=2.81%, 1000=1.18%
  lat (msec)   : 2=0.95%, 4=0.02%, 10=0.01%, 20=0.01%
  cpu          : usr=12.58%, sys=87.27%, ctx=4587, majf=0, minf=1318
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=30737492,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=8004MiB/s (8393MB/s), 8004MiB/s-8004MiB/s (8393MB/s-8393MB/s), io=469GiB (504GB), run=60002-60002msec

So on the same SATA-Pool, with the only difference in primarycache=metadata vs primarycache=all the difference is abnormal:
536k IOPS, 500µs latency -> primarycache=all
781 IOPS, 500ms latency -> primarycache=metadata

So primarycache=metadata is definitively broken and no one noticed it.
It's Probaby not noticeable on normal Hard Drives, but on fast NVME Drives its a dealbreaker.

Cheers

amotin · 2025-01-27T22:49:14Z

So on the same SATA-Pool, with the only difference in primarycache=metadata vs primarycache=all the difference is abnormal: 536k IOPS, 500µs latency -> primarycache=all 781 IOPS, 500ms latency -> primarycache=metadata

So primarycache=metadata is definitively broken and no one noticed it.

The difference is definitely big, but how many IOPS would you expect from 12 spinning disks? Especially in RAIDZ vdevs, which are typically known to have IOPS of single disk, just with higher throughput and capacity? Would you have that pool as a stripe or a set of mirrors, you would get 2-3 times more IOPS.

Ramalama2 · 2025-01-27T22:53:04Z

So on the same SATA-Pool, with the only difference in primarycache=metadata vs primarycache=all the difference is abnormal: 536k IOPS, 500µs latency -> primarycache=all 781 IOPS, 500ms latency -> primarycache=metadata
So primarycache=metadata is definitively broken and no one noticed it.

The difference is definitely big, but how many IOPS would you expect from 12 spinning disks? Especially in RAIDZ vdevs, which are typically known to have IOPS of single disk, just with higher throughput and capacity? Would you have that pool as a stripe or a set of mirrors, you would get 2-3 times more IOPS.

Its not about that, its about raw disk performance on NVME's between ZFS (no arc) and BTRFS.
The Difference in both cases, where you rely on pure Filesystem-Speed is 700k IOPS vs 26k IOPS...

My Databases are simply too big to be in ARC. However without ARC, ZFS should reach at least half the IOPS of BTRFS to be anywhere usable on NVME Drives.

tonyhutter · 2025-01-27T23:05:52Z

@Ramalama2 I see you're running fio --rw=randread --bs=16k with:

HDD-SATA  recordsize            1M                     local
HDD-SATA  primarycache          metadata               local
HDD-SATA  logbias               throughput             local
HDD-SATA  prefetch              all                    default

Try matching recordsize to your fio block size, set logbias=latency and set prefetch=metadata, if you want good fio random read performance. Also consider using Direct IO with zfs-2.3.0.

Ramalama2 · 2025-01-27T23:19:21Z

@Ramalama2 I see you're running fio --rw=randread --bs=16k with:
HDD-SATA  recordsize            1M                     local
HDD-SATA  primarycache          metadata               local
HDD-SATA  logbias               throughput             local
HDD-SATA  prefetch              all                    default
Try matching recordsize to your fio block size, set logbias=latency and set prefetch=metadata, if you want good fio random read performance. Also consider using Direct IO with zfs-2.3.0.

Hey Tony, that test was only made to test metadata vs all.
HDD-SATA is an absolutely Crap Pool that i dont use even for Backups xD
I just took it for Comparisation for primarycache.

Basically that should only show "if i enable primarycache for the NVME pool", that the NVME-Pool in the first Post would skyrocket to 700k IOPS, like BTRFS (for FIO only).
But it wont improve the MySQL Query Benchmarks, since the MySQL Database is far too big for the ARC.

Cheers

clhedrick · 2025-01-27T23:32:13Z

If this database is the only thing on your server, I might try primarycache=all. The amount of metadata should be fairly small. There just aren't many files. Direct I/O bypasses the ARC, and might (under the right conditions) get better performance. But if I understand it, primarycache=metadata doesn't cause data I/O to take paths that are any more efficient than with primarycache=all. I'd expect it to be useful in situations where you have lots of metadata, and a cache small enough that that's all that will fit. I'd expect that to be fairly specialized.

It's true that you can't fit the whole database into cache, but I'd bet that data is not accessed in an entirely random fashion. I'd bet some will turn out to be "hot", and would benefit from cache. It would also help if mysql's record size doesn't match your file systems'. You seem to have ZFS record size of 1M. I believe ZFS uses 16K. Without cache, if there's a situation where data in a table is in sequential blocks, you'd get a read amplification of 64 with cache turned off. I don't know why innodb works, but I'd hope they make some attempt to keep tables together.

Ramalama2 · 2025-01-27T23:37:59Z

If this database is the only thing on your server, I might try primarycache=all. The amount of metadata should be fairly small. There just aren't many files. Direct I/O bypasses the ARC, and might (under the right conditions) get better performance. But if I understand it, primarycache=metadata doesn't cause data I/O to take paths that are any more efficient than with primarycache=all. I'd expect it to be useful in situations where you have lots of metadata, and a cache small enough that that's all that will fit. I'd expect that to be fairly specialized.

It's true that you can't fit the whole database into cache, but I'd bet that data is not accessed in an entirely random fashion. I'd bet some will turn out to be "hot", and would benefit from cache. It would also help if mysql's record size doesn't match your file systems'. You seem to have ZFS record size of 1M. I believe ZFS uses 16K. Without cache, if there's a situation where data in a table is in sequential blocks, you'd get a read amplification of 64 with cache turned off. I don't know why innodb works, but I'd hope they make some attempt to keep tables together.

Well, it gets weird. Because i just tryed with primarycache=all
With primarycache=all on my MySQL Benchmark, the query runtimes on ZFS drops from 162s to 10s.
And the Database is definitively not in ARC.

But that means to me that there is something broken with primarycache=metadata alltogether, since it should not be slower conpared to primarycache=all and an empty ARC.

Seems like there is something broken in the ZFS pipeline to me with primarycache=metadata.

I dont understand anything anymore to be honest.
Some dev should read this and i fear if we talk here more it gets to much to read for any dev.

clhedrick · 2025-01-27T23:46:17Z

Not necessarily. It may mean that my conjecture is right. Primarycache=metadata is harmless only if access at the 16k innodb level is truly random. But there are all kinds of reasons why it might not be. You could see as much as a factor of 64 improvement from “all,” though that much is unlikely. On Jan 27, 2025, at 6:38 PM, Ramalama2 ***@***.***> wrote: If this database is the only thing on your server, I might try primarycache=all. The amount of metadata should be fairly small. There just aren't many files. Direct I/O bypasses the ARC, and might (under the right conditions) get better performance. But if I understand it, primarycache=metadata doesn't cause data I/O to take paths that are any more efficient than with primarycache=all. I'd expect it to be useful in situations where you have lots of metadata, and a cache small enough that that's all that will fit. I'd expect that to be fairly specialized. It's true that you can't fit the whole database into cache, but I'd bet that data is not accessed in an entirely random fashion. I'd bet some will turn out to be "hot", and would benefit from cache. It would also help if mysql's record size doesn't match your file systems'. You seem to have ZFS record size of 1M. I believe ZFS uses 16K. Without cache, if there's a situation where data in a table is in sequential blocks, you'd get a read amplification of 64 with cache turned off. I don't know why innodb works, but I'd hope they make some attempt to keep tables together. Well, it gets weird. Because i just tryed with primarycache=all With primarycache=all on my MySQL Benchmark, the query runtimes on ZFS drops from 162s to 10s. And the Database is definitively not in ARC. But that means to me that there is something broken with primarycache=metadata alltogether, since it should not be slower conpared to primarycache=all and an empty ARC. Seems like there is something broken in the ZFS pipeline to me with primarycache=metadata. I dont understand anything anymore to be honest. Some dev should read this and i fear if we talk here more it gets to much to read for any dev. — Reply to this email directly, view it on GitHub<#16993 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAORUCECSCN5L535O7J4WUL2M27OZAVCNFSM6AAAAABV6ZSPDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJXGEZTOOJWHE>. You are receiving this because you commented.Message ID: ***@***.***>

clhedrick · 2025-01-28T02:19:55Z

I should note that 6 is probably not enough threads. I tried this on a production system, with recordsize 16K and fio read size 16. It took me 32 threads to get a maximum, 235K IOPS. (I should probably apologize to my users, since this surely affected their performance.)

I left primarycache=all, but used a 4 TB file to minimize cache effects. (arc 273G)

Single threaded I get 11.5K IOPS. This is not surprising, since NVMe typically has more than 100 usec read latency.

fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=16k --size=4T --numjobs=1 --runtime=60 --time_based=1 --group_reporting --filename=testfile

ZFS 2.2.7
Redhat 9.5
AMD EPYC 9474F
512 G memory (300 G arc)
9 mirrors, mix of Intel SSDPE2KX080T8 and Samsung MZQL215THBLA-00A07

The Intel drives are 4 years old, so they are not state of the art.

Note that I didn't specify direct, because ZFS 2.2.7 doesn't do it. But without that I/O is synchronous. That's why it is limited to 10K IOPS for a thread. That's the hardware spec. I thought --buffered=0 might help, but the man page seems to imply that this is simply direct, which of course won't work with ZFS 2.2.7.

I believe innodb does use async I/O, so 2.3 might help it.

Our real usage is 100% NFS. Can anyone comment whether that allows async I/O? Will implementation of direct I/O in 2.3 give NFS any additional functionality?

xeghia · 2025-01-28T05:10:24Z

Several months ago I found a similar performance issue to Ramalama2 when using ZFS 2.2.x in my own workflow where I had set primarycache=metadata on a single device pool (a SATA SSD) as this data was only going to be discarded immediately after processing.

For some context this was downloading multipart archives in parallel to this temporary SSD pool using lftp, then immediately extracting that archive out again to the main pool using unrar. Put another way, writing 12 x 50MB files simultaneously to speed up downloads, then reading them back in sequence extracting the archive.

This did not perform as expected, it was very slow when primarycache=metadata was set but was easily resolved setting it back to primarycache=all while changing nothing else. From memory the former was about 50 MB/s and the latter was around 450 MB/s. After confirming there were no hardware issues I eventually shrugged it off and left the pool set to primarycache=all and forgot about it.

I have gone back and checked this scenario again today before commenting here, I'm not able to replicate this using ZFS 2.3.0 as it appears this odd performance issue when setting primarycache=metadata only existed when I last tried this somewhere back around 2.2.2 (give or take) however this now seems fixed in 2.3.0, but still appears to be present in the latest 2.2 release.

I'm guessing primarycache=metadata is not a common use case so not surprised if it went unnoticed, I found nothing online at the time when I searched about it several months back.

ChristophSchmidpeter · 2025-01-28T09:55:12Z

Possibly duplicate of #16966 ?

Ramalama2 · 2025-01-28T10:16:30Z

Possibly duplicate of #16966 ?

Not really, i have no issues with booting or stuttering/hangs. Its just about the performance of primarycache=all vs primarycache=metadata on NVME Drives.
350k IOPS vs 26k IOPS.

BTW, for the thread update: I changed now everything to primarycache=all, since metadata is somewhat buggy.
Performance returned to a normal state (almost everything is now 12x times faster).
MySQL-Benchmarks went from 162s to 10s. (and im getting around 350k IOPS instead of 26k)

Im happy, but still not entirely happy, since i get still twice (almost 3x) the performance with BTRFS. 3,7s vs 10,4s for the Query 1 Benchmark xD

However, i need to wait for ZFS 2.3.0, till it arrives in Proxmox and retest. (@xeghia )
Thanks everyone :-) (Except @scineram that looks like a troll to me xD)

clhedrick · 2025-01-28T13:09:34Z

You can get some performance from using mutliple threads, as you can tell my increasing the number of processes in fio. innodb has a parameter innodb_read_io_threads (and the write equivalent) to control the number of threads. It defaults to the number of logical processors / 2. That makes sense if it can use async I/O, since the main advantage of multiple threads would be to get lots of processor power. But without AIO, it might make sense to use a larger value.

Look at "show engine innodb status". If the number of queued queries is high, you might benefit from more i/O threads or use of ZFS 2.3.0. https://dba.stackexchange.com/questions/299461/how-do-you-tune-innodb-read-io-threads But there are other things you should do as well: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#database-workloads However logbias=throughput isn't always appropriate with NVMe. Note that the recommendation disables async I/O. I don't know whether that's still true for ZFS 2.3.0. See also https://www.reddit.com/r/zfs/comments/u1xklc/mariadbmysql_database_settings_for_zfs/

clhedrick · 2025-01-28T14:39:18Z

Finally, take a look at your memory allocation. The reason people recommend primarycache=metadata isn't because cache is useless, but because they assume innodb's cache is better than ZFS's for mysql data. If primarycache=all gives dramatically better performance that suggests that either 1) innodb's cache isn't as good as we'd hoped or 2) you haven't given innodb enough memory for its cache.

If innodb's cache is in fact better than ZFS's for database data, you'd be better off the decrease the size of the arc and increase the size of the innodb cache. Half of memory for the innodb cache is certainly reasonable (just as typically ZFS uses half of memory for its cache). Of course the total of the ARC and innodb's cache needs to be smaller than total memory, or you'll create swapping or some other misbehavior.

Ramalama2 · 2025-01-28T18:00:28Z

Finally, take a look at your memory allocation. The reason people recommend primarycache=metadata isn't because cache is useless, but because they assume innodb's cache is better than ZFS's for mysql data. If primarycache=all gives dramatically better performance that suggests that either 1) innodb's cache isn't as good as we'd hoped or 2) you haven't given innodb enough memory for its cache.

If innodb's cache is in fact better than ZFS's for database data, you'd be better off the decrease the size of the arc and increase the size of the innodb cache. Half of memory for the innodb cache is certainly reasonable (just as typically ZFS uses half of memory for its cache). Of course the total of the ARC and innodb's cache needs to be smaller than total memory, or you'll create swapping or some other misbehavior.

Im using myisam tables here without caching.
You are comparing ZFS Caching mechanism vs Mysql, im Comparing ZFS(all, but empty ARC) to ZFS(metadata) to BTRFS.
ZFS with primarycache=metadata -> 162s
ZFS with primarycache=all -> 10s (Empty ARC)
BTRFS (no caching) -> 3,7s

Its entirely about native FS Speed, primarycache=all is made with an empty ARC, so no cache either. With "empty" memory (nothing in ARC), ZFS has 12x the performance of primarycache=metadata.
It would make sense if primarycache=all with something in ARC Cached would be 12 times faster, but not if ARC is empty.

This results in a simple Conclusion, that there is something broken with primarycache=metadata, some blocks or whatever, like zfs takes some extra steps (which takes time) instead of shorten the path or leave as is.
BTRFS has no caching at all and i dont need additional Ram for anything, it stupidly reads from NVME Drives and gets 52x the performance of ZFS (metadata) or 3x the performance of ZFS (all).

--> Even if metadata is not used by a lot of people, i would expect that ZFS reaches at least half of the raw NVME-Array Speed with primarycache=metadata. (Instead of beeing 52x slower)

--> With primarycache=all, i expect that ZFS is even faster as raw NVME speed, because of the additional usage of insanely fast 12-channel memory speed. (Instead of beeing 3x slower)

PS: Sure there is a Limit, what the CPU can handle. But i reach the CPU-Limit only with BTRFS, while on ZFS there is maybe a peak of 20%.

clhedrick · 2025-01-28T22:16:15Z

OK, if you're using isam with no caching the results make sense. I believe btrfs, like other Linux file systems, uses the Linux page cache. To my knowledge, ZFS is the only file system that does its own caching. So disabling ZFS caching is not comparable to btrfs. To be comparable with btrfs you need primarycache=all.

Ramalama2 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jan 27, 2025

ChristophSchmidpeter mentioned this issue Jan 28, 2025

zfs 2.2.7/2.3.0 wont boot / hanging / extremely slow (NVMe related?) #16966

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS utterly Slow on NVME (BTRFS vs ZFS) #16993

ZFS utterly Slow on NVME (BTRFS vs ZFS) #16993

Ramalama2 commented Jan 27, 2025 •

edited

Loading

amotin commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025

tonyhutter commented Jan 27, 2025

clhedrick commented Jan 27, 2025 •

edited

Loading

Ramalama2 commented Jan 27, 2025 •

edited

Loading

amotin commented Jan 27, 2025 •

edited

Loading

Ramalama2 commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025 •

edited

Loading

Ramalama2 commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025 •

edited

Loading

amotin commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025

tonyhutter commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025

clhedrick commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025

clhedrick commented Jan 27, 2025 via email

clhedrick commented Jan 28, 2025 •

edited

Loading

xeghia commented Jan 28, 2025

ChristophSchmidpeter commented Jan 28, 2025

Ramalama2 commented Jan 28, 2025 •

edited

Loading

clhedrick commented Jan 28, 2025 •

edited

Loading

clhedrick commented Jan 28, 2025

Ramalama2 commented Jan 28, 2025 •

edited

Loading

clhedrick commented Jan 28, 2025

ZFS utterly Slow on NVME (BTRFS vs ZFS) #16993

ZFS utterly Slow on NVME (BTRFS vs ZFS) #16993

Comments

Ramalama2 commented Jan 27, 2025 • edited Loading

System information

amotin commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025

tonyhutter commented Jan 27, 2025

clhedrick commented Jan 27, 2025 • edited Loading

Ramalama2 commented Jan 27, 2025 • edited Loading

1. Without direct, the IOPS on BTRFS will skyrocket while on ZFS this wont change anything:

2. Increasing Blocksize to Recordsize:

3. Increasing numjobs, fio would benefit a little, but mysql not. So this is basically not really helpfull in any way. However:

4. Conclusion:

amotin commented Jan 27, 2025 • edited Loading

Ramalama2 commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025 • edited Loading

Ramalama2 commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025 • edited Loading

amotin commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025

tonyhutter commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025

clhedrick commented Jan 27, 2025

Ramalama2 commented Jan 27, 2025

clhedrick commented Jan 27, 2025 via email

clhedrick commented Jan 28, 2025 • edited Loading

xeghia commented Jan 28, 2025

ChristophSchmidpeter commented Jan 28, 2025

Ramalama2 commented Jan 28, 2025 • edited Loading

clhedrick commented Jan 28, 2025 • edited Loading

clhedrick commented Jan 28, 2025

Ramalama2 commented Jan 28, 2025 • edited Loading

clhedrick commented Jan 28, 2025

Ramalama2 commented Jan 27, 2025 •

edited

Loading

clhedrick commented Jan 27, 2025 •

edited

Loading

Ramalama2 commented Jan 27, 2025 •

edited

Loading

amotin commented Jan 27, 2025 •

edited

Loading

Ramalama2 commented Jan 27, 2025 •

edited

Loading

Ramalama2 commented Jan 27, 2025 •

edited

Loading

clhedrick commented Jan 28, 2025 •

edited

Loading

Ramalama2 commented Jan 28, 2025 •

edited

Loading

clhedrick commented Jan 28, 2025 •

edited

Loading

Ramalama2 commented Jan 28, 2025 •

edited

Loading