-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS utterly Slow on NVME (BTRFS vs ZFS) #16993
Comments
When some test shows two orders of magnitude difference, it usually means you are comparing incomparable. |
What is incomparable there? |
quick observations -
Side note - we have a PR going though review #16591 that could potentially double IOPS for |
We'd certainly notice a 20 msec latency. On a production system having home directories we see 241 usec read, 198 usec write (from nfsiostat on the client). I think those are a bit high, as Linux doesn't do a good job of giving latency when several transactions are in the pipeline at once. A system on hard disk used for people who need more space gives 1 msec read and 373 usec write, with the same caveat. Of course we get fairly high hit rates on the arc, so that may be a bit misleading, but I certainly wouldn't expect to see 20 msec average read. To see what happens when things actually hit the disk, I use zpool iostat -v -l. It shows around 13 msec read and 30 msec write on hard disk. (That's total wait. Disk wait is like 11 read and 9 write.) Users would never see the write latencies, because writes are acknowledged as soon as the data goes into the arc or gets written to an NVMe slog, depending. iostat on our NVMe system shows around 140 usec total read and 50 usec write. These are not heavily loaded systems. Presumably you can find a way to load a system badly enough that latencies will go very high. |
I know that directio is comming with ZFS 2.3.0 (which i cannot test at the moment sadly), but im pretty sure that directio will greatly increase NVME IOPS. 1. Without direct, the IOPS on BTRFS will skyrocket while on ZFS this wont change anything:
ZFS-Tuned (Raid-10, 8x nvme, 26.5k IOPS, 10ms latency)
BTRFS (Raid-0, 2x nvme, 1178k IOPS, 100-250µs latency)
2. Increasing Blocksize to Recordsize:
ZFS-Tuned (Raid-10, 8x nvme, 21,8k IOPS, 10ms latency)
BTRFS (Raid-0, 2x nvme, 615k IOPS, 250-500µs latency)
3. Increasing numjobs, fio would benefit a little, but mysql not. So this is basically not really helpfull in any way. However:
ZFS-Tuned (Raid-10, 8x nvme, 31,2k IOPS, 10ms latency)
BTRFS (Raid-0, 2x nvme, 941k IOPS, 250µs latency)
4. Conclusion:Dont forget, i try to improve MySQL Performance, and BTRFS is worlds ahead for Mysql. Actually almost everything real-world related. |
You haven't specified what "Queries" have you tested. Are those all selects or is there any writes? ZFS is known for its rigorous transactional safety guaranties, and even among SSDs not all are made equal in respect to synchronous writes, plentifully produced on database requests. Do other file system provide the same guaranties? You've disabled all data caching in ZFS and actively running sub-block accesses. It predictably creates massive read/write inflation. Do other file system provide checksuming, compression, etc, preventing sub-block disk accesses? What block sizes do they use? ZFS does have some IOPS bottlenecks, for some of which there are already PRs floating around, but to my experience those bottlenecks start closer to 300-400K IOPS. 27K IOPS is not serious. There must be some reason(s). |
|
Here is the Example Query 1 (The one for the Benchmark):
Just that others do not complain about the Query that i use. |
Its similar here:
But the Storage for the MariaDB vm is not Directly on the Pool, its on a ZFS-FS on a dataset: ZFS Storage Details
Its an LXC Container on Proxmox (not a VM), so there is no addiotional overhead like another Scheduler etc... So im not even sure what the issue is, or lets say i dont believe that there is an issue. The Funny Part is, that i get on an SAS-Array (on my Backup-Server) with ARC 573k IOPS: ZFS (Raid-10, 8x HDD(SAS) + 2x Special nvme, 573k IOPS, 500µs latency)
So ARC does a ton, but i don't want ARC, because i have really huge Databases and Servers with only 768GB Memory. Because of this fact, ZFS with primarycache=metadata, seems broken to me. |
primarycache=all vs primarycache=metadata Command: ZFS HDD-SATA Storage Details
ZFS primarycache=metadata (781 IOPS, 500ms latency)
ZFS primarycache=all (536k IOPS, 500µs latency)
So on the same SATA-Pool, with the only difference in primarycache=metadata vs primarycache=all the difference is abnormal: So primarycache=metadata is definitively broken and no one noticed it. Cheers |
The difference is definitely big, but how many IOPS would you expect from 12 spinning disks? Especially in RAIDZ vdevs, which are typically known to have IOPS of single disk, just with higher throughput and capacity? Would you have that pool as a stripe or a set of mirrors, you would get 2-3 times more IOPS. |
Its not about that, its about raw disk performance on NVME's between ZFS (no arc) and BTRFS. My Databases are simply too big to be in ARC. However without ARC, ZFS should reach at least half the IOPS of BTRFS to be anywhere usable on NVME Drives. |
@Ramalama2 I see you're running
Try matching |
Hey Tony, that test was only made to test metadata vs all. Basically that should only show "if i enable primarycache for the NVME pool", that the NVME-Pool in the first Post would skyrocket to 700k IOPS, like BTRFS (for FIO only). Cheers |
If this database is the only thing on your server, I might try primarycache=all. The amount of metadata should be fairly small. There just aren't many files. Direct I/O bypasses the ARC, and might (under the right conditions) get better performance. But if I understand it, primarycache=metadata doesn't cause data I/O to take paths that are any more efficient than with primarycache=all. I'd expect it to be useful in situations where you have lots of metadata, and a cache small enough that that's all that will fit. I'd expect that to be fairly specialized. It's true that you can't fit the whole database into cache, but I'd bet that data is not accessed in an entirely random fashion. I'd bet some will turn out to be "hot", and would benefit from cache. It would also help if mysql's record size doesn't match your file systems'. You seem to have ZFS record size of 1M. I believe ZFS uses 16K. Without cache, if there's a situation where data in a table is in sequential blocks, you'd get a read amplification of 64 with cache turned off. I don't know why innodb works, but I'd hope they make some attempt to keep tables together. |
Well, it gets weird. Because i just tryed with primarycache=all But that means to me that there is something broken with primarycache=metadata alltogether, since it should not be slower conpared to primarycache=all and an empty ARC. Seems like there is something broken in the ZFS pipeline to me with primarycache=metadata. I dont understand anything anymore to be honest. |
Not necessarily. It may mean that my conjecture is right. Primarycache=metadata is harmless only if access at the 16k innodb level is truly random. But there are all kinds of reasons why it might not be. You could see as much as a factor of 64 improvement from “all,” though that much is unlikely.
On Jan 27, 2025, at 6:38 PM, Ramalama2 ***@***.***> wrote:
If this database is the only thing on your server, I might try primarycache=all. The amount of metadata should be fairly small. There just aren't many files. Direct I/O bypasses the ARC, and might (under the right conditions) get better performance. But if I understand it, primarycache=metadata doesn't cause data I/O to take paths that are any more efficient than with primarycache=all. I'd expect it to be useful in situations where you have lots of metadata, and a cache small enough that that's all that will fit. I'd expect that to be fairly specialized.
It's true that you can't fit the whole database into cache, but I'd bet that data is not accessed in an entirely random fashion. I'd bet some will turn out to be "hot", and would benefit from cache. It would also help if mysql's record size doesn't match your file systems'. You seem to have ZFS record size of 1M. I believe ZFS uses 16K. Without cache, if there's a situation where data in a table is in sequential blocks, you'd get a read amplification of 64 with cache turned off. I don't know why innodb works, but I'd hope they make some attempt to keep tables together.
Well, it gets weird. Because i just tryed with primarycache=all
With primarycache=all on my MySQL Benchmark, the query runtimes on ZFS drops from 162s to 10s.
And the Database is definitively not in ARC.
But that means to me that there is something broken with primarycache=metadata alltogether, since it should not be slower conpared to primarycache=all and an empty ARC.
Seems like there is something broken in the ZFS pipeline to me with primarycache=metadata.
I dont understand anything anymore to be honest.
Some dev should read this and i fear if we talk here more it gets to much to read for any dev.
—
Reply to this email directly, view it on GitHub<#16993 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAORUCECSCN5L535O7J4WUL2M27OZAVCNFSM6AAAAABV6ZSPDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJXGEZTOOJWHE>.
You are receiving this because you commented.Message ID: ***@***.***>
|
I should note that 6 is probably not enough threads. I tried this on a production system, with recordsize 16K and fio read size 16. It took me 32 threads to get a maximum, 235K IOPS. (I should probably apologize to my users, since this surely affected their performance.) I left primarycache=all, but used a 4 TB file to minimize cache effects. (arc 273G) Single threaded I get 11.5K IOPS. This is not surprising, since NVMe typically has more than 100 usec read latency. fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=16k --size=4T --numjobs=1 --runtime=60 --time_based=1 --group_reporting --filename=testfile ZFS 2.2.7 The Intel drives are 4 years old, so they are not state of the art. Note that I didn't specify direct, because ZFS 2.2.7 doesn't do it. But without that I/O is synchronous. That's why it is limited to 10K IOPS for a thread. That's the hardware spec. I thought --buffered=0 might help, but the man page seems to imply that this is simply direct, which of course won't work with ZFS 2.2.7. I believe innodb does use async I/O, so 2.3 might help it. Our real usage is 100% NFS. Can anyone comment whether that allows async I/O? Will implementation of direct I/O in 2.3 give NFS any additional functionality? |
Several months ago I found a similar performance issue to Ramalama2 when using ZFS 2.2.x in my own workflow where I had set primarycache=metadata on a single device pool (a SATA SSD) as this data was only going to be discarded immediately after processing. For some context this was downloading multipart archives in parallel to this temporary SSD pool using lftp, then immediately extracting that archive out again to the main pool using unrar. Put another way, writing 12 x 50MB files simultaneously to speed up downloads, then reading them back in sequence extracting the archive. This did not perform as expected, it was very slow when primarycache=metadata was set but was easily resolved setting it back to primarycache=all while changing nothing else. From memory the former was about 50 MB/s and the latter was around 450 MB/s. After confirming there were no hardware issues I eventually shrugged it off and left the pool set to primarycache=all and forgot about it. I have gone back and checked this scenario again today before commenting here, I'm not able to replicate this using ZFS 2.3.0 as it appears this odd performance issue when setting primarycache=metadata only existed when I last tried this somewhere back around 2.2.2 (give or take) however this now seems fixed in 2.3.0, but still appears to be present in the latest 2.2 release. I'm guessing primarycache=metadata is not a common use case so not surprised if it went unnoticed, I found nothing online at the time when I searched about it several months back. |
Possibly duplicate of #16966 ? |
Not really, i have no issues with booting or stuttering/hangs. Its just about the performance of primarycache=all vs primarycache=metadata on NVME Drives. BTW, for the thread update: I changed now everything to primarycache=all, since metadata is somewhat buggy. Im happy, but still not entirely happy, since i get still twice (almost 3x) the performance with BTRFS. 3,7s vs 10,4s for the Query 1 Benchmark xD However, i need to wait for ZFS 2.3.0, till it arrives in Proxmox and retest. (@xeghia ) |
You can get some performance from using mutliple threads, as you can tell my increasing the number of processes in fio. innodb has a parameter innodb_read_io_threads (and the write equivalent) to control the number of threads. It defaults to the number of logical processors / 2. That makes sense if it can use async I/O, since the main advantage of multiple threads would be to get lots of processor power. But without AIO, it might make sense to use a larger value. Look at "show engine innodb status". If the number of queued queries is high, you might benefit from more i/O threads or use of ZFS 2.3.0. https://dba.stackexchange.com/questions/299461/how-do-you-tune-innodb-read-io-threads But there are other things you should do as well: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#database-workloads However logbias=throughput isn't always appropriate with NVMe. Note that the recommendation disables async I/O. I don't know whether that's still true for ZFS 2.3.0. See also https://www.reddit.com/r/zfs/comments/u1xklc/mariadbmysql_database_settings_for_zfs/ |
Finally, take a look at your memory allocation. The reason people recommend primarycache=metadata isn't because cache is useless, but because they assume innodb's cache is better than ZFS's for mysql data. If primarycache=all gives dramatically better performance that suggests that either 1) innodb's cache isn't as good as we'd hoped or 2) you haven't given innodb enough memory for its cache. If innodb's cache is in fact better than ZFS's for database data, you'd be better off the decrease the size of the arc and increase the size of the innodb cache. Half of memory for the innodb cache is certainly reasonable (just as typically ZFS uses half of memory for its cache). Of course the total of the ARC and innodb's cache needs to be smaller than total memory, or you'll create swapping or some other misbehavior. |
Im using myisam tables here without caching. Its entirely about native FS Speed, primarycache=all is made with an empty ARC, so no cache either. With "empty" memory (nothing in ARC), ZFS has 12x the performance of primarycache=metadata. This results in a simple Conclusion, that there is something broken with primarycache=metadata, some blocks or whatever, like zfs takes some extra steps (which takes time) instead of shorten the path or leave as is. --> Even if metadata is not used by a lot of people, i would expect that ZFS reaches at least half of the raw NVME-Array Speed with primarycache=metadata. (Instead of beeing 52x slower) --> With primarycache=all, i expect that ZFS is even faster as raw NVME speed, because of the additional usage of insanely fast 12-channel memory speed. (Instead of beeing 3x slower) PS: Sure there is a Limit, what the CPU can handle. But i reach the CPU-Limit only with BTRFS, while on ZFS there is maybe a peak of 20%. |
OK, if you're using isam with no caching the results make sense. I believe btrfs, like other Linux file systems, uses the Linux page cache. To my knowledge, ZFS is the only file system that does its own caching. So disabling ZFS caching is not comparable to btrfs. To be comparable with btrfs you need primarycache=all. |
This is maybe known to everyone, but something has to change because ZFS gets more and more unusable on fast NVME Drives.
System information
Distribution Name | Proxmox 8.3.3 (Debian 12)
Kernel Version | 6.8 / 6.11
Architecture | x64
OpenZFS Version | 2.2.7
CPU: Genoa 9374F
Memory: 12x 64GB Dimms (768GB Total)
Drives: 8x Micron 7450 Max 3,2TB / 2x Micron 7500 Max 3,2TB
I made a lot of Benchmarks, FIO and complex MySQL Queries on ZFS / EXT4 & BTRFS
Tested MariaDB version: 11.6
Total Database Space used: 446 GB
FIO Command:
fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=1 --size=4G --numjobs=6 --runtime=60 --time_based=1 --group_reporting --filename=/var/lib/mysql/testfile
Query Command:
Complex Select Query with Left Joints, Inner Joins, Group by, from multiple Databases.
MariaDB 11.6 Results:
I tested 6 Queries, they are all different but "Query 1" and "Query 3" are hard to Cache, so i would look only to these 2.
After the Startup of the VM, each Query was run only 3x and only the third Result is measured.
(The First Result on BTRFS vs ZFS, is even a lot more painfull, btrfs takes on first time after startup ~5s, while zfs takes over 15 minutes!)
FIO Results (Paired for the MySQL-Benchmarks above):
ZFS-Default means:
ZFS-Tuned means:
ZFS-Default (single, 1x 7450, 7,9k IOPS, 20ms latency)
ZFS-Tuned (single, 1x 7450, 27,7k IOPS, 10ms latency)
ZFS-Tuned (Raid-10, 8x 7450, 26,7k IOPS, 10ms latency)
EXT4 (single, 1x 7450, 703k IOPS, 250µs latency)
BTRFS (single, 1x 7450, 673k IOPS, 250µs latency)
BTRFS (Raid-0, 2x 7500, 642k IOPS, 250µs latency)
ZFS is most horrible on Databases especially, on General-Usage/Filetransfer and so on, ZFS is almost on par with BTRFS.
In the end, on Mixed Environments, BTRFS is a much better Choice for NVME-Drives.
The text was updated successfully, but these errors were encountered: