-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large sync writes perform worse with slog #14378
Conversation
For synchronous write workloads with large IO sizes, a pool configured with a slog performs worse than one with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 1292 438 -66.10% Write Bandwidth: 1323570 448910 -66.08% Write Latency: 12128400 36330970 3.0x sequential_writes 1m sync ios, 32 threads Write IOPS: 1293 430 -66.74% Write Bandwidth: 1324184 441188 -66.68% Write Latency: 24486278 74028536 3.0x The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`, if a zil block is greater than 768K, the priority of the write is downgraded from sync to async. Increasing the value allows greater throughput. To select a value for this PR, I ran an fio workload with the following values for `zil_slog_bulk`: zil_slog_bulk KiB/s 1048576 422132 2097152 478935 4194304 533645 8388608 623031 12582912 827158 16777216 1038359 25165824 1142210 33554432 1211472 50331648 1292847 67108864 1308506 100663296 1306821 134217728 1304998 At 64M, the results with a slog are now improved to parity with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 438 1288 2.9x Write Bandwidth: 448910 1319062 2.9x Write Latency: 36330970 12163408 -66.52% sequential_writes 1m sync ios, 32 threads Write IOPS: 430 1290 3.0x Write Bandwidth: 441188 1321693 3.0x Write Latency: 74028536 24519698 -66.88% None of the other tests in the performance suite (run with a zil or slog) had a significant change, including the random_write_zil tests, which use multiple datasets. Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising the limit makes sense given improvement in disk write performance since the tuneable was introduced. Though I wonder if it's necessary to still keep this limit. Perhaps, we can revisit this another time.
Change looks great! Thanks for chasing down this problem.
@jwk404 I think your test is flawed at least in point that 64MB is the point. Having 32 1MB writes you can never get more than 32MB of ZIL log, so setting zil_slog_bulk simply disables the functionality. Would you use 4MB writes, for example, I bet bigger values could have effect. The problem with 64MB is that it can barely be reachable for sync=always workloads. I think it can only be reached if some application would do plenty of writes and then requested sync(). But the real question is why does it really help? What so bad is done for async write priority that hurts performance so much in your case? My first guess is that it can make latency less fair by switching from FIFO to elevator sorting, but not sure why it should cause 3x slowdown. The second guess is that async priority is limited in number of concurrent requests by I/O scheduler. It would be good to find it out for better understanding. May be my idea of using async priority there was not the best, or may be it could be improved. |
With respect to why this helps, I believe your second guess is the correct one. When comparing the output of zpool_influxdb between runs with zil vs. slog without this change, I noticed that the pool with the slog had significantly fewer IOs in the active sync write vdev IO queue. It was roughly a 10x difference. I agree that a setting of 64MB effectively disables the functionality. I considered removing it, but thought it better to leave the tunable in case some other workload regressed with the new value. I tested workloads where we have multiple datasets being written synchronously to see if changing the tunable had any effect, but it didn't. Can you describe a workload where this tunable improves performance? I wasn't able to create one. |
@amotin Do you have thoughts on this? |
The idea behind this was that if you have two datasets, one of which is more latency sensitive, like iSCSI with sync=always, while another like async NFS, generating huge sync bursts from time to time, the first should not suffer from the bursts of the second. With the growth of I/O sizes, queue depths and speeds it may be that 768K is indeed too small. I don't think I'd have objections to bump it up, just not sure how high. My unmotivated feeling is ~16MB, but it would probably be good to check how big sync bursts are created by modern NFS clients, properly supporting server cache control (Linux, FreeBSD, etc). As alternative or addition to this, I think we may investigate completely disabling offset sorting in I/O scheduler for non-rotational (SSD) vdevs even for async priorities, while keeping aggregation. SSDs should not care much about offsets if the writes are large, while it should make latency more stable and predictable. IIRC at the time I/O scheduler was written ZFS had no information whether the device is rotational or not. |
I did some further testing along these lines. Using 4 NFS clients, I was able to create relatively high bursts of sync IO. Using a bcc script (and the 64MB change suggested in this PR), I able to monitor zilog->zl_cur_used and saw it hovering under 100MB, with peaks in the 200-300MB range. With the default, it averaged in the 380MB range with a peak of almost 2GB. So I think 64MB is a reasonable default; I don't think a complex config (mixed workloads in a pool) should be prioritized over a simple config (any pool with a slog). |
@jwk404 What burst sizes do you see with one client doing sequential writes? I'd prefer the tunable to be set below that level. |
I did a series of experiments where I launched a 60 second local sync workload, and after 15 seconds, launced a similar 30 second workload over NFS. The table below shows avg throughput in MB/s at varying tunable sizes.
Given that, what do you think about 32M for the tunable? |
I asked to monitor zilog->zl_cur_used when only NFS client is used, may be with different intensity. From the "1 NFS" column I can guess only that it is likely somewhere between 768K and 16MB. Also, what was your "local" workload, that it required 64MB of simultaneous requests? The whole idea of this tunable is to separate "bulk" and "interactive" workloads, giving the last a higher priority. 64MB does not sound interactive to me. |
zilog->zl_cur_used varied from a few MB to peaks as high as 320MB. The average over 3 runs was 42MB. |
"1MB writes over 64 threads" is not exactly an interactive workload, especially if all the threads are writing to the same dataset. It may not depend on individual request latency as much as would one thread doing 16KB writes. |
@jwk404 OK. Cold you please update or remove the date change in zfs.4 to resolve the conflict and lets merge it. |
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For synchronous write workloads with large IO sizes, a pool configured with a slog performs worse than one with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 1292 438 -66.10% Write Bandwidth: 1323570 448910 -66.08% Write Latency: 12128400 36330970 3.0x sequential_writes 1m sync ios, 32 threads Write IOPS: 1293 430 -66.74% Write Bandwidth: 1324184 441188 -66.68% Write Latency: 24486278 74028536 3.0x The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`, if a zil block is greater than 768K, the priority of the write is downgraded from sync to async. Increasing the value allows greater throughput. To select a value for this PR, I ran an fio workload with the following values for `zil_slog_bulk`: zil_slog_bulk KiB/s 1048576 422132 2097152 478935 4194304 533645 8388608 623031 12582912 827158 16777216 1038359 25165824 1142210 33554432 1211472 50331648 1292847 67108864 1308506 100663296 1306821 134217728 1304998 At 64M, the results with a slog are now improved to parity with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 438 1288 2.9x Write Bandwidth: 448910 1319062 2.9x Write Latency: 36330970 12163408 -66.52% sequential_writes 1m sync ios, 32 threads Write IOPS: 430 1290 3.0x Write Bandwidth: 441188 1321693 3.0x Write Latency: 74028536 24519698 -66.88% None of the other tests in the performance suite (run with a zil or slog) had a significant change, including the random_write_zil tests, which use multiple datasets. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com> Closes openzfs#14378
For synchronous write workloads with large IO sizes, a pool configured with a slog performs worse than one with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 1292 438 -66.10% Write Bandwidth: 1323570 448910 -66.08% Write Latency: 12128400 36330970 3.0x sequential_writes 1m sync ios, 32 threads Write IOPS: 1293 430 -66.74% Write Bandwidth: 1324184 441188 -66.68% Write Latency: 24486278 74028536 3.0x The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`, if a zil block is greater than 768K, the priority of the write is downgraded from sync to async. Increasing the value allows greater throughput. To select a value for this PR, I ran an fio workload with the following values for `zil_slog_bulk`: zil_slog_bulk KiB/s 1048576 422132 2097152 478935 4194304 533645 8388608 623031 12582912 827158 16777216 1038359 25165824 1142210 33554432 1211472 50331648 1292847 67108864 1308506 100663296 1306821 134217728 1304998 At 64M, the results with a slog are now improved to parity with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 438 1288 2.9x Write Bandwidth: 448910 1319062 2.9x Write Latency: 36330970 12163408 -66.52% sequential_writes 1m sync ios, 32 threads Write IOPS: 430 1290 3.0x Write Bandwidth: 441188 1321693 3.0x Write Latency: 74028536 24519698 -66.88% None of the other tests in the performance suite (run with a zil or slog) had a significant change, including the random_write_zil tests, which use multiple datasets. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com> Closes openzfs#14378
Motivation and Context
For synchronous write workloads with large IO sizes, a pool configured with a slog performs worse than one with an embedded zil:
Description
The reason is the
zil_slog_bulk
variable. Inzil_lwb_write_open
, if a zil block is greater than 768K, the priority of the write is downgraded from sync to async. Increasing the value allows greater throughput. To select a value for this PR, I ran an fio workload with the following values forzil_slog_bulk
:How Has This Been Tested?
At 64M, the results with a slog are now improved to parity with an embedded zil:
None of the other tests in the performance suite (run with a zil or slog) had a significant change, including the random_write_zil tests, which use multiple datasets.
Types of changes
Checklist:
Signed-off-by
.