Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor mmap(2) writepages (WIP) #2413

Closed
wants to merge 1 commit into from

Conversation

behlendorf
Copy link
Contributor

The idea here is to pull the zfs_range_lock() out of writepage()
function and in to the writepages() function. In theory, this
should reduce our overhead, improve performance, and simplify
the code. In practice, the reworked code is still complicated
and the performance may be worse.

Moving the zfs_range_lock() outside writepages makes ensuring the
data-integrity semantics difficult. We cannot rely on the generic
write_cache_pages() function in WB_SYNC_ALL mode because it can
deadlock as follows.

--- Process 1 --- --- Process 2 ---
zfs_range_lock sync_page
zfs_get_data wait_on_page_bit
zil_commit write_cache_pages
zfs_putpage zfs_putpage
zpl_writepages zpl_writepages

That means we need to implement our own logic which is similar
to that in write_cache_pages() to ensure pages which are already
in writeback and are redirtied do not get skipped. This patch
accomplishes that by tagging the pages with the TOWRITE tag but
the convergence logic is overly broad. If other processes are
calling msync() over the same file range the pages may be written
more often that needed. That said, it is semantically correct.

This needs to be carefully profiled and benchmarked under mmap(2)
workloads to determine if it's going to work well. I've been
using an fio workload which does small 4k IOs from 32 threads.
To make the workload more realistic the sync_file_range option
was added to cause msync() to be called every 32 writes.

--- FIO workload ---
[global]
bs=4k
ioengine=mmap
iodepth=1
size=1g
direct=0
runtime=60
directory=/tank/fio
filename=mmap.test.file
numjobs=32
sync_file_range=write:32

[seq-read]
rw=read
stonewall

[rand-read]
rw=randread
stonewall

[seq-write]
rw=write
stonewall

[rand-write]
rw=randwrite
stonewall

Original-patch-by: Richard Yao ryao@gentoo.org
Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov

@behlendorf
Copy link
Contributor Author

@ryao I've refreshed this patch. This is something I'm fairly happy with although I'm still not have much luck showing that it improves performance. The bottleneck for the workloads I've tested appears to never be the overhead introduced by the additional range lock calls. Although I have no doubt under certain circumstance it could be, I'm just have trouble reproducing those circumstances.

If you could review this and provide some feedback I'd appreciate it. I've queued this patch up for additional testing and it's working well. I've also done some manual torture testing to ensure no accidental deadlocks were introduced.

The idea here is to pull the zfs_range_lock() out of writepage()
function and in to the writepages() function.  In theory, this
should reduce our overhead, improve performance, and simplify
the code.  In practice, the reworked code is still complicated
and the performance may be worse.

Moving the zfs_range_lock() outside writepages makes ensuring the
data-integrity semantics difficult.  We cannot rely on the generic
write_cache_pages() function in WB_SYNC_ALL mode because it can
deadlock as follows.

--- Process 1 ---    --- Process 2 ---
zfs_range_lock       sync_page
zfs_get_data         wait_on_page_bit
zil_commit           write_cache_pages
zfs_putpage          zfs_putpage
zpl_writepages       zpl_writepages

That means we need to implement our own logic which is similar
to that in write_cache_pages() to ensure pages which are already
in writeback and are redirtied do not get skipped.  This patch
accomplished that by tagging the pages with the TOWRITE tag but
this may offset any performance gains from the refactoring.  It
needs to be carefully profiled and benchmarked under relavant
workloads.  This is still a work in progress.

Original-patch-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf behlendorf modified the milestones: 0.6.4, 0.7.0 Jun 23, 2014
@ryao
Copy link
Contributor

ryao commented Jun 24, 2014

I have run the following commands on a few configurations:

modprobe brd rd_size=8388608                                                                                                                                                                                                                                       
zpool create -m /tmp/test -o cachefile=/tmp/zpool.cache -O recordsize=4k test /dev/ram0                                                                                                                                                                            
echo 0 > /proc/sys/kernel/randomize_va_space
filebench << END                                                                                                                                                                                                                                                           
load randomwrite                                                                                                                                                                                                                                                           
set \$dir=/test                                                                                                                                                                                                                                                            
set \$nthreads=32                                                                                                                                                                                                                                                          
set \$iosize=4096                                                                                                                                                                                                                                                          
run 60                                                                                                                                                                                                                                                                     
END

zpool destroy test
modprobe -r brd

modprobe brd rd_size=8388608                                                                                                                                                                                                                                       
zpool create -m /tmp/test -o cachefile=/tmp/zpool.cache -O compression=lz4 test /dev/ram0                                                                                                                                                                            
filebench << END                                                                                                                                                                                                                                                           
load randomwrite                                                                                                                                                                                                                                                           
set \$dir=/test                                                                                                                                                                                                                                                            
set \$nthreads=32                                                                                                                                                                                                                                                          
set \$iosize=4096                                                                                                                                                                                                                                                          
run 60                                                                                                                                                                                                                                                                     
END

zpool destroy test
modprobe -r brd

modprobe brd rd_size=8388608                                                                                                                                                                                                                                       
zpool create -m /tmp/test -o cachefile=/tmp/zpool.cache test /dev/ram0                                                                                                                                                                            
filebench << END                                                                                                                                                                                                                                                           
load randomwrite                                                                                                                                                                                                                                                           
set \$dir=/test                                                                                                                                                                                                                                                            
set \$nthreads=32                                                                                                                                                                                                                                                          
set \$iosize=4096                                                                                                                                                                                                                                                          
run 60                                                                                                                                                                                                                                                                     
END

This appears to slightly harm performance and paradoxically, the unpatched ZoL 0.6.3 is exhibiting performance levels that I had observed when testing my original patch against HEAD a few months ago:

Linux 3.14.2 with Brian's patch and recordsize=4k:

 6727: 61.025: Run took 60 seconds...
 6727: 61.025: Per-Operation Breakdown
rand-write1          964141ops    16068ops/s  62.8mb/s      0.1ms/op     4571us/op-cpu [0ms - 4ms]
 6727: 61.025: IO Summary: 964141 ops, 16067.888 ops/s, (0/16068 r/w),  62.8mb/s,    390us cpu/op,   0.1ms latency
 6727: 61.025: Shutting down processes

Linux 3.14.2 with Brian's patch and compression=lz4:

 6875: 61.028: Run took 60 seconds...
 6875: 61.029: Per-Operation Breakdown
rand-write1          963114ops    16051ops/s  62.7mb/s      0.1ms/op     4573us/op-cpu [0ms - 4ms]
 6875: 61.029: IO Summary: 963114 ops, 16050.746 ops/s, (0/16051 r/w),  62.7mb/s,    392us cpu/op,   0.1ms latency
 6875: 61.029: Shutting down processes

Linux 3.14.2 with Brian's patch and default settings:

 7174: 61.025: Per-Operation Breakdown
rand-write1          959603ops    15992ops/s  62.5mb/s      0.1ms/op     4581us/op-cpu [0ms - 4ms]
 7174: 61.025: IO Summary: 959603 ops, 15992.265 ops/s, (0/15992 r/w),  62.5mb/s,    392us cpu/op,   0.1ms latency
 7174: 61.025: Shutting down processes

Linux 3.14.2 with 0.6.3 and recordsize=4k:

 6124: 61.026: Run took 60 seconds...
 6124: 61.026: Per-Operation Breakdown
rand-write1          978864ops    16313ops/s  63.7mb/s      0.1ms/op     4555us/op-cpu [0ms - 4ms]
 6124: 61.026: IO Summary: 978864 ops, 16313.189 ops/s, (0/16313 r/w),  63.7mb/s,    383us cpu/op,   0.1ms latency
 6124: 61.026: Shutting down processes

Linux 3.14.2 with 0.6.3 and compression=lz4:

 6272: 61.026: Run took 60 seconds...
 6272: 61.026: Per-Operation Breakdown
rand-write1          962920ops    16048ops/s  62.7mb/s      0.1ms/op     4574us/op-cpu [0ms - 4ms]
 6272: 61.026: IO Summary: 962920 ops, 16047.539 ops/s, (0/16048 r/w),  62.7mb/s,    393us cpu/op,   0.1ms latency
 6272: 61.026: Shutting down processe

Linux 3.14.2 with 0.6.3 and default settings:

  6419: 61.028: Run took 60 seconds...
 6419: 61.028: Per-Operation Breakdown
rand-write1          967937ops    16131ops/s  63.0mb/s      0.1ms/op     4545us/op-cpu [0ms - 4ms]
 6419: 61.028: IO Summary: 967937 ops, 16131.162 ops/s, (0/16131 r/w),  63.0mb/s,    388us cpu/op,   0.1ms latency
 6419: 61.028: Shutting down processes

Linux 2.6.32-431.11.2.el6.x86_64 with Brian's patch and recordsize=4k:

 6187: 61.029: Run took 60 seconds...
 6187: 61.029: Per-Operation Breakdown
rand-write1          971822ops    16196ops/s  63.3mb/s      0.1ms/op     4549us/op-cpu [0ms - 2ms]
 6187: 61.029: IO Summary: 971822 ops, 16195.872 ops/s, (0/16196 r/w),  63.3mb/s,    385us cpu/op,   0.1ms latency
 6187: 61.029: Shutting down processes

Linux 2.6.32-431.11.2.el6.x86_64 with Brian's patch and compression=lz4:

 6350: 61.028: Run took 60 seconds...
 6350: 61.028: Per-Operation Breakdown
rand-write1          977579ops    16292ops/s  63.6mb/s      0.1ms/op     4551us/op-cpu [0ms - 7ms]
 6350: 61.028: IO Summary: 977579 ops, 16291.845 ops/s, (0/16292 r/w),  63.6mb/s,    382us cpu/op,   0.1ms latency
 6350: 61.028: Shutting down processes

Linux 2.6.32-431.11.2.el6.x86_64 with Brian's patch and default settings:

 6498: 61.025: Run took 60 seconds...
 6498: 61.026: Per-Operation Breakdown
rand-write1          959725ops    15994ops/s  62.5mb/s      0.1ms/op     4565us/op-cpu [0ms - 3ms]
 6498: 61.026: IO Summary: 959725 ops, 15994.278 ops/s, (0/15994 r/w),  62.5mb/s,    392us cpu/op,   0.1ms latency
 6498: 61.026: Shutting down processes

 Linux 2.6.32-431.11.2.el6.x86_64 with 0.6.3 and recordsize=4k:

 5232: 61.031: Run took 60 seconds...
 5232: 61.031: Per-Operation Breakdown
rand-write1          1088007ops    18132ops/s  70.8mb/s      0.1ms/op     3390us/op-cpu [0ms - 5ms]
 5232: 61.031: IO Summary: 1088007 ops, 18132.268 ops/s, (0/18132 r/w),  70.8mb/s,    300us cpu/op,   0.1ms latency
 5232: 61.031: Shutting down processes

Linux 2.6.32-431.11.2.el6.x86_64 with 0.6.3 and compression=lz4:

 5450: 61.032: Run took 60 seconds...
 5450: 61.033: Per-Operation Breakdown
rand-write1          1090814ops    18179ops/s  71.0mb/s      0.1ms/op     3382us/op-cpu [0ms - 11ms]
 5450: 61.033: IO Summary: 1090814 ops, 18178.800 ops/s, (0/18179 r/w),  71.0mb/s,    298us cpu/op,   0.1ms latency
 5450: 61.033: Shutting down processes

Linux 2.6.32-431.11.2.el6.x86_64 with 0.6.3 and default settings:

 5722: 61.029: Run took 60 seconds...
 5722: 61.029: Per-Operation Breakdown
rand-write1          1086150ops    18101ops/s  70.7mb/s      0.1ms/op     3387us/op-cpu [0ms - 3ms]
 5722: 61.029: IO Summary: 1086150 ops, 18101.283 ops/s, (0/18101 r/w),  70.7mb/s,    299us cpu/op,   0.1ms latency
 5722: 61.030: Shutting down processes

@behlendorf I think it would be a good idea to shelve this until an actual improvement is realized. I do not know why I was seeing abnormal unpatched performance a few months ago, but it appears that something was merged to HEAD that fixed it. Whatever it was eliminated the benefit of this change.

@behlendorf
Copy link
Contributor Author

I don't think this will work as you expect it to work on systems where we have more than 8TB of data mmap'ed. write_cache_pages()

Good thought. Let me look more closely at the case. Although, even if there is a problem in that situation we'd suffer from the same issue with the current code.

dirty pages were written out to avoid the second zil_commit()

Maybe I'd need to look carefully at the logic and see if it's something we can depend on. Remember this code may skip pages and the range may be sparse. Also the second zil_commit() is likely a no-op and only apply to the data-integrity case so the cost of doing the safe thing here is small.

I think it would be a good idea to shelve this until an actual improvement is realized

Let's not shelve this just yet. I think the refactored code is a real improvement. We just need to take the care required to benchmark and validate that improvement.

@sempervictus
Copy link
Contributor

After removing 2250 and replacing with this patch, something rather odd happened - scrubs went from starting @ 3MB/s and going up to ~40-50 (taking many hours to complete), to starting, and running at 200MBs on a 1T SSD.
Unfortunately i can't say for sure that this patch was the cause - i went to 3.15 on this machine at the same time. I'll boot into 3.14 again to check, but if it is related, then that's a pretty significant win.

@ryao
Copy link
Contributor

ryao commented Jun 25, 2014

@sempervictus This patch affects the POSIX API. It should have no effect on scrub performance, which is in a different component.

@sempervictus
Copy link
Contributor

@ryao thanks, something screwy is going on here - these specific SSDs have been very unhappy with ZFS, showing sub 1MB random write performance on tiotest and other unpleasant behaviors. Digging through the changes made over the last few days, and testing on 3.14 with the version of ZFS i'd built for that kernel does show the performance drop... will need to eliminate factors and figure out what helped.

@behlendorf
Copy link
Contributor Author

While on the surface making this change sounds like a good idea the performance testing indicates otherwise. This patch does not significantly improve performance for any of the tested mmap(2) workloads and in several case has a severe negative impact. Therefore this change won't be merged.

Initially I was surprised by this result but after some reflection and profiling I can explain what's going on. Fundamentally this change pulls the zfs range lock up a level allowing us to take one lock while we iterate over all the pages covered by the lock. This reduces the overhead considerably compared to the existing implementation which must acquire and drop the range lock for every page. This reduction in overhead was supposed to translate in to improved performance.

However, in practice any performance gains appear to be more then offset by reduced concurrency. With a single large lock covering the entire range it's far more likely that concurrent callers will need to block waiting to acquire that lock. When the locking was per-page it was uncommon to see any lock contention and both callers would be allowed to proceed concurrently.

Ironically, while one of the goals of this patch was to get ZoL in sync with Illumos by changing the ZoL code. The testing implies that instead Illumos should adjust their code to such that the range locking is done on a per-page basis. This should improve their mmap(2) performance.

zfs-0.6.3 stock (results in MB/s)

N-Thread Workload 1M IO, msync=32 4K IO, msync=32 1M IO, msync=0 4K IO, msync=0
32 Seq Write 323 31 1422 1532
16 Seq Write 246 31 1599 928
8 Seq Write 137 31 960 488
1 Seq Write 49 16 119 61
32 Rnd Write 94 4 71 48
16 Rnd Write 68 5 62 38
8 Rnd Write 53 4 58 41
1 Rnd Write 42 7 27 40

zfs-0.6.3 + 2413 writepage patch (results in MB/s)

Test Workload 1M IO, msync=32 4K IO, msync=32 1M IO, msync=0 4K IO, msync=0
32 Seq Write 215 26 1101 877
16 Seq Write 163 26 949 834
8 Seq Write 148 28 291 321
1 Seq Write 46 15 49 57
32 Rnd Write 94 8 57 45
16 Rnd Write 85 8 57 45
8 Rnd Write 70 8 54 45
1 Rnd Write 45 8 39 45

fio test script

[global]
#bs=4k
bs=1m
ioengine=mmap
iodepth=1
size=512m
direct=0
runtime=60
directory=/tank/fio
filename=mmap.test.file
numjobs=8
sync_file_range=write:32

[seq-read]
rw=read
stonewall

[rand-read]
rw=randread
stonewall

[seq-write]
rw=write
stonewall

[rand-write]
rw=randwrite
stonewall

zpool configuration

  pool: tank
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    tank        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        vdb1    ONLINE       0     0     0
        vdc1    ONLINE       0     0     0
    logs
      vdd1      ONLINE       0     0     0

@behlendorf behlendorf closed this Jun 25, 2014
@ryao
Copy link
Contributor

ryao commented Jul 29, 2014

@behlendorf I took a fresh look at these figures today. Random IO did benefit, which is what I had expected. Unexpectedly, sequential IO suffered, but I have an idea of why. Our default recordsize is 128KB and we call dmu_write() on each page. On sequential IO, the first operation a thread performs in a record will be the following:

dbuf_read
dmu_buf_hold_array_by_dnode
dmu_buf_hold_array
dmu_write
zfs_putpage

Subsequent operations will be:

dnode_hold_impl
dnode_hold
dmu_buf_hold_array_by_dnode
dmu_buf_hold_array
dmu_write
zfs_putpage

In the current code, we have multiple threads enter the first case and block. When the read is done, they continue with their work and consequently, the second case is entered fewer times and each thread does only a portion of the work. When we change things to take the range lock only once, a single thread does everything while others block on the rangelock. The single thread copying pages ends up looping 31 times through the second case. My first thought is that we would be copying all 128KB each time because of dmu_buf_will_dirty(), but the code is clever enough to reuse buffers whenever possible, so that doesn't happen.

Illumos is rather different. On the first page in a record, it will call zfs_putapage(), which is similar to our present zfs_putpage(), but its VFS provides pvn_write_kluster(), which allows it to seek forward to find additional dirty pages up to the full blocksize. This works because Illumos' struct page has an internal linked list and pvn_write_kluster() will append pages for writeback to the tail of that list. The implication is that if all pages in the record are dirty, then Illumos is able to avoid the read entirely and do a single write. Whichever the case, then calls dmu_write_pages(), which will take the new length and continue copying beyond the boundary of the page by iterating through the linked list.

Switching to a single rangelock per ->writepages did improve random IO with msync=32 and helped 4K random IO with msync=0. I think the penalty in sequential IO is the result of a combination of the large number of iterations that the single thread must now make and the overhead from calling dmu_write() for each page. If we had klustering, the performance on sequential would have been better, although how much better is hard to say. A large improvement would result from not having to do the read while a smaller improvement would result having tighter loops.

There might be a way to get the best of both worlds. Since Linux gives us the hooks we need to manage pages ourselves, we could make the leaves of the radix tree records rather than pages. This would look like a hybrid between a radix tree and a b-tree. If we treat the records themselves as being in writeback, we should be able to decrease overhead in a way that allows writeback to still be concurrent.

@ryao
Copy link
Contributor

ryao commented Jul 29, 2014

To summarize my remark, a significant strength of Illumos' approach of taking the rangelock only once per VOP_PUTPAGE is that it allows it to avoid the read entirely by enabling klustering to safely add more pages. When I wrote this patch, I had viewed klustering as something to add after this was merged. I see now that it should be added together with this.

@behlendorf behlendorf mentioned this pull request Dec 19, 2014
@behlendorf behlendorf deleted the writepages branch February 16, 2017 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants