-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor mmap(2) writepages (WIP) #2413
Conversation
@ryao I've refreshed this patch. This is something I'm fairly happy with although I'm still not have much luck showing that it improves performance. The bottleneck for the workloads I've tested appears to never be the overhead introduced by the additional range lock calls. Although I have no doubt under certain circumstance it could be, I'm just have trouble reproducing those circumstances. If you could review this and provide some feedback I'd appreciate it. I've queued this patch up for additional testing and it's working well. I've also done some manual torture testing to ensure no accidental deadlocks were introduced. |
The idea here is to pull the zfs_range_lock() out of writepage() function and in to the writepages() function. In theory, this should reduce our overhead, improve performance, and simplify the code. In practice, the reworked code is still complicated and the performance may be worse. Moving the zfs_range_lock() outside writepages makes ensuring the data-integrity semantics difficult. We cannot rely on the generic write_cache_pages() function in WB_SYNC_ALL mode because it can deadlock as follows. --- Process 1 --- --- Process 2 --- zfs_range_lock sync_page zfs_get_data wait_on_page_bit zil_commit write_cache_pages zfs_putpage zfs_putpage zpl_writepages zpl_writepages That means we need to implement our own logic which is similar to that in write_cache_pages() to ensure pages which are already in writeback and are redirtied do not get skipped. This patch accomplished that by tagging the pages with the TOWRITE tag but this may offset any performance gains from the refactoring. It needs to be carefully profiled and benchmarked under relavant workloads. This is still a work in progress. Original-patch-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
I have run the following commands on a few configurations:
This appears to slightly harm performance and paradoxically, the unpatched ZoL 0.6.3 is exhibiting performance levels that I had observed when testing my original patch against HEAD a few months ago:
@behlendorf I think it would be a good idea to shelve this until an actual improvement is realized. I do not know why I was seeing abnormal unpatched performance a few months ago, but it appears that something was merged to HEAD that fixed it. Whatever it was eliminated the benefit of this change. |
Good thought. Let me look more closely at the case. Although, even if there is a problem in that situation we'd suffer from the same issue with the current code.
Maybe I'd need to look carefully at the logic and see if it's something we can depend on. Remember this code may skip pages and the range may be sparse. Also the second zil_commit() is likely a no-op and only apply to the data-integrity case so the cost of doing the safe thing here is small.
Let's not shelve this just yet. I think the refactored code is a real improvement. We just need to take the care required to benchmark and validate that improvement. |
After removing 2250 and replacing with this patch, something rather odd happened - scrubs went from starting @ 3MB/s and going up to ~40-50 (taking many hours to complete), to starting, and running at 200MBs on a 1T SSD. |
@sempervictus This patch affects the POSIX API. It should have no effect on scrub performance, which is in a different component. |
@ryao thanks, something screwy is going on here - these specific SSDs have been very unhappy with ZFS, showing sub 1MB random write performance on tiotest and other unpleasant behaviors. Digging through the changes made over the last few days, and testing on 3.14 with the version of ZFS i'd built for that kernel does show the performance drop... will need to eliminate factors and figure out what helped. |
While on the surface making this change sounds like a good idea the performance testing indicates otherwise. This patch does not significantly improve performance for any of the tested Initially I was surprised by this result but after some reflection and profiling I can explain what's going on. Fundamentally this change pulls the zfs range lock up a level allowing us to take one lock while we iterate over all the pages covered by the lock. This reduces the overhead considerably compared to the existing implementation which must acquire and drop the range lock for every page. This reduction in overhead was supposed to translate in to improved performance. However, in practice any performance gains appear to be more then offset by reduced concurrency. With a single large lock covering the entire range it's far more likely that concurrent callers will need to block waiting to acquire that lock. When the locking was per-page it was uncommon to see any lock contention and both callers would be allowed to proceed concurrently. Ironically, while one of the goals of this patch was to get ZoL in sync with Illumos by changing the ZoL code. The testing implies that instead Illumos should adjust their code to such that the range locking is done on a per-page basis. This should improve their zfs-0.6.3 stock (results in MB/s)
zfs-0.6.3 + 2413 writepage patch (results in MB/s)
fio test script
zpool configuration
|
@behlendorf I took a fresh look at these figures today. Random IO did benefit, which is what I had expected. Unexpectedly, sequential IO suffered, but I have an idea of why. Our default recordsize is 128KB and we call dmu_write() on each page. On sequential IO, the first operation a thread performs in a record will be the following:
Subsequent operations will be:
In the current code, we have multiple threads enter the first case and block. When the read is done, they continue with their work and consequently, the second case is entered fewer times and each thread does only a portion of the work. When we change things to take the range lock only once, a single thread does everything while others block on the rangelock. The single thread copying pages ends up looping 31 times through the second case. My first thought is that we would be copying all 128KB each time because of Illumos is rather different. On the first page in a record, it will call Switching to a single rangelock per There might be a way to get the best of both worlds. Since Linux gives us the hooks we need to manage pages ourselves, we could make the leaves of the radix tree records rather than pages. This would look like a hybrid between a radix tree and a b-tree. If we treat the records themselves as being in writeback, we should be able to decrease overhead in a way that allows writeback to still be concurrent. |
To summarize my remark, a significant strength of Illumos' approach of taking the rangelock only once per |
The idea here is to pull the zfs_range_lock() out of writepage()
function and in to the writepages() function. In theory, this
should reduce our overhead, improve performance, and simplify
the code. In practice, the reworked code is still complicated
and the performance may be worse.
Moving the zfs_range_lock() outside writepages makes ensuring the
data-integrity semantics difficult. We cannot rely on the generic
write_cache_pages() function in WB_SYNC_ALL mode because it can
deadlock as follows.
--- Process 1 --- --- Process 2 ---
zfs_range_lock sync_page
zfs_get_data wait_on_page_bit
zil_commit write_cache_pages
zfs_putpage zfs_putpage
zpl_writepages zpl_writepages
That means we need to implement our own logic which is similar
to that in write_cache_pages() to ensure pages which are already
in writeback and are redirtied do not get skipped. This patch
accomplishes that by tagging the pages with the TOWRITE tag but
the convergence logic is overly broad. If other processes are
calling msync() over the same file range the pages may be written
more often that needed. That said, it is semantically correct.
This needs to be carefully profiled and benchmarked under mmap(2)
workloads to determine if it's going to work well. I've been
using an fio workload which does small 4k IOs from 32 threads.
To make the workload more realistic the sync_file_range option
was added to cause msync() to be called every 32 writes.
--- FIO workload ---
[global]
bs=4k
ioengine=mmap
iodepth=1
size=1g
direct=0
runtime=60
directory=/tank/fio
filename=mmap.test.file
numjobs=32
sync_file_range=write:32
[seq-read]
rw=read
stonewall
[rand-read]
rw=randread
stonewall
[seq-write]
rw=write
stonewall
[rand-write]
rw=randwrite
stonewall
Original-patch-by: Richard Yao ryao@gentoo.org
Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov