-
Notifications
You must be signed in to change notification settings - Fork 4.9k
[WIP] Pipelines - Reduce write lock contention #35267
Conversation
Doesn't this mean that a segment which might otherwise have been available for the continuation is now much less likely to be? Is that not a problem at scale? |
Pool size is 16; segments are 4096 bytes so that would affect many small writes without Flush larger than 60kB (15 segments * 4096b) or larger if the ArrayPool is being used. If you are doing single shot large writes then it would be 16 writes without a Flush (say 1MB x 16 = 16MB) that would hit this. On the flip size holding the lock which is for both reader and writer and delaying the signaling of the continuation is more of an issue as e.g. (called for every block returned) ; Assembly listing for method BufferSegment:ResetMemory():this
; Lcl frame size = 40
G_M401_IG01:
57 push rdi
56 push rsi
4883EC28 sub rsp, 40
488BF1 mov rsi, rcx
G_M404_IG02:
488B5628 mov rdx, gword ptr [rsi+40]
48B9B8D075C9FA7F0000 mov rcx, 0x7FFAC975D0B8
E8546A615F call CORINFO_HELP_ISINSTANCEOFINTERFACE
4885C0 test rax, rax
7417 je SHORT G_M404_IG03
488BC8 mov rcx, rax
49BB081536C9FA7F0000 mov r11, 0x7FFAC9361508
3909 cmp dword ptr [rcx], ecx
FF1552B89AFF call [IDisposable:Dispose():this]
EB49 jmp SHORT G_M404_IG04
G_M404_IG03:
488B5628 mov rdx, gword ptr [rsi+40]
48B9020D50C9FA7F0000 mov rcx, 0x7FFAC9500D02
E845274F5F call CORINFO_HELP_ISINSTANCEOFARRAY
488BF8 mov rdi, rax
4885FF test rdi, rdi
742E je SHORT G_M404_IG04
48B9003F45C9FA7F0000 mov rcx, 0x7FFAC9453F00
BA77000000 mov edx, 119
E8B96C5F5F call CORINFO_HELP_CLASSINIT_SHARED_DYNAMICCLASS
48B9E02E951095010000 mov rcx, 0x19510952EE0
488B09 mov rcx, gword ptr [rcx]
488BD7 mov rdx, rdi
4533C0 xor r8d, r8d
3909 cmp dword ptr [rcx], ecx
E89795FFFF call TlsOverPerCoreLockedStacksArrayPool`1:Return(ref,bool):this
G_M405_IG04:
33C0 xor rax, rax
48894608 mov gword ptr [rsi+8], rax
33C0 xor rax, rax
48894610 mov qword ptr [rsi+16], rax
33C0 xor rax, rax
33D2 xor edx, edx
488D4E18 lea rcx, bword ptr [rsi+24]
488901 mov gword ptr [rcx], rax
895108 mov dword ptr [rcx+8], edx
89510C mov dword ptr [rcx+12], edx
33C0 xor rax, rax
48894628 mov gword ptr [rsi+40], rax
48894630 mov gword ptr [rsi+48], rax
33C0 xor eax, eax
894638 mov dword ptr [rsi+56], eax
33C0 xor rax, rax
33D2 xor edx, edx
4883C640 add rsi, 64
488906 mov gword ptr [rsi], rax
895608 mov dword ptr [rsi+8], edx
89560C mov dword ptr [rsi+12], edx
G_M407_IG05:
4883C428 add rsp, 40
5E pop rsi
5F pop rdi
C3 ret
; Total bytes of code 197, prolog size 6 for method BufferSegment:ResetMemory():this |
c175b05
to
e778d5b
Compare
Is there a way to make segment pool non-locking? It doesn't have to be strict, loosing some segments to GC is acceptable. |
Before we merge this one, we need to see the impact on our benchmarks. |
Easiest way is to use the ConcurrentQueueSegment though its a bit chunky. Though its single reader single writer, so might be able to do something simpler SingleProducerConsumerQueue looks like the ticket? |
Made it lock-free and interleaved the Resets with the Returns |
Updated summary
|
@stephentoub does the conversion of |
src/System.IO.Pipelines/src/System/IO/Pipelines/SingleProducerSingleConsumerPool.cs
Outdated
Show resolved
Hide resolved
Removed the lock-free segment pool as Reduced the lock in the write path further following discussion in #35484 |
Summary change
|
Fixed race between setting writing active and |
*note race condition was in this PR, not in current version |
Using stack array for pooling rather than heap array; adds a chunk of zeroing; however its now done after completion and out of lock, so shouldn't hit perf
|
Uses a |
This reverts commit a3cb00f.
Reverted the ValueArray as it didn't help perf and had issues (above) |
Debian.8.Amd64.Open-x64-Release failures |
@@ -97,6 +98,7 @@ public Pipe(PipeOptions options) | |||
} | |||
|
|||
_bufferSegmentPool = new BufferSegment[SegmentPoolSize]; | |||
_bufferSegmentsToReturn = new BufferSegment[SegmentPoolSize]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, larger pipe and more allocations per pipe.
Any benchmark for the last iteration? |
Yea... Is why I added WIP 😉 |
Whoops, sorry. |
Has a mild regression on some scenarios on the Pipe throughput benchmark; and I don't fully understand why; have dropped to asm and tweaking to improve asm CQ. May fold them back in as a separate PR if I don't find the root cause. |
@benaadams any update? |
@benaadams any update / plans? (I assume you will be busy at MVP Summit this week) |
1 month no update, closing for now. Please feel free to reopen it once you're ready to push it forward. Thanks! |
BufferSegment:ResetMemory()
is not an inexpensive operation (blanking 80 bytes + returning data to pool); however it doesn't need to be done under the Pipe's reader/writer sync lock.Also the scheduling/marking of completion can happen prior to the Segments being reset and returned to the pool, allowing for lower latency.
This change resets the segments outside of lock; then uses the pool as a different lock to return the Segments to the pool (and to get them from the pool) so its a fast lock only for the pooled Segments and doesn't lock the whole Pipe.
It also reduces the scope of the lock on the write side when acquiring Memory; to only when it modifies the readHead or needs to change state to Writing (so most write Advancing will skip the lock)
Shrunk the Flush lock when writing as it can modify the write head; then acquire lock after when signalling the Reader.
/cc @davidfowl @pakrym @jkotalik