-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write length and payload separately for double the throughput #388
Conversation
The 4MiB benchmark still doesn't work, but the smaller benchmarks don't hang indefinitely.
This greatly improves the throughput of sendPacket by avoiding a copy of each packet before it is sent, and by avoiding copying the payload of Write and Data packets at all. Benchmarked against OpenSSH using TMPDIR=/dev/shm go test -run=- -integration \ -bench='kWrite([0-9]*k|1MiB)' -benchmem on Linux/amd64. Results: name old time/op new time/op delta Write1k-8 312ms ±12% 305ms ± 5% ~ (p=0.684 n=10+10) Write16k-8 56.9ms ± 2% 31.9ms ± 4% -43.92% (p=0.000 n=10+10) Write32k-8 43.2ms ± 3% 18.1ms ± 6% -58.22% (p=0.000 n=10+10) Write128k-8 31.1ms ± 2% 13.5ms ± 4% -56.72% (p=0.000 n=10+9) Write512k-8 17.5ms ± 5% 10.1ms ± 4% -42.30% (p=0.000 n=10+10) Write1MiB-8 14.9ms ± 1% 9.3ms ± 4% -37.72% (p=0.000 n=8+10) name old speed new speed delta Write1k-8 33.7MB/s ±11% 34.4MB/s ± 5% ~ (p=0.684 n=10+10) Write16k-8 184MB/s ± 2% 329MB/s ± 4% +78.39% (p=0.000 n=10+10) Write32k-8 243MB/s ± 3% 581MB/s ± 6% +139.45% (p=0.000 n=10+10) Write128k-8 337MB/s ± 2% 778MB/s ± 4% +131.11% (p=0.000 n=10+9) Write512k-8 599MB/s ± 5% 1038MB/s ± 4% +73.29% (p=0.000 n=10+10) Write1MiB-8 702MB/s ± 1% 1128MB/s ± 4% +60.67% (p=0.000 n=8+10) name old alloc/op new alloc/op delta Write1k-8 61.6MB ± 0% 35.7MB ± 0% -42.01% (p=0.000 n=10+10) Write16k-8 27.9MB ± 0% 2.2MB ± 0% -91.99% (p=0.000 n=10+10) Write32k-8 29.9MB ± 0% 1.1MB ± 0% -96.26% (p=0.000 n=10+7) Write128k-8 29.2MB ± 0% 0.3MB ± 0% -98.87% (p=0.000 n=10+10) Write512k-8 29.0MB ± 0% 0.1MB ± 0% -99.55% (p=0.000 n=10+10) Write1MiB-8 28.9MB ± 0% 0.1MB ± 0% -99.66% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Write1k-8 92.2k ± 0% 82.0k ± 0% -11.09% (p=0.000 n=9+9) Write16k-8 5.81k ± 0% 5.17k ± 0% -11.07% (p=0.000 n=10+10) Write32k-8 2.93k ± 0% 2.60k ± 0% -11.04% (p=0.000 n=10+9) Write128k-8 2.45k ± 0% 2.12k ± 0% -13.21% (p=0.000 n=8+10) Write512k-8 2.33k ± 0% 2.00k ± 0% -13.91% (p=0.000 n=9+9) Write1MiB-8 2.31k ± 0% 1.99k ± 0% -14.03% (p=0.000 n=6+10)
54b35b7
to
8a86537
Compare
The test failure is the one already fixed in #387. |
I think what I would more ideally like to see, is things to marshal into a slice that has space for the length already in it, but just unset. Then before writing it, the length is then written into that header, and then written as a whole. The
I think that probably everything here would still continue to work if we break up all of these writes, I have been bitten before by weird segmentation issues where |
The code in this PR is not as elegant as it could be, but it's relatively local. I can try to implement a more thorough solution, but it will (1) be a much larger patch, and (2) the packet and the payload really need to be separated to get an O(1) sendPacket rather than the current O(n).
I think you're confusing arrays and slices. A [4]byte is just four bytes, without a slice header, and with value semantics. When you assign one to an interface{}, it gets copied to the heap. Try func BenchmarkArrayPool(b *testing.B) {
pool := sync.Pool{New: func() interface{} { return [4]byte{} }}
for i := 0; i < b.N; i++ {
a := pool.Get().([4]byte)
pool.Put(a) // array copied to the heap
}
}
func BenchmarkPointerPool(b *testing.B) {
pool := sync.Pool{New: func() interface{} { return new([4]byte) }}
for i := 0; i < b.N; i++ {
a := pool.Get().(*[4]byte)
pool.Put(a)
}
}
func BenchmarkSlicePool(b *testing.B) {
pool := sync.Pool{New: func() interface{} { return make([]byte, 4) }}
for i := 0; i < b.N; i++ {
a := pool.Get().([]byte)
pool.Put(a) // slice header copied to the heap
}
} On my machine, that gives
With this library? Because I don't see how the SFTP protocol could be affected. |
So, I’ve played around with this a bit, and came up with the commits in https://github.com/pkg/sftp/tree/more-optimization (also included some various subtle race conditions being dealt with as well), by just allocating and starting after the length, it’s a lot of lines-of-code-change, but simpler in the end without a need for a The other improvements were mostly with an eye towards improving the I’ll have something tomorrow either way, either the code sitting uncommitted on my laptop, or a restart that isn’t overhauling |
I've benchmarked your branch @ 113f5e9 against master:
And against this PR (new = yours):
That's quite a bit better for small writes! |
I’ve pushed another commit to the branch with the alternate concurrency pattern. Ideally benchmarking them with the 4MiB with Delays should demonstrate the better concurrency performance. Perhaps your brenchmark tool output can show some better statistical data than I get with my single runs. Get an idea of if this commit would be better to just back out. |
Here's new=113f5e9b80e37152e3412ee9a2d6fbd8c61e9b37 against old=2b85c58483d85114825b740918f7fed96cf330cf (the delayed benchmarks won't run on master):
With f80745d applied, this becomes (old is now 113f5e9):
If you want to try this too, here's what I did:
Benchstat can also take more than two logs as input, but then it can't do significance testing. |
Maybe the concurrent WriteAt can be simplified using golang.org/x/sync/errgroup? |
Thanks. I’m already eyeing the WriteAt and ReadFrom. But like I said, while the delayed writes and reads are way faster, the non-delayed ones are where the “is it a wash enough to justify shifting so that long pipes are faster”. I mean, as well, benchmarks are only rough gauges of actual performance… the dropped memory allocation is definitely a big win what with multiple factors of 10 being dropped, but the throughput at this stage is… hazy. Thanks for the link to run the benchmarks myself, though. I might check it out a bit more over the weekend or the coming week. Get a feel for what the right direction to move forward is. (Even if that’s throwing out the concurrency model change I just wrote. 😆 ) |
Ideas were wrapped into #397 thanks @greatroar for the contribution! I’m sure we will all see much better performance as a result. (Even if it took me a while to get around to getting it all done.) |
This patch greatly improves the throughput of sendPacket by avoiding a copy of each packet before it is sent, and by avoiding copying the payload of Write and Data packets at all.
Benchmarked against OpenSSH using
on Linux/amd64. Results:
I had to fix the benchmark to get this to work; the 4MiB version still hangs indefinitely.