Write length and payload separately for double the throughput #388

greatroar · 2020-10-27T11:52:20Z

This patch greatly improves the throughput of sendPacket by avoiding a copy of each packet before it is sent, and by avoiding copying the payload of Write and Data packets at all.

Benchmarked against OpenSSH using

TMPDIR=/dev/shm go test -run=- -integration -bench='kWrite([0-9]*k|1MiB)' -benchmem

on Linux/amd64. Results:

name         old time/op    new time/op    delta
Write1k-8       312ms ±12%     305ms ± 5%      ~     (p=0.684 n=10+10)
Write16k-8     56.9ms ± 2%    31.9ms ± 4%   -43.92%  (p=0.000 n=10+10)
Write32k-8     43.2ms ± 3%    18.1ms ± 6%   -58.22%  (p=0.000 n=10+10)
Write128k-8    31.1ms ± 2%    13.5ms ± 4%   -56.72%  (p=0.000 n=10+9)
Write512k-8    17.5ms ± 5%    10.1ms ± 4%   -42.30%  (p=0.000 n=10+10)
Write1MiB-8    14.9ms ± 1%     9.3ms ± 4%   -37.72%  (p=0.000 n=8+10)

name         old speed      new speed      delta
Write1k-8    33.7MB/s ±11%  34.4MB/s ± 5%      ~     (p=0.684 n=10+10)
Write16k-8    184MB/s ± 2%   329MB/s ± 4%   +78.39%  (p=0.000 n=10+10)
Write32k-8    243MB/s ± 3%   581MB/s ± 6%  +139.45%  (p=0.000 n=10+10)
Write128k-8   337MB/s ± 2%   778MB/s ± 4%  +131.11%  (p=0.000 n=10+9)
Write512k-8   599MB/s ± 5%  1038MB/s ± 4%   +73.29%  (p=0.000 n=10+10)
Write1MiB-8   702MB/s ± 1%  1128MB/s ± 4%   +60.67%  (p=0.000 n=8+10)

name         old alloc/op   new alloc/op   delta
Write1k-8      61.6MB ± 0%    35.7MB ± 0%   -42.01%  (p=0.000 n=10+10)
Write16k-8     27.9MB ± 0%     2.2MB ± 0%   -91.99%  (p=0.000 n=10+10)
Write32k-8     29.9MB ± 0%     1.1MB ± 0%   -96.26%  (p=0.000 n=10+7)
Write128k-8    29.2MB ± 0%     0.3MB ± 0%   -98.87%  (p=0.000 n=10+10)
Write512k-8    29.0MB ± 0%     0.1MB ± 0%   -99.55%  (p=0.000 n=10+10)
Write1MiB-8    28.9MB ± 0%     0.1MB ± 0%   -99.66%  (p=0.000 n=10+10)

name         old allocs/op  new allocs/op  delta
Write1k-8       92.2k ± 0%     82.0k ± 0%   -11.09%  (p=0.000 n=9+9)
Write16k-8      5.81k ± 0%     5.17k ± 0%   -11.07%  (p=0.000 n=10+10)
Write32k-8      2.93k ± 0%     2.60k ± 0%   -11.04%  (p=0.000 n=10+9)
Write128k-8     2.45k ± 0%     2.12k ± 0%   -13.21%  (p=0.000 n=8+10)
Write512k-8     2.33k ± 0%     2.00k ± 0%   -13.91%  (p=0.000 n=9+9)
Write1MiB-8     2.31k ± 0%     1.99k ± 0%   -14.03%  (p=0.000 n=6+10)

I had to fix the benchmark to get this to work; the 4MiB version still hangs indefinitely.

The 4MiB benchmark still doesn't work, but the smaller benchmarks don't hang indefinitely.

This greatly improves the throughput of sendPacket by avoiding a copy of each packet before it is sent, and by avoiding copying the payload of Write and Data packets at all. Benchmarked against OpenSSH using TMPDIR=/dev/shm go test -run=- -integration \ -bench='kWrite([0-9]*k|1MiB)' -benchmem on Linux/amd64. Results: name old time/op new time/op delta Write1k-8 312ms ±12% 305ms ± 5% ~ (p=0.684 n=10+10) Write16k-8 56.9ms ± 2% 31.9ms ± 4% -43.92% (p=0.000 n=10+10) Write32k-8 43.2ms ± 3% 18.1ms ± 6% -58.22% (p=0.000 n=10+10) Write128k-8 31.1ms ± 2% 13.5ms ± 4% -56.72% (p=0.000 n=10+9) Write512k-8 17.5ms ± 5% 10.1ms ± 4% -42.30% (p=0.000 n=10+10) Write1MiB-8 14.9ms ± 1% 9.3ms ± 4% -37.72% (p=0.000 n=8+10) name old speed new speed delta Write1k-8 33.7MB/s ±11% 34.4MB/s ± 5% ~ (p=0.684 n=10+10) Write16k-8 184MB/s ± 2% 329MB/s ± 4% +78.39% (p=0.000 n=10+10) Write32k-8 243MB/s ± 3% 581MB/s ± 6% +139.45% (p=0.000 n=10+10) Write128k-8 337MB/s ± 2% 778MB/s ± 4% +131.11% (p=0.000 n=10+9) Write512k-8 599MB/s ± 5% 1038MB/s ± 4% +73.29% (p=0.000 n=10+10) Write1MiB-8 702MB/s ± 1% 1128MB/s ± 4% +60.67% (p=0.000 n=8+10) name old alloc/op new alloc/op delta Write1k-8 61.6MB ± 0% 35.7MB ± 0% -42.01% (p=0.000 n=10+10) Write16k-8 27.9MB ± 0% 2.2MB ± 0% -91.99% (p=0.000 n=10+10) Write32k-8 29.9MB ± 0% 1.1MB ± 0% -96.26% (p=0.000 n=10+7) Write128k-8 29.2MB ± 0% 0.3MB ± 0% -98.87% (p=0.000 n=10+10) Write512k-8 29.0MB ± 0% 0.1MB ± 0% -99.55% (p=0.000 n=10+10) Write1MiB-8 28.9MB ± 0% 0.1MB ± 0% -99.66% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Write1k-8 92.2k ± 0% 82.0k ± 0% -11.09% (p=0.000 n=9+9) Write16k-8 5.81k ± 0% 5.17k ± 0% -11.07% (p=0.000 n=10+10) Write32k-8 2.93k ± 0% 2.60k ± 0% -11.04% (p=0.000 n=10+9) Write128k-8 2.45k ± 0% 2.12k ± 0% -13.21% (p=0.000 n=8+10) Write512k-8 2.33k ± 0% 2.00k ± 0% -13.91% (p=0.000 n=9+9) Write1MiB-8 2.31k ± 0% 1.99k ± 0% -14.03% (p=0.000 n=6+10)

greatroar · 2020-10-27T13:46:24Z

The test failure is the one already fixed in #387.

puellanivis · 2020-10-27T16:42:04Z

I think what I would more ideally like to see, is things to marshal into a slice that has space for the length already in it, but just unset. Then before writing it, the length is then written into that header, and then written as a whole.

The sync.Pool work being done here probably isn’t the correct solution either. Returning a pointer to an array is really weird when the array is already has pointer semantics. And in fact, on a 64-bit machine, we end up with just as much space using the pointer to an array as we would using make([]byte, 4):

unsafe.Sizeof(uintptr) /* pointer to */ + unsafe.Sizeof(uintptr) + unsafe.Sizeof(int) /* array header */ + 4*unsafe.Sizeof(byte) = 28
unsafe.Sizeof(reflect.SliceHeader) + 4*unsafe.Sizeof(byte) = 28

I think that probably everything here would still continue to work if we break up all of these writes, I have been bitten before by weird segmentation issues where Writes are not being done as an whole together. There’s likely already still buffers pulling the various writes together into a single transmission, since writing a bunch of half-packets would be quite inefficient… (Nagle’s Algorithm, etc)

greatroar · 2020-10-27T18:54:16Z

I think what I would more ideally like to see, is things to marshal into a slice that has space for the length already in it, but just unset. Then before writing it, the length is then written into that header, and then written as a whole.

The code in this PR is not as elegant as it could be, but it's relatively local. I can try to implement a more thorough solution, but it will (1) be a much larger patch, and (2) the packet and the payload really need to be separated to get an O(1) sendPacket rather than the current O(n).

Returning a pointer to an array is really weird when the array is already has pointer semantics.

I think you're confusing arrays and slices. A [4]byte is just four bytes, without a slice header, and with value semantics. When you assign one to an interface{}, it gets copied to the heap. Try

func BenchmarkArrayPool(b *testing.B) {
	pool := sync.Pool{New: func() interface{} { return [4]byte{} }}

	for i := 0; i < b.N; i++ {
		a := pool.Get().([4]byte)
		pool.Put(a) // array copied to the heap
	}
}

func BenchmarkPointerPool(b *testing.B) {
	pool := sync.Pool{New: func() interface{} { return new([4]byte) }}

	for i := 0; i < b.N; i++ {
		a := pool.Get().(*[4]byte)
		pool.Put(a)
	}
}

func BenchmarkSlicePool(b *testing.B) {
	pool := sync.Pool{New: func() interface{} { return make([]byte, 4) }}

	for i := 0; i < b.N; i++ {
		a := pool.Get().([]byte)
		pool.Put(a) // slice header copied to the heap
	}
}

On my machine, that gives

BenchmarkArrayPool-8     	28204717	        42.4 ns/op	       4 B/op	       1 allocs/op
BenchmarkPointerPool-8   	69263575	        17.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkSlicePool-8     	19285311	        86.2 ns/op	      32 B/op	       1 allocs/op

I have been bitten before by weird segmentation issues where Writes are not being done as an whole together.

With this library? Because I don't see how the SFTP protocol could be affected.

puellanivis · 2020-10-29T00:24:41Z

So, I’ve played around with this a bit, and came up with the commits in https://github.com/pkg/sftp/tree/more-optimization (also included some various subtle race conditions being dealt with as well), by just allocating and starting after the length, it’s a lot of lines-of-code-change, but simpler in the end without a need for a sync.Pool. (Oh, and I found a few incorrect length calculations, that could have 🤷‍♀️ been causing reallocations? Results unclear, go 1.15 at least seems to be allocating with some buffer room sometimes.)

The other improvements were mostly with an eye towards improving the client.go code as well… but the results of my first go are… hazy about how good they might be. They’re significantly better under write delay, but only “meh” at best without… (more concurrency == more overhead, but also better responsiveness per channel when slow). And does ~20 kB/op really matter, when both maybe-solutions can get about the same throughput, and we‘re already cutting out 100s of MB/op? 🤷‍♀️

I’ll have something tomorrow either way, either the code sitting uncommitted on my laptop, or a restart that isn’t overhauling client.go as much.

greatroar · 2020-10-29T14:21:12Z

I've benchmarked your branch @ 113f5e9 against master:

name         old time/op    new time/op    delta
Write1k-8       312ms ±12%     311ms ± 6%      ~     (p=0.796 n=10+10)
Write16k-8     56.9ms ± 2%    32.7ms ± 4%   -42.62%  (p=0.000 n=10+10)
Write32k-8     43.2ms ± 3%    17.6ms ± 5%   -59.17%  (p=0.000 n=10+10)
Write128k-8    31.1ms ± 2%    14.7ms ± 8%   -52.77%  (p=0.000 n=10+10)
Write512k-8    17.5ms ± 5%    10.2ms ±10%   -41.66%  (p=0.000 n=10+10)
Write1MiB-8    14.9ms ± 1%     9.2ms ± 3%   -38.19%  (p=0.000 n=8+10)

name         old speed      new speed      delta
Write1k-8    33.7MB/s ±11%  33.8MB/s ± 6%      ~     (p=0.796 n=10+10)
Write16k-8    184MB/s ± 2%   321MB/s ± 4%   +74.35%  (p=0.000 n=10+10)
Write32k-8    243MB/s ± 3%   595MB/s ± 5%  +145.04%  (p=0.000 n=10+10)
Write128k-8   337MB/s ± 2%   715MB/s ± 8%  +112.43%  (p=0.000 n=10+10)
Write512k-8   599MB/s ± 5%  1028MB/s ±10%   +71.65%  (p=0.000 n=10+10)
Write1MiB-8   702MB/s ± 1%  1136MB/s ± 3%   +61.84%  (p=0.000 n=8+10)

name         old alloc/op   new alloc/op   delta
Write1k-8      61.6MB ± 0%    35.7MB ± 0%   -42.02%  (p=0.000 n=10+10)
Write16k-8     27.9MB ± 0%     2.2MB ± 0%   -91.99%  (p=0.000 n=10+10)
Write32k-8     29.9MB ± 0%     1.1MB ± 0%   -96.26%  (p=0.000 n=10+9)
Write128k-8    29.2MB ± 0%     0.3MB ± 0%   -98.87%  (p=0.000 n=10+9)
Write512k-8    29.0MB ± 0%     0.1MB ± 0%   -99.55%  (p=0.000 n=10+10)
Write1MiB-8    28.9MB ± 0%     0.1MB ± 0%   -99.66%  (p=0.000 n=10+10)

name         old allocs/op  new allocs/op  delta
Write1k-8       92.2k ± 0%     82.0k ± 0%   -11.12%  (p=0.000 n=9+10)
Write16k-8      5.81k ± 0%     5.16k ± 0%   -11.23%  (p=0.000 n=10+9)
Write32k-8      2.93k ± 0%     2.60k ± 0%   -11.35%  (p=0.000 n=10+10)
Write128k-8     2.45k ± 0%     2.12k ± 0%   -13.56%  (p=0.002 n=8+10)
Write512k-8     2.33k ± 0%     2.00k ± 0%   -14.30%  (p=0.000 n=9+9)
Write1MiB-8     2.31k ± 0%     1.98k ± 0%   -14.42%  (p=0.000 n=6+9)

And against this PR (new = yours):

name         old time/op    new time/op    delta
Write1k-8       326ms ± 7%     311ms ± 6%   -4.54%  (p=0.043 n=10+10)
Write16k-8     35.5ms ± 7%    32.7ms ± 4%   -7.93%  (p=0.000 n=10+10)
Write32k-8     19.8ms ± 5%    17.6ms ± 5%  -10.89%  (p=0.000 n=10+10)
Write128k-8    14.9ms ± 4%    14.7ms ± 8%     ~     (p=0.971 n=10+10)
Write512k-8    10.4ms ± 4%    10.2ms ±10%     ~     (p=0.912 n=10+10)
Write1MiB-8    9.33ms ± 1%    9.23ms ± 3%     ~     (p=0.182 n=9+10)
Write4MiB-8    8.86ms ± 7%    8.81ms ± 4%     ~     (p=0.481 n=10+10)

name         old speed      new speed      delta
Write1k-8    32.3MB/s ± 7%  33.8MB/s ± 6%   +4.68%  (p=0.045 n=10+10)
Write16k-8    296MB/s ± 7%   321MB/s ± 4%   +8.53%  (p=0.000 n=10+10)
Write32k-8    530MB/s ± 5%   595MB/s ± 5%  +12.20%  (p=0.000 n=10+10)
Write128k-8   702MB/s ± 4%   715MB/s ± 8%     ~     (p=0.971 n=10+10)
Write512k-8  1.01GB/s ± 4%  1.03GB/s ±10%     ~     (p=0.912 n=10+10)
Write1MiB-8  1.12GB/s ± 1%  1.14GB/s ± 3%     ~     (p=0.182 n=9+10)
Write4MiB-8  1.17GB/s ± 5%  1.19GB/s ± 4%     ~     (p=0.243 n=9+10)

name         old alloc/op   new alloc/op   delta
Write1k-8      35.7MB ± 0%    35.7MB ± 0%   -0.01%  (p=0.000 n=9+10)
Write16k-8     2.24MB ± 0%    2.24MB ± 0%   -0.03%  (p=0.000 n=10+10)
Write32k-8     1.12MB ± 0%    1.12MB ± 0%   -0.05%  (p=0.000 n=10+9)
Write128k-8     330kB ± 0%     330kB ± 0%   -0.17%  (p=0.000 n=10+9)
Write512k-8     132kB ± 0%     131kB ± 0%   -0.35%  (p=0.000 n=10+10)
Write1MiB-8    98.6kB ± 0%    98.1kB ± 0%   -0.49%  (p=0.000 n=10+10)
Write4MiB-8    72.1kB ± 0%    71.6kB ± 0%   -0.67%  (p=0.000 n=10+10)

name         old allocs/op  new allocs/op  delta
Write1k-8       82.0k ± 0%     82.0k ± 0%   -0.03%  (p=0.000 n=10+10)
Write16k-8      5.17k ± 0%     5.16k ± 0%   -0.18%  (p=0.000 n=10+9)
Write32k-8      2.60k ± 0%     2.60k ± 0%   -0.35%  (p=0.000 n=9+10)
Write128k-8     2.12k ± 0%     2.12k ± 0%   -0.40%  (p=0.000 n=10+10)
Write512k-8     2.00k ± 0%     2.00k ± 0%   -0.45%  (p=0.000 n=9+9)
Write1MiB-8     1.99k ± 0%     1.98k ± 0%   -0.45%  (p=0.000 n=10+9)
Write4MiB-8     1.97k ± 0%     1.96k ± 0%   -0.46%  (p=0.000 n=10+10)

That's quite a bit better for small writes!

puellanivis · 2020-10-29T14:35:03Z

I’ve pushed another commit to the branch with the alternate concurrency pattern. Ideally benchmarking them with the 4MiB with Delays should demonstrate the better concurrency performance.

Perhaps your brenchmark tool output can show some better statistical data than I get with my single runs. Get an idea of if this commit would be better to just back out.

greatroar · 2020-10-29T15:20:46Z

Here's new=113f5e9b80e37152e3412ee9a2d6fbd8c61e9b37 against old=2b85c58483d85114825b740918f7fed96cf330cf (the delayed benchmarks won't run on master):

name                     old time/op    new time/op    delta
Write4MiBDelay10Msec-8      260ms ± 0%     259ms ± 0%   -0.40%  (p=0.000 n=8+8)
Write4MiBDelay50Msec-8      1.26s ± 0%     1.26s ± 0%   -0.06%  (p=0.000 n=7+9)
Write4MiBDelay150Msec-8     3.76s ± 0%     3.76s ± 0%   -0.05%  (p=0.000 n=8+10)

name                     old speed      new speed      delta
Write4MiBDelay10Msec-8   40.4MB/s ± 0%  40.5MB/s ± 0%   +0.43%  (p=0.000 n=9+8)
Write4MiBDelay50Msec-8   8.32MB/s ± 0%  8.32MB/s ± 0%     ~     (all equal)
Write4MiBDelay150Msec-8  2.79MB/s ± 0%  2.79MB/s ± 0%     ~     (all equal)

name                     old alloc/op   new alloc/op   delta
Write4MiBDelay10Msec-8     42.0MB ± 0%    10.6MB ± 0%  -74.84%  (p=0.000 n=10+9)
Write4MiBDelay50Msec-8     42.0MB ± 0%    10.6MB ± 0%  -74.83%  (p=0.000 n=9+9)
Write4MiBDelay150Msec-8    42.0MB ± 0%    10.6MB ± 0%  -74.83%  (p=0.000 n=10+10)

name                     old allocs/op  new allocs/op  delta
Write4MiBDelay10Msec-8      2.61k ± 0%     2.61k ± 0%   -0.13%  (p=0.000 n=10+8)
Write4MiBDelay50Msec-8      2.62k ± 0%     2.62k ± 0%   -0.09%  (p=0.010 n=9+9)
Write4MiBDelay150Msec-8     2.62k ± 0%     2.62k ± 0%   -0.08%  (p=0.006 n=10+10)

With f80745d applied, this becomes (old is now 113f5e9):

name                     old time/op    new time/op     delta
Write4MiBDelay10Msec-8      259ms ± 0%       91ms ± 1%   -64.78%  (p=0.000 n=8+9)
Write4MiBDelay50Msec-8      1.26s ± 0%      0.41s ± 0%   -67.34%  (p=0.000 n=9+10)
Write4MiBDelay150Msec-8     3.76s ± 0%      1.21s ± 0%   -67.76%  (p=0.000 n=10+10)

name                     old speed      new speed       delta
Write4MiBDelay10Msec-8   40.5MB/s ± 0%  115.1MB/s ± 1%  +183.93%  (p=0.000 n=8+9)
Write4MiBDelay50Msec-8   8.32MB/s ± 0%  25.49MB/s ± 0%  +206.35%  (p=0.000 n=8+10)
Write4MiBDelay150Msec-8  2.79MB/s ± 0%   8.65MB/s ± 0%  +210.04%  (p=0.000 n=10+7)

name                     old alloc/op   new alloc/op    delta
Write4MiBDelay10Msec-8     10.6MB ± 0%     10.6MB ± 0%    +0.18%  (p=0.000 n=9+9)
Write4MiBDelay50Msec-8     10.6MB ± 0%     10.6MB ± 0%    +0.16%  (p=0.000 n=9+9)
Write4MiBDelay150Msec-8    10.6MB ± 0%     10.6MB ± 0%    +0.19%  (p=0.000 n=10+10)

name                     old allocs/op  new allocs/op   delta
Write4MiBDelay10Msec-8      2.61k ± 0%      3.00k ± 0%   +15.26%  (p=0.000 n=8+9)
Write4MiBDelay50Msec-8      2.62k ± 0%      3.01k ± 1%   +14.96%  (p=0.000 n=9+9)
Write4MiBDelay150Msec-8     2.62k ± 0%      3.02k ± 1%   +15.33%  (p=0.000 n=10+10)

If you want to try this too, here's what I did:

$ go install golang.org/x/perf/cmd/benchstat
$ TMPDIR=/dev/shm go test -run=- -integration -bench='Write4MiBDelay' -benchmem -count=10 | tee -a log0
# similarly for log1, after checking out a different commit
$ benchstat log0 log1

Benchstat can also take more than two logs as input, but then it can't do significance testing.

greatroar · 2020-10-29T15:27:29Z

Maybe the concurrent WriteAt can be simplified using golang.org/x/sync/errgroup?

puellanivis · 2020-10-29T20:34:25Z

Thanks. I’m already eyeing the WriteAt and ReadFrom. But like I said, while the delayed writes and reads are way faster, the non-delayed ones are where the “is it a wash enough to justify shifting so that long pipes are faster”. I mean, as well, benchmarks are only rough gauges of actual performance… the dropped memory allocation is definitely a big win what with multiple factors of 10 being dropped, but the throughput at this stage is… hazy.

Thanks for the link to run the benchmarks myself, though. I might check it out a bit more over the weekend or the coming week. Get a feel for what the right direction to move forward is. (Even if that’s throwing out the concurrency model change I just wrote. 😆 )

puellanivis · 2021-02-22T20:32:47Z

Ideas were wrapped into #397 thanks @greatroar for the contribution! I’m sure we will all see much better performance as a result. (Even if it took me a while to get around to getting it all done.)

greatroar added 2 commits October 27, 2020 11:58

Fix benchmarkWrite

4f1dc59

The 4MiB benchmark still doesn't work, but the smaller benchmarks don't hang indefinitely.

greatroar force-pushed the no-copy-write branch from 54b35b7 to 8a86537 Compare October 27, 2020 12:16

puellanivis closed this Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write length and payload separately for double the throughput #388

Write length and payload separately for double the throughput #388

greatroar commented Oct 27, 2020

greatroar commented Oct 27, 2020

puellanivis commented Oct 27, 2020

greatroar commented Oct 27, 2020

puellanivis commented Oct 29, 2020

greatroar commented Oct 29, 2020

puellanivis commented Oct 29, 2020

greatroar commented Oct 29, 2020

greatroar commented Oct 29, 2020

puellanivis commented Oct 29, 2020

puellanivis commented Feb 22, 2021

Write length and payload separately for double the throughput #388

Write length and payload separately for double the throughput #388

Conversation

greatroar commented Oct 27, 2020

greatroar commented Oct 27, 2020

puellanivis commented Oct 27, 2020

greatroar commented Oct 27, 2020

puellanivis commented Oct 29, 2020

greatroar commented Oct 29, 2020

puellanivis commented Oct 29, 2020

greatroar commented Oct 29, 2020

greatroar commented Oct 29, 2020

puellanivis commented Oct 29, 2020

puellanivis commented Feb 22, 2021