Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #334

Closed
drakkan opened this issue Feb 12, 2020 · 15 comments · Fixed by #344
Closed

Performance improvements #334

drakkan opened this issue Feb 12, 2020 · 15 comments · Fixed by #344
Milestone

Comments

@drakkan
Copy link
Collaborator

drakkan commented Feb 12, 2020

Hi,

SFTPGo uses this library for SFTP and an home made implementation for SCP.

Generally the bottleneck is the encryption, anyway for some ciphers this is not the case, for example aes128-gcm@openssh.com has implicit MAC and is really fast, other ciphers can be optimized, for example for AES CTR we can apply this patch:

https://go-review.googlesource.com/c/go/+/51670

and sha256, used for MAC, can be optimized this way:

drakkan/crypto@cc78d71

Now encryption is not more the bottleneck and SFTPGo' SCP implementation has a speed comparable with OpenSSH, sadly SFTP is still slower than OpenSSH.

I understand that SCP is a simpler protocol than SFTP and my implementation is really basic too: I have only one goroutine that read/write to/from a preallocated []byte, no packet manager or multiple goroutines as in pkg/sftp.

In SFTPGo if I put this line:

https://github.com/drakkan/sftpgo/blob/master/sftpd/scp.go#L402

inside the for loop I have a 25% performance regression, pkg/sftp seems to do exactly this, looking at pkg/sftp code for each packet there is at least a line such as this one:

data := make([]byte, clamp(length, maxTxPacket))

I think we can improve performance using preallocated slices. I'm thinking about something like a global or per request sync.Pool but it does't seem so easy to follow allocation/deallocation in all cases, do you have better suggestions? We could make pkg/sftp as fast as OpenSSH implementation!

Please take a look here for some numbers and profiling results:

drakkan/sftpgo#69

thanks!

@puellanivis
Copy link
Collaborator

In general, you do not actually want to store variable-length buffers in a sync.Pool for later reuse: golang/go#23199 because many of the long-capacity buffers will never be released, even when usage does not require such a long buffer.

Using a global buffer is prone to having to deal with threading issues, because now you cannot have that code run in parallel anymore, without duplicating the whole global object.

I’m not saying that the code cannot be optimized to do less allocations, but a lot of the structure and design of the library relies upon heavy multi-threaded access, where allocating per-request byte-slices may be the only way to do such a thing safely.

In particular, you point to a few examples where we make a byte-slice with a length determined by clamp. Even if we setup a sync.Pool to be able to return these slices back, when and where are their uses finished? In order to be able to put them back into a pool, we would have to know when they are no longer being used, and put them back into the pool. We cannot have any dangling references to them, so we need to be sure. What happens if/when we return these back across goroutines, etc.

That’s a lot of bookkeeping that we’re not setup to do right now, and might even require a big refactor of the whole package to get done. 🤷‍♀

@drakkan
Copy link
Collaborator Author

drakkan commented Feb 12, 2020

In general, you do not actually want to store variable-length buffers in a sync.Pool for later reuse: golang/go#23199 because many of the long-capacity buffers will never be released, even when usage does not require such a long buffer.

Using a global buffer is prone to having to deal with threading issues, because now you cannot have that code run in parallel anymore, without duplicating the whole global object.

I’m not saying that the code cannot be optimized to do less allocations, but a lot of the structure and design of the library relies upon heavy multi-threaded access, where allocating per-request byte-slices may be the only way to do such a thing safely.

In particular, you point to a few examples where we make a byte-slice with a length determined by clamp. Even if we setup a sync.Pool to be able to return these slices back, when and where are their uses finished? In order to be able to put them back into a pool, we would have to know when they are no longer being used, and put them back into the pool.

I agree, this seems the most difficult part and this is what I mean when writing "but it doesn't seem so easy to follow allocation/deallocation in all cases" and this is the main reason beacause I asked for advise before starting to write some code.

We cannot have any dangling references to them, so we need to be sure. What happens if/when we return these back across goroutines, etc.

That’s a lot of bookkeeping that we’re not setup to do right now, and might even require a big refactor of the whole package to get done.

For my use case saturate a gigabit connection is enough and this can be easily achieved, anyway would be really good to be as fast as OpenSSH, I would be glad to help but I don't want to rewrite the whole library

@puellanivis thanks for your response
@eikenb do you want to add something?

@drakkan
Copy link
Collaborator Author

drakkan commented Feb 12, 2020

anyway we could start with the simplest things, for example: is this copy really needed?

https://github.com/pkg/sftp/blob/master/packet.go#L841

something like this seems to work and tests pass:

//p.Data = make([]byte, p.Length)
//copy(p.Data, b)
 p.Data = b

here is the profile result with the actual code:

1260ms 26.47% 26.47%     1320ms 27.73%  syscall.Syscall
     420ms  8.82% 35.29%      420ms  8.82%  syscall.Syscall6
     290ms  6.09% 41.39%      290ms  6.09%  runtime.memclrNoHeapPointers
     220ms  4.62% 46.01%      220ms  4.62%  runtime.memmove
     190ms  3.99% 50.00%      190ms  3.99%  runtime.epollwait
     180ms  3.78% 53.78%      340ms  7.14%  runtime.scanobject
     160ms  3.36% 57.14%      160ms  3.36%  runtime.futex
     160ms  3.36% 60.50%      160ms  3.36%  runtime.procyield
     150ms  3.15% 63.66%      150ms  3.15%  crypto/aes.gcmAesDec
     120ms  2.52% 66.18%      120ms  2.52%  golang.org/x/crypto/argon2.blamkaSSE4

and here is the result with the suggested modification:

1280ms 27.35% 27.35%     1320ms 28.21%  syscall.Syscall
     390ms  8.33% 35.68%      390ms  8.33%  crypto/aes.gcmAesDec
     260ms  5.56% 41.24%      260ms  5.56%  syscall.Syscall6
     250ms  5.34% 46.58%      250ms  5.34%  runtime.memclrNoHeapPointers
     200ms  4.27% 50.85%      390ms  8.33%  runtime.scanobject
     180ms  3.85% 54.70%      180ms  3.85%  runtime.futex
     180ms  3.85% 58.55%      180ms  3.85%  runtime.memmove
     150ms  3.21% 61.75%      150ms  3.21%  runtime.epollwait
     130ms  2.78% 64.53%      130ms  2.78%  runtime.procyield
     100ms  2.14% 66.67%      110ms  2.35%  runtime.findObject

@eikenb
Copy link
Member

eikenb commented Feb 15, 2020

@drakkan .. I'd be happy to try to work with you to increase the performance of sftp, but I'd like to make sure we keep the flexibility and don't complicate things.

I agree that more targeted optimizations would be better than a large refactor at this point. I think the code needs some time spent to refactor it first, to eliminate the duplication of server code and simplify the data flow. Then larger scale optimizations would be more feasible maybe even becoming more obvious with more attention paid to the data flow.

So I'd love to review any targeted PRs for optimizations. Even simple ones, like eliminating the copy in your last post, add up and could make a significant impact. But I think more work needs to be done before something like a pool would be feasible.

@drakkan
Copy link
Collaborator Author

drakkan commented Feb 16, 2020

@eikenb I'll try to read the code again to see if I can find some other simple improvement.

Regarding the refactoring I could try to help but can you please better explain what do think is needed to do? For example we have server and request-server and some code is duplicated between these two implementation, do we need to keep both?

In general if you can find some time to describe a refactoring plan this would be helpful.

Regarding the allocator I just did a quick test with this very basic allocator

package sftp

import (
	"sync"
)

type pagelist struct {
	pages [][]byte
}

type allocator struct {
	// map key is the size of containted list of []byte
	available map[int]pagelist
	// map key is the request order, the value is the same as available
	used map[uint32]map[int]pagelist
	sync.Mutex
}

func newAllocator() *allocator {
	return &allocator{
		available: make(map[int]pagelist),
		used:      make(map[uint32]map[int]pagelist),
	}
}

// GetPage returns a previously allocated and unused []byte or create
// a new one
func (a *allocator) GetPage(size int, requestID uint32) []byte {
	a.Lock()
	defer a.Unlock()
	var result []byte
	if avail, ok := a.available[size]; ok {
		// get the available page and remove it from the available ones
		result = avail.pages[len(avail.pages)-1]
		if len(avail.pages) == 1 {
			delete(a.available, size)
		} else {
			avail.pages = avail.pages[:len(avail.pages)-1]
			a.available[size] = avail
		}
	} else {
		result = make([]byte, size)
	}
	// put result in used page
	if used, ok := a.used[requestID]; ok {
		if val, ok := used[size]; ok {
			val.pages = append(val.pages, result)
			a.used[requestID][size] = val
		} else {
			p := pagelist{
				pages: nil,
			}
			p.pages = append(p.pages, result)
			used[size] = p
			a.used[requestID] = used
		}
	} else {
		used := make(map[int]pagelist)
		p := pagelist{
			pages: nil,
		}
		p.pages = append(p.pages, result)
		used[size] = p
		a.used[requestID] = used
	}
	return result
}

// ReleasePages mark unused all pages in use for the given requestID
func (a *allocator) ReleasePages(requestID uint32) {
	a.Lock()
	defer a.Unlock()
	if used, ok := a.used[requestID]; ok {
		for size, p := range used {
			if avail, ok := a.available[size]; ok {
				p.pages = append(p.pages, avail.pages...)
			}
			a.available[size] = p
		}
		delete(a.used, requestID)
	}
}

func (a *allocator) Free() {
	a.Lock()
	defer a.Unlock()
	a.available = make(map[int]pagelist)
	a.used = make(map[uint32]map[int]pagelist)
}

The idea is to use an allocator for each packetManager and mark allocated pages reusable each time a request ends (so each allocation is bouded to the request order).

I used it in recvPacket to see how it works and it didn't change nothing for downloads (the received packets are small), while it increased performance for uploads (from 490 MB/s to 530 MB/s).

I'll postpone my experiments on this allocator for now

@puellanivis
Copy link
Collaborator

IF you use map[int]*pagelist and map[uint32]map[int]*pagelist then you do not have to keep reassigning everything back into the maps. Since everything is already locked under a mutex, you don’t need to worry about mutating the values in the data structure.

Don’t delete the pagelist out of the available map when it is empty. It’s not a large data structure, so keeping it around is not a big overhead. and its internal [][]byte slice with its already allocated capacity can then be reused. Though it will be a good idea to assign nil into values before truncating them out of the [][]byte slice, to avoid long-held references from blocking garbage collection.

Instead of used, ok := a.used[requestID]; ok you can use used := a.used[requestID]; used != nil

With the above, and recognizing where pointer semantics apply, you can simplify your allocator a lot:

// getPage returns a previously allocated and unused []byte or creates a new one.
// no need to export receiver methods of unexported types.
func (a *allocator) getPage(size int, requestID uint32) []byte {
	a.Lock()
	defer a.Unlock()

	var result []byte
	if avail := a.available[size]; avail != nil {
		if len(avail.pages) > 0 {
			truncLength := len(avail.pages)-1
			result = avail.pages[truncLength]

			avail.pages[truncLength] = nil // clear out the internal pointer
			avail.pages = avail.pages[:truncLength] // truncate the slice
		}
	}

	// no preallocated slice found, doesn’t matter why, just alloc a new one
	if result == nil {
		result = make([]byte, size)
	}

	// put result in used page
	used := a.used[requestID]
	if used == nil {
		used = make(map[int]*pagelist)
		a.used[requestID] = used
	}

	page := used[size]
	if page == nil {
		page = new(pagelist) // you don’t need to explicitly assign zero values into structs
		used[size] = page
		// no need to assign a.used[requestID] = used, because maps are pointer semantics
	}
	page.pages = append(page.pages, result)

	// no need to assign used[size] = page, because it’s a pointer
	// no need to assign a.used[requestID] = used, because maps are pointer semantics

	return result

We can also reduce the chances of stale pointers dragging garbage collection by making sure we don’t leave spare pointers around, when we release pages:

// releasePages mark unused all pages in use for the given requestID
// no need to export receiver methods of unexported types.
func (a *allocator) releasePages(requestID uint32) {
	a.Lock()
	defer a.Unlock()
	if used := a.used[requestID]; used != nil {
		for size, p := range used {
			avail := a.available[size]
			if avail == nil {
				avail = new(pagelist)
				a.available[size] = avail
			}

			// try to always copy things into the destination, rather than copying the source onto the destination and the assigning the source into the destination.
			// doing it the later way increases the amount of internal slice reallocation going on.
			avail.pages = append(avail.pages, p.pages...)

			for i := range p.pages {
				p.pages[i] = nil // we want to clear out the internal pointer here, so it is not left hanging around.
			}
			// now we can just drop `p` on the floor. The garbage collector will clean it up.
			delete(used, size)
		}
		delete(a.used, requestID)
	}
}

@drakkan
Copy link
Collaborator Author

drakkan commented Feb 23, 2020

@puellanivis, thanks! I'll test your suggestions, I'm sure they will improve the allocator.

I'll try to convert random access read/writes to sequential read/writes too using a memory cache of a configurable size to understand if this can improve performance too. Since I'm using request server I can do this changes directly inside my app. I think I can find some time for these test after finalizing sftpgo 0.9.6 release

@drakkan
Copy link
Collaborator Author

drakkan commented Feb 25, 2020

I did a quick test with the allocator posted above (a really ugly patch), here are some number using aes128-gcm@openssh.com

In my test I write to a ramfs and I use sftp and scp CLI as test tool

  1. pkg/sftp +data packet unmarshal binary: remove uneeded copy #338, upload: 526.6MB/s , download: 428.3MB/s
  2. pkg/sftp +data packet unmarshal binary: remove uneeded copy #338 + allocator, upload: 613.2MB/s, download: 523.8MB/s
  3. my scp implementation (https://github.com/drakkan/sftpgo/blob/master/sftpd/scp.go), upload: 638.1MB/s, download: 1.0GB/s
  4. OpenSSH sftp using the same cipher, upload: 943.6MB/s, download: 804.3MB/s
  5. OpenSSH scp using the same cipher, upload: 792.8MB/s, download: 766.8MB/s

so basically Golang SSH seems very fast, at least for downloads, but there is something in pkg/sftp that make them slow, I'm a bit lost, I cannot find other bottlenecks, any help is appreciated, thanks

@drakkan
Copy link
Collaborator Author

drakkan commented Mar 4, 2020

Hi again,

I published my allocator patch here:

https://github.com/drakkan/sftp/commits/master

please take a look at the two latest commit. I think we can still improve downloads performance but I was unable to find how for now (scp implementation in sftpgo is still faster). I also tryed to send the data packet directly from fileget here:

https://github.com/drakkan/sftp/blob/master/request.go#L242

instead of returning the data packet but this does not improve anything.

Anyway the tests on real hw done by the user that initially reported the issue against sftpgo shows that with this patch we have similar performance as OpenSSH (but a higher CPU usage). Probably the disks limits were reached.

My allocator patch is very hacky, I would like to discuss how to improve it and eventually how to refactor pkg/sftp to get it merged.

The idea is to use an allocator for each packetManager, each new request, identified by order id requests new slices as needed and it release them in maybeSendPacket

https://github.com/drakkan/sftp/blob/master/packet-manager.go#L186

the released slices can be reused by other requests inside the same packetManager, so we need:

  1. a way to easily access the packet manager from inner requests
  2. sshFxpDataPacket seems the only big packet we send out to the network, to avoid a copy/reallocation in marshal we could allocate all the needed byte from the beginning, here:

https://github.com/drakkan/sftp/blob/master/request.go#L236

In my patch I marshal the data packet directly in fileget, this is very hacky, we could, for example, only leave the unallocated needed bytes at the beginning, but then we need to do the same for other packets too or special case the data packet in sendPacket as I did here (another hack):

https://github.com/drakkan/sftp/blob/master/packet.go#L125

I'm open to suggestions and even to completly rewrite my patch, but I would like some guidance/suggestions, thanks!

@puellanivis
Copy link
Collaborator

puellanivis commented Mar 4, 2020

I’m not 100% that we necessarily need a caching allocator like this yet, because ideally, Go should already be handling that sort of optimization reasonably well for us. (And it adds a lot of complexity, so I would like to see benchmarks demonstrating it is patently absolutely necessary.)

I’m sure there are plenty more improvements available just by avoiding reallocations. If the performance is the same, but higher CPU, then it’s way more likely that there is a bunch of memcpy going around, rather than issues with allocating fresh buffers.

@drakkan
Copy link
Collaborator Author

drakkan commented Mar 6, 2020

I’m not 100% that we necessarily need a caching allocator like this yet, because ideally, Go should already be handling that sort of optimization reasonably well for us. (And it adds a lot of complexity, so I would like to see benchmarks demonstrating it is patently absolutely necessary.)

it is not absolutely necessary but it improve the performance, here are some numbers:

https://github.com/HiFiPhile/sftpgo/blob/work/docs/performance.md#cipher-aes128gcmopensshcom

for the cipher aes128gcm@openssh.com only the allocator patch apply, the other patches improve aes-ctr and MAC, aes128gcm uses implicit message authentication.

The baseline configuration include my latest two patch and so pkg/sftp git master performance is compared against the allocator branch

I’m sure there are plenty more improvements available just by avoiding reallocations. If the performance is the same, but higher CPU, then it’s way more likely that there is a bunch of memcpy going around, rather than issues with allocating fresh buffers.

the performance is the same as OpenSSH not the same as without the allocator patch. The CPU is a bit higher than OpenSSH, anyway the user didn't include the CPU usage in the summary.

I don't have a such hw available for testing myself, if it is the same for you I suggest to try my allocator branch using a ram filesystem with aes128gcm@openssh.com cipher that doesn't require patching Golang itself. I developed and tested the patch this way.

Thanks

@puellanivis
Copy link
Collaborator

puellanivis commented Mar 6, 2020

What I’m saying is that if we can avoid at least some of the copies into new slices, we should decrease the pressure being relieved by the allocator. And then we would be addressing the root cause, and not the symptoms.

After we have removed all the copies into new slices, then we should be able to evaluate if building in an allocator would be worthwhile.

Like I said, I would really like to address the root cause of the memory allocation pressure without an allocator first, and then evaluate the need of adding a complicated allocator code. The driving interests here: K.I.S.S. and “No code is best code.”

P.S.: If we implement an allocator, we have to ensure that it eventually releases memory back to the system, otherwise we will have basically just written an over-engineered memory leak.

P.P.S.: getting a good code review on the allocator is going to be a complex process, because it’s complex code. Meanwhile, getting rid of unnecessary allocate+copies are simple code reviews, that we should be able to turn around quickly.

@drakkan
Copy link
Collaborator Author

drakkan commented Mar 6, 2020

What I’m saying is that if we can avoid at least some of the copies into new slices, we should decrease the pressure being relieved by the allocator. And then we would be addressing the root cause, and not the symptoms.

After we have removed all the copies into new slices, then we should be able to evaluate if building in an allocator would be worthwhile.

Like I said, I would really like to address the root cause of the memory allocation pressure without an allocator first, and then evaluate the need of adding a complicated allocator code. The driving interests here: K.I.S.S. and “No code is best code.”

P.S.: If we implement an allocator, we have to ensure that it eventually releases memory back to the system, otherwise we will have basically just written an over-engineered memory leak.

Well I'll try to read the code once again but I cannot see any other copies in the critical paths (uploads and downloads, I don't worry about other sftp packets)

@drakkan
Copy link
Collaborator Author

drakkan commented Mar 6, 2020

What I’m saying is that if we can avoid at least some of the copies into new slices, we should decrease the pressure being relieved by the allocator. And then we would be addressing the root cause, and not the symptoms.

After we have removed all the copies into new slices, then we should be able to evaluate if building in an allocator would be worthwhile.

Like I said, I would really like to address the root cause of the memory allocation pressure without an allocator first, and then evaluate the need of adding a complicated allocator code. The driving interests here: K.I.S.S. and “No code is best code.”

P.S.: If we implement an allocator, we have to ensure that it eventually releases memory back to the system, otherwise we will have basically just written an over-engineered memory leak.

in my brach the allocator is associated to the packet manager and the memory is released when the packet manager ends

https://github.com/drakkan/sftp/blob/master/request-server.go#L154

we can have a memory leak only if downloads and uploads requires packet of different sizes all the time, eventually I can add a guard against this limiting the maximum number of preallocated packets if there is interest in this approach, thanks

P.P.S.: getting a good code review on the allocator is going to be a complex process, because it’s complex code. Meanwhile, getting rid of unnecessary allocate+copies are simple code reviews, that we should be able to turn around quickly.

@drakkan
Copy link
Collaborator Author

drakkan commented Mar 9, 2020

only for info, this is a memory profile while downloading a file (1GB size) using current master:

go tool pprof --alloc_space /tmp/memprofile
...
(pprof) top
Showing nodes accounting for 2247.56MB, 98.88% of 2273.02MB total
Dropped 123 nodes (cum <= 11.37MB)
      flat  flat%   sum%        cum   cum%
 1254.24MB 55.18% 55.18%  1254.24MB 55.18%  github.com/pkg/sftp.sshFxpDataPacket.MarshalBinary
  991.81MB 43.63% 98.81%   991.81MB 43.63%  github.com/pkg/sftp.fileget
       1MB 0.044% 98.86%  1255.24MB 55.22%  github.com/pkg/sftp.(*packetManager).maybeSendPackets
    0.50MB 0.022% 98.88%  1260.24MB 55.44%  github.com/pkg/sftp.(*packetManager).controller
         0     0% 98.88%   991.81MB 43.63%  github.com/pkg/sftp.(*Request).call
         0     0% 98.88%   993.31MB 43.70%  github.com/pkg/sftp.(*RequestServer).Serve.func1.1
         0     0% 98.88%   993.31MB 43.70%  github.com/pkg/sftp.(*RequestServer).packetWorker
         0     0% 98.88%  1254.24MB 55.18%  github.com/pkg/sftp.(*conn).sendPacket
         0     0% 98.88%  1254.24MB 55.18%  github.com/pkg/sftp.sendPacket

as you can see there is an allocation both in fileget and in MarshalBinary.

Here is a memory profile while downloading the same file using my allocator fork:

(pprof) top
Showing nodes accounting for 31.26MB, 78.59% of 39.78MB total
Showing top 10 nodes out of 93
      flat  flat%   sum%        cum   cum%
   16.50MB 41.48% 41.48%    16.50MB 41.48%  github.com/pkg/sftp.(*allocator).GetPage
       3MB  7.54% 49.02%        5MB 12.57%  golang.org/x/crypto/ssh.Marshal
    2.50MB  6.28% 55.30%     3.50MB  8.80%  github.com/pkg/sftp.orderedPackets.Sort
       2MB  5.03% 60.33%        2MB  5.03%  github.com/pkg/sftp.makePacket
    1.76MB  4.43% 64.76%     1.76MB  4.43%  compress/flate.NewWriter
    1.50MB  3.77% 68.53%     1.50MB  3.77%  github.com/pkg/sftp.(*packetManager).incomingPacket
       1MB  2.51% 71.05%        1MB  2.51%  golang.org/x/crypto/ssh.(*connectionState).readPacket
       1MB  2.51% 73.56%        1MB  2.51%  internal/reflectlite.Swapper
       1MB  2.51% 76.08%        1MB  2.51%  github.com/pkg/sftp.(*packetManager).readyPacket
       1MB  2.51% 78.59%        3MB  7.54%  github.com/pkg/sftp.fileget

maybe we could at least avoid the second allocation in MarshalBinary, a pull request is coming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants