Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning of readPool buffer sizes #556

Open
darthShadow opened this issue Jan 16, 2025 · 25 comments
Open

Tuning of readPool buffer sizes #556

darthShadow opened this issue Jan 16, 2025 · 25 comments

Comments

@darthShadow
Copy link

darthShadow commented Jan 16, 2025

Would it be possible to tune the buffer sizes a little bit intelligently for the readPool here:

go-fuse/fuse/server.go

Lines 218 to 229 in aa9c516

ms.readPool.New = func() interface{} {
targetSize := o.MaxWrite + int(maxInputSize)
if targetSize < _FUSE_MIN_READ_BUFFER {
targetSize = _FUSE_MIN_READ_BUFFER
}
// O_DIRECT typically requires buffers aligned to
// blocksize (see man 2 open), but requirements vary
// across file systems. Presumably, we could also fix
// this by reading the requests using readv.
buf := make([]byte, targetSize+logicalBlockSize)
buf = alignSlice(buf, unsafe.Sizeof(WriteIn{}), logicalBlockSize, uintptr(targetSize))
return buf

It starts with a minimum buffer size of 1M (because of MaxWrite being set to 1M) which, as per my understanding, is not necessary for most requests like LOOKUP, FORGET, READDIR etc.

I have a mostly read-heavy mount (with reads done via passthrough), and in times of high usage, the memory grows quite a bit with a majority of it due to the readPool allocations. It does go down eventually during quieter periods, so there is no leak, just something to enhance.

I guess I could revert to the default MaxWrite value, but I wanted to check if the pool could be enhanced instead and let me keep my faster writes 😅 .

@darthShadow
Copy link
Author

Trimmed output from pprof:

Showing nodes accounting for 1723.22MB, 98.53% of 1748.92MB total
Dropped 104 nodes (cum <= 8.74MB)
Showing top 10 nodes out of 39
      flat  flat%   sum%        cum   cum%
 1192.96MB 68.21% 68.21%  1192.96MB 68.21%  github.com/hanwen/go-fuse/v2/fuse.NewServer.func2

@hanwen
Copy link
Owner

hanwen commented Jan 18, 2025

It starts with a minimum buffer size of 1M (because of MaxWrite being set to 1M) which, as per my understanding, is not necessary for most requests like LOOKUP, FORGET, READDIR etc.

If the request is small, the 1M buffer is returned immediately after it is read.

AFAIK, you don't know upfront which request you get so you have to be prepared for the worst (be happy to be proven wrong; maybe you can have 2 read loops, one with the large one, and one with the small one).

@hanwen
Copy link
Owner

hanwen commented Jan 18, 2025

The code you quote provisions space for answering to READ calls. It isn´t used for LOOKUP/FORGET/etc.

@darthShadow
Copy link
Author

darthShadow commented Jan 18, 2025

know upfront which request you get

I was thinking if it is possible to do it similar to how the outPayloadSize is calculated for buffers from the ms.buffers pool:

go-fuse/fuse/server.go

Lines 559 to 572 in aa9c516

h, inSize, outSize, outPayloadSize, code := parseRequest(req.inputBuf, &ms.kernelSettings)
if !code.Ok() {
ms.opts.Logger.Printf("parseRequest: %v", code)
return code
}
req.inPayload = req.inputBuf[inSize:]
req.inputBuf = req.inputBuf[:inSize]
req.outputBuf = req.outBuf[:outSize+int(sizeOfOutHeader)]
copy(req.outputBuf, zeroOutBuf[:])
if outPayloadSize > 0 {
req.outPayload = ms.buffers.AllocBuffer(uint32(outPayloadSize))
req.bufferPoolOutputBuf = req.outPayload
}

Perhaps a smaller buffer that could be used to determine the type of request from the header and then size it appropriately.


answering to READ calls. It isn´t used for LOOKUP/FORGET/etc.

It seems to be allocated for all types of requests per this, I don't see any if condition which allocates the buffer only for READ or WRITE requests:

go-fuse/fuse/server.go

Lines 351 to 359 in aa9c516

destIface := ms.readPool.Get()
dest := destIface.([]byte)
var n int
err := handleEINTR(func() error {
var err error
n, err = syscall.Read(ms.mountFd, dest)
return err
})

@hanwen
Copy link
Owner

hanwen commented Jan 18, 2025

It seems to be allocated for all types of requests per this, I don't see any if condition which allocates the buffer only for READ or WRITE requests:

Sorry, I was confused by the naming the readPool is for reading requests, not servicing READ requests (that happens in handleRequest()).

This buffer is put back into the pool if it's not used, see

ms.readPool.Put(destIface)

@darthShadow
Copy link
Author

Yeah, but that's still a huge allocation, which may not be fully used, which is what I want to reduce.

It could result in a lot of huge buffers lying in the sync.Pool (as seen in the pprof output above) after a burst of high activity which may take a long time to be GC-ed.

@hanwen
Copy link
Owner

hanwen commented Jan 18, 2025

It could result in a lot of huge buffers lying in the sync.Pool (as seen in the pprof output above) after a burst of high activity which may take a long time to be GC-ed.

that is WAI , no? High activity costs memory, and memory needs GC to be reclaimed.

Perhaps a smaller buffer that could be used to determine the type of request from the header and then size it appropriately.

how would that work? If you try to read from the device with a buffer that is too small, the read will fail with EINVAL.

@darthShadow
Copy link
Author

Perhaps the buffer allocation could be skipped till we know the input size?

So, just a plain byte slice for the initial syscall.Read and let it grow as needed with the actual pool alloc coming later for the inputBuf once we know the inSize ?

Meaning the pool alloc moves from here:

go-fuse/fuse/server.go

Lines 351 to 359 in aa9c516

destIface := ms.readPool.Get()
dest := destIface.([]byte)
var n int
err := handleEINTR(func() error {
var err error
n, err = syscall.Read(ms.mountFd, dest)
return err
})

to here:

go-fuse/fuse/server.go

Lines 559 to 572 in aa9c516

h, inSize, outSize, outPayloadSize, code := parseRequest(req.inputBuf, &ms.kernelSettings)
if !code.Ok() {
ms.opts.Logger.Printf("parseRequest: %v", code)
return code
}
req.inPayload = req.inputBuf[inSize:]
req.inputBuf = req.inputBuf[:inSize]
req.outputBuf = req.outBuf[:outSize+int(sizeOfOutHeader)]
copy(req.outputBuf, zeroOutBuf[:])
if outPayloadSize > 0 {
req.outPayload = ms.buffers.AllocBuffer(uint32(outPayloadSize))
req.bufferPoolOutputBuf = req.outPayload
}

@hanwen
Copy link
Owner

hanwen commented Jan 18, 2025

I suggest you try out the idea, and send me a PR if it works.

@darthShadow
Copy link
Author

Btw, just reading the code further, the max concurrency for requests seems to be 16, am I right?

With that value, I’m not sure how so many byte slices are even allocated because they should have been getting re-used. Something for me to check further.

@trapexit
Copy link

@hanwen Forgive me as I'm unfamiliar with go-fuse but... how does it manage /dev/fuse? Is it like libfuse with a thread pool reading from the fd and then handling each request or is it reading messages and then passing them off to something else to process? If the latter is that queue bounded?

@hanwen
Copy link
Owner

hanwen commented Jan 21, 2025

@trapexit - look for singleReader in server.go and read surrounding code.

@darthShadow
Copy link
Author

Using a fixed-size pool with a channel for synchronization seems to be a lot better on memory usage. That function has all but disappeared from the memory profile, and the allocs in the benchmarks have also reduced.

+ go test ./benchmark -test.bench '.*' -test.cpu 1,2
goos: linux
goarch: amd64
pkg: github.com/hanwen/go-fuse/v2/benchmark
cpu: AMD EPYC 7763 64-Core Processor                
BenchmarkGoFuseMemoryRead      	    2266	    501278 ns/op	4183.61 MB/s	    3385 B/op	      72 allocs/op
BenchmarkGoFuseMemoryRead-2    	    4238	    279978 ns/op	7490.41 MB/s	    5060 B/op	      87 allocs/op
BenchmarkGoFuseFDRead          	   48939	     25596 ns/op	2560.39 MB/s	      36 B/op	       1 allocs/op
BenchmarkGoFuseFDRead-2        	   48085	     26591 ns/op	2464.63 MB/s	      45 B/op	       1 allocs/op
BenchmarkGoFuseStat            	    7188	    148202 ns/op
BenchmarkGoFuseStat-2          	    6966	    145004 ns/op
BenchmarkGoFuseReaddir         	    4059	    288141 ns/op
BenchmarkGoFuseReaddir-2       	    4729	    242145 ns/op
BenchmarkTimeNow               	21389954	        55.94 ns/op
BenchmarkTimeNow-2             	21435914	        55.98 ns/op
BenchmarkCFuseThreadedStat     	    7597	    139052 ns/op
BenchmarkCFuseThreadedStat-2   	   15082	     80008 ns/op

Will let it run for a few days to see if it remains stable before sending a PR.

@trapexit
Copy link

@trapexit - look for singleReader in server.go and read surrounding code.

I'm just trying to help @darthShadow. The question is merely if there is a bounded or unbounded message reading functionality. @darthShadow are you using singleReader?

The reason unbounded is bad is because while your FUSE server is likely to respond faster than incoming requests can be made that holds generally true only till "forget" messages come in. They can come in huge waves when there is any cache / memory pressure or there are forced node forgets. These messages require no response and can come in in the hundreds or thousands in a short period.

@darthShadow
Copy link
Author

are you using singleReader

No, this is with singleReader set to false.

@hanwen
Copy link
Owner

hanwen commented Jan 22, 2025

They can come in huge waves when there is any cache / memory pressure or there are forced node forgets. These messages require no response and can come in in the hundreds or thousands in a short period.

recent kernels would send them as batches (see BatchForget type).

bounded or unbounded message reading functionality. @darthShadow are you using singleReader?

it's unbounded, both for singleReader=false and singleReader=true.

@hanwen
Copy link
Owner

hanwen commented Jan 22, 2025

Using a fixed-size pool with a channel for synchronization seems to be a lot better on memory usage.

I would expect a performance degradation for highly parallel filesystem access. It would be interesting to see a benchmark that reproduces that, so we can understand the tradeoff better.

till "forget" messages come in. They can come in huge waves when there is any cache / memory pressure or there are forced node forgets.

This should be easy to reproduce; you can force a cache drop by writing into /proc/sys/vm/drop_caches . Check memory usage right before and right after.

@darthShadow
Copy link
Author

it's unbounded

For singleReader=false, it is still bounded by the maxReaders (which has a max cap of 16), right? Or am I missing something in the code?

go-fuse/fuse/server.go

Lines 193 to 198 in aa9c516

maxReaders := runtime.GOMAXPROCS(0)
if maxReaders < minMaxReaders {
maxReaders = minMaxReaders
} else if maxReaders > maxMaxReaders {
maxReaders = maxMaxReaders
}

go-fuse/fuse/server.go

Lines 342 to 345 in aa9c516

if ms.reqReaders > ms.maxReaders {
ms.reqMu.Unlock()
return nil, OK
}

I’ve set the size to 2 x maxReaders, which I thought would avoid any possible problems with performance based on my understanding of the library processing only maxReaders number of requests at once when singleReader is false.

go-fuse/fuse/server.go

Lines 544 to 548 in aa9c516

if ms.singleReader {
go ms.handleRequest(req)
} else {
ms.handleRequest(req)
}

Let me know if my understanding is wrong here and I’ve missed something.

@hanwen
Copy link
Owner

hanwen commented Jan 23, 2025

maxReaders is exactly what it says: it is the maximum number of goroutines that can be reading the /dev/fuse device, and therefore also the maximum number of outstanding buffers from the readPool.

The maximum number of concurrent requests is still unbounded: as soon as a read completes, we start processing the request, but also start a new reader to pick up the slack.

@trapexit
Copy link

trapexit commented Jan 23, 2025

recent kernels would send them as batches (see BatchForget type).

Not always. I ran into this issue of unbounded processing within the past 24 months. Batch forget has been around for many years and even recent'ish kernels I've seen floods of regular forgets.

@hanwen
Copy link
Owner

hanwen commented Jan 24, 2025

Not always. I ran into this issue of unbounded processing within the past 24 months.

How does https://review.gerrithub.io/c/hanwen/go-fuse/+/1207746 work for you?

@darthShadow
Copy link
Author

Thanks for the explanation and correcting my flawed understanding.

I’ve cherry-picked your commit into my fork for testing, but it will be a while before I get to it, sorry.

@hanwen
Copy link
Owner

hanwen commented Feb 8, 2025

I've submitted the backpressure change, because it seemed sensible on principle. I am not sure if this will fix the original issue, but for that I would need a reliable reproduction of the problem.

@darthShadow
Copy link
Author

So, I modified the fixed-size pool slightly by backing it with a sync.Pool, which increased the allocs but should allow for better reuse of the structs.

Reference: darthShadow@7af94a6#diff-8e280562298ff7d5f517dba57766a75956945d309b7b99d805e0fa2dec1e55c8

With that and the backpressure change, the numbers are quite promising. The percentage of alloc_space in the heap profile for the readpool has gone down from almost 30% to less than 1%. Will keep it running for a few days to see if the numbers hold up over time.

@hanwen
Copy link
Owner

hanwen commented Feb 9, 2025

It would be interesting if you could tease apart which change made the difference: your custom pool or the backpressure.

So, I modified the fixed-size pool slightly by backing it with a sync.Pool

?

the blog posts that you link are both very old (one says explicitly "THIS BLOG POST IS VERY OLD NOW. YOU PROBABLY DON'T WANT TO USE THE TECHNIQUE DESCRIBED HERE. GO'S sync.Pool IS A BETTER WAY TO GO.").

The sync.Pool implementation has had a lot of optimization. It would be surprising that we could do better by simply sticking a channel in front of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants