Tuning of readPool buffer sizes #556

darthShadow · 2025-01-16T21:38:43Z

Would it be possible to tune the buffer sizes a little bit intelligently for the readPool here:

Lines 218 to 229 in aa9c516

    
           ms.readPool.New = func() interface{} { 
        
           	targetSize := o.MaxWrite + int(maxInputSize) 
        
           	if targetSize < _FUSE_MIN_READ_BUFFER { 
        
           		targetSize = _FUSE_MIN_READ_BUFFER 
        
           	} 
        
           	// O_DIRECT typically requires buffers aligned to 
        
           	// blocksize (see man 2 open), but requirements vary 
        
           	// across file systems. Presumably, we could also fix 
        
           	// this by reading the requests using readv. 
        
           	buf := make([]byte, targetSize+logicalBlockSize) 
        
           	buf = alignSlice(buf, unsafe.Sizeof(WriteIn{}), logicalBlockSize, uintptr(targetSize)) 
        
           	return buf

It starts with a minimum buffer size of 1M (because of MaxWrite being set to 1M) which, as per my understanding, is not necessary for most requests like LOOKUP, FORGET, READDIR etc.

I have a mostly read-heavy mount (with reads done via passthrough), and in times of high usage, the memory grows quite a bit with a majority of it due to the readPool allocations. It does go down eventually during quieter periods, so there is no leak, just something to enhance.

I guess I could revert to the default MaxWrite value, but I wanted to check if the pool could be enhanced instead and let me keep my faster writes 😅 .

The text was updated successfully, but these errors were encountered:

darthShadow · 2025-01-16T21:46:22Z

Trimmed output from pprof:

Showing nodes accounting for 1723.22MB, 98.53% of 1748.92MB total
Dropped 104 nodes (cum <= 8.74MB)
Showing top 10 nodes out of 39
      flat  flat%   sum%        cum   cum%
 1192.96MB 68.21% 68.21%  1192.96MB 68.21%  github.com/hanwen/go-fuse/v2/fuse.NewServer.func2

hanwen · 2025-01-18T11:42:24Z

It starts with a minimum buffer size of 1M (because of MaxWrite being set to 1M) which, as per my understanding, is not necessary for most requests like LOOKUP, FORGET, READDIR etc.

If the request is small, the 1M buffer is returned immediately after it is read.

AFAIK, you don't know upfront which request you get so you have to be prepared for the worst (be happy to be proven wrong; maybe you can have 2 read loops, one with the large one, and one with the small one).

hanwen · 2025-01-18T11:52:38Z

The code you quote provisions space for answering to READ calls. It isn´t used for LOOKUP/FORGET/etc.

darthShadow · 2025-01-18T12:42:46Z

know upfront which request you get

I was thinking if it is possible to do it similar to how the outPayloadSize is calculated for buffers from the ms.buffers pool:

go-fuse/fuse/server.go

Lines 559 to 572 in aa9c516

    
           h, inSize, outSize, outPayloadSize, code := parseRequest(req.inputBuf, &ms.kernelSettings) 
        
           if !code.Ok() { 
        
           	ms.opts.Logger.Printf("parseRequest: %v", code) 
        
           	return code 
        
           } 
        
           req.inPayload = req.inputBuf[inSize:] 
        
           req.inputBuf = req.inputBuf[:inSize] 
        
           req.outputBuf = req.outBuf[:outSize+int(sizeOfOutHeader)] 
        
           copy(req.outputBuf, zeroOutBuf[:]) 
        
           if outPayloadSize > 0 { 
        
           	req.outPayload = ms.buffers.AllocBuffer(uint32(outPayloadSize)) 
        
           	req.bufferPoolOutputBuf = req.outPayload 
        
           }

Perhaps a smaller buffer that could be used to determine the type of request from the header and then size it appropriately.

answering to READ calls. It isn´t used for LOOKUP/FORGET/etc.

It seems to be allocated for all types of requests per this, I don't see any if condition which allocates the buffer only for READ or WRITE requests:

go-fuse/fuse/server.go

Lines 351 to 359 in aa9c516

    
           destIface := ms.readPool.Get() 
        
           dest := destIface.([]byte) 
        
           var n int 
        
           err := handleEINTR(func() error { 
        
           	var err error 
        
           	n, err = syscall.Read(ms.mountFd, dest) 
        
           	return err 
        
           })

hanwen · 2025-01-18T13:38:11Z

It seems to be allocated for all types of requests per this, I don't see any if condition which allocates the buffer only for READ or WRITE requests:

Sorry, I was confused by the naming the readPool is for reading requests, not servicing READ requests (that happens in handleRequest()).

This buffer is put back into the pool if it's not used, see

go-fuse/fuse/server.go

Line 381 in aa9c516

ms.readPool.Put(destIface)

darthShadow · 2025-01-18T13:48:01Z

Yeah, but that's still a huge allocation, which may not be fully used, which is what I want to reduce.

It could result in a lot of huge buffers lying in the sync.Pool (as seen in the pprof output above) after a burst of high activity which may take a long time to be GC-ed.

hanwen · 2025-01-18T15:29:02Z

It could result in a lot of huge buffers lying in the sync.Pool (as seen in the pprof output above) after a burst of high activity which may take a long time to be GC-ed.

that is WAI , no? High activity costs memory, and memory needs GC to be reclaimed.

Perhaps a smaller buffer that could be used to determine the type of request from the header and then size it appropriately.

how would that work? If you try to read from the device with a buffer that is too small, the read will fail with EINVAL.

darthShadow · 2025-01-18T15:35:05Z

Perhaps the buffer allocation could be skipped till we know the input size?

So, just a plain byte slice for the initial syscall.Read and let it grow as needed with the actual pool alloc coming later for the inputBuf once we know the inSize ?

Meaning the pool alloc moves from here:

go-fuse/fuse/server.go

Lines 351 to 359 in aa9c516

    
           destIface := ms.readPool.Get() 
        
           dest := destIface.([]byte) 
        
           var n int 
        
           err := handleEINTR(func() error { 
        
           	var err error 
        
           	n, err = syscall.Read(ms.mountFd, dest) 
        
           	return err 
        
           })

to here:

go-fuse/fuse/server.go

Lines 559 to 572 in aa9c516

    
           h, inSize, outSize, outPayloadSize, code := parseRequest(req.inputBuf, &ms.kernelSettings) 
        
           if !code.Ok() { 
        
           	ms.opts.Logger.Printf("parseRequest: %v", code) 
        
           	return code 
        
           } 
        
           req.inPayload = req.inputBuf[inSize:] 
        
           req.inputBuf = req.inputBuf[:inSize] 
        
           req.outputBuf = req.outBuf[:outSize+int(sizeOfOutHeader)] 
        
           copy(req.outputBuf, zeroOutBuf[:]) 
        
           if outPayloadSize > 0 { 
        
           	req.outPayload = ms.buffers.AllocBuffer(uint32(outPayloadSize)) 
        
           	req.bufferPoolOutputBuf = req.outPayload 
        
           }

hanwen · 2025-01-18T15:57:32Z

I suggest you try out the idea, and send me a PR if it works.

darthShadow · 2025-01-20T16:20:42Z

Btw, just reading the code further, the max concurrency for requests seems to be 16, am I right?

With that value, I’m not sure how so many byte slices are even allocated because they should have been getting re-used. Something for me to check further.

trapexit · 2025-01-20T16:20:42Z

@hanwen Forgive me as I'm unfamiliar with go-fuse but... how does it manage /dev/fuse? Is it like libfuse with a thread pool reading from the fd and then handling each request or is it reading messages and then passing them off to something else to process? If the latter is that queue bounded?

hanwen · 2025-01-21T07:00:02Z

@trapexit - look for singleReader in server.go and read surrounding code.

darthShadow · 2025-01-21T12:13:42Z

Using a fixed-size pool with a channel for synchronization seems to be a lot better on memory usage. That function has all but disappeared from the memory profile, and the allocs in the benchmarks have also reduced.

+ go test ./benchmark -test.bench '.*' -test.cpu 1,2
goos: linux
goarch: amd64
pkg: github.com/hanwen/go-fuse/v2/benchmark
cpu: AMD EPYC 7763 64-Core Processor                
BenchmarkGoFuseMemoryRead      	    2266	    501278 ns/op	4183.61 MB/s	    3385 B/op	      72 allocs/op
BenchmarkGoFuseMemoryRead-2    	    4238	    279978 ns/op	7490.41 MB/s	    5060 B/op	      87 allocs/op
BenchmarkGoFuseFDRead          	   48939	     25596 ns/op	2560.39 MB/s	      36 B/op	       1 allocs/op
BenchmarkGoFuseFDRead-2        	   48085	     26591 ns/op	2464.63 MB/s	      45 B/op	       1 allocs/op
BenchmarkGoFuseStat            	    7188	    148202 ns/op
BenchmarkGoFuseStat-2          	    6966	    145004 ns/op
BenchmarkGoFuseReaddir         	    4059	    288141 ns/op
BenchmarkGoFuseReaddir-2       	    4729	    242145 ns/op
BenchmarkTimeNow               	21389954	        55.94 ns/op
BenchmarkTimeNow-2             	21435914	        55.98 ns/op
BenchmarkCFuseThreadedStat     	    7597	    139052 ns/op
BenchmarkCFuseThreadedStat-2   	   15082	     80008 ns/op

Will let it run for a few days to see if it remains stable before sending a PR.

trapexit · 2025-01-21T12:23:22Z

@trapexit - look for singleReader in server.go and read surrounding code.

I'm just trying to help @darthShadow. The question is merely if there is a bounded or unbounded message reading functionality. @darthShadow are you using singleReader?

The reason unbounded is bad is because while your FUSE server is likely to respond faster than incoming requests can be made that holds generally true only till "forget" messages come in. They can come in huge waves when there is any cache / memory pressure or there are forced node forgets. These messages require no response and can come in in the hundreds or thousands in a short period.

darthShadow · 2025-01-21T16:01:13Z

are you using singleReader

No, this is with singleReader set to false.

hanwen · 2025-01-22T07:47:51Z

They can come in huge waves when there is any cache / memory pressure or there are forced node forgets. These messages require no response and can come in in the hundreds or thousands in a short period.

recent kernels would send them as batches (see BatchForget type).

bounded or unbounded message reading functionality. @darthShadow are you using singleReader?

it's unbounded, both for singleReader=false and singleReader=true.

hanwen · 2025-01-22T07:59:36Z

Using a fixed-size pool with a channel for synchronization seems to be a lot better on memory usage.

I would expect a performance degradation for highly parallel filesystem access. It would be interesting to see a benchmark that reproduces that, so we can understand the tradeoff better.

till "forget" messages come in. They can come in huge waves when there is any cache / memory pressure or there are forced node forgets.

This should be easy to reproduce; you can force a cache drop by writing into /proc/sys/vm/drop_caches . Check memory usage right before and right after.

darthShadow · 2025-01-22T09:17:19Z

it's unbounded

For singleReader=false, it is still bounded by the maxReaders (which has a max cap of 16), right? Or am I missing something in the code?

go-fuse/fuse/server.go

Lines 193 to 198 in aa9c516

    
           maxReaders := runtime.GOMAXPROCS(0) 
        
           if maxReaders < minMaxReaders { 
        
           	maxReaders = minMaxReaders 
        
           } else if maxReaders > maxMaxReaders { 
        
           	maxReaders = maxMaxReaders 
        
           }

go-fuse/fuse/server.go

Lines 342 to 345 in aa9c516

    
           if ms.reqReaders > ms.maxReaders { 
        
           	ms.reqMu.Unlock() 
        
           	return nil, OK 
        
           }

I’ve set the size to 2 x maxReaders, which I thought would avoid any possible problems with performance based on my understanding of the library processing only maxReaders number of requests at once when singleReader is false.

go-fuse/fuse/server.go

Lines 544 to 548 in aa9c516

    
           if ms.singleReader { 
        
           	go ms.handleRequest(req) 
        
           } else { 
        
           	ms.handleRequest(req) 
        
           }

Let me know if my understanding is wrong here and I’ve missed something.

hanwen · 2025-01-23T20:43:42Z

maxReaders is exactly what it says: it is the maximum number of goroutines that can be reading the /dev/fuse device, and therefore also the maximum number of outstanding buffers from the readPool.

The maximum number of concurrent requests is still unbounded: as soon as a read completes, we start processing the request, but also start a new reader to pick up the slack.

trapexit · 2025-01-23T21:31:17Z

recent kernels would send them as batches (see BatchForget type).

Not always. I ran into this issue of unbounded processing within the past 24 months. Batch forget has been around for many years and even recent'ish kernels I've seen floods of regular forgets.

hanwen · 2025-01-24T14:55:11Z

Not always. I ran into this issue of unbounded processing within the past 24 months.

How does https://review.gerrithub.io/c/hanwen/go-fuse/+/1207746 work for you?

darthShadow · 2025-01-27T20:30:28Z

Thanks for the explanation and correcting my flawed understanding.

I’ve cherry-picked your commit into my fork for testing, but it will be a while before I get to it, sorry.

hanwen · 2025-02-08T09:39:13Z

I've submitted the backpressure change, because it seemed sensible on principle. I am not sure if this will fix the original issue, but for that I would need a reliable reproduction of the problem.

darthShadow · 2025-02-08T12:01:19Z

So, I modified the fixed-size pool slightly by backing it with a sync.Pool, which increased the allocs but should allow for better reuse of the structs.

Reference: darthShadow@7af94a6#diff-8e280562298ff7d5f517dba57766a75956945d309b7b99d805e0fa2dec1e55c8

With that and the backpressure change, the numbers are quite promising. The percentage of alloc_space in the heap profile for the readpool has gone down from almost 30% to less than 1%. Will keep it running for a few days to see if the numbers hold up over time.

hanwen · 2025-02-09T17:26:26Z

It would be interesting if you could tease apart which change made the difference: your custom pool or the backpressure.

So, I modified the fixed-size pool slightly by backing it with a sync.Pool

?

the blog posts that you link are both very old (one says explicitly "THIS BLOG POST IS VERY OLD NOW. YOU PROBABLY DON'T WANT TO USE THE TECHNIQUE DESCRIBED HERE. GO'S sync.Pool IS A BETTER WAY TO GO.").

The sync.Pool implementation has had a lot of optimization. It would be surprising that we could do better by simply sticking a channel in front of it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tuning of readPool buffer sizes #556

Tuning of readPool buffer sizes #556

darthShadow commented Jan 16, 2025 •

edited

Loading

darthShadow commented Jan 16, 2025

hanwen commented Jan 18, 2025

hanwen commented Jan 18, 2025

darthShadow commented Jan 18, 2025 •

edited

Loading

hanwen commented Jan 18, 2025

darthShadow commented Jan 18, 2025

hanwen commented Jan 18, 2025

darthShadow commented Jan 18, 2025

hanwen commented Jan 18, 2025

darthShadow commented Jan 20, 2025

trapexit commented Jan 20, 2025

hanwen commented Jan 21, 2025

darthShadow commented Jan 21, 2025

trapexit commented Jan 21, 2025

darthShadow commented Jan 21, 2025

hanwen commented Jan 22, 2025

hanwen commented Jan 22, 2025

darthShadow commented Jan 22, 2025

hanwen commented Jan 23, 2025

trapexit commented Jan 23, 2025 •

edited

Loading

hanwen commented Jan 24, 2025

darthShadow commented Jan 27, 2025

hanwen commented Feb 8, 2025

darthShadow commented Feb 8, 2025

hanwen commented Feb 9, 2025

Tuning of readPool buffer sizes #556

Tuning of readPool buffer sizes #556

Comments

darthShadow commented Jan 16, 2025 • edited Loading

darthShadow commented Jan 16, 2025

hanwen commented Jan 18, 2025

hanwen commented Jan 18, 2025

darthShadow commented Jan 18, 2025 • edited Loading

hanwen commented Jan 18, 2025

darthShadow commented Jan 18, 2025

hanwen commented Jan 18, 2025

darthShadow commented Jan 18, 2025

hanwen commented Jan 18, 2025

darthShadow commented Jan 20, 2025

trapexit commented Jan 20, 2025

hanwen commented Jan 21, 2025

darthShadow commented Jan 21, 2025

trapexit commented Jan 21, 2025

darthShadow commented Jan 21, 2025

hanwen commented Jan 22, 2025

hanwen commented Jan 22, 2025

darthShadow commented Jan 22, 2025

hanwen commented Jan 23, 2025

trapexit commented Jan 23, 2025 • edited Loading

hanwen commented Jan 24, 2025

darthShadow commented Jan 27, 2025

hanwen commented Feb 8, 2025

darthShadow commented Feb 8, 2025

hanwen commented Feb 9, 2025

darthShadow commented Jan 16, 2025 •

edited

Loading

darthShadow commented Jan 18, 2025 •

edited

Loading

trapexit commented Jan 23, 2025 •

edited

Loading