Failing to delete the stream from the `activeStreams` map leading to `REFUSED_STREAM` errors. #8504

alanprot · 2025-08-11T23:00:50Z

Description

Hi,

After upgrading go-grpc, we started see the following error:

stream terminated by RST_STREAM with error code: REFUSED_STREAM

Upon investigation, it appears that in some scenarios, we fail to properly delete streams from the activeStreams map, resulting in this error being wrongly returned.

I added a test case that seems to trigger the problem:

PS: To test the behavior i added a "ActiveStreamTracker" hook -> im not sure this is the best way but i wanted just to show the problem

end2end_test.go:4015: leak streams:  5

The issue seems to have been introduced in the following PR: #8071

Specifically, the check added here:

grpc-go/internal/transport/http2_server.go

Lines 1357 to 1359 in 57b69b4

    
           if oldState == streamDone { 
        
           	return 
        
           }

If this conditional is removed, all streams are correctly cleaned up from the activeStreams map, and the test passes without leaks.

Signed-off-by: alanprot <alanprot@gmail.com>

alanprot · 2025-08-11T23:04:16Z

cc @arjan-bal @dfawley

arjan-bal · 2025-08-12T05:16:40Z

Hi @alanprot, thanks for sharing the test to repro. I haven't debugged the test, but wanted to ask if you know why the check added causes stream deletion to be skipped? From a look quick look at the places streamDone is being set, I found only two:

grpc-go/internal/transport/http2_server.go

Lines 1332 to 1343 in 57b69b4

    
           oldState := s.swapState(streamDone) 
        
           if oldState == streamDone { 
        
           	// If the stream was already done, return. 
        
           	return 
        
           } 
        
           hdr.cleanup = &cleanupStream{ 
        
           	streamID: s.id, 
        
           	rst:      rst, 
        
           	rstCode:  rstCode, 
        
           	onWrite: func() { 
        
           		t.deleteStream(s, eosReceived)

grpc-go/internal/transport/http2_server.go

Lines 1350 to 1360 in 57b69b4

    
           func (t *http2Server) closeStream(s *ServerStream, rst bool, rstCode http2.ErrCode, eosReceived bool) { 
        
           	// In case stream sending and receiving are invoked in separate 
        
           	// goroutines (e.g., bi-directional streaming), cancel needs to be 
        
           	// called to interrupt the potential blocking on other goroutines. 
        
           	s.cancel() 
        
           	oldState := s.swapState(streamDone) 
        
           	if oldState == streamDone { 
        
           		return 
        
           	} 
        
           	t.deleteStream(s, eosReceived)

In both places, deleteStream is called eventually. Do you know how the code ends up in a situation where the stream is marked as done, but the it's not deleted from the map?

Edit: It seems like the onWrite callback may not be getting executed.

alanprot · 2025-08-12T17:27:52Z

Hi @arjan-bal

As far i could see the problem is we add set the clean and add the hdr on the controlbuff at

grpc-go/internal/transport/http2_server.go

Line 1338 in 19c720f

hdr.cleanup = &cleanupStream{

And we get to:

grpc-go/internal/transport/controlbuf.go

Lines 692 to 696 in 19c720f

    
           if str.state != empty { // either active or waiting on stream quota. 
        
           	// add it str's list of items. 
        
           	str.itl.enqueue(h) 
        
           	return nil 
        
           }

And at this point, the stream is not on the activeStreams list, so processData returns here:

grpc-go/internal/transport/controlbuf.go

Lines 947 to 950 in 19c720f

    
           str := l.activeStreams.dequeue() // Remove the first stream. 
        
           if str == nil { 
        
           	return true, nil 
        
           }

And i think this stream never become added back on the activeStreams list, so we never clean it up.

So it seems that it only happens when the ServerSide is sending tons of data and waiting for quota?

arjan-bal · 2025-08-13T10:47:50Z

Thank you @alanprot, I think I understand the sequence of events leading to this.

The server stream exhausts the stream quota, and starts buffering data frames.
The server attempts to finish the RPC gracefully by sending trailers by calling finishStream.
The write of the header frame gets blocked due to the pending data frames that are awaiting flow control quota.
Once the deadline expires, the server's deadline monitoring timer tries to call closeStream, which is a no-op since the stream state is already marked streamDone by finishStream in point 1.
The client sends and RST stream, the server calls closeStream which is again a no-op.

The simplest fix is to remove the guard block added in closeStream in #8071. We may end up pushing redundant cleanupStream events on to the controlbuf, but it doesn't cause correctness issues since cleanupStreamHandler returns early in such cases.

@dfawley what do you think?

dfawley · 2025-08-13T15:33:19Z

The simplest fix is to remove the guard block added in closeStream in #8071. We may end up pushing redundant cleanupStream events on to the controlbuf, but it doesn't cause correctness issues since cleanupStreamHandler returns early in such cases.

That makes sense to me. I can't think of any other way to do this. Does the transport properly handle the rest of this situation, especially: dropping the queued data frames when the RST_STREAM is sent (or shortly thereafter)?

alanprot · 2025-08-13T15:45:13Z

Also maybe we wanna consider decrementing this metric only if the stream was present on the activeStreams map?

grpc-go/internal/transport/http2_server.go

Lines 1307 to 1323 in 18ee309

    
           	t.mu.Lock() 
        
           	if _, ok := t.activeStreams[s.id]; ok { 
        
           		delete(t.activeStreams, s.id) 
        
           		if len(t.activeStreams) == 0 { 
        
           			t.idle = time.Now() 
        
           		} 
        
           	} 
        
           	t.mu.Unlock() 
        
           	if channelz.IsOn() { 
        
           		if eosReceived { 
        
           			t.channelz.SocketMetrics.StreamsSucceeded.Add(1) 
        
           		} else { 
        
           			t.channelz.SocketMetrics.StreamsFailed.Add(1) 
        
           		} 
        
           	} 
        
           }

arjan-bal · 2025-08-14T06:44:11Z

That makes sense to me. I can't think of any other way to do this. Does the transport properly handle the rest of this situation, especially: dropping the queued data frames when the RST_STREAM is sent (or shortly thereafter)?

All the pending Data and Header frames are dropped in cleanupStreamHandler.

When does our max incoming streams quota go back up?

The check for max concurrent streams depends on the size of the activeStreams map.
We delete from activeStreams map only in deleteStream().
So the question is: When is deleteStream() called?

Without timeout:
After the trailers are written, at the beginning of cleanupStreamHandler() due to cleanupStream.onWrite().

With timeout:
Just before cleanupStream event is pushed into the controlBuf. So there is a small interval before the cleanupStreamHandler is called when an extra stream could be started. We can minimize this interval this by using cleaupStream.onWrite(), similar to the timeout case above, but this is not super urgent since it's been this way for a while.

arjan-bal · 2025-08-14T06:44:16Z

@alanprot are you interested in raising a PR with the fix and a unit test for catching regressions?

alanprot · 2025-08-15T12:59:13Z

@arjan-bal I can definitely do that, but I’m out until Tuesday, so I’d only be able to pick it up then. :)

arjan-bal · 2025-08-18T04:24:15Z

I'll be closing this PR and using #8517 to track the fix as per our normal bug fixing process. When the PR with the fix is merged #8517 can be marked as completed.

Leak streams when client does not cancell the context and timeout is set

d09c515

Signed-off-by: alanprot <alanprot@gmail.com>

alanprot force-pushed the repro-leak-stream branch from ccc4735 to d09c515 Compare August 11, 2025 23:04

alanprot changed the title ~~Failing to delete the stream from the activeStreams map leading to "REFUSED_STREAM" errors.~~ Failing to delete the stream from the activeStreams map leading to REFUSED_STREAM errors. Aug 11, 2025

arjan-bal self-assigned this Aug 12, 2025

yeya24 mentioned this pull request Aug 12, 2025

downgrade gRPC server to v1.71.2 cortexproject/cortex#6964

Merged

3 tasks

arjan-bal assigned dfawley Aug 13, 2025

dfawley removed their assignment Aug 13, 2025

arjan-bal assigned alanprot Aug 14, 2025

arjan-bal mentioned this pull request Aug 18, 2025

Failing to delete the stream from the activeStreams map leading to REFUSED_STREAM errors #8517

Closed

arjan-bal closed this Aug 18, 2025

arjan-bal mentioned this pull request Aug 21, 2025

transport: Servers may emit extra channelz stream metrics on cancellation #8529

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failing to delete the stream from the `activeStreams` map leading to `REFUSED_STREAM` errors. #8504

Failing to delete the stream from the `activeStreams` map leading to `REFUSED_STREAM` errors. #8504

Uh oh!

alanprot commented Aug 11, 2025 •

edited

Loading

Uh oh!

alanprot commented Aug 11, 2025

Uh oh!

arjan-bal commented Aug 12, 2025 •

edited

Loading

Uh oh!

alanprot commented Aug 12, 2025 •

edited

Loading

Uh oh!

arjan-bal commented Aug 13, 2025

Uh oh!

dfawley commented Aug 13, 2025

Uh oh!

alanprot commented Aug 13, 2025

Uh oh!

arjan-bal commented Aug 14, 2025

Uh oh!

arjan-bal commented Aug 14, 2025

Uh oh!

alanprot commented Aug 15, 2025

Uh oh!

arjan-bal commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Failing to delete the stream from the activeStreams map leading to REFUSED_STREAM errors. #8504

Failing to delete the stream from the activeStreams map leading to REFUSED_STREAM errors. #8504

Uh oh!

Conversation

alanprot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

alanprot commented Aug 11, 2025

Uh oh!

arjan-bal commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alanprot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arjan-bal commented Aug 13, 2025

Uh oh!

dfawley commented Aug 13, 2025

Uh oh!

alanprot commented Aug 13, 2025

Uh oh!

arjan-bal commented Aug 14, 2025

Uh oh!

arjan-bal commented Aug 14, 2025

Uh oh!

alanprot commented Aug 15, 2025

Uh oh!

arjan-bal commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Failing to delete the stream from the `activeStreams` map leading to `REFUSED_STREAM` errors. #8504

Failing to delete the stream from the `activeStreams` map leading to `REFUSED_STREAM` errors. #8504

alanprot commented Aug 11, 2025 •

edited

Loading

arjan-bal commented Aug 12, 2025 •

edited

Loading

alanprot commented Aug 12, 2025 •

edited

Loading