Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to close writer: context canceled #474

Open
lukaspj opened this issue Aug 30, 2024 · 12 comments
Open

failed to close writer: context canceled #474

lukaspj opened this issue Aug 30, 2024 · 12 comments
Labels

Comments

@lukaspj
Copy link

lukaspj commented Aug 30, 2024

I'm using this library to create a zero-trust tunnel into a Kubernetes cluster, when I encountered this error message:

failed to write msg: failed to close writer: failed to acquire lock: context canceled

So what happened was that when client closed the connection, I would send a "connection closed" message over the WebSocket connection, but I (mistakenly) used the request context for the write command.
This context would get cancelled somewhere around this Write call, sometimes after message was written but before it released the lock.

As a result, the lock would never get released and the websocket was no longer usable.

The solution was to create a new context for the write:

	writeCtx, _ := context.WithTimeout(context.Background(), 2*time.Second)

	err = c.Websocket.Write(writeCtx, websocket.MessageText, marshal)

And I don't think I should have used the request context to begin with, but I do think it's not optimal behaviour that the websocket can be left in a broken state by mistimed context cancel.

In any case, just thought I'd share it. Thanks for maintaining this awesome library!

@mafredri
Copy link
Member

mafredri commented Aug 30, 2024

Hey @lukaspj, thanks for the user experience report.

You are definitely not the first! This is a very common mistake to make. There is one valid use-case for r.Context() after the connection being hijacked1, but in general it's almost never what you want.

I see this issue is further exacerbated by the readme and examples using it incorrectly (https://github.com/coder/websocket?tab=readme-ov-file#server).

We should at least update the readmes and examples, but I'll try to think about some alternative solutions as well.

Perhaps the most obvious way would be to put a context on websocket.Conn, but defining it's behavior would be a bit tricky.

Anyway, if you have anything else to add, how you wish this had worked instead, what lead you down the wrong path, etc, please feel free to share.

1 A valid use-case is when you're running the websocket logic in a separate goroutine, and want it to stop when the HTTP handler exits. Usually the logic runs in the HTTP handler, though, where using r.Context() is useless. (https://pkg.go.dev/net/http#Hijacker)

@lukaspj
Copy link
Author

lukaspj commented Aug 30, 2024

I think you pin-pointed it, if the readme and examples explained it, it would probably help.

Additionally, it would be nice if the function documentation would state this danger of closing the context. Something like "if the context is cancelled, it can cause a deadlock" or just "the context should be the context of the websocket not the context of the message being written". Idk if there's a nice way to write it tbh, but would be nice if it showed up on the documentation somewhere on "how to use this library".

@spikecurtis
Copy link

@mafredri I don't think a docs update alone will resolve this issue.

@lukaspj seems to be saying that if the context you pass to Write() gets canceled, it can leave the Websocket in a broken state (a lock that never gets released). While it's true that you probably don't want to use the r.Context() for calls to Write() in most cases, I would still consider it a bug if an ill-timed context timeout leaves the websocket in a broken state.

@mafredri
Copy link
Member

mafredri commented Sep 9, 2024

@spikecurtis I was thinking about the original report, but that's a good callout. We should split the second issue into a new ticket.

There may be multiple issues in how connection closure is propagated and context cancellation is handled. There seems to be multiple code-paths depending on scenario.

I also noticed while looking at #476 that a read/write context that is cancelled actually ends up closing the connection abruptly, whereas I would expect we just give up on the write.

@maggie44
Copy link
Contributor

maggie44 commented Nov 22, 2024

Hey @lukaspj, thanks for the user experience report.

You are definitely not the first! This is a very common mistake to make. There is one valid use-case for r.Context() after the connection being hijacked1, but in general it's almost never what you want.

I see this issue is further exacerbated by the readme and examples using it incorrectly (https://github.com/coder/websocket?tab=readme-ov-file#server).

We should at least update the readmes and examples, but I'll try to think about some alternative solutions as well.

Perhaps the most obvious way would be to put a context on websocket.Conn, but defining it's behavior would be a bit tricky.

Anyway, if you have anything else to add, how you wish this had worked instead, what lead you down the wrong path, etc, please feel free to share.

1 A valid use-case is when you're running the websocket logic in a separate goroutine, and want it to stop when the HTTP handler exits. Usually the logic runs in the HTTP handler, though, where using r.Context() is useless. (https://pkg.go.dev/net/http#Hijacker)

The HiJacker docs referenced note:

 The original Request's Context remains valid and
	// is not canceled until the Request's ServeHTTP method
	// returns.

This sounds to me like the context is still useful after a hijack, I'm not following why "in general it's almost never what you want"? Wanting to stop when the HTTP handler exits is probably a very common use case? And not too different to having a context created in the handler function.

In the PR that has been merged (https://github.com/coder/websocket/pull/477/files), it says:

// Set the context as needed. Use of r.Context() is not recommended
// to avoid surprising behavior (see http.Hijacker).

It doesn't go on to explain why r.Context() is not recommended, it doesn't explain what "surprising behaviour" is, and the http.Hijacker seems to suggest that r.Context() is valid after a Hijack. Could you clarify what is meant here?

If a request.Context or any other context is used in Write, it shouldn't lead to a broken websocket connection. If Write only accepts context.Background, is there any value in passing a context at all?

@mafredri
Copy link
Member

@maggie44 certainly. In my experience, a common use-case is that the websocket is managed within the http.Handler, thus when r.Context() is used, a common misconception is that the context will be cancelled when the other side disconnects, which is not the case.

In the aforementioned scenario, the r.Context() can't be utilized in any way and is mainly a dangerous gotcha.

That's not to say there aren't use-cases, one could imagine the following method on a service:

func (s *Server) handleWS(w http.ResponseWriter, r *http.Request) {
	conn, _ := websocket.Accept(s.ctx, r, nil) // <- Note, r.Context() would be wrong here because it will **never** cancel.
    s.connHandler <- conn
    <-s.ctx.Done()
}

I'd be interested to know if there's a relatively common scenario in which you would find the behavior useful?

@maggie44
Copy link
Contributor

maggie44 commented Nov 22, 2024

I think the behaviour described in the first post is what I would have expected:

failed to write msg: failed to close writer: failed to acquire lock: context canceled

c.Websocket.Write has used r.Context(). The websocket has been closed, all the functions have returned up the chain to the handler where the websocket was accepted which returns nil, which triggers a cancel on r.Context() ("The original Request's Context remains valid and is not canceled until the Request's ServeHTTP method returns"):

func handler(w http.ResponseWriter, r *http.Request) {
	websocket.Accept(... )
	// Doing things. 
	things(r.Context())
	// Returns nil
}

A c.Websocket.Write in the initial post is trying to write something to a connection that has closed, which we know because the handler has returned, so a context canceled error is returned through .Write. This context cancelled accurately reflects what happened, and can be handled gracefully. I.e. there isn't anyone to write the websocket message to anymore, so aborts.

Where it is problematic is in this scenario it doesn't release the lock, and it seems the Write hangs, but up until that stage it seems that r.Context() is doing what I would expect.

It sounds like r.Context() in the Hijack scenario is equivalent to:

func handler(w http.ResponseWriter, r *http.Request) {
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	websocket.Accept(... )
	// Doing things. 
	things(ctx)
	// Returns nil
}

As you mentioned, this isn't what is expected in terms of the user disconnecting doesn't call a cancel, but is a useful way to attach multiple functions, process, particularly goroutines to the broader user connection. One of those things is communicating to writers and readers that nobody is here anymore to hear the requests? An example might be using the Ping() functionality in this package. I might start a goroutine that loops every 10 seconds, to call a ping for keep alive. I would pass r.Context() to this so that it will keep pinging until the connection is closed.

I think long story short, the Write should probably be able to accept r.Context() and the cancel it calls without blocking the websocket. And that r.Context() would be a better context to pass down than context.Background(), even if it isn't performing as it would in a non-hijack scenario.

@mafredri
Copy link
Member

mafredri commented Nov 24, 2024

There's nothing inherently wrong with passing the r.Context() other than being the source of logic bugs and misunderstanding the behavior. And you're absolutely right that a deadlock as described in this issue shouldn't happen. That part is still a bug and the reason this issue wasn't closed by the documentation changes. That bug is also unrelated to recommending against request context usage.

[...] but is a useful way to attach multiple functions, process, particularly goroutines to the broader user connection.

I will reluctantly say maybe, but all the request context says is that the HTTP handler returned and nothing else.

One of those things is communicating to writers and readers that nobody is here anymore to hear the requests?

This is incorrect and precisely the misconception the documentation changes is trying to address. The request context has no association to the connection after Hijack.

Consider the following modified (from example) echo server handler:

func (s echoServer) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	c, err := websocket.Accept(w, r, &websocket.AcceptOptions{})
	if err != nil {
		s.logf("%v", err)
		return
	}

	go func() { // Observe, we return immediately after Accept from ServeHTTP, r.Context() is cancelled.
		defer c.CloseNow()

		l := rate.NewLimiter(rate.Every(time.Millisecond*100), 10)
		for {
			err = echo(c, l) // Using r.Context() here would break this implementation.
			if websocket.CloseStatus(err) == websocket.StatusNormalClosure {
				return
			}
			if err != nil {
				s.logf("failed to echo with %v: %v", r.RemoteAddr, err)
				return
			}
		}
	}()
}

A completely valid use-case where we don't and shouldn't care about r.Context() or what the HTTP handler does. This demonstrates that r.Context() doesn't tell us anything about the other end or the status of the connection.

@maggie44
Copy link
Contributor

maggie44 commented Nov 24, 2024

Interesting conversation, and I think this thread is turning in to exactly what is needed, a place for people to read through that flushes out the different scenarios.

The request context has no association to the connection after Hijack.

The request context (r.Context()) after Hijack, and after return from ServeHTTP is cancelled. Which passes the context canceled to the various write/writer and read/readers, or anything else it is passed to. It doesn't close the connection, but indicates to anything that is using it that ServeHTTP has returned and therefore the function is returned and likely should be closed. Is this not valuable behaviour?

Perhaps I have not been using ServeHTTP as others have, but I have something in ServeHTTP blocking for the life of the websocket connection to mimic HTTP behaviour. I presume others do too, as defer c.CloseNow() is in ServeHTTP in all the examples.

@mafredri
Copy link
Member

mafredri commented Nov 24, 2024

Interesting conversation, and I think this thread is turning in to exactly what is needed, a place for people to read through that flushes out the different scenarios.

Glad to hear it 👍🏻

[...] It doesn't close the connection, but indicates to anything that is using it that ServeHTTP has returned and therefore the connection is closed. Is this not valuable behaviour?

Depends on the author, I could make up use-cases but I don't think that'll add anything to the conversation. Rather I'd like to see it proved useful enough to warrant changing the recommendation from against using it. (Am I right to understand that's what's being discussed here?)

My take is that it's useful maybe 1/3 times (but rare), in 1/3 times it's harmful and in 1/3 times it'll prevent you from progressing until you figure it out (or cause unexpected behaviors depending on when the handler exits). So if someone wants to use it, they should really know what they're doing and preferably have a deep understanding of the behavior. Ultimately the recommendation is just that, a recommendation. Users are free to ignore it.

@maggie44
Copy link
Contributor

Rather I'd like to see it proved useful enough to warrant changing the recommendation from against using it. (Am I right to understand that's what's being discussed here?)

1 part thinking about what the recommendation should be, 2 parts checking over my code to make sure it has been doing what I thought it had been doing all along 😆 .

defer c.CloseNow() <-- this in the below closes the connection on exit of the function.

So r.Context() would indeed reflect that the connection is closed.

http.HandlerFunc(func (w http.ResponseWriter, r *http.Request) {
	c, err := websocket.Accept(w, r, nil)
	if err != nil {
		// ...
	}
	defer c.CloseNow()

	// Set the context as needed. Use of r.Context() is not recommended
	// to avoid surprising behavior (see http.Hijacker).
	ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
	defer cancel()

	var v interface{}
	err = wsjson.Read(ctx, c, &v)
	if err != nil {
		// ...
	}

	log.Printf("received: %v", v)

	c.Close(websocket.StatusNormalClosure, "")
})

If defer c.CloseNow() is a recommendation then perhaps r.Context() is still valid.

I think I'm happy with my implementation now at least, and learnt a few things about the Hijack behaviour. I guess we will see how people decide to implement. 👍

@mafredri
Copy link
Member

mafredri commented Nov 24, 2024

1 part thinking about what the recommendation should be, 2 parts checking over my code to make sure it has been doing what I thought it had been doing all along 😆 .

Haha, alright. 😄

So r.Context() would indeed reflect that the connection is closed.

Trying to read or write to the connection will tell you if the connection has closed irrespective of the request context. The bigger question is, where would you even find use for the request context as a signal the connection may have been closed? That question is rhetorical though, I don't expect an answer 😄

If defer c.CloseNow() is a recommendation then perhaps r.Context() is still valid.

I wouldn't say those two are necessarily related, i.e. defer doesn't prove the validity of the request context. For instance, the code you shared is a traditional use-case, perhaps the most common one. All logic is in the handler and executes synchronously. From this point of view, the context will never cancel. Because by the time it does it's irrelevant and we've already stopped all work and the connection is closed.

I think I'm happy with my implementation now at least, and learnt a few things about the Hijack behaviour. I guess we will see how people decide to implement. 👍

Great to hear, and good luck with evolving your implementation! 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants