reverseproxy: streaming timeouts #5567

mmm444 · 2023-06-08T20:16:45Z

This PR adds two settings related to streaming connections to the reverse_proxy handler:

stream_timeout defines the maximum lifetime of a streaming connection in the reverse proxy; when the connection reaches this age it is closed
stream_close_delay defines the time for which the proxy waits before closing the streaming connections when it is cleaned up i.e. after a configuration change

#5471

CLAassistant · 2023-06-08T20:16:52Z

All committers have signed the CLA.

francislavoie · 2023-06-08T20:41:29Z

Whoops, looks like there's a conflict. Could you rebase and fix it?

mholt

Thank you for working on this, we're really excited to get something like this merged soon.

I just did a quick first pass on the code -- let me know what you think, or if you have any questions!

mholt · 2023-06-09T20:22:46Z

modules/caddyhttp/reverseproxy/reverseproxy.go

-	connections   map[io.ReadWriteCloser]openConnection
-	connectionsMu *sync.Mutex
+	connections           map[io.ReadWriteCloser]openConnection
+	connectionsCloseTimer **time.Timer


You must be a C programmer 😃

Can we get this down to a *time.Timer? Seeing as I don't see us ever use connectionsCloseTimer, it's always *connectionsCloseTimer.

The reason for this is the same as why there is *sync.Mutex and not just sync.Mutex. As I have found during debugging there are multiple instances of Handler struct floating around during runtime and I needed to synchronize whether there is *time.Timer among all of them or at least some of them. Specifically the registration and the cleanup code are called on different instances of Handler. Why is this so? Is it an unfortunately leaked reference from a non-pointer receiver method? Or does it have some other reason?

And no, although I can write some C I would not call myself a C programmer. 😄

They should be on the same instance. The only reason I can think of why it would be different is because some of the methods don't have pointer receivers (i.e. some methods are (h Handler) because they, at least currently, don't intend to modify the handler and so this is kind of a safeguard -- but that h will be different than if the method has a pointer receiver, i.e. (h *Handler)).

The only reason I can think of why it would be different is because some of the methods don't have pointer receivers

This is exactly the cause. I have looked more into this and found one of the two code paths dealing with the timer goes over some methods that don't have pointer receivers. It is the cleanup and it looks like this:

(*Handler).registerConnection.func1 (Handler).handleUpgradeResponse (Handler).finalizeResponse (*Handler).reverseProxy (*Handler).proxyLoopIteration (*Handler).ServeHTTP ...

If I change finalizeResponse and handleUpgradeResponse to also have a pointer receiver we can get rid of the double pointer. Do you think that is the right way to do it?

Yes! Absolutely.

modules/caddyhttp/reverseproxy/reverseproxy.go

mholt · 2023-06-15T20:28:22Z

modules/caddyhttp/reverseproxy/reverseproxy.go

+				// this is potentially blocking while we have the lock on the connections
+				// map, but that should be OK since the server has in theory shut down
+				// and we are no longer using the connections map
+				gracefulErr := oc.gracefulClose()
+				if gracefulErr != nil && err == nil {
+					err = gracefulErr
+				}


Slightly nervous about this, but maybe we can revisit this one more time before merge.

Me too, OK.

francislavoie · 2023-06-16T08:44:04Z

Thanks a ton for finishing this off for me! The timer/channel bits are not a strength of mine. I don't have much to offer in terms of review but it's looking pretty good to me! 😊

mholt · 2023-06-19T17:19:43Z

Thanks!

I think I want to try this in beta 2, and I don't want to wait much longer on that. AFAICT this change looks alright to me for a beta, but would you be alright if we work on it more after merging? I want to see what we can do about that double pointer, and also verify a few logical things. Oh, and we should probably mark this as experimental so we can change it later, before the final tag.

But I think there's value in having this for people to try right away.

mmm444 · 2023-06-19T20:31:21Z

I think I want to try this in beta 2, and I don't want to wait much longer on that. AFAICT this change looks alright to me for a beta, but would you be alright if we work on it more after merging?

Yes. No problem.

I want to see what we can do about that double pointer, and also verify a few logical things.

I have outlined a solution in the comment above.

So just 2 questions from me:

Shall I commit the removal of the double pointer before you merge it? I already have the commit ready, tested a little bit and it seems to be working. I just don't want to spontaneously change the code under review.
I have noticed in the contributors guide that you prefer the PR to be squashed into one commit. Shall I do that?

mholt · 2023-06-19T20:40:57Z

@mmm444

Shall I commit the removal of the double pointer before you merge it? I already have the commit ready, tested a little bit and it seems to be working. I just don't want to spontaneously change the code under review.

Yes, please! Then I will merge this.

I have noticed in the contributors guide that you prefer the PR to be squashed into one commit. Shall I do that?

Nowadays, GitHub has a button that does this, so no need for you to do it. :)

Thank you for working on this!

Let's continue any adjustments after beta 2 is released here in the next day or so.

ghost · 2023-07-06T02:10:01Z

Perhaps I'm doing something wrong to test this?

I've tried adding stream_close_delay 3 or stream_close_delay 30 to a reverse_proxy block and when I reload my config, I still see my websocket connection die (edit: immediately) and then get recreated.

Am I missing something? Am I misunderstanding something?

francislavoie · 2023-07-06T09:06:34Z

30 means 30 nanoseconds. Try 30s instead for 30 seconds. See https://caddyserver.com/docs/conventions#durations

ghost · 2023-07-06T18:51:21Z

30 means 30 nanoseconds. Try 30s instead for 30 seconds. See https://caddyserver.com/docs/conventions#durations

D'oh!

Unfortunately, switching to 30s instead still yields the same result.

mholt · 2023-07-06T19:54:01Z

@xnaas Can you verify that you are running the correct Caddy build and config together?

@mmm444 Any ideas on this? @xnaas , maybe if you share your config @mmm444 could help.

ghost · 2023-07-06T20:05:47Z

$ docker exec -it caddy caddy version
v2.7.0-beta.2 h1:jaS1odoRuDR2W8igaKgVGvVjhTNt8xfoz3YPC4bcenA=

I have a snippet like:

(proxyheaders) {
	flush_interval -1
	stream_close_delay 30s
	header_up X-Real-IP {remote_host}
	header_up X-Fake-Test "lul"
}

I have an https://*.domain.com {} block with multiple @something host something.domain.com. Below is something like:

	handle @thelounge {
		reverse_proxy http://thelounge:9000 {
			import proxyheaders
		}
	}

I have the header_up X-Fake-Test "lul" up line above so I can toggle it with a comment to verify I'm actually reloading the config each time.

Edit: I've tested this with multiple services I use that use websockets. It's not just The Lounge. If I leave the network logs open in Firefox, I can see the websockets die almost immediately and then get recreated.

mholt · 2023-07-06T20:08:53Z

Thanks. @mmm444 do you have a chance to help out with this?

mmm444 · 2023-07-06T20:42:28Z

I would really like to help but I am on a vacation right now and testing this remotely is beyond my possibilities. I can get to this on next Thursday.

The situation seems to me like the basic use case of this feature that I have checked multiple times. The only difference is that I have never used the Caddyfile only JSON. Maybe there is a bug in the adaptation code? I admit I didn't check that.

mholt · 2023-07-06T21:05:56Z

Enjoy your vacation 😊 Thanks for replying.

I just did a quick overview of the Caddyfile code, and it looks like it should be fine. 🤔 Maybe @xnaas could verify that the JSON is correct (use caddy adapt --pretty to inspect).

Hopefully we can figure this out when you get back then!

ghost · 2023-07-07T01:15:45Z

Is this more or less what it should look like?

                    {
                      "group": "group62",
                      "handle": [
                        {
                          "handler": "subroute",
                          "routes": [
                            {
                              "handle": [
                                {
                                  "flush_interval": -1,
                                  "handler": "reverse_proxy",
                                  "headers": {
                                    "request": {
                                      "set": {
                                        "X-Fake-Test": [
                                          "lul"
                                        ],
                                        "X-Real-Ip": [
                                          "{http.request.remote.host}"
                                        ]
                                      }
                                    }
                                  },
                                  "stream_close_delay": 30000000000,
                                  "upstreams": [
                                    {
                                      "dial": "thelounge:9000"
                                    }
                                  ]
                                }
                              ]
                            }
                          ]
                        }
                      ],
                      "match": [
                        {
                          "host": [
                            "tl.asak.gg"
                          ]
                        }
                      ]
                    },

Edit: Wow...that's...super indented lol.

Edit 2: Less indented. :P

mmm444 · 2023-07-14T19:08:22Z

Ok. I have looked into this and it is caused by the flush_interval -1 setting. Without it it seems to work. 😁

Setting flush_interval to -1 causes the upstream (proxy to server) connections to get bound to the context of the whole reverse proxy handler (h.ctx) via ignoreClientGoneContext instead of the client connection and this context terminates when the configuration is reloaded. Closing the h.ctx context in turn terminates one of the websocket copy goroutines and this effectively closes the websocket connection before any timeout can take place.

I am not sure yet what is the best fix here. I will try to come up with something in the following days.

If someone wants to play with the situation, here is a minimal websocket server that can be used to reproduce the reported behavior. https://gist.github.com/mmm444/efc3e25fbbb0056f9c759d6dac3f65f0

Shall I file a bug about it or is it ok to continue the discussion here?

francislavoie · 2023-07-14T19:23:59Z

Fascinating.

Coincidentally, Go 1.21 is introducing a new context.WithoutCancel() function. I think that would be helpful in this case to allow WS connections to outlive the context. We can't use it quite yet though (probably for another 6 months to a year) because Caddy's minimum Go version will need to be increased to match.

I think the best thing to do for now is probably to add a warning log when these two config options are combined to mention that it might not work properly, unless you can figure out some other creative fix. We can open an issue as a reminder to implement WithoutCancel when we have the Go version to do it.

mholt · 2023-07-14T21:02:15Z

@mmm444 Thanks for coming back to this!

Francis pointed out to me in Slack this is why that happens:

	// if FlushInterval is explicitly configured to -1 (i.e. flush continuously to achieve
	// low-latency streaming), don't let the transport cancel the request if the client
	// disconnects: user probably wants us to finish sending the data to the upstream
	// regardless, and we should expect client disconnection in low-latency streaming
	// scenarios (see issue #4922)
	if h.FlushInterval == -1 {
		req = req.WithContext(ignoreClientGoneContext{req.Context(), h.ctx.Done()})
	}

That said, I feel like when the server flushes data should be orthogonal to when the connection is hung up.

Is there, like, a done chan we could give it that never gets closed? Or doesn't get closed until the stream_close_delay is up?

ghost · 2023-07-15T12:55:02Z

Perhaps this is on me for using flush_interval -1 for all of my reverse proxies. I guess Plex is really the only one that specifically benefits from this.

Edit: Just editing to say that I shifted my Caddyfile around a bit and now Plex is the only one that has flush_interval -1. I can confirm that stream_close_delay 30s is working as expected now.

francislavoie · 2023-07-16T00:38:33Z

I'm thinking we should add a condition to that if h.FlushInterval == -1 { condition to skip that behaviour for WebSocket requests. I can't think of a situation where that would be beneficial for WebSockets, it seems like it would only be harmful. It could make sense to configure flush_interval -1 for a single backend which does both streaming and WebSockets, but we'd want to ignore that option for any WebSocket requests.

mmm444 · 2023-07-17T23:53:58Z

I have been thinking about this over the weekend and come to that the correct for now (tm) solution is to not use the h.ctx context for shutting down any connections when h.FlushInterval == -1. It should be achievable using something like the WithoutCancel feature from go 1.21. The reasons for this are:

Ongoing downloads/uploads will not get terminated upon config reloads. I find it a bit surprising, maybe even buggy, that it happens now. It does not happen when h.FlushInterval != -1.
The timeouts on websockets will get fixed.
From the users' standpoint flushing tuning has nothing to do with websocket timeouts so it is better not to leak the implementation details (about configuration reloading) to them.

I can submit a PR if you want.

francislavoie · 2023-07-18T01:37:37Z

Yeah, PR certainly welcome.

I don't think we can use WithoutCancel yet like I said earlier because we'd need to have Go 1.21 in go.mod as the minimum version (and obviously because Go 1.21 isn't out yet).

mholt · 2023-07-18T01:40:01Z

@mmm444 Thanks for the careful thought. I agree with you. I initially used that context because if the client canceling the request doesn't close the connection, we still had to ensure it closed at some point: a config reload seemed like the right time. But it's really not, like you pointed out.

I just hope it won't lead to leaking resources. I guess if it's just proxying the stream, it's up to the client or the backend to actually close them.

mholt · 2023-07-26T17:59:33Z

@mmm444 Would you still like to open a PR? If not, I can take a stab at it, but you're more attuned to this than I am currently 😃

We just need to do it without WithoutCancel for now.

mholt · 2023-07-26T18:37:18Z

What about this as a simple patch (two more conditions on the if statement)?

	if h.FlushInterval == -1 && h.StreamCloseDelay == 0 && h.StreamTimeout == 0 {
		req = req.WithContext(ignoreClientGoneContext{req.Context(), h.ctx.Done()})
	}

Since if the close delay or timeout are set, the connection will definitely be closed at that point, right? We just need to ensure the connection closes eventually otherwise.

mmm444 · 2023-07-31T10:26:12Z

@mmm444 Would you still like to open a PR? If not, I can take a stab at it, but you're more attuned to this than I am currently 😃

Yes. I am looking into this right now. Will submit a PR with some explanation in a few hours, hopefully. 😄 Sorry for the delay, I have been away from the internet once again.

mholt · 2023-07-31T20:25:24Z

Awesome, thank you @mmm444 !

berkant · 2023-09-25T16:16:14Z

So, should I put an incredibly long duration for stream_close_delay to keep WebSocket connections alive after reloading config? My use case includes configuring Caddy on-the-fly to add/remove new reverse_proxy routes to various WebSocket upstreams through Admin API (http://localhost:2019/config/), but for the time being, all connections do break after doing this.

francislavoie · 2023-09-25T16:36:21Z

Yeah you should set it to some reasonable amount of time dependent on the usual session length of your users. But you should definitely make sure to have proper reconnect logic on the client-side to ensure that even if the connections are closed, it reconnects cleanly.

francislavoie added the feature ⚙️ New feature or request label Jun 8, 2023

francislavoie added this to the v2.7.1 milestone Jun 8, 2023

francislavoie and others added 3 commits June 8, 2023 23:09

reverseproxy: WIP streaming timeouts

ee88dbd

More verbose logging by using the child logger

9a53554

reverseproxy: Implement streaming timeouts

84d4844

mmm444 force-pushed the ws-close-timeouts branch from fab2ddc to 84d4844 Compare June 8, 2023 21:10

mholt requested changes Jun 15, 2023

View reviewed changes

mholt added the under review 🧐 Review is pending before merging label Jun 15, 2023

reverseproxy: Refactor cleanup

988cc2f

mholt removed the under review 🧐 Review is pending before merging label Jun 19, 2023

reverseproxy: Avoid **time.Timer

fa04538

mholt merged commit 424ae0f into caddyserver:master Jun 19, 2023

francislavoie modified the milestones: v2.7.1, v2.7.0 Jun 19, 2023

mmm444 deleted the ws-close-timeouts branch June 20, 2023 08:15

francislavoie mentioned this pull request Jul 18, 2023

reverseproxy: do not close streams when explicitly configured #5637

Closed

mmm444 mentioned this pull request Jul 31, 2023

reverseproxy: Connection termination cleanup #5663

Merged

mholt mentioned this pull request May 29, 2024

Don't close active websocket connections on config reload #5471

Closed

reverseproxy: streaming timeouts #5567

reverseproxy: streaming timeouts #5567

Conversation

mmm444 commented Jun 8, 2023

CLAassistant commented Jun 8, 2023 • edited Loading

francislavoie commented Jun 8, 2023

mholt left a comment

Choose a reason for hiding this comment

mholt Jun 9, 2023

Choose a reason for hiding this comment

mmm444 Jun 16, 2023

Choose a reason for hiding this comment

mholt Jun 16, 2023

Choose a reason for hiding this comment

mmm444 Jun 19, 2023

Choose a reason for hiding this comment

mholt Jun 19, 2023

Choose a reason for hiding this comment

mholt Jun 15, 2023

Choose a reason for hiding this comment

mmm444 Jun 16, 2023

Choose a reason for hiding this comment

francislavoie commented Jun 16, 2023

mholt commented Jun 19, 2023

mmm444 commented Jun 19, 2023

mholt commented Jun 19, 2023

ghost commented Jul 6, 2023 • edited by ghost Loading

francislavoie commented Jul 6, 2023 • edited Loading

ghost commented Jul 6, 2023

mholt commented Jul 6, 2023

ghost commented Jul 6, 2023 • edited by ghost Loading

mholt commented Jul 6, 2023

mmm444 commented Jul 6, 2023

mholt commented Jul 6, 2023

ghost commented Jul 7, 2023 • edited by ghost Loading

mmm444 commented Jul 14, 2023

francislavoie commented Jul 14, 2023

mholt commented Jul 14, 2023

ghost commented Jul 15, 2023 • edited by ghost Loading

francislavoie commented Jul 16, 2023

mmm444 commented Jul 17, 2023

francislavoie commented Jul 18, 2023

mholt commented Jul 18, 2023

mholt commented Jul 26, 2023 • edited Loading

mholt commented Jul 26, 2023 • edited Loading

mmm444 commented Jul 31, 2023 • edited Loading

mholt commented Jul 31, 2023

berkant commented Sep 25, 2023 • edited Loading

francislavoie commented Sep 25, 2023

CLAassistant commented Jun 8, 2023 •

edited

Loading

ghost commented Jul 6, 2023 •

edited by ghost

Loading

francislavoie commented Jul 6, 2023 •

edited

Loading

ghost commented Jul 6, 2023 •

edited by ghost

Loading

ghost commented Jul 7, 2023 •

edited by ghost

Loading

ghost commented Jul 15, 2023 •

edited by ghost

Loading

mholt commented Jul 26, 2023 •

edited

Loading

mholt commented Jul 26, 2023 •

edited

Loading

mmm444 commented Jul 31, 2023 •

edited

Loading

berkant commented Sep 25, 2023 •

edited

Loading