Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverseproxy: streaming timeouts #5567

Merged
merged 5 commits into from
Jun 19, 2023
Merged

Conversation

mmm444
Copy link
Contributor

@mmm444 mmm444 commented Jun 8, 2023

This PR adds two settings related to streaming connections to the reverse_proxy handler:

stream_timeout defines the maximum lifetime of a streaming connection in the reverse proxy; when the connection reaches this age it is closed
stream_close_delay defines the time for which the proxy waits before closing the streaming connections when it is cleaned up i.e. after a configuration change

#5471

@CLAassistant
Copy link

CLAassistant commented Jun 8, 2023

CLA assistant check
All committers have signed the CLA.

@francislavoie francislavoie added the feature ⚙️ New feature or request label Jun 8, 2023
@francislavoie francislavoie added this to the v2.7.1 milestone Jun 8, 2023
@francislavoie
Copy link
Member

Whoops, looks like there's a conflict. Could you rebase and fix it?

@mmm444 mmm444 force-pushed the ws-close-timeouts branch from fab2ddc to 84d4844 Compare June 8, 2023 21:10
Copy link
Member

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, we're really excited to get something like this merged soon.

I just did a quick first pass on the code -- let me know what you think, or if you have any questions!

connections map[io.ReadWriteCloser]openConnection
connectionsMu *sync.Mutex
connections map[io.ReadWriteCloser]openConnection
connectionsCloseTimer **time.Timer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must be a C programmer 😃

Can we get this down to a *time.Timer? Seeing as I don't see us ever use connectionsCloseTimer, it's always *connectionsCloseTimer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is the same as why there is *sync.Mutex and not just sync.Mutex. As I have found during debugging there are multiple instances of Handler struct floating around during runtime and I needed to synchronize whether there is *time.Timer among all of them or at least some of them. Specifically the registration and the cleanup code are called on different instances of Handler. Why is this so? Is it an unfortunately leaked reference from a non-pointer receiver method? Or does it have some other reason?

And no, although I can write some C I would not call myself a C programmer. 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be on the same instance. The only reason I can think of why it would be different is because some of the methods don't have pointer receivers (i.e. some methods are (h Handler) because they, at least currently, don't intend to modify the handler and so this is kind of a safeguard -- but that h will be different than if the method has a pointer receiver, i.e. (h *Handler)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I can think of why it would be different is because some of the methods don't have pointer receivers

This is exactly the cause. I have looked more into this and found one of the two code paths dealing with the timer goes over some methods that don't have pointer receivers. It is the cleanup and it looks like this:

(*Handler).registerConnection.func1
(Handler).handleUpgradeResponse
(Handler).finalizeResponse
(*Handler).reverseProxy
(*Handler).proxyLoopIteration
(*Handler).ServeHTTP
...

If I change finalizeResponse and handleUpgradeResponse to also have a pointer receiver we can get rid of the double pointer. Do you think that is the right way to do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Absolutely.

modules/caddyhttp/reverseproxy/reverseproxy.go Outdated Show resolved Hide resolved
Comment on lines 409 to 415
// this is potentially blocking while we have the lock on the connections
// map, but that should be OK since the server has in theory shut down
// and we are no longer using the connections map
gracefulErr := oc.gracefulClose()
if gracefulErr != nil && err == nil {
err = gracefulErr
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly nervous about this, but maybe we can revisit this one more time before merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too, OK.

@mholt mholt added the under review 🧐 Review is pending before merging label Jun 15, 2023
@francislavoie
Copy link
Member

Thanks a ton for finishing this off for me! The timer/channel bits are not a strength of mine. I don't have much to offer in terms of review but it's looking pretty good to me! 😊

@mholt
Copy link
Member

mholt commented Jun 19, 2023

Thanks!

I think I want to try this in beta 2, and I don't want to wait much longer on that. AFAICT this change looks alright to me for a beta, but would you be alright if we work on it more after merging? I want to see what we can do about that double pointer, and also verify a few logical things. Oh, and we should probably mark this as experimental so we can change it later, before the final tag.

But I think there's value in having this for people to try right away.

@mholt mholt removed the under review 🧐 Review is pending before merging label Jun 19, 2023
@mmm444
Copy link
Contributor Author

mmm444 commented Jun 19, 2023

I think I want to try this in beta 2, and I don't want to wait much longer on that. AFAICT this change looks alright to me for a beta, but would you be alright if we work on it more after merging?

Yes. No problem.

I want to see what we can do about that double pointer, and also verify a few logical things.

I have outlined a solution in the comment above.

So just 2 questions from me:

  1. Shall I commit the removal of the double pointer before you merge it? I already have the commit ready, tested a little bit and it seems to be working. I just don't want to spontaneously change the code under review.
  2. I have noticed in the contributors guide that you prefer the PR to be squashed into one commit. Shall I do that?

@mholt
Copy link
Member

mholt commented Jun 19, 2023

@mmm444

Shall I commit the removal of the double pointer before you merge it? I already have the commit ready, tested a little bit and it seems to be working. I just don't want to spontaneously change the code under review.

Yes, please! Then I will merge this.

I have noticed in the contributors guide that you prefer the PR to be squashed into one commit. Shall I do that?

Nowadays, GitHub has a button that does this, so no need for you to do it. :)

Thank you for working on this!

Let's continue any adjustments after beta 2 is released here in the next day or so.

@mholt mholt merged commit 424ae0f into caddyserver:master Jun 19, 2023
@francislavoie francislavoie modified the milestones: v2.7.1, v2.7.0 Jun 19, 2023
@mmm444 mmm444 deleted the ws-close-timeouts branch June 20, 2023 08:15
@ghost
Copy link

ghost commented Jul 6, 2023

Perhaps I'm doing something wrong to test this?

I've tried adding stream_close_delay 3 or stream_close_delay 30 to a reverse_proxy block and when I reload my config, I still see my websocket connection die (edit: immediately) and then get recreated.

Am I missing something? Am I misunderstanding something?

@francislavoie
Copy link
Member

francislavoie commented Jul 6, 2023

30 means 30 nanoseconds. Try 30s instead for 30 seconds. See https://caddyserver.com/docs/conventions#durations

@ghost
Copy link

ghost commented Jul 6, 2023

30 means 30 nanoseconds. Try 30s instead for 30 seconds. See https://caddyserver.com/docs/conventions#durations

D'oh!

Unfortunately, switching to 30s instead still yields the same result.

@mholt
Copy link
Member

mholt commented Jul 6, 2023

@xnaas Can you verify that you are running the correct Caddy build and config together?

@mmm444 Any ideas on this? @xnaas , maybe if you share your config @mmm444 could help.

@ghost
Copy link

ghost commented Jul 6, 2023

$ docker exec -it caddy caddy version
v2.7.0-beta.2 h1:jaS1odoRuDR2W8igaKgVGvVjhTNt8xfoz3YPC4bcenA=

I have a snippet like:

(proxyheaders) {
	flush_interval -1
	stream_close_delay 30s
	header_up X-Real-IP {remote_host}
	header_up X-Fake-Test "lul"
}

I have an https://*.domain.com {} block with multiple @something host something.domain.com. Below is something like:

	handle @thelounge {
		reverse_proxy http://thelounge:9000 {
			import proxyheaders
		}
	}

I have the header_up X-Fake-Test "lul" up line above so I can toggle it with a comment to verify I'm actually reloading the config each time.

Edit: I've tested this with multiple services I use that use websockets. It's not just The Lounge. If I leave the network logs open in Firefox, I can see the websockets die almost immediately and then get recreated.

@mholt
Copy link
Member

mholt commented Jul 6, 2023

Thanks. @mmm444 do you have a chance to help out with this?

@mmm444
Copy link
Contributor Author

mmm444 commented Jul 6, 2023

I would really like to help but I am on a vacation right now and testing this remotely is beyond my possibilities. I can get to this on next Thursday.

The situation seems to me like the basic use case of this feature that I have checked multiple times. The only difference is that I have never used the Caddyfile only JSON. Maybe there is a bug in the adaptation code? I admit I didn't check that.

@mholt
Copy link
Member

mholt commented Jul 6, 2023

Enjoy your vacation 😊 Thanks for replying.

I just did a quick overview of the Caddyfile code, and it looks like it should be fine. 🤔 Maybe @xnaas could verify that the JSON is correct (use caddy adapt --pretty to inspect).

Hopefully we can figure this out when you get back then!

@ghost
Copy link

ghost commented Jul 7, 2023

Is this more or less what it should look like?

                    {
                      "group": "group62",
                      "handle": [
                        {
                          "handler": "subroute",
                          "routes": [
                            {
                              "handle": [
                                {
                                  "flush_interval": -1,
                                  "handler": "reverse_proxy",
                                  "headers": {
                                    "request": {
                                      "set": {
                                        "X-Fake-Test": [
                                          "lul"
                                        ],
                                        "X-Real-Ip": [
                                          "{http.request.remote.host}"
                                        ]
                                      }
                                    }
                                  },
                                  "stream_close_delay": 30000000000,
                                  "upstreams": [
                                    {
                                      "dial": "thelounge:9000"
                                    }
                                  ]
                                }
                              ]
                            }
                          ]
                        }
                      ],
                      "match": [
                        {
                          "host": [
                            "tl.asak.gg"
                          ]
                        }
                      ]
                    },

Edit: Wow...that's...super indented lol.

Edit 2: Less indented. :P

@mmm444
Copy link
Contributor Author

mmm444 commented Jul 14, 2023

Ok. I have looked into this and it is caused by the flush_interval -1 setting. Without it it seems to work. 😁

Setting flush_interval to -1 causes the upstream (proxy to server) connections to get bound to the context of the whole reverse proxy handler (h.ctx) via ignoreClientGoneContext instead of the client connection and this context terminates when the configuration is reloaded. Closing the h.ctx context in turn terminates one of the websocket copy goroutines and this effectively closes the websocket connection before any timeout can take place.

I am not sure yet what is the best fix here. I will try to come up with something in the following days.

If someone wants to play with the situation, here is a minimal websocket server that can be used to reproduce the reported behavior. https://gist.github.com/mmm444/efc3e25fbbb0056f9c759d6dac3f65f0

Shall I file a bug about it or is it ok to continue the discussion here?

@francislavoie
Copy link
Member

Fascinating.

Coincidentally, Go 1.21 is introducing a new context.WithoutCancel() function. I think that would be helpful in this case to allow WS connections to outlive the context. We can't use it quite yet though (probably for another 6 months to a year) because Caddy's minimum Go version will need to be increased to match.

I think the best thing to do for now is probably to add a warning log when these two config options are combined to mention that it might not work properly, unless you can figure out some other creative fix. We can open an issue as a reminder to implement WithoutCancel when we have the Go version to do it.

@mholt
Copy link
Member

mholt commented Jul 14, 2023

@mmm444 Thanks for coming back to this!

Francis pointed out to me in Slack this is why that happens:

	// if FlushInterval is explicitly configured to -1 (i.e. flush continuously to achieve
	// low-latency streaming), don't let the transport cancel the request if the client
	// disconnects: user probably wants us to finish sending the data to the upstream
	// regardless, and we should expect client disconnection in low-latency streaming
	// scenarios (see issue #4922)
	if h.FlushInterval == -1 {
		req = req.WithContext(ignoreClientGoneContext{req.Context(), h.ctx.Done()})
	}

That said, I feel like when the server flushes data should be orthogonal to when the connection is hung up.

Is there, like, a done chan we could give it that never gets closed? Or doesn't get closed until the stream_close_delay is up?

@ghost
Copy link

ghost commented Jul 15, 2023

Perhaps this is on me for using flush_interval -1 for all of my reverse proxies. I guess Plex is really the only one that specifically benefits from this.

Edit: Just editing to say that I shifted my Caddyfile around a bit and now Plex is the only one that has flush_interval -1. I can confirm that stream_close_delay 30s is working as expected now.

@francislavoie
Copy link
Member

I'm thinking we should add a condition to that if h.FlushInterval == -1 { condition to skip that behaviour for WebSocket requests. I can't think of a situation where that would be beneficial for WebSockets, it seems like it would only be harmful. It could make sense to configure flush_interval -1 for a single backend which does both streaming and WebSockets, but we'd want to ignore that option for any WebSocket requests.

@mmm444
Copy link
Contributor Author

mmm444 commented Jul 17, 2023

I have been thinking about this over the weekend and come to that the correct for now (tm) solution is to not use the h.ctx context for shutting down any connections when h.FlushInterval == -1. It should be achievable using something like the WithoutCancel feature from go 1.21. The reasons for this are:

  • Ongoing downloads/uploads will not get terminated upon config reloads. I find it a bit surprising, maybe even buggy, that it happens now. It does not happen when h.FlushInterval != -1.
  • The timeouts on websockets will get fixed.
  • From the users' standpoint flushing tuning has nothing to do with websocket timeouts so it is better not to leak the implementation details (about configuration reloading) to them.

I can submit a PR if you want.

@francislavoie
Copy link
Member

Yeah, PR certainly welcome.

I don't think we can use WithoutCancel yet like I said earlier because we'd need to have Go 1.21 in go.mod as the minimum version (and obviously because Go 1.21 isn't out yet).

@mholt
Copy link
Member

mholt commented Jul 18, 2023

@mmm444 Thanks for the careful thought. I agree with you. I initially used that context because if the client canceling the request doesn't close the connection, we still had to ensure it closed at some point: a config reload seemed like the right time. But it's really not, like you pointed out.

I just hope it won't lead to leaking resources. I guess if it's just proxying the stream, it's up to the client or the backend to actually close them.

@mholt
Copy link
Member

mholt commented Jul 26, 2023

@mmm444 Would you still like to open a PR? If not, I can take a stab at it, but you're more attuned to this than I am currently 😃

We just need to do it without WithoutCancel for now.

@mholt
Copy link
Member

mholt commented Jul 26, 2023

What about this as a simple patch (two more conditions on the if statement)?

	if h.FlushInterval == -1 && h.StreamCloseDelay == 0 && h.StreamTimeout == 0 {
		req = req.WithContext(ignoreClientGoneContext{req.Context(), h.ctx.Done()})
	}

Since if the close delay or timeout are set, the connection will definitely be closed at that point, right? We just need to ensure the connection closes eventually otherwise.

@mmm444
Copy link
Contributor Author

mmm444 commented Jul 31, 2023

@mmm444 Would you still like to open a PR? If not, I can take a stab at it, but you're more attuned to this than I am currently 😃

Yes. I am looking into this right now. Will submit a PR with some explanation in a few hours, hopefully. 😄 Sorry for the delay, I have been away from the internet once again.

@mholt
Copy link
Member

mholt commented Jul 31, 2023

Awesome, thank you @mmm444 !

@berkant
Copy link

berkant commented Sep 25, 2023

So, should I put an incredibly long duration for stream_close_delay to keep WebSocket connections alive after reloading config? My use case includes configuring Caddy on-the-fly to add/remove new reverse_proxy routes to various WebSocket upstreams through Admin API (http://localhost:2019/config/), but for the time being, all connections do break after doing this.

@francislavoie
Copy link
Member

Yeah you should set it to some reasonable amount of time dependent on the usual session length of your users. But you should definitely make sure to have proper reconnect logic on the client-side to ensure that even if the connections are closed, it reconnects cleanly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature ⚙️ New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants