-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reverseproxy: streaming timeouts #5567
Conversation
Whoops, looks like there's a conflict. Could you rebase and fix it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this, we're really excited to get something like this merged soon.
I just did a quick first pass on the code -- let me know what you think, or if you have any questions!
connections map[io.ReadWriteCloser]openConnection | ||
connectionsMu *sync.Mutex | ||
connections map[io.ReadWriteCloser]openConnection | ||
connectionsCloseTimer **time.Timer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must be a C programmer 😃
Can we get this down to a *time.Timer
? Seeing as I don't see us ever use connectionsCloseTimer
, it's always *connectionsCloseTimer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this is the same as why there is *sync.Mutex
and not just sync.Mutex
. As I have found during debugging there are multiple instances of Handler
struct floating around during runtime and I needed to synchronize whether there is *time.Timer
among all of them or at least some of them. Specifically the registration and the cleanup code are called on different instances of Handler
. Why is this so? Is it an unfortunately leaked reference from a non-pointer receiver method? Or does it have some other reason?
And no, although I can write some C I would not call myself a C programmer. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should be on the same instance. The only reason I can think of why it would be different is because some of the methods don't have pointer receivers (i.e. some methods are (h Handler)
because they, at least currently, don't intend to modify the handler and so this is kind of a safeguard -- but that h
will be different than if the method has a pointer receiver, i.e. (h *Handler)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I can think of why it would be different is because some of the methods don't have pointer receivers
This is exactly the cause. I have looked more into this and found one of the two code paths dealing with the timer goes over some methods that don't have pointer receivers. It is the cleanup and it looks like this:
(*Handler).registerConnection.func1
(Handler).handleUpgradeResponse
(Handler).finalizeResponse
(*Handler).reverseProxy
(*Handler).proxyLoopIteration
(*Handler).ServeHTTP
...
If I change finalizeResponse
and handleUpgradeResponse
to also have a pointer receiver we can get rid of the double pointer. Do you think that is the right way to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Absolutely.
// this is potentially blocking while we have the lock on the connections | ||
// map, but that should be OK since the server has in theory shut down | ||
// and we are no longer using the connections map | ||
gracefulErr := oc.gracefulClose() | ||
if gracefulErr != nil && err == nil { | ||
err = gracefulErr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly nervous about this, but maybe we can revisit this one more time before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me too, OK.
Thanks a ton for finishing this off for me! The timer/channel bits are not a strength of mine. I don't have much to offer in terms of review but it's looking pretty good to me! 😊 |
Thanks! I think I want to try this in beta 2, and I don't want to wait much longer on that. AFAICT this change looks alright to me for a beta, but would you be alright if we work on it more after merging? I want to see what we can do about that double pointer, and also verify a few logical things. Oh, and we should probably mark this as experimental so we can change it later, before the final tag. But I think there's value in having this for people to try right away. |
Yes. No problem.
I have outlined a solution in the comment above. So just 2 questions from me:
|
Yes, please! Then I will merge this.
Nowadays, GitHub has a button that does this, so no need for you to do it. :) Thank you for working on this! Let's continue any adjustments after beta 2 is released here in the next day or so. |
Perhaps I'm doing something wrong to test this? I've tried adding Am I missing something? Am I misunderstanding something? |
|
D'oh! Unfortunately, switching to |
$ docker exec -it caddy caddy version
v2.7.0-beta.2 h1:jaS1odoRuDR2W8igaKgVGvVjhTNt8xfoz3YPC4bcenA= I have a snippet like:
I have an
I have the Edit: I've tested this with multiple services I use that use websockets. It's not just The Lounge. If I leave the network logs open in Firefox, I can see the websockets die almost immediately and then get recreated. |
Thanks. @mmm444 do you have a chance to help out with this? |
I would really like to help but I am on a vacation right now and testing this remotely is beyond my possibilities. I can get to this on next Thursday. The situation seems to me like the basic use case of this feature that I have checked multiple times. The only difference is that I have never used the Caddyfile only JSON. Maybe there is a bug in the adaptation code? I admit I didn't check that. |
Enjoy your vacation 😊 Thanks for replying. I just did a quick overview of the Caddyfile code, and it looks like it should be fine. 🤔 Maybe @xnaas could verify that the JSON is correct (use Hopefully we can figure this out when you get back then! |
Is this more or less what it should look like? {
"group": "group62",
"handle": [
{
"handler": "subroute",
"routes": [
{
"handle": [
{
"flush_interval": -1,
"handler": "reverse_proxy",
"headers": {
"request": {
"set": {
"X-Fake-Test": [
"lul"
],
"X-Real-Ip": [
"{http.request.remote.host}"
]
}
}
},
"stream_close_delay": 30000000000,
"upstreams": [
{
"dial": "thelounge:9000"
}
]
}
]
}
]
}
],
"match": [
{
"host": [
"tl.asak.gg"
]
}
]
}, Edit: Wow...that's...super indented lol. Edit 2: Less indented. :P |
Ok. I have looked into this and it is caused by the Setting I am not sure yet what is the best fix here. I will try to come up with something in the following days. If someone wants to play with the situation, here is a minimal websocket server that can be used to reproduce the reported behavior. https://gist.github.com/mmm444/efc3e25fbbb0056f9c759d6dac3f65f0 Shall I file a bug about it or is it ok to continue the discussion here? |
Fascinating. Coincidentally, Go 1.21 is introducing a new I think the best thing to do for now is probably to add a warning log when these two config options are combined to mention that it might not work properly, unless you can figure out some other creative fix. We can open an issue as a reminder to implement |
@mmm444 Thanks for coming back to this! Francis pointed out to me in Slack this is why that happens: // if FlushInterval is explicitly configured to -1 (i.e. flush continuously to achieve
// low-latency streaming), don't let the transport cancel the request if the client
// disconnects: user probably wants us to finish sending the data to the upstream
// regardless, and we should expect client disconnection in low-latency streaming
// scenarios (see issue #4922)
if h.FlushInterval == -1 {
req = req.WithContext(ignoreClientGoneContext{req.Context(), h.ctx.Done()})
} That said, I feel like when the server flushes data should be orthogonal to when the connection is hung up. Is there, like, a done chan we could give it that never gets closed? Or doesn't get closed until the stream_close_delay is up? |
Perhaps this is on me for using Edit: Just editing to say that I shifted my Caddyfile around a bit and now Plex is the only one that has |
I'm thinking we should add a condition to that |
I have been thinking about this over the weekend and come to that the correct for now (tm) solution is to not use the
I can submit a PR if you want. |
Yeah, PR certainly welcome. I don't think we can use |
@mmm444 Thanks for the careful thought. I agree with you. I initially used that context because if the client canceling the request doesn't close the connection, we still had to ensure it closed at some point: a config reload seemed like the right time. But it's really not, like you pointed out. I just hope it won't lead to leaking resources. I guess if it's just proxying the stream, it's up to the client or the backend to actually close them. |
@mmm444 Would you still like to open a PR? If not, I can take a stab at it, but you're more attuned to this than I am currently 😃 We just need to do it without WithoutCancel for now. |
What about this as a simple patch (two more conditions on the if h.FlushInterval == -1 && h.StreamCloseDelay == 0 && h.StreamTimeout == 0 {
req = req.WithContext(ignoreClientGoneContext{req.Context(), h.ctx.Done()})
} Since if the close delay or timeout are set, the connection will definitely be closed at that point, right? We just need to ensure the connection closes eventually otherwise. |
Yes. I am looking into this right now. Will submit a PR with some explanation in a few hours, hopefully. 😄 Sorry for the delay, I have been away from the internet once again. |
Awesome, thank you @mmm444 ! |
So, should I put an incredibly long duration for |
Yeah you should set it to some reasonable amount of time dependent on the usual session length of your users. But you should definitely make sure to have proper reconnect logic on the client-side to ensure that even if the connections are closed, it reconnects cleanly. |
This PR adds two settings related to streaming connections to the reverse_proxy handler:
stream_timeout
defines the maximum lifetime of a streaming connection in the reverse proxy; when the connection reaches this age it is closedstream_close_delay
defines the time for which the proxy waits before closing the streaming connections when it is cleaned up i.e. after a configuration change#5471