-
Notifications
You must be signed in to change notification settings - Fork 844
Support graceful shutdown on both HTTP/1.1 and HTTP/2 #2106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@zizhong Please take a look. |
|
@maskit This is a 7.1.0 candidate? |
jpeach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review placeholder :)
I'll try to look this weekend.
|
@maskit I trust your judgement here, let me know what you think is the right thing to do. My goal right now is to make 7.1.0 rock solid, and since we're working on other issues, there's still time. |
| { | ||
| sslNetProcessor.stop_accept(); | ||
| netProcessor.stop_accept(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work correctly. If you are running under traffic_manager then what happens is that the client connection just hangs because traffic_manager is still listening on the sockets.
Refusing new connections while draining should absolutely be configurable. Many configurations need to continue to serve clients while the GSLB directs traffic elsewhere. @zwoop.
I saw an assert at one point:
traffic_server: using root directory '/opt/ats'
[Jun 10 20:51:40.206] Server {0x7fd9da1b8700} DEBUG: <Main.cc:417 (periodic)> (server) limiting connections based on memory usage has been disabled
[Jun 10 20:52:28.020] Server {0x7fd9da3bc700} DEBUG: <Main.cc:285 (periodic)> (server) received exit signal, shutting down in 30secs
Fatal: UnixNetVConnection.cc:1401: failed assertion `con.fd == NO_FD`
traffic_server: received signal 6 (Aborted)
traffic_server - STACK TRACE:
/opt/ats/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xc3)[0x504c72]
/lib64/libpthread.so.0(+0x115c0)[0x7fd9ddb065c0]
/lib64/libc.so.6(gsignal+0x9f)[0x7fd9dccb891f]
/lib64/libc.so.6(abort+0x16a)[0x7fd9dccba51a]
/opt/ats/lib/libtsutil.so.8(_Z11ink_warningPKcz+0x0)[0x7fd9df2d58f2]
/opt/ats/lib/libtsutil.so.8(_Z17ats_base64_encodePKhmPcmPm+0x0)[0x7fd9df2d2d65]
/opt/ats/bin/traffic_server(_ZN18UnixNetVConnection4freeEP7EThread+0x2d9)[0x7a8425]
/opt/ats/bin/traffic_server(_ZN18UnixNetVConnection11acceptEventEiP5Event+0x16d)[0x7a7049]
/opt/ats/bin/traffic_server(_ZN12Continuation11handleEventEiPv+0x72)[0x507de6]
/opt/ats/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x134)[0x7ca102]
/opt/ats/bin/traffic_server(_ZN7EThread7executeEv+0x10f)[0x7ca391]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refusing new connections while draining should absolutely be configurable. Many configurations need to continue to serve clients while the GSLB directs traffic elsewhere.
I'm fine with keeping sockets open but I think it should have been done before starting graceful shutdown. If we don't close sockets, number of active connections won't goes down without external traffic control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work correctly. If you are running under traffic_manager then what happens is that the client connection just hangs because traffic_manager is still listening on the sockets.
Right. Is it OK if I close sockets only if traffic_manager isn't used ? This PR doesn't include support for traffic_manager.
(This still depends on proxy.config.stop.shutdown_timeout. I'm going to add support for traffic_ctl stop --drain later.)
It was not clear. I mean another Pull Request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the assert, it seems the NetVConnection has to be closed with close_UnixNetVConnection() before calling its free. Otherwise, net_connections_currently_open_stat won't be decremented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't close sockets, number of active connections won't goes down
without external traffic control.
@maskit That's right. If you are running a CDN you have external load balancing, so you don't need to give errors to clients. I can see the utility of refusing connections, but it should not be the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are running a CDN you have external load balancing
What if you aren't?
| is_draining() const | ||
| { | ||
| return http_client_session_draining; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any value in disguising the global variable here. Just use it directly where you need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrapped it because the variable will be replaced with a metric you suggested.
proxy/ProxyClientSession.h
Outdated
| #include "InkAPIInternal.h" | ||
| #include "http/HttpServerSession.h" | ||
|
|
||
| extern volatile bool http_client_session_draining; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a metric that publishes this state so that operators can tell we are in draining state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I left that as a future work? Because main purpose of this PR is generalizing the flag and supporting graceful shutdown on HTTP/1.1. This is just a replacement of http2_drain.
proxy/http2/Http2ConnectionState.h
Outdated
| }; | ||
|
|
||
| enum Http2ShutdownState { NOT_INITIATED, INITIATED, IN_PROGRESS }; | ||
| enum Http2ShutdownState { NOT_PLANNED, NOT_INITIATED, INITIATED, IN_PROGRESS }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do these states mean? I can't figure out from the usage what you intend here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These enums should be renamed to be consistent with the rest of the code. I think that all you need are:
enum Http2ShutdownState {
HTTP2_SHUTDOWN_NONE,
HTTP2_SHUTDOWN_BEGIN,
HTTP2_SHUTDOWN_INPROGRESS
};There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOT_PLANNED: Shutdown is not planned (it'll keep working)
NOT_INITIATED: Shutdown is planned but nothing is happened yet (The 1st GOAWAY will be sent)
INITIATED: The first GOAWAY is sent and waiting the 2sec before sending the 2nd GOAWAY
IN_PROGRESS: The second GOAWAY is sent and waiting shutdown timeout
I'll rename them. Let me know if you have better names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start by making them consistent with other enums, so:
HTTP2_SHUTDOWN_NONE,
HTTP2_SHUTDOWN_NOT_INITIATED,
HTTP2_SHUTDOWN_INITIATED,
HTTP2_SHUTDOWN_IN_PROGRESSWould you be able to make another PR later that reduces this down to just 3 values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible if I change the conditions, but I guess the condition will be a bit complex.
I'll make them consistent for now.
proxy/http2/Http2Stream.h
Outdated
| void reenable(VIO *vio); | ||
| virtual void transaction_done(); | ||
| virtual bool | ||
| ignore_keep_alive() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool ignore_keep_alive() override { ... }There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do after #2113. (Many warnings show up.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
| virtual bool | ||
| ignore_keep_alive() | ||
| { | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a comment explaining it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I don't understand why it was true. Without this change, Connection: close is always added (and is removed while converting headers from 1.1 to 2). I think it should be handled as the same as HTTP/1.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good. Just comment the explanation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
proxy/http2/Http2ConnectionState.h
Outdated
| }; | ||
|
|
||
| enum Http2ShutdownState { NOT_INITIATED, INITIATED, IN_PROGRESS }; | ||
| enum Http2ShutdownState { NOT_PLANNED, NOT_INITIATED, INITIATED, IN_PROGRESS }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code for handling HTTP2_SESSION_EVENT_SHUTDOWN_INIT, is it safe to schedule the HTTP2_SESSION_EVENT_SHUTDOWN_CONT in 2 seconds? What guarantees that the Http2ConnectionState is still alive at that time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but you can say the same thing for other events, it should be guaranteed though. It's not special.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one other schedule_in in HTTP/2 and that keeps hold of the returned Action so it can be canceled. I don't this this usage is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. Are you suggesting any of these?
- Keep
Http2ConnectionStatealive until the event processed - Make it cancelable and cancel the event when
Http2ConnectoinStatedies
Is it guaranteed that Http2ConnectionState is still alive when HTTP2_SESSION_EVENT_RECV is going to be processed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't understand how this schedule_in is safe. If the object is destroyed before the 2sec timeout, the event will run on a freed object AFAICT. I think you have to make it cancelable and cancel the event when Http2ConnectionState dies.
The HTTP2_SESSION_EVENT_RECV is different, since send_connection_event() delivers the event directly, so it is reentrant on the event handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand your point.
HTTP2_SESSION_EVENT_RECV was a bad example :) I wanted to know how other places guarantee the continuation is still alive. Anyway, I'll look into it and try to make it solid.
|
Even if I couldn't complete the whole feature, I think most changes we have already merged on master (and code on this PR) is still valid. HTTP/2 sessions should be closed if (proxy's) response headers have |
|
Done
WIP
Need discussion
|
|
I don't see any reason to not refuse new connections while draining. Refusing new connections might be overkill for users who use load balancer, but if ATS doesn't refuse connections, draining feature will be only for users who use load balancer (or some external controller). You can remove servers from load balancer before starting draining so I think it shouldn't be a problem even if the behavior is not configurable. |
|
Moving out to 8.0.0 for now, since there seems to be some contention on this topic. :) |
|
On Jun 13, 2017, at 1:10 AM, Masakazu Kitajo ***@***.***> wrote:
I don't see any reason to not refuse new connections while draining.
It definitely has to be configurable.
Refusing new connections might be overkill for users who use load balancer, but if ATS doesn't refuse connections, draining feature will be only for users who use load balancer (or some external controller).
That's everyone. You have to manage getting the traffic into ATS somehow :) Also, if you refuse connections you have to actually refuse them; you can't leave them to time out in the listen backlog and you can't shut them down once you have accepted them.
You can remove servers from load balancer before starting draining so I think it shouldn't be a problem even if the behavior is not configurable.
Not if you are using a DNS-based GSLB.
|
Can we assume that all ATS operators can control it? If the answer was yes, why is it need to be configurable?
I agree. That's why I tried to close listening sockets. So, is it OK to close the sockets from TS if TS isn't managed by TM (and users want it) ?
Why not? Isn't it possible to remove servers? I'm not familiar with DNS-based GLSB. |
|
[approve ci autest] |
|
@maskit @masaori335 Where are we with this? Is this still a WIP, or getting ready for review / merge? If so, please remove the WIP label. |
|
It's ready for review. |
|
This needs to be reviewed. |
|
I'm thinking about adding tests. @maskit do you have any good idea about how to test this? |
|
@zizhong I think autest would be an option. We would need an origin server ( or a plugin) which returns a response with few seconds of delay without The test would be something like:
I think we don't have any ways to test this behavior on HTTP/2. |
|
@maskit Something more useful is that ATS sends the GOAWAY frame if a response header has |
|
@zizhong That logic is already implemented :) |
|
Rebased on the latest master. The code hasn't been changed since Jul 20. I just resolved conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. One minor change is required. Also commits should be squashed in few commits.
proxy/ProxyClientSession.cc
Outdated
| #include "HttpDebugNames.h" | ||
| #include "ProxyClientSession.h" | ||
|
|
||
| volatile bool http_client_session_draining = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We removed volatile keyword by #2603. Should we use std::atomic<bool>?
For HTTP/1.1, add Connection: close header For HTTP/2, send GOAWAY frames
|
Made a small change. RecInt timeout = 0;
- if (RecGetRecordInt("proxy.config.stop.shutdown_timeout", &timeout) == REC_ERR_OKAY && timeout) {
+ if (RecGetRecordInt("proxy.config.stop.shutdown_timeout", &timeout) == REC_ERR_OKAY && timeout &&
+ !http_client_session_draining) {
http_client_session_draining = true; |
|
@jpeach Can you take another look? |
shinrich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed this PR. Looks great! We had an internal issue filed on using the GOAWAY to make HTTP/2 draining go faster. This is a superset of that issue.
|
[approve ci] |
| This setting specifies the number of active client connections | ||
| for use by :option:`traffic_ctl server restart --drain`. | ||
|
|
||
| .. ts:cv:: CONFIG proxy.config.restart.stop_listening INT 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maskit
Tests are needed for both HTTP/1 draining and HTTP/2 graceful shutdown. They can help you when you add "traffic_ctl --drain" feature.
For how to test it in HTTP/2, you can check out this gist. https://gist.github.com/zizhong/401bf0618c0a977d0cd7e1042ee632fc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test. Nice job. I tested it manually but we should automate it, and the test is helpful even without supporting "traffic_ctl --drain". Can you make the test autest compatible and submit a PR?
Also, as you know, the GOAWAY frame for graceful shutdown has the unique stream id. So, it would be great if the test checks the stream id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will do.
I think the requested changes (at least on this PR) are already addressed.
|
@jpeach I dismissed your change request because I think they are already addressed. Please file issues if there's something I missed. |
For HTTP/1.1, add
Connection: closeheaderFor HTTP/2, send
GOAWAYframesConnection: closeThe last change also fixes an inconsistent behavior that H2 sessions won't be closed even if
Connection: closeheader is set by a plugin.(This still depends on
proxy.config.stop.shutdown_timeout. I'm going to add support fortraffic_ctl stop --drainlater.)