Skip to content

Conversation

@maskit
Copy link
Member

@maskit maskit commented Jun 9, 2017

For HTTP/1.1, add Connection: close header
For HTTP/2, send GOAWAY frames

  • Generalize the flag name
  • Close listening sockets when graceful shutdown has been scheduled
  • Send the GOAWAY frame if a response header hasConnection: close

The last change also fixes an inconsistent behavior that H2 sessions won't be closed even if Connection: close header is set by a plugin.

(This still depends on proxy.config.stop.shutdown_timeout. I'm going to add support for traffic_ctl stop --drain later.)

@maskit maskit added this to the 8.0.0 milestone Jun 9, 2017
@maskit maskit self-assigned this Jun 9, 2017
@maskit maskit requested review from jpeach and masaori335 June 9, 2017 07:38
@maskit
Copy link
Member Author

maskit commented Jun 9, 2017

@mlibbey This will enable to shed H2 traffic with header_rewrite. (#1693)

@maskit
Copy link
Member Author

maskit commented Jun 9, 2017

@zizhong Please take a look.

@zwoop
Copy link
Contributor

zwoop commented Jun 9, 2017

@maskit This is a 7.1.0 candidate?

Copy link
Contributor

@jpeach jpeach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review placeholder :)

I'll try to look this weekend.

@maskit
Copy link
Member Author

maskit commented Jun 9, 2017

@zwoop I marked this 8.0 because of necessity but that should be possible. We'll need #1710, #1937 and #2094 if we backport this.

@zwoop
Copy link
Contributor

zwoop commented Jun 9, 2017

@maskit I trust your judgement here, let me know what you think is the right thing to do. My goal right now is to make 7.1.0 rock solid, and since we're working on other issues, there's still time.

@maskit
Copy link
Member Author

maskit commented Jun 9, 2017

@zwoop Then I'd say yes.

Although it's not merged yet, we'll need #2096 too.

@maskit maskit modified the milestones: 7.1.0, 8.0.0 Jun 9, 2017
@maskit maskit added the Backport Marked for backport for an LTS patch release label Jun 9, 2017
jpeach
jpeach previously requested changes Jun 11, 2017
{
sslNetProcessor.stop_accept();
netProcessor.stop_accept();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work correctly. If you are running under traffic_manager then what happens is that the client connection just hangs because traffic_manager is still listening on the sockets.

Refusing new connections while draining should absolutely be configurable. Many configurations need to continue to serve clients while the GSLB directs traffic elsewhere. @zwoop.

I saw an assert at one point:

traffic_server: using root directory '/opt/ats'
[Jun 10 20:51:40.206] Server {0x7fd9da1b8700} DEBUG: <Main.cc:417 (periodic)> (server) limiting connections based on memory usage has been disabled
[Jun 10 20:52:28.020] Server {0x7fd9da3bc700} DEBUG: <Main.cc:285 (periodic)> (server) received exit signal, shutting down in 30secs
Fatal: UnixNetVConnection.cc:1401: failed assertion `con.fd == NO_FD`
traffic_server: received signal 6 (Aborted)
traffic_server - STACK TRACE:
/opt/ats/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xc3)[0x504c72]
/lib64/libpthread.so.0(+0x115c0)[0x7fd9ddb065c0]
/lib64/libc.so.6(gsignal+0x9f)[0x7fd9dccb891f]
/lib64/libc.so.6(abort+0x16a)[0x7fd9dccba51a]
/opt/ats/lib/libtsutil.so.8(_Z11ink_warningPKcz+0x0)[0x7fd9df2d58f2]
/opt/ats/lib/libtsutil.so.8(_Z17ats_base64_encodePKhmPcmPm+0x0)[0x7fd9df2d2d65]
/opt/ats/bin/traffic_server(_ZN18UnixNetVConnection4freeEP7EThread+0x2d9)[0x7a8425]
/opt/ats/bin/traffic_server(_ZN18UnixNetVConnection11acceptEventEiP5Event+0x16d)[0x7a7049]
/opt/ats/bin/traffic_server(_ZN12Continuation11handleEventEiPv+0x72)[0x507de6]
/opt/ats/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x134)[0x7ca102]
/opt/ats/bin/traffic_server(_ZN7EThread7executeEv+0x10f)[0x7ca391]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refusing new connections while draining should absolutely be configurable. Many configurations need to continue to serve clients while the GSLB directs traffic elsewhere.

I'm fine with keeping sockets open but I think it should have been done before starting graceful shutdown. If we don't close sockets, number of active connections won't goes down without external traffic control.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work correctly. If you are running under traffic_manager then what happens is that the client connection just hangs because traffic_manager is still listening on the sockets.

Right. Is it OK if I close sockets only if traffic_manager isn't used ? This PR doesn't include support for traffic_manager.

(This still depends on proxy.config.stop.shutdown_timeout. I'm going to add support for traffic_ctl stop --drain later.)

It was not clear. I mean another Pull Request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the assert, it seems the NetVConnection has to be closed with close_UnixNetVConnection() before calling its free. Otherwise, net_connections_currently_open_stat won't be decremented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't close sockets, number of active connections won't goes down
without external traffic control.

@maskit That's right. If you are running a CDN you have external load balancing, so you don't need to give errors to clients. I can see the utility of refusing connections, but it should not be the default.

Copy link
Member Author

@maskit maskit Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are running a CDN you have external load balancing

What if you aren't?

is_draining() const
{
return http_client_session_draining;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any value in disguising the global variable here. Just use it directly where you need it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrapped it because the variable will be replaced with a metric you suggested.

#include "InkAPIInternal.h"
#include "http/HttpServerSession.h"

extern volatile bool http_client_session_draining;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a metric that publishes this state so that operators can tell we are in draining state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I left that as a future work? Because main purpose of this PR is generalizing the flag and supporting graceful shutdown on HTTP/1.1. This is just a replacement of http2_drain.

};

enum Http2ShutdownState { NOT_INITIATED, INITIATED, IN_PROGRESS };
enum Http2ShutdownState { NOT_PLANNED, NOT_INITIATED, INITIATED, IN_PROGRESS };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these states mean? I can't figure out from the usage what you intend here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These enums should be renamed to be consistent with the rest of the code. I think that all you need are:

enum Http2ShutdownState {
    HTTP2_SHUTDOWN_NONE,
    HTTP2_SHUTDOWN_BEGIN,
    HTTP2_SHUTDOWN_INPROGRESS
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOT_PLANNED: Shutdown is not planned (it'll keep working)
NOT_INITIATED: Shutdown is planned but nothing is happened yet (The 1st GOAWAY will be sent)
INITIATED: The first GOAWAY is sent and waiting the 2sec before sending the 2nd GOAWAY
IN_PROGRESS: The second GOAWAY is sent and waiting shutdown timeout

I'll rename them. Let me know if you have better names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start by making them consistent with other enums, so:

HTTP2_SHUTDOWN_NONE,
HTTP2_SHUTDOWN_NOT_INITIATED,
HTTP2_SHUTDOWN_INITIATED,
HTTP2_SHUTDOWN_IN_PROGRESS

Would you be able to make another PR later that reduces this down to just 3 values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible if I change the conditions, but I guess the condition will be a bit complex.
I'll make them consistent for now.

void reenable(VIO *vio);
virtual void transaction_done();
virtual bool
ignore_keep_alive()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool ignore_keep_alive() override { ... }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do after #2113. (Many warnings show up.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

virtual bool
ignore_keep_alive()
{
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment explaining it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't understand why it was true. Without this change, Connection: close is always added (and is removed while converting headers from 1.1 to 2). I think it should be handled as the same as HTTP/1.1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good. Just comment the explanation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

};

enum Http2ShutdownState { NOT_INITIATED, INITIATED, IN_PROGRESS };
enum Http2ShutdownState { NOT_PLANNED, NOT_INITIATED, INITIATED, IN_PROGRESS };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code for handling HTTP2_SESSION_EVENT_SHUTDOWN_INIT, is it safe to schedule the HTTP2_SESSION_EVENT_SHUTDOWN_CONT in 2 seconds? What guarantees that the Http2ConnectionState is still alive at that time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but you can say the same thing for other events, it should be guaranteed though. It's not special.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one other schedule_in in HTTP/2 and that keeps hold of the returned Action so it can be canceled. I don't this this usage is safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. Are you suggesting any of these?

  • Keep Http2ConnectionState alive until the event processed
  • Make it cancelable and cancel the event when Http2ConnectoinState dies

Is it guaranteed that Http2ConnectionState is still alive when HTTP2_SESSION_EVENT_RECV is going to be processed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't understand how this schedule_in is safe. If the object is destroyed before the 2sec timeout, the event will run on a freed object AFAICT. I think you have to make it cancelable and cancel the event when Http2ConnectionState dies.

The HTTP2_SESSION_EVENT_RECV is different, since send_connection_event() delivers the event directly, so it is reentrant on the event handler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand your point.

HTTP2_SESSION_EVENT_RECV was a bad example :) I wanted to know how other places guarantee the continuation is still alive. Anyway, I'll look into it and try to make it solid.

@jpeach
Copy link
Contributor

jpeach commented Jun 13, 2017

@zwoop @maskit I don't think this is a backport candidate until the whole feature is complete. I don't want us to get stuck on something just for compatibility reasons.

@maskit
Copy link
Member Author

maskit commented Jun 13, 2017

Even if I couldn't complete the whole feature, I think most changes we have already merged on master (and code on this PR) is still valid. HTTP/2 sessions should be closed if (proxy's) response headers have Connection: close. I'd rather remove only the shutdown part than postpone all the changes to 8.0.

@maskit
Copy link
Member Author

maskit commented Jun 13, 2017

Done

  • Made enum names consistent
  • Added override
  • Explained why ignore_keep_alive() should return false
  • Removed code that closes listening sockets
    • It refuses new connections on SessionAccept instead (and it's not configurable for now)

WIP

  • Cancel HTTP2_SESSION_EVENT_SHUTDOWN_CONT event when Http2ConnectionState dies

Need discussion

  • Whether refuse new connections while draining

@maskit maskit added the WIP label Jun 13, 2017
@maskit
Copy link
Member Author

maskit commented Jun 13, 2017

I don't see any reason to not refuse new connections while draining. Refusing new connections might be overkill for users who use load balancer, but if ATS doesn't refuse connections, draining feature will be only for users who use load balancer (or some external controller). You can remove servers from load balancer before starting draining so I think it shouldn't be a problem even if the behavior is not configurable.

@zwoop zwoop modified the milestones: 8.0.0, 7.1.0 Jun 13, 2017
@zwoop
Copy link
Contributor

zwoop commented Jun 13, 2017

Moving out to 8.0.0 for now, since there seems to be some contention on this topic. :)

@jpeach
Copy link
Contributor

jpeach commented Jun 13, 2017 via email

@maskit
Copy link
Member Author

maskit commented Jun 13, 2017

You have to manage getting the traffic into ATS somehow :)

Can we assume that all ATS operators can control it? If the answer was yes, why is it need to be configurable?

Also, if you refuse connections you have to actually refuse them;

I agree. That's why I tried to close listening sockets. So, is it OK to close the sockets from TS if TS isn't managed by TM (and users want it) ?

Not if you are using a DNS-based GSLB.

Why not? Isn't it possible to remove servers? I'm not familiar with DNS-based GLSB.

@bryancall
Copy link
Contributor

[approve ci autest]

@zwoop
Copy link
Contributor

zwoop commented Jul 20, 2017

@maskit @masaori335 Where are we with this? Is this still a WIP, or getting ready for review / merge? If so, please remove the WIP label.

@maskit maskit removed WIP Backport Marked for backport for an LTS patch release labels Jul 20, 2017
@maskit
Copy link
Member Author

maskit commented Jul 20, 2017

It's ready for review.

@bryancall
Copy link
Contributor

This needs to be reviewed.

@zizhong
Copy link
Member

zizhong commented Aug 18, 2017

I'm thinking about adding tests. @maskit do you have any good idea about how to test this?

@maskit
Copy link
Member Author

maskit commented Aug 19, 2017

@zizhong I think autest would be an option.

We would need an origin server ( or a plugin) which returns a response with few seconds of delay without Connection: close.

The test would be something like:

  1. Set proxy.config.stop.shutdown_timeout to 10 sec
  2. Launch Traffic Server
  3. Send an HTTP request
  4. Send SIGTERM to the process
  5. Check if the response has Connection: close
  6. Wait for 10 sec
  7. Check if the process doesn't exist

I think we don't have any ways to test this behavior on HTTP/2.

@zizhong
Copy link
Member

zizhong commented Aug 21, 2017

@maskit Something more useful is that ATS sends the GOAWAY frame if a response header has Connection: close no matter the draining started or not. This will enable a plugin to initiate the graceful shutdown process for both HTTP/1 and HTTP/2.

@maskit
Copy link
Member Author

maskit commented Aug 22, 2017

@maskit
Copy link
Member Author

maskit commented Oct 25, 2017

Rebased on the latest master. The code hasn't been changed since Jul 20. I just resolved conflicts.

Copy link
Contributor

@masaori335 masaori335 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One minor change is required. Also commits should be squashed in few commits.

#include "HttpDebugNames.h"
#include "ProxyClientSession.h"

volatile bool http_client_session_draining = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed volatile keyword by #2603. Should we use std::atomic<bool>?

masaori335
masaori335 previously approved these changes Nov 6, 2017
For HTTP/1.1, add Connection: close header
For HTTP/2, send GOAWAY frames
@maskit
Copy link
Member Author

maskit commented Nov 6, 2017

Made a small change.

       RecInt timeout = 0;
-      if (RecGetRecordInt("proxy.config.stop.shutdown_timeout", &timeout) == REC_ERR_OKAY && timeout) {
+      if (RecGetRecordInt("proxy.config.stop.shutdown_timeout", &timeout) == REC_ERR_OKAY && timeout &&
+          !http_client_session_draining) {
         http_client_session_draining = true;

@maskit
Copy link
Member Author

maskit commented Nov 6, 2017

@jpeach Can you take another look?

Copy link
Member

@shinrich shinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this PR. Looks great! We had an internal issue filed on using the GOAWAY to make HTTP/2 draining go faster. This is a superset of that issue.

@maskit
Copy link
Member Author

maskit commented Dec 5, 2017

[approve ci]

This setting specifies the number of active client connections
for use by :option:`traffic_ctl server restart --drain`.

.. ts:cv:: CONFIG proxy.config.restart.stop_listening INT 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maskit
Tests are needed for both HTTP/1 draining and HTTP/2 graceful shutdown. They can help you when you add "traffic_ctl --drain" feature.
For how to test it in HTTP/2, you can check out this gist. https://gist.github.com/zizhong/401bf0618c0a977d0cd7e1042ee632fc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test. Nice job. I tested it manually but we should automate it, and the test is helpful even without supporting "traffic_ctl --drain". Can you make the test autest compatible and submit a PR?

Also, as you know, the GOAWAY frame for graceful shutdown has the unique stream id. So, it would be great if the test checks the stream id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do.

@zizhong zizhong mentioned this pull request Dec 7, 2017
@maskit maskit dismissed jpeach’s stale review December 13, 2017 05:03

I think the requested changes (at least on this PR) are already addressed.

@maskit
Copy link
Member Author

maskit commented Dec 13, 2017

@jpeach I dismissed your change request because I think they are already addressed. Please file issues if there's something I missed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants