Fix for dropped connections due to next_inactivity_timeout_at==0 and inactivity_timeout_in!=0 #1675

keesspoelstra · 2017-04-14T14:25:09Z

Under certain conditions it is possible that read_disable or write_disable will set vc->next_inactivity_timeout to 0 while inactivity_timeout_in still has a value.
In manage_active_queue this will lead to a close of the vc, resulting in interrupted service for the client.
Checking the source code the best fix seems to be to check for next_inactivity_timeout_at!=0 in manage_active_queue.
The value 0 does not seem to be used to force an early close/cleanup. The current time is used for that in certain cases.

The active timeout does not have this problem. If it is set to 0 the active_timeout_in is also set to 0. This cannot be done for inactive timeout because the net_activity call needs inactivity_timeout_in.

…inactivity_timeout_in!=0 Under certain conditions it is possible that read_disable or write_disable will set vc->next_inactivity_timeout to 0 while inactivity_timeout_in still has a value. In manage_active_queue this will lead to a close of the vc, resulting in interrupted service for the client. Checking the source code the best fix seems to be to check for next_inactivity_timeout_at!=0 in manage_active_queue. The value 0 does not seem to be used to force an early close/cleanup. The current time is used for that in certain cases. The active timeout does not have this problem. If it is set to 0 the active_timeout_in is also set to 0. This cannot be done for inactive timeout because the net_activity call needs inactivity_timeout_in.

PSUdaemon · 2017-04-17T17:46:46Z

The commit message is missing the note about which commit on master this was cherry-picked from. Please use the -x option when cherry picking. Also, don't forget to uncomment any conflicts.

zwoop · 2017-04-18T17:15:39Z

How does this apply to master now? Is this an issue there? I saw your comment on my other Issue, maybe this is now a candidate for 7.1.x as well ?

keesspoelstra · 2017-04-18T20:10:31Z

@zwoop this applies to 7,1.x also,It can be reliably applied to master also,

However for master the 425b696 commit mitigates the behavior by calling the vc->set_inactivity_timeout(0); instead of vc->next_inactivity_timeout_at = 0;
This will prevent the condition to clean up the connection to be satisfied (inactivity_timeout_in will be 0).

425b696 causes other problems though as you've found out.

Do I need to do anything to ready it for master/7.1?

oknet · 2017-04-19T03:45:40Z

In the master branch,

Call vc->set_inactivity_timeout(0) instead of set next_inactivity_timeout_at to 0.
The inactivity_timeout_in is always greater than 0, it will be nh->default_inactivity_timeout if you set_inactivity_timeout(0).

Therefore, the patch for 7.0.x and later is below,

diff --git a/iocore/net/UnixNetVConnection.cc b/iocore/net/UnixNetVConnection.cc
index b73bcb3..d850e69 100644
--- a/iocore/net/UnixNetVConnection.cc
+++ b/iocore/net/UnixNetVConnection.cc
@@ -75,10 +75,7 @@ net_activity(UnixNetVConnection *vc, EThread *thread)
       vc->inactivity_timeout = 0;
   }
 #else
-  if (vc->inactivity_timeout_in)
-    vc->next_inactivity_timeout_at = Thread::get_hrtime() + vc->inactivity_timeout_in;
-  else
-    vc->next_inactivity_timeout_at = 0;
+  vc->next_inactivity_timeout_at = Thread::get_hrtime() + vc->inactivity_timeout_in;
 #endif
 }

@keesspoelstra Thanks for helping us figure it out. @zwoop

keesspoelstra · 2017-04-19T06:38:43Z

@oknet , seems reasonable, but I don't know if reverting back to default inactivity timeout is the way to go, in the old code
next_inactivity_timeout_at=0 seems to be used to stall the inactivity timeout during write_disable, read_disable, not resetting it to default. Default might even be higher than the current inactivity timeout causing more connections for some servers.

To make it more explicit I would opt for an extra function pause_inactivity_timeout or something like that.

oknet · 2017-04-19T11:53:46Z

@keesspoelstra in the old InactivityCop::check_inactivity() code, it will set next_inactivity_timeout_at to default if it is 0.

 83       // set a default inactivity timeout if one is not set
 84       if (vc->next_inactivity_timeout_at == 0 && default_inactivity_timeout > 0) {
 85         Debug("inactivity_cop", "vc: %p inactivity timeout not set, setting a default of %d", vc, default_inactivity_timeout);
 86         vc->set_inactivity_timeout(HRTIME_SECONDS(default_inactivity_timeout));
 87         NET_INCREMENT_DYN_STAT(default_inactivity_timeout_stat);
 88       } else {
 89         Debug("inactivity_cop_verbose", "vc: %p now: %" PRId64 " timeout at: %" PRId64 " timeout in: %" PRId64, vc, now,
 90               ink_hrtime_to_sec(vc->next_inactivity_timeout_at), ink_hrtime_to_sec(vc->inactivity_timeout_in));
 91       }

oknet · 2017-04-19T12:51:00Z

"set next_inactivity_timeout_at to default" is a safety mechanism that enforce the netvc is closed in a expected time.

The manage_active_queue() try to close the netvc that has timeouted.
The netvc won't be closed by manage_active_queue() if the netvc has read_disable() and write_disable().

keesspoelstra · 2017-04-19T14:17:06Z

Ok, now we're on to something. Because the VCs return from disable very quickly (we hope so) the inactivity cop does not see them often and does not set the default, thus not introducing the lengthening or shortening of inactivity timeout.

But by setting the inactivity_timeout_in explicitly to default when we're disabling and on other places we're making matters worse because we're introducing longer (or shorter) timeouts in some cases.

No access to source code , but I can imagine SSL handshakes being vulnerable to this.

zwoop · 2017-04-22T21:19:19Z

What's the consensus here? Note that I had to revert @oknet's patch for inactivity timeout on 7.1.x, we likely need to do it on master too unless we can resolve the performance issues (they are severe). How does this PR apply to master? I'm confused with the fact that this is being developed on the 6.2.x branch.

keesspoelstra · 2017-04-24T09:05:05Z

If you reverted you will need this patch. @oknet , agree?

Without it every once in a few 1000-100.000 requests connections can be dropped giving partial content depending on plugins and timing of the system.

It was developed against 6.2.x because at the time it was only applicable to 6.2.x and earlier. What is the procedure to land this on master or 7.1.x?

oknet · 2017-04-24T12:56:36Z

@keesspoelstra agree with you, @zwoop please apply this PR if you have revert #771 .
Because next_inactivity_timeout_at=0 for a new netvc.

ATS maybe close netvc in active queue if add_to_active_queue() called.

oknet · 2017-04-24T13:01:56Z

@keesspoelstra @zwoop

Without #771 , the cancel_inactivity_timeout() is useless because the InactivityCop::check_inactivity() will reset next_inactivity_timeout_at to default value in the next second.

codes from 7.1.x branch:

437 TS_INLINE void
438 UnixNetVConnection::cancel_inactivity_timeout()
439 {
440   Debug("socket", "Cancel inactive timeout for NetVC=%p", this);
441   inactivity_timeout_in = 0;
442 #ifdef INACTIVITY_TIMEOUT
443   if (inactivity_timeout) {
444     Debug("socket", "Cancel inactive timeout for NetVC=%p", this);
445     inactivity_timeout->cancel_action(this);
446     inactivity_timeout = nullptr;
447   }
448 #else
449   next_inactivity_timeout_at = 0;   // set to 0
450 #endif
451 }

Set next_inactivity_timeout_at by set_inactivity_timeout() if it is 0.

 83       // set a default inactivity timeout if one is not set
 84       if (vc->next_inactivity_timeout_at == 0 && default_inactivity_timeout > 0) {
 85         Debug("inactivity_cop", "vc: %p inactivity timeout not set, setting a default of %d", vc, default_inactivity_timeout);
 86         vc->set_inactivity_timeout(HRTIME_SECONDS(default_inactivity_timeout));
 87         NET_INCREMENT_DYN_STAT(default_inactivity_timeout_stat);
 88       } else {

PSUdaemon · 2017-05-15T19:55:58Z

[approve ci]

PSUdaemon · 2017-05-15T19:58:28Z

[approve ci]

zwoop · 2017-05-31T17:50:58Z

Where are we with this PR? Do we need this on master and/or 7.1.x as well? Or is it already in either (or both) ?

oknet · 2017-06-01T09:54:41Z

@zwoop I think this is only to 6.2.x.

PSUdaemon · 2017-07-26T16:39:44Z

This PR seems a little confused. It seems like a legitimate bug, but this is not set up as a backport of an upstream fix. I'd like to see that fixed so I can merge, but I cannot merge this as is. I am going to close it, but please reopen it if it can be fixed, or open a new one as a proper backport of an upstream fix.

keesspoelstra mentioned this pull request Apr 14, 2017

More and slower active connections in 7.1.x #1527

Closed

PSUdaemon self-assigned this Apr 17, 2017

PSUdaemon added Backport Marked for backport for an LTS patch release Network labels Apr 17, 2017

PSUdaemon added this to the 6.2.2 milestone Apr 17, 2017

oknet approved these changes Jun 1, 2017

View reviewed changes

PSUdaemon closed this Jul 26, 2017

PSUdaemon removed this from the 6.2.2 milestone Aug 2, 2017

Fix for dropped connections due to next_inactivity_timeout_at==0 and inactivity_timeout_in!=0 #1675

Fix for dropped connections due to next_inactivity_timeout_at==0 and inactivity_timeout_in!=0 #1675

Uh oh!

Conversation

keesspoelstra commented Apr 14, 2017

Uh oh!

PSUdaemon commented Apr 17, 2017

Uh oh!

zwoop commented Apr 18, 2017

Uh oh!

keesspoelstra commented Apr 18, 2017

Uh oh!

oknet commented Apr 19, 2017

Uh oh!

keesspoelstra commented Apr 19, 2017

Uh oh!

oknet commented Apr 19, 2017

Uh oh!

oknet commented Apr 19, 2017

Uh oh!

keesspoelstra commented Apr 19, 2017

Uh oh!

zwoop commented Apr 22, 2017

Uh oh!

keesspoelstra commented Apr 24, 2017

Uh oh!

oknet commented Apr 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oknet commented Apr 24, 2017

Uh oh!

PSUdaemon commented May 15, 2017

Uh oh!

PSUdaemon commented May 15, 2017

Uh oh!

zwoop commented May 31, 2017

Uh oh!

oknet commented Jun 1, 2017

Uh oh!

PSUdaemon commented Jul 26, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

oknet commented Apr 24, 2017 •

edited

Loading