Revert the runner-side idle timeout to 1s #1483

jan-g · 2019-04-25T13:57:21Z

We stumbled over a problem where some FDKs are idling their UDS HTTP
connections at periods lower than the 120s that this was expecting.

This was giving rise to spurious errors (from older versions of the node
FDK, for instance) - where invocations beyond the first were seeing
502 gateway errors from a prematurely-closed UDS socket.

The notion of an idle timeout here is good, but we should check that all
FDKs have the appropriate behaviour and give users time to rev their
functions before reintroducing this.

How to verify it

I used a node-FDK-based "hello world" from prior to the fix fnproject/fdk-node#26 like this:

fn invoke jang node; sleep 6; fn invoke jang node

Before:

% fn invoke jang node; sleep 6; fn invoke jang node
{"message":"Hello World"}{"code":"StatusBadGateway","message":"error receiving function response"}
Fn: Error calling function: status 502

See 'fn <command> --help' for more information. Client version: 0.5.63

After:

% fn invoke jang node; sleep 6; fn invoke jang node
{"message":"Hello World"}{"message":"Hello World"}%

I'm a fan of tracking down any remaining duff timeouts in our FDKs (ideally, the FDK proabbly shouldn't time out connections at all) but we should give people time to adjust prior to shifting this value.

We stumbled over a problem where *some* FDKs are idling their UDS HTTP connections at periods lower than the 120s that this was expecting. This was giving rise to spurious errors (from older versions of the node FDK, for instance) - where invocations beyond the first were seeing 502 gateway errors from a prematurely-closed UDS socket. The notion of an idle timeout here is good, but we should check that all FDKs have the appropriate behaviour and give users time to rev their functions before reintroducing this.

reclaro

LGTM

kmjohansen

I'm not convinced that we've done all the investigative work necessary to establish that this change will actually solve our problem. To be clear, any notion of connection caching that cannot survive one of the endpoints unexpectedly closing the connection will be problematic no matter what the length of the timeout is. We've fixed the FDKs where we've observed this problem, so I don't think haste is required for this PR. I'd like for us to gather more information and debug further. Based upon some of the other problems we've run into with concurrent invokes, I'm suspicious that there's a race between inserting a fd into the connection cache, and having it closed from the remote side. Intend to dig into this more, but if that's the case, it would be worth considering how to disable the connection re-use, or harden it to cope with losing such a race.

For this PR, do we have any evidence that this helps solve problems beyond the node fdk? (Note, also, the node fdk has been fixed for days.)

kmjohansen · 2019-04-25T17:14:34Z

Withdrawing my objection after an offline discussion. Need to temporarily revert part of the timeout change to unblock deployment downstream. Will continue to debug and look for a more comprehensive fix in parallel.

rdallman

i'm okay with this too until we figure out a better way to increase this

jan-g requested review from rdallman and kmjohansen April 25, 2019 13:57

reclaro self-requested a review April 25, 2019 14:06

jan-g force-pushed the fix/revert-120s-idle branch from de7066c to b2dac56 Compare April 25, 2019 14:09

reclaro approved these changes Apr 25, 2019

View reviewed changes

kmjohansen suggested changes Apr 25, 2019

View reviewed changes

kmjohansen approved these changes Apr 25, 2019

View reviewed changes

kmjohansen merged commit d21a2f4 into master Apr 25, 2019

rdallman approved these changes Apr 25, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert the runner-side idle timeout to 1s #1483

Revert the runner-side idle timeout to 1s #1483

jan-g commented Apr 25, 2019

reclaro left a comment

kmjohansen left a comment

kmjohansen commented Apr 25, 2019

rdallman left a comment

Revert the runner-side idle timeout to 1s #1483

Revert the runner-side idle timeout to 1s #1483

Conversation

jan-g commented Apr 25, 2019

reclaro left a comment

Choose a reason for hiding this comment

kmjohansen left a comment

Choose a reason for hiding this comment

kmjohansen commented Apr 25, 2019

rdallman left a comment

Choose a reason for hiding this comment