Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert the runner-side idle timeout to 1s #1483

Merged
merged 1 commit into from
Apr 25, 2019
Merged

Conversation

jan-g
Copy link
Contributor

@jan-g jan-g commented Apr 25, 2019

We stumbled over a problem where some FDKs are idling their UDS HTTP
connections at periods lower than the 120s that this was expecting.

This was giving rise to spurious errors (from older versions of the node
FDK, for instance) - where invocations beyond the first were seeing
502 gateway errors from a prematurely-closed UDS socket.

The notion of an idle timeout here is good, but we should check that all
FDKs have the appropriate behaviour and give users time to rev their
functions before reintroducing this.

  • How to verify it

I used a node-FDK-based "hello world" from prior to the fix fnproject/fdk-node#26 like this:

fn invoke jang node; sleep 6; fn invoke jang node

Before:

% fn invoke jang node; sleep 6; fn invoke jang node
{"message":"Hello World"}{"code":"StatusBadGateway","message":"error receiving function response"}
Fn: Error calling function: status 502

See 'fn <command> --help' for more information. Client version: 0.5.63

After:

% fn invoke jang node; sleep 6; fn invoke jang node
{"message":"Hello World"}{"message":"Hello World"}%

I'm a fan of tracking down any remaining duff timeouts in our FDKs (ideally, the FDK proabbly shouldn't time out connections at all) but we should give people time to adjust prior to shifting this value.

@reclaro reclaro self-requested a review April 25, 2019 14:06
We stumbled over a problem where *some* FDKs are idling their UDS HTTP
connections at periods lower than the 120s that this was expecting.

This was giving rise to spurious errors (from older versions of the node
FDK, for instance) - where invocations beyond the first were seeing
502 gateway errors from a prematurely-closed UDS socket.

The notion of an idle timeout here is good, but we should check that all
FDKs have the appropriate behaviour and give users time to rev their
functions before reintroducing this.
Copy link
Contributor

@reclaro reclaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@kmjohansen kmjohansen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that we've done all the investigative work necessary to establish that this change will actually solve our problem. To be clear, any notion of connection caching that cannot survive one of the endpoints unexpectedly closing the connection will be problematic no matter what the length of the timeout is. We've fixed the FDKs where we've observed this problem, so I don't think haste is required for this PR. I'd like for us to gather more information and debug further. Based upon some of the other problems we've run into with concurrent invokes, I'm suspicious that there's a race between inserting a fd into the connection cache, and having it closed from the remote side. Intend to dig into this more, but if that's the case, it would be worth considering how to disable the connection re-use, or harden it to cope with losing such a race.

For this PR, do we have any evidence that this helps solve problems beyond the node fdk? (Note, also, the node fdk has been fixed for days.)

@kmjohansen
Copy link

Withdrawing my objection after an offline discussion. Need to temporarily revert part of the timeout change to unblock deployment downstream. Will continue to debug and look for a more comprehensive fix in parallel.

@kmjohansen kmjohansen merged commit d21a2f4 into master Apr 25, 2019
Copy link
Contributor

@rdallman rdallman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm okay with this too until we figure out a better way to increase this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants