-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert the runner-side idle timeout to 1s #1483
Conversation
We stumbled over a problem where *some* FDKs are idling their UDS HTTP connections at periods lower than the 120s that this was expecting. This was giving rise to spurious errors (from older versions of the node FDK, for instance) - where invocations beyond the first were seeing 502 gateway errors from a prematurely-closed UDS socket. The notion of an idle timeout here is good, but we should check that all FDKs have the appropriate behaviour and give users time to rev their functions before reintroducing this.
de7066c
to
b2dac56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that we've done all the investigative work necessary to establish that this change will actually solve our problem. To be clear, any notion of connection caching that cannot survive one of the endpoints unexpectedly closing the connection will be problematic no matter what the length of the timeout is. We've fixed the FDKs where we've observed this problem, so I don't think haste is required for this PR. I'd like for us to gather more information and debug further. Based upon some of the other problems we've run into with concurrent invokes, I'm suspicious that there's a race between inserting a fd into the connection cache, and having it closed from the remote side. Intend to dig into this more, but if that's the case, it would be worth considering how to disable the connection re-use, or harden it to cope with losing such a race.
For this PR, do we have any evidence that this helps solve problems beyond the node fdk? (Note, also, the node fdk has been fixed for days.)
Withdrawing my objection after an offline discussion. Need to temporarily revert part of the timeout change to unblock deployment downstream. Will continue to debug and look for a more comprehensive fix in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm okay with this too until we figure out a better way to increase this
We stumbled over a problem where some FDKs are idling their UDS HTTP
connections at periods lower than the 120s that this was expecting.
This was giving rise to spurious errors (from older versions of the node
FDK, for instance) - where invocations beyond the first were seeing
502 gateway errors from a prematurely-closed UDS socket.
The notion of an idle timeout here is good, but we should check that all
FDKs have the appropriate behaviour and give users time to rev their
functions before reintroducing this.
I used a node-FDK-based "hello world" from prior to the fix fnproject/fdk-node#26 like this:
Before:
After:
I'm a fan of tracking down any remaining duff timeouts in our FDKs (ideally, the FDK proabbly shouldn't time out connections at all) but we should give people time to adjust prior to shifting this value.