-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(@embark/utils): use timeouts to improve utils.pingEndpoint #1166
Conversation
6ce0971
to
1f955bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really like to see what others think about that PRs. I personally feel it is really complex and not normal we have to do such things.
I would be surprise there is a problem in node regarding curl
as it is used a lot.
I don't think we are really fixing the root cause here but a side effect of the problem
e.g if a library is a problem we should find a way to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @alaibe that this is really complex just for a ping. If the provider is capable of talking to the endpoint correctly but we can't ping it in a timely fashion, there must be a flaw in how we do it or the libs we use.
I saw you said that even under great CPU pressure, curl
calls still worked.
Maybe we should try to replace our current ping by a simple exec curl
. I saw that it is possible to do it with a websocket too: https://gist.github.com/htp/fbce19069187ec1cc486b594104f01d0
@alaibe, excellent points. I will try to make my case for the complexity. I believe the problem is related to node.js itself, but I can only describe what's happening, I don't know the real reason. What I've observed seems similar to what's described in nodejs/node#20256, but happening for the client when the server is not necessarily node.js (could be geth or ipfs). It is affecting clients created with built-in Bug Description
This problem is why when doing We can solve the problem with timeouts with the assumption that a request taking longer than The reason for the complexity in implementing the solution is that I discovered the socket timeouts don't always work, they can get stuck too — probably related to the same bug. So I doubled-up with |
@jrainville it would be possible to do the ping in a child process with an external tool, good idea. However, the downside of doing so is that In the process of trying to figure out the problem, I observed that when my CPU is busy it takes awhile for ips and/or geth processes to actually start, which adds to the perception that See my comment to @alaibe re: this being a problem with node itself — at this point, I'm fairly sure that's the case, though I would love to find out that it's something we're doing wrong or can do better. |
b04518c
to
3e6bd40
Compare
This PR doesn't work on Windows, changes needed. |
3e6bd40
to
c9ea017
Compare
ad5ad17
to
25520c7
Compare
This PR works together with #1181 to prevent hangs and crashes when running / stopping embark processes. When combining the code changes in this PR with those in #1181 and doing more experimentation, I found that it's unnecessary to have secondary timeouts, i.e. the ones created with Note that the maximum number of 1000ms connection attempts results in the same maximum |
_cb(); | ||
} | ||
} | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we extend this out to all other service checks (swarm, webserver, whisper)? And in that case, maybe this can be abstracted a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea @emizzle. I'll create a pivotal task for that pending this PR being merged.
src/lib/utils/utils.js
Outdated
req.on('response', (_response) => { | ||
callback(); | ||
}); | ||
const doCallback = (v) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is v
? Can we have a better name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, seems like doCallback
returns a boolean about if it should callback, but the name implies that it will actually callback. We should change that name too to something like shouldCallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a silent "should" in front of the variable name... You're right, shouldCallback
is better, will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/lib/utils/utils.js
Outdated
setImmediate(() => { | ||
pingEndpoint(host, port, type, protocol, origin, callback, ++count); | ||
}); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove this if
with a return
to limit the depth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few ideas when I woke up this morning about how to improve control flow, will be making those now and will keep in mind opportunities for early return
.
One thing, though, is that since pingEndpoint
is doing asynchronous stuff and the return value is effectively ignored, there's no stack growth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A number of changes have been made. They've been squashed into the one commit for this PR, but if you want to see what changed specifically you can check 8dfef2a — I was using that branch to test revised pingEndpoint
and the proxy revisions together.
I tested on macOS, Windows, and desktop Linux, with websocket proxy and HTTP-only proxy. For all cases, I tried with embark run
and blockchain
in separate terminals and with single-terminal embark run
, checking for blockchain process crashes when killing run
and for hangs during startup. I tried jamming up the load on the CPU to see if I could get an indefinite stall... all very tedious, but I wanted to double-check everything so we can wrap this up with as much confidence as possible.
Everything worked correctly in all cases. Interestingly, on Linux I did one time see Proxy: Network error 'read ECONNRESET'
reported in the output of embark blockchain
. So I guess it's not just Windows on which that can happen.
I introduced an exponential increase in the timeout, with a max timeout of 3 seconds. As before, it will keep retrying for up to 10 minutes, after which the behavior becomes under defined as described previously.
I eliminated an edge case where a retry might be ongoing but another event might fire on a previous request. With my latest changes, if a request timeout has triggered a retry then additional events on the previous request will be ignored.
Some branching logic for handling error/success events has been combined into maybeRetry
which is renamed handleEvent
. An early The return
could be used here and the else if
branch would become an if
; not sure if you have a strong preference on that change being made.cleanup
call needs to happen regardless of which branch executes, so there's not actually a case for early an return
in handleEvent
.
Finally, the changes re: the HTTP timeout were motivated by another look at the docs. The request.setTimeout()
method doesn't take affect until after a socket connection has been made, which is the behavior we want so it's cleaner to use it than to setup a waterfall of event handlers.
src/lib/utils/utils.js
Outdated
|
||
const handleError = (req, closeMethod) => { | ||
req.on('error', (err) => { | ||
if (!retrying) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would early return here with:
if (retrying) {
return cleanup(req, closeMethod);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: #1166 (comment)
25520c7
to
6db9def
Compare
For reasons unknown, sockets underlying http and websocket requests made by `utils.pingEndpoint` could become stuck and after a *long* time eventually hang up. This can happen even when the endpoint being pinged is perfectly responsive, e.g. when checking it with `curl` in another terminal. The root cause may be a bug or poorly documented behavior in node.js itself. The effects of the stuck sockets are experienced more often when the CPU is under moderate-heavy load, and also more often when pinging embark's blockchain proxy vs. pinging geth, ipfs, etc. directly. Use timeouts to cut short the stuck sockets. Usually, after one-to-several retries (which may get stuck) following a stuck socket, connections behave as expected (learned by experience). Assume that `T` time of inactivity on a "ping socket" indicates that socket is stuck; increase `T` exponentially on each retry until it reaches a max of 3000ms, and don't continue retrying beyond 600000ms (10 minutes). Make the initial `T` value 100ms. Once a retry is underway following a timeout event, ignore all further events on the previous ping attempt. Note that when the CPU is moderately busy, one of the slowest steps is launching embark's child processes, e.g. ipfs, geth, etc. However, even under very heavy load, with the revised `pingEndpoint` embark seems to always eventually start/detect services correctly. And under light to moderate load, the effect of the stuck sockets is much less noticeable. Use a weboscket client from the `ws` package instead of manually building websocket headers. Use `pingEndpoint` for the IPFS module's `_checkService` method to avoid the same stuck socket problem that was affecting blockchain detection.
6db9def
to
6824264
Compare
}; | ||
|
||
if (type === 'ws') { | ||
const req = new (require('ws'))( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll be better of having all necessary require()
statements at the top of the file instead of requiring within a function.
Especially if we happen to use (in this case) require('ws')
in multiple places.
Closing, will be replaced with a PR that uses a |
The problems described in embark PR #1166 can be resolved by implementing the blockchain proxy with `http-proxy` directly instead of using `express` together with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the "stuck sockets" problems described in #1166) is unknown. The need to swallow some errors as described in embark PR #1181 is also eliminated by dropping `http-proxy-middleware` and `express`.
The problems described in embark PR #1166 can be resolved by implementing the blockchain proxy with `http-proxy` directly instead of using `express` together with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the "stuck sockets" problems described in #1166) is unknown. The need to swallow some errors as described in embark PR #1181 is also eliminated by dropping `http-proxy-middleware` and `express`.
This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there is no longer a need to use timeouts. However a few aspects of the original PR are still useful, and lessons learned from #1166, #1181, and #1195 can be put to good use. Use a websocket client from the `ws` package when pinging websocket endpoints instead of manually building a request header. The `'upgrade'` event being listened for was never actually firing; and though a response was received for those pings, the response messages indicated problems with those requests. It seems cleaner to use a proper websocket client and callback success upon the `ws` client's `'open'` event. Abstract error and success handling across websocket and http pings. Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is expected, as that genuinely indicates an endpoint isn't accepting connections at the specified host and port. If other kinds of network errors are occurring, it will be helpful to have a visual indicator to prompt investigation. After success or the first error, cleanup the ping's request/connection immediately since we're not awaiting `'data'` events on an http request and we don't want to leave a websocket connection open. Don't callback any additional `'error'` events that might fire after the first `'error'` event, but do report them.
This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there is no longer a need to use timeouts. However a few aspects of the original PR are still useful, and lessons learned from #1166, #1181, and #1195 can be put to good use. Use a websocket client from the `ws` package when pinging websocket endpoints instead of manually building a request header. The `'upgrade'` event being listened for was never actually firing; and though a response was received for those pings, the response messages indicated problems with those requests. It seems cleaner to use a proper websocket client and callback success upon the `ws` client's `'open'` event. Abstract error and success handling across websocket and http pings. Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is expected, as that genuinely indicates an endpoint isn't accepting connections at the specified host and port. If other kinds of network errors are occurring, it will be helpful to have a visual indicator to prompt investigation. After success or the first error, cleanup the ping's request/connection immediately since we're not awaiting `'data'` events on an http request and we don't want to leave a websocket connection open. Don't callback any `'error'` events that might fire after the first `'error'` event or a success event, but do report them.
The problems described in embark PR #1166 can be resolved by implementing the blockchain proxy with `http-proxy` directly instead of using `express` together with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the "stuck sockets" problems described in #1166) is unknown. The need to swallow some errors as described in embark PR #1181 is also eliminated by dropping `http-proxy-middleware` and `express`.
This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there is no longer a need to use timeouts. However a few aspects of the original PR are still useful, and lessons learned from #1166, #1181, and #1195 can be put to good use. Use a websocket client from the `ws` package when pinging websocket endpoints instead of manually building a request header. The `'upgrade'` event being listened for was never actually firing; and though a response was received for those pings, the response messages indicated problems with those requests. It seems cleaner to use a proper websocket client and callback success upon the `ws` client's `'open'` event. Abstract error and success handling across websocket and http pings. Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is expected, as that genuinely indicates an endpoint isn't accepting connections at the specified host and port. If other kinds of network errors are occurring, it will be helpful to have a visual indicator to prompt investigation. After success or the first error, cleanup the ping's request/connection immediately since we're not awaiting `'data'` events on an http request and we don't want to leave a websocket connection open. Don't callback any `'error'` events that might fire after the first `'error'` event or a success event, but do report them.
The problems described in embark PR #1166 can be resolved by implementing the blockchain proxy with `http-proxy` directly instead of using `express` together with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the "stuck sockets" problems described in #1166) is unknown. The need to swallow some errors as described in embark PR #1181 is also eliminated by dropping `http-proxy-middleware` and `express`.
The problems described in embark PR #1166 can be resolved by implementing the blockchain proxy with `http-proxy` directly instead of using `express` together with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the "stuck sockets" problems described in #1166) is unknown. The need to swallow some errors as described in embark PR #1181 is also eliminated by dropping `http-proxy-middleware` and `express`.
This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there is no longer a need to use timeouts. However a few aspects of the original PR are still useful, and lessons learned from #1166, #1181, and #1195 can be put to good use. Use a websocket client from the `ws` package when pinging websocket endpoints instead of manually building a request header. The `'upgrade'` event being listened for was never actually firing; and though a response was received for those pings, the response messages indicated problems with those requests. It seems cleaner to use a proper websocket client and callback success upon the `ws` client's `'open'` event. Abstract error and success handling across websocket and http pings. Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is expected, as that genuinely indicates an endpoint isn't accepting connections at the specified host and port. If other kinds of network errors are occurring, it will be helpful to have a visual indicator to prompt investigation. After success or the first error, cleanup the ping's request/connection immediately since we're not awaiting `'data'` events on an http request and we don't want to leave a websocket connection open. Don't callback any `'error'` events that might fire after the first `'error'` event or a success event, but do report them.
For reasons unknown, sockets underlying http and websocket requests made by
utils.pingEndpoint
could become stuck and after a long time eventually hang up. This can happen even when the endpoint being pinged is perfectly responsive, e.g. when checking it withcurl
in another terminal. The root cause may be a bug or poorly documented behavior in node.js itself. The effects of the stuck sockets are experienced more often when the CPU is under moderate-heavy load, and also more often when pinging embark's blockchain proxy vs. pinging geth, ipfs, etc. directly.Use timeouts to cut short the stuck sockets. Usually, after one-to-several retries (which may get stuck) following a stuck socket, connections behave as expected (learned by experience). Assume that
T
time of inactivity on a "ping socket" indicates that socket is stuck; increaseT
exponentially on each retry until it reaches a max of 3000ms, and don't continue retrying beyond 600000ms (10 minutes). Make the initialT
value 100ms. Once a retry is underway following a timeout event, ignore all further events on the previous ping attempt.Note that when the CPU is moderately busy, one of the slowest steps is launching embark's child processes, e.g. ipfs, geth, etc. However, even under very heavy load, with the revised
pingEndpoint
embark seems to always eventually start/detect services correctly. And under light to moderate load, the effect of the stuck sockets is much less noticeable.Use a weboscket client from the
ws
package instead of manually building websocket headers.Use
pingEndpoint
for the IPFS module's_checkService
method to avoid the same stuck socket problem that was affecting blockchain detection.