Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(@embark/utils): use timeouts to improve utils.pingEndpoint #1166

Closed
wants to merge 1 commit into from

Conversation

michaelsbradleyjr
Copy link
Contributor

@michaelsbradleyjr michaelsbradleyjr commented Dec 7, 2018

For reasons unknown, sockets underlying http and websocket requests made by utils.pingEndpoint could become stuck and after a long time eventually hang up. This can happen even when the endpoint being pinged is perfectly responsive, e.g. when checking it with curl in another terminal. The root cause may be a bug or poorly documented behavior in node.js itself. The effects of the stuck sockets are experienced more often when the CPU is under moderate-heavy load, and also more often when pinging embark's blockchain proxy vs. pinging geth, ipfs, etc. directly.

Use timeouts to cut short the stuck sockets. Usually, after one-to-several retries (which may get stuck) following a stuck socket, connections behave as expected (learned by experience). Assume that T time of inactivity on a "ping socket" indicates that socket is stuck; increase T exponentially on each retry until it reaches a max of 3000ms, and don't continue retrying beyond 600000ms (10 minutes). Make the initial T value 100ms. Once a retry is underway following a timeout event, ignore all further events on the previous ping attempt.

Note that when the CPU is moderately busy, one of the slowest steps is launching embark's child processes, e.g. ipfs, geth, etc. However, even under very heavy load, with the revised pingEndpoint embark seems to always eventually start/detect services correctly. And under light to moderate load, the effect of the stuck sockets is much less noticeable.

Use a weboscket client from the ws package instead of manually building websocket headers.

Use pingEndpoint for the IPFS module's _checkService method to avoid the same stuck socket problem that was affecting blockchain detection.

@michaelsbradleyjr michaelsbradleyjr requested a review from a team December 7, 2018 02:51
@michaelsbradleyjr michaelsbradleyjr force-pushed the refactor/pingEndpoint branch 6 times, most recently from 6ce0971 to 1f955bd Compare December 7, 2018 05:13
Copy link
Contributor

@alaibe alaibe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to see what others think about that PRs. I personally feel it is really complex and not normal we have to do such things.

I would be surprise there is a problem in node regarding curl as it is used a lot.
I don't think we are really fixing the root cause here but a side effect of the problem
e.g if a library is a problem we should find a way to fix it.

Copy link
Collaborator

@jrainville jrainville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @alaibe that this is really complex just for a ping. If the provider is capable of talking to the endpoint correctly but we can't ping it in a timely fashion, there must be a flaw in how we do it or the libs we use.

I saw you said that even under great CPU pressure, curl calls still worked.
Maybe we should try to replace our current ping by a simple exec curl. I saw that it is possible to do it with a websocket too: https://gist.github.com/htp/fbce19069187ec1cc486b594104f01d0

@michaelsbradleyjr
Copy link
Contributor Author

michaelsbradleyjr commented Dec 7, 2018

@alaibe, excellent points. I will try to make my case for the complexity.

I believe the problem is related to node.js itself, but I can only describe what's happening, I don't know the real reason. What I've observed seems similar to what's described in nodejs/node#20256, but happening for the client when the server is not necessarily node.js (could be geth or ipfs).

It is affecting clients created with built-in require('http') (so utils.pingEndpoint and utils.getJson are affected). It's also affecting clients created with require('ws') in the same way, probably because that package uses the built-in http.

Bug Description

  • Assume some endpoint is up and listening at http://localhost:12345, i.e. you can verify it's answering requests properly with some external tool like curl or a browser.
  • In node.js create a request with require('http').
  • Sometimes the underlying socket of the request is immediately stuck.
  • After a long time there will be an 'error' event on the request reporting a socket hang up.
  • If there is a successful retry, then afterward sockets don't seem to get stuck when making requests to the endpoint.

This problem is why when doing embark run it will sometimes seem to freeze during detection of the Ethereum node or the IPFS node (or both). It can also cause the blockchain proxy to take a long time to be ready, because the proxy uses pingEndpoint to detect the target. The bug seems to occur more frequently when the CPU is busy.

We can solve the problem with timeouts with the assumption that a request taking longer than T time to finish (with error or success) is probably a stuck socket. This allows the ECONNREFUSED error to work correctly for us, detecting that an endpoint isn't actually available. There is a risk that if T is too short then non-stuck sockets may not have time to receive a response. I chose 125ms, which seems like a lot of time for a request to a localhost process, but it may need to be longer.

The reason for the complexity in implementing the solution is that I discovered the socket timeouts don't always work, they can get stuck too — probably related to the same bug. So I doubled-up with setTimeout, and it works. BUT with potentially overlapped requests it gets tricky to handle events and cleanup correctly, and we have to make sure the callback isn't called multiple times too.

@michaelsbradleyjr
Copy link
Contributor Author

michaelsbradleyjr commented Dec 7, 2018

@jrainville it would be possible to do the ping in a child process with an external tool, good idea. However, the downside of doing so is that exec can be pretty slow, especially if the CPU is busy.

In the process of trying to figure out the problem, I observed that when my CPU is busy it takes awhile for ips and/or geth processes to actually start, which adds to the perception that embark run is hanging. So execing curl may have a similar problem, taking a relatively long time to provide node.js with the result.

See my comment to @alaibe re: this being a problem with node itself — at this point, I'm fairly sure that's the case, though I would love to find out that it's something we're doing wrong or can do better.

@michaelsbradleyjr michaelsbradleyjr force-pushed the refactor/pingEndpoint branch 2 times, most recently from b04518c to 3e6bd40 Compare December 8, 2018 04:20
@michaelsbradleyjr
Copy link
Contributor Author

This PR doesn't work on Windows, changes needed.

@michaelsbradleyjr
Copy link
Contributor Author

michaelsbradleyjr commented Dec 12, 2018

This PR works together with #1181 to prevent hangs and crashes when running / stopping embark processes.

When combining the code changes in this PR with those in #1181 and doing more experimentation, I found that it's unnecessary to have secondary timeouts, i.e. the ones created with setTimeout in an earlier version of this PR's commit. So, I removed that code and now the revised pingEndpoint is a little simpler, though it still suffers from complexity that I don't know how to further reduce at present.

Note that the maximum number of 1000ms connection attempts results in the same maximum waitOnTarget time of the proxy, i.e. 600000ms (10 minutes). The behavior when waitOnTarget and/or the retries goes beyond T time limit is under defined. In practice, things often work out because the blockchain is actually running and it was spurious network errors that resulted in the timeouts, or it wasn't and embark will fallback to launching it in another process. By setting T to a very high value, we allow pingEndpoint's retry behavior to (hopefully) eventually result in a genuine ECONNREFUSED error or a successful connection. 10 minutes is a long time to wait, most users would probably Ctrl-c if it's stuck for more than a few, helping to avoid the under defined behavior from becoming a frequent occurrence.

_cb();
}
}
);
Copy link
Collaborator

@emizzle emizzle Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we extend this out to all other service checks (swarm, webserver, whisper)? And in that case, maybe this can be abstracted a bit.

Copy link
Contributor Author

@michaelsbradleyjr michaelsbradleyjr Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea @emizzle. I'll create a pivotal task for that pending this PR being merged.

req.on('response', (_response) => {
callback();
});
const doCallback = (v) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is v? Can we have a better name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, seems like doCallback returns a boolean about if it should callback, but the name implies that it will actually callback. We should change that name too to something like shouldCallback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a silent "should" in front of the variable name... You're right, shouldCallback is better, will change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

setImmediate(() => {
pingEndpoint(host, port, type, protocol, origin, callback, ++count);
});
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this if with a return to limit the depth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few ideas when I woke up this morning about how to improve control flow, will be making those now and will keep in mind opportunities for early return.

One thing, though, is that since pingEndpoint is doing asynchronous stuff and the return value is effectively ignored, there's no stack growth.

Copy link
Contributor Author

@michaelsbradleyjr michaelsbradleyjr Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A number of changes have been made. They've been squashed into the one commit for this PR, but if you want to see what changed specifically you can check 8dfef2a — I was using that branch to test revised pingEndpoint and the proxy revisions together.

I tested on macOS, Windows, and desktop Linux, with websocket proxy and HTTP-only proxy. For all cases, I tried with embark run and blockchain in separate terminals and with single-terminal embark run, checking for blockchain process crashes when killing run and for hangs during startup. I tried jamming up the load on the CPU to see if I could get an indefinite stall... all very tedious, but I wanted to double-check everything so we can wrap this up with as much confidence as possible.

Everything worked correctly in all cases. Interestingly, on Linux I did one time see Proxy: Network error 'read ECONNRESET' reported in the output of embark blockchain. So I guess it's not just Windows on which that can happen.

I introduced an exponential increase in the timeout, with a max timeout of 3 seconds. As before, it will keep retrying for up to 10 minutes, after which the behavior becomes under defined as described previously.

I eliminated an edge case where a retry might be ongoing but another event might fire on a previous request. With my latest changes, if a request timeout has triggered a retry then additional events on the previous request will be ignored.

Some branching logic for handling error/success events has been combined into maybeRetry which is renamed handleEvent. An early return could be used here and the else if branch would become an if; not sure if you have a strong preference on that change being made. The cleanup call needs to happen regardless of which branch executes, so there's not actually a case for early an return in handleEvent.

Finally, the changes re: the HTTP timeout were motivated by another look at the docs. The request.setTimeout() method doesn't take affect until after a socket connection has been made, which is the behavior we want so it's cleaner to use it than to setup a waterfall of event handlers.


const handleError = (req, closeMethod) => {
req.on('error', (err) => {
if (!retrying) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would early return here with:

if (retrying) {
  return cleanup(req, closeMethod);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reasons unknown, sockets underlying http and websocket requests made by
`utils.pingEndpoint` could become stuck and after a *long* time eventually hang
up. This can happen even when the endpoint being pinged is perfectly
responsive, e.g. when checking it with `curl` in another terminal. The root
cause may be a bug or poorly documented behavior in node.js itself. The effects
of the stuck sockets are experienced more often when the CPU is under
moderate-heavy load, and also more often when pinging embark's blockchain proxy
vs. pinging geth, ipfs, etc. directly.

Use timeouts to cut short the stuck sockets. Usually, after one-to-several
retries (which may get stuck) following a stuck socket, connections behave as
expected (learned by experience). Assume that `T` time of inactivity on a "ping
socket" indicates that socket is stuck; increase `T` exponentially on each
retry until it reaches a max of 3000ms, and don't continue retrying beyond
600000ms (10 minutes). Make the initial `T` value 100ms. Once a retry is
underway following a timeout event, ignore all further events on the previous
ping attempt.

Note that when the CPU is moderately busy, one of the slowest steps is
launching embark's child processes, e.g. ipfs, geth, etc. However, even under
very heavy load, with the revised `pingEndpoint` embark seems to always
eventually start/detect services correctly. And under light to moderate load,
the effect of the stuck sockets is much less noticeable.

Use a weboscket client from the `ws` package instead of manually building
websocket headers.

Use `pingEndpoint` for the IPFS module's `_checkService` method to avoid the
same stuck socket problem that was affecting blockchain detection.
};

if (type === 'ws') {
const req = new (require('ws'))(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll be better of having all necessary require() statements at the top of the file instead of requiring within a function.

Especially if we happen to use (in this case) require('ws') in multiple places.

@michaelsbradleyjr
Copy link
Contributor Author

Closing, will be replaced with a PR that uses a ws client but doesn't involve the additional code for retrying on socket timeouts.

michaelsbradleyjr added a commit that referenced this pull request Dec 14, 2018
The problems described in embark PR #1166 can be resolved by implementing the
blockchain proxy with `http-proxy` directly instead of using `express` together
with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the
"stuck sockets" problems described in #1166) is unknown.

The need to swallow some errors as described in embark PR #1181 is also
eliminated by dropping `http-proxy-middleware` and `express`.
michaelsbradleyjr added a commit that referenced this pull request Dec 14, 2018
The problems described in embark PR #1166 can be resolved by implementing the
blockchain proxy with `http-proxy` directly instead of using `express` together
with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the
"stuck sockets" problems described in #1166) is unknown.

The need to swallow some errors as described in embark PR #1181 is also
eliminated by dropping `http-proxy-middleware` and `express`.
michaelsbradleyjr added a commit that referenced this pull request Dec 14, 2018
This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there
is no longer a need to use timeouts. However a few aspects of the original PR
are still useful, and lessons learned from #1166, #1181, and #1195 can be put
to good use.

Use a websocket client from the `ws` package when pinging websocket endpoints
instead of manually building a request header. The `'upgrade'` event being
listened for was never actually firing; and though a response was received for
those pings, the response messages indicated problems with those requests. It
seems cleaner to use a proper websocket client and callback success upon the
`ws` client's `'open'` event.

Abstract error and success handling across websocket and http pings.

Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is
expected, as that genuinely indicates an endpoint isn't accepting connections
at the specified host and port. If other kinds of network errors are occurring,
it will be helpful to have a visual indicator to prompt investigation.

After success or the first error, cleanup the ping's request/connection
immediately since we're not awaiting `'data'` events on an http request and we
don't want to leave a websocket connection open. Don't callback any additional
`'error'` events that might fire after the first `'error'` event, but do report
them.
michaelsbradleyjr added a commit that referenced this pull request Dec 14, 2018
This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there
is no longer a need to use timeouts. However a few aspects of the original PR
are still useful, and lessons learned from #1166, #1181, and #1195 can be put
to good use.

Use a websocket client from the `ws` package when pinging websocket endpoints
instead of manually building a request header. The `'upgrade'` event being
listened for was never actually firing; and though a response was received for
those pings, the response messages indicated problems with those requests. It
seems cleaner to use a proper websocket client and callback success upon the
`ws` client's `'open'` event.

Abstract error and success handling across websocket and http pings.

Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is
expected, as that genuinely indicates an endpoint isn't accepting connections
at the specified host and port. If other kinds of network errors are occurring,
it will be helpful to have a visual indicator to prompt investigation.

After success or the first error, cleanup the ping's request/connection
immediately since we're not awaiting `'data'` events on an http request and we
don't want to leave a websocket connection open. Don't callback any `'error'`
events that might fire after the first `'error'` event or a success event, but
do report them.
michaelsbradleyjr added a commit that referenced this pull request Dec 17, 2018
The problems described in embark PR #1166 can be resolved by implementing the
blockchain proxy with `http-proxy` directly instead of using `express` together
with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the
"stuck sockets" problems described in #1166) is unknown.

The need to swallow some errors as described in embark PR #1181 is also
eliminated by dropping `http-proxy-middleware` and `express`.
michaelsbradleyjr added a commit that referenced this pull request Dec 18, 2018
This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there
is no longer a need to use timeouts. However a few aspects of the original PR
are still useful, and lessons learned from #1166, #1181, and #1195 can be put
to good use.

Use a websocket client from the `ws` package when pinging websocket endpoints
instead of manually building a request header. The `'upgrade'` event being
listened for was never actually firing; and though a response was received for
those pings, the response messages indicated problems with those requests. It
seems cleaner to use a proper websocket client and callback success upon the
`ws` client's `'open'` event.

Abstract error and success handling across websocket and http pings.

Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is
expected, as that genuinely indicates an endpoint isn't accepting connections
at the specified host and port. If other kinds of network errors are occurring,
it will be helpful to have a visual indicator to prompt investigation.

After success or the first error, cleanup the ping's request/connection
immediately since we're not awaiting `'data'` events on an http request and we
don't want to leave a websocket connection open. Don't callback any `'error'`
events that might fire after the first `'error'` event or a success event, but
do report them.
michaelsbradleyjr added a commit that referenced this pull request Dec 18, 2018
The problems described in embark PR #1166 can be resolved by implementing the
blockchain proxy with `http-proxy` directly instead of using `express` together
with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the
"stuck sockets" problems described in #1166) is unknown.

The need to swallow some errors as described in embark PR #1181 is also
eliminated by dropping `http-proxy-middleware` and `express`.
michaelsbradleyjr added a commit that referenced this pull request Dec 18, 2018
The problems described in embark PR #1166 can be resolved by implementing the
blockchain proxy with `http-proxy` directly instead of using `express` together
with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the
"stuck sockets" problems described in #1166) is unknown.

The need to swallow some errors as described in embark PR #1181 is also
eliminated by dropping `http-proxy-middleware` and `express`.
michaelsbradleyjr added a commit that referenced this pull request Dec 18, 2018
This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there
is no longer a need to use timeouts. However a few aspects of the original PR
are still useful, and lessons learned from #1166, #1181, and #1195 can be put
to good use.

Use a websocket client from the `ws` package when pinging websocket endpoints
instead of manually building a request header. The `'upgrade'` event being
listened for was never actually firing; and though a response was received for
those pings, the response messages indicated problems with those requests. It
seems cleaner to use a proper websocket client and callback success upon the
`ws` client's `'open'` event.

Abstract error and success handling across websocket and http pings.

Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is
expected, as that genuinely indicates an endpoint isn't accepting connections
at the specified host and port. If other kinds of network errors are occurring,
it will be helpful to have a visual indicator to prompt investigation.

After success or the first error, cleanup the ping's request/connection
immediately since we're not awaiting `'data'` events on an http request and we
don't want to leave a websocket connection open. Don't callback any `'error'`
events that might fire after the first `'error'` event or a success event, but
do report them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants