refactor(@embark/utils): use timeouts to improve utils.pingEndpoint #1166

michaelsbradleyjr · 2018-12-07T02:51:35Z

For reasons unknown, sockets underlying http and websocket requests made by utils.pingEndpoint could become stuck and after a long time eventually hang up. This can happen even when the endpoint being pinged is perfectly responsive, e.g. when checking it with curl in another terminal. The root cause may be a bug or poorly documented behavior in node.js itself. The effects of the stuck sockets are experienced more often when the CPU is under moderate-heavy load, and also more often when pinging embark's blockchain proxy vs. pinging geth, ipfs, etc. directly.

Use timeouts to cut short the stuck sockets. Usually, after one-to-several retries (which may get stuck) following a stuck socket, connections behave as expected (learned by experience). Assume that T time of inactivity on a "ping socket" indicates that socket is stuck; increase T exponentially on each retry until it reaches a max of 3000ms, and don't continue retrying beyond 600000ms (10 minutes). Make the initial T value 100ms. Once a retry is underway following a timeout event, ignore all further events on the previous ping attempt.

Note that when the CPU is moderately busy, one of the slowest steps is launching embark's child processes, e.g. ipfs, geth, etc. However, even under very heavy load, with the revised pingEndpoint embark seems to always eventually start/detect services correctly. And under light to moderate load, the effect of the stuck sockets is much less noticeable.

Use a weboscket client from the ws package instead of manually building websocket headers.

Use pingEndpoint for the IPFS module's _checkService method to avoid the same stuck socket problem that was affecting blockchain detection.

alaibe

I would really like to see what others think about that PRs. I personally feel it is really complex and not normal we have to do such things.

I would be surprise there is a problem in node regarding curl as it is used a lot.
I don't think we are really fixing the root cause here but a side effect of the problem
e.g if a library is a problem we should find a way to fix it.

jrainville

I agree with @alaibe that this is really complex just for a ping. If the provider is capable of talking to the endpoint correctly but we can't ping it in a timely fashion, there must be a flaw in how we do it or the libs we use.

I saw you said that even under great CPU pressure, curl calls still worked.
Maybe we should try to replace our current ping by a simple exec curl. I saw that it is possible to do it with a websocket too: https://gist.github.com/htp/fbce19069187ec1cc486b594104f01d0

michaelsbradleyjr · 2018-12-07T16:16:45Z

@alaibe, excellent points. I will try to make my case for the complexity.

I believe the problem is related to node.js itself, but I can only describe what's happening, I don't know the real reason. What I've observed seems similar to what's described in nodejs/node#20256, but happening for the client when the server is not necessarily node.js (could be geth or ipfs).

It is affecting clients created with built-in require('http') (so utils.pingEndpoint and utils.getJson are affected). It's also affecting clients created with require('ws') in the same way, probably because that package uses the built-in http.

Bug Description

Assume some endpoint is up and listening at http://localhost:12345, i.e. you can verify it's answering requests properly with some external tool like curl or a browser.
In node.js create a request with require('http').
Sometimes the underlying socket of the request is immediately stuck.
After a long time there will be an 'error' event on the request reporting a socket hang up.
If there is a successful retry, then afterward sockets don't seem to get stuck when making requests to the endpoint.

This problem is why when doing embark run it will sometimes seem to freeze during detection of the Ethereum node or the IPFS node (or both). It can also cause the blockchain proxy to take a long time to be ready, because the proxy uses pingEndpoint to detect the target. The bug seems to occur more frequently when the CPU is busy.

We can solve the problem with timeouts with the assumption that a request taking longer than T time to finish (with error or success) is probably a stuck socket. This allows the ECONNREFUSED error to work correctly for us, detecting that an endpoint isn't actually available. There is a risk that if T is too short then non-stuck sockets may not have time to receive a response. I chose 125ms, which seems like a lot of time for a request to a localhost process, but it may need to be longer.

The reason for the complexity in implementing the solution is that I discovered the socket timeouts don't always work, they can get stuck too — probably related to the same bug. So I doubled-up with setTimeout, and it works. BUT with potentially overlapped requests it gets tricky to handle events and cleanup correctly, and we have to make sure the callback isn't called multiple times too.

michaelsbradleyjr · 2018-12-07T16:26:34Z

@jrainville it would be possible to do the ping in a child process with an external tool, good idea. However, the downside of doing so is that exec can be pretty slow, especially if the CPU is busy.

In the process of trying to figure out the problem, I observed that when my CPU is busy it takes awhile for ips and/or geth processes to actually start, which adds to the perception that embark run is hanging. So execing curl may have a similar problem, taking a relatively long time to provide node.js with the result.

See my comment to @alaibe re: this being a problem with node itself — at this point, I'm fairly sure that's the case, though I would love to find out that it's something we're doing wrong or can do better.

michaelsbradleyjr · 2018-12-10T17:30:39Z

This PR doesn't work on Windows, changes needed.

michaelsbradleyjr · 2018-12-12T00:16:22Z

This PR works together with #1181 to prevent hangs and crashes when running / stopping embark processes.

When combining the code changes in this PR with those in #1181 and doing more experimentation, I found that it's unnecessary to have secondary timeouts, i.e. the ones created with setTimeout in an earlier version of this PR's commit. So, I removed that code and now the revised pingEndpoint is a little simpler, though it still suffers from complexity that I don't know how to further reduce at present.

Note that the maximum number of 1000ms connection attempts results in the same maximum waitOnTarget time of the proxy, i.e. 600000ms (10 minutes). The behavior when waitOnTarget and/or the retries goes beyond T time limit is under defined. In practice, things often work out because the blockchain is actually running and it was spurious network errors that resulted in the timeouts, or it wasn't and embark will fallback to launching it in another process. By setting T to a very high value, we allow pingEndpoint's retry behavior to (hopefully) eventually result in a genuine ECONNREFUSED error or a successful connection. 10 minutes is a long time to wait, most users would probably Ctrl-c if it's stuck for more than a few, helping to avoid the under defined behavior from becoming a frequent occurrence.

emizzle · 2018-12-12T01:12:21Z

src/lib/modules/ipfs/index.js

+          _cb();
+        }
+      }
+    );


Should we extend this out to all other service checks (swarm, webserver, whisper)? And in that case, maybe this can be abstracted a bit.

Good idea @emizzle. I'll create a pivotal task for that pending this PR being merged.

jrainville · 2018-12-12T13:34:05Z

src/lib/utils/utils.js

-  req.on('response', (_response) => {
-    callback();
-  });
+  const doCallback = (v) => {


What is v? Can we have a better name

Also, seems like doCallback returns a boolean about if it should callback, but the name implies that it will actually callback. We should change that name too to something like shouldCallback

It was a silent "should" in front of the variable name... You're right, shouldCallback is better, will change it.

jrainville · 2018-12-12T13:38:20Z

src/lib/utils/utils.js

+        setImmediate(() => {
+          pingEndpoint(host, port, type, protocol, origin, callback, ++count);
+        });
+      } else {


Could you remove this if with a return to limit the depth.

I had a few ideas when I woke up this morning about how to improve control flow, will be making those now and will keep in mind opportunities for early return.

One thing, though, is that since pingEndpoint is doing asynchronous stuff and the return value is effectively ignored, there's no stack growth.

A number of changes have been made. They've been squashed into the one commit for this PR, but if you want to see what changed specifically you can check 8dfef2a — I was using that branch to test revised pingEndpoint and the proxy revisions together.

I tested on macOS, Windows, and desktop Linux, with websocket proxy and HTTP-only proxy. For all cases, I tried with embark run and blockchain in separate terminals and with single-terminal embark run, checking for blockchain process crashes when killing run and for hangs during startup. I tried jamming up the load on the CPU to see if I could get an indefinite stall... all very tedious, but I wanted to double-check everything so we can wrap this up with as much confidence as possible.

Everything worked correctly in all cases. Interestingly, on Linux I did one time see Proxy: Network error 'read ECONNRESET' reported in the output of embark blockchain. So I guess it's not just Windows on which that can happen.

I introduced an exponential increase in the timeout, with a max timeout of 3 seconds. As before, it will keep retrying for up to 10 minutes, after which the behavior becomes under defined as described previously.

I eliminated an edge case where a retry might be ongoing but another event might fire on a previous request. With my latest changes, if a request timeout has triggered a retry then additional events on the previous request will be ignored.

Some branching logic for handling error/success events has been combined into maybeRetry which is renamed handleEvent. ~~An early return could be used here and the else if branch would become an if; not sure if you have a strong preference on that change being made.~~ The cleanup call needs to happen regardless of which branch executes, so there's not actually a case for early an return in handleEvent.

Finally, the changes re: the HTTP timeout were motivated by another look at the docs. The request.setTimeout() method doesn't take affect until after a socket connection has been made, which is the behavior we want so it's cleaner to use it than to setup a waterfall of event handlers.

jrainville · 2018-12-12T13:39:08Z

src/lib/utils/utils.js

+
+  const handleError = (req, closeMethod) => {
+    req.on('error', (err) => {
+      if (!retrying) {


Would early return here with:

if (retrying) { return cleanup(req, closeMethod); }

See: #1166 (comment)

For reasons unknown, sockets underlying http and websocket requests made by `utils.pingEndpoint` could become stuck and after a *long* time eventually hang up. This can happen even when the endpoint being pinged is perfectly responsive, e.g. when checking it with `curl` in another terminal. The root cause may be a bug or poorly documented behavior in node.js itself. The effects of the stuck sockets are experienced more often when the CPU is under moderate-heavy load, and also more often when pinging embark's blockchain proxy vs. pinging geth, ipfs, etc. directly. Use timeouts to cut short the stuck sockets. Usually, after one-to-several retries (which may get stuck) following a stuck socket, connections behave as expected (learned by experience). Assume that `T` time of inactivity on a "ping socket" indicates that socket is stuck; increase `T` exponentially on each retry until it reaches a max of 3000ms, and don't continue retrying beyond 600000ms (10 minutes). Make the initial `T` value 100ms. Once a retry is underway following a timeout event, ignore all further events on the previous ping attempt. Note that when the CPU is moderately busy, one of the slowest steps is launching embark's child processes, e.g. ipfs, geth, etc. However, even under very heavy load, with the revised `pingEndpoint` embark seems to always eventually start/detect services correctly. And under light to moderate load, the effect of the stuck sockets is much less noticeable. Use a weboscket client from the `ws` package instead of manually building websocket headers. Use `pingEndpoint` for the IPFS module's `_checkService` method to avoid the same stuck socket problem that was affecting blockchain detection.

0x-r4bbit · 2018-12-14T14:06:50Z

src/lib/utils/utils.js

+  };
+
+  if (type === 'ws') {
+    const req = new (require('ws'))(


I think we'll be better of having all necessary require() statements at the top of the file instead of requiring within a function.

Especially if we happen to use (in this case) require('ws') in multiple places.

michaelsbradleyjr · 2018-12-14T17:14:32Z

Closing, will be replaced with a PR that uses a ws client but doesn't involve the additional code for retrying on socket timeouts.

The problems described in embark PR #1166 can be resolved by implementing the blockchain proxy with `http-proxy` directly instead of using `express` together with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the "stuck sockets" problems described in #1166) is unknown. The need to swallow some errors as described in embark PR #1181 is also eliminated by dropping `http-proxy-middleware` and `express`.

This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there is no longer a need to use timeouts. However a few aspects of the original PR are still useful, and lessons learned from #1166, #1181, and #1195 can be put to good use. Use a websocket client from the `ws` package when pinging websocket endpoints instead of manually building a request header. The `'upgrade'` event being listened for was never actually firing; and though a response was received for those pings, the response messages indicated problems with those requests. It seems cleaner to use a proper websocket client and callback success upon the `ws` client's `'open'` event. Abstract error and success handling across websocket and http pings. Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is expected, as that genuinely indicates an endpoint isn't accepting connections at the specified host and port. If other kinds of network errors are occurring, it will be helpful to have a visual indicator to prompt investigation. After success or the first error, cleanup the ping's request/connection immediately since we're not awaiting `'data'` events on an http request and we don't want to leave a websocket connection open. Don't callback any additional `'error'` events that might fire after the first `'error'` event, but do report them.

This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there is no longer a need to use timeouts. However a few aspects of the original PR are still useful, and lessons learned from #1166, #1181, and #1195 can be put to good use. Use a websocket client from the `ws` package when pinging websocket endpoints instead of manually building a request header. The `'upgrade'` event being listened for was never actually firing; and though a response was received for those pings, the response messages indicated problems with those requests. It seems cleaner to use a proper websocket client and callback success upon the `ws` client's `'open'` event. Abstract error and success handling across websocket and http pings. Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is expected, as that genuinely indicates an endpoint isn't accepting connections at the specified host and port. If other kinds of network errors are occurring, it will be helpful to have a visual indicator to prompt investigation. After success or the first error, cleanup the ping's request/connection immediately since we're not awaiting `'data'` events on an http request and we don't want to leave a websocket connection open. Don't callback any `'error'` events that might fire after the first `'error'` event or a success event, but do report them.

The problems described in embark PR #1166 can be resolved by implementing the blockchain proxy with `http-proxy` directly instead of using `express` together with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the "stuck sockets" problems described in #1166) is unknown. The need to swallow some errors as described in embark PR #1181 is also eliminated by dropping `http-proxy-middleware` and `express`.

This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there is no longer a need to use timeouts. However a few aspects of the original PR are still useful, and lessons learned from #1166, #1181, and #1195 can be put to good use. Use a websocket client from the `ws` package when pinging websocket endpoints instead of manually building a request header. The `'upgrade'` event being listened for was never actually firing; and though a response was received for those pings, the response messages indicated problems with those requests. It seems cleaner to use a proper websocket client and callback success upon the `ws` client's `'open'` event. Abstract error and success handling across websocket and http pings. Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is expected, as that genuinely indicates an endpoint isn't accepting connections at the specified host and port. If other kinds of network errors are occurring, it will be helpful to have a visual indicator to prompt investigation. After success or the first error, cleanup the ping's request/connection immediately since we're not awaiting `'data'` events on an http request and we don't want to leave a websocket connection open. Don't callback any `'error'` events that might fire after the first `'error'` event or a success event, but do report them.

The problems described in embark PR #1166 can be resolved by implementing the blockchain proxy with `http-proxy` directly instead of using `express` together with `http-proxy-middleware`. The ultimate cause of the buggy behavior (the "stuck sockets" problems described in #1166) is unknown. The need to swallow some errors as described in embark PR #1181 is also eliminated by dropping `http-proxy-middleware` and `express`.

This PR replaces #1166. The "stuck sockets" bug is addressed in #1195 so there is no longer a need to use timeouts. However a few aspects of the original PR are still useful, and lessons learned from #1166, #1181, and #1195 can be put to good use. Use a websocket client from the `ws` package when pinging websocket endpoints instead of manually building a request header. The `'upgrade'` event being listened for was never actually firing; and though a response was received for those pings, the response messages indicated problems with those requests. It seems cleaner to use a proper websocket client and callback success upon the `ws` client's `'open'` event. Abstract error and success handling across websocket and http pings. Report network errors other than `ECONNREFUSED`. Only `ECONNREFUSED` is expected, as that genuinely indicates an endpoint isn't accepting connections at the specified host and port. If other kinds of network errors are occurring, it will be helpful to have a visual indicator to prompt investigation. After success or the first error, cleanup the ping's request/connection immediately since we're not awaiting `'data'` events on an http request and we don't want to leave a websocket connection open. Don't callback any `'error'` events that might fire after the first `'error'` event or a success event, but do report them.

michaelsbradleyjr requested a review from a team December 7, 2018 02:51

michaelsbradleyjr force-pushed the refactor/pingEndpoint branch 6 times, most recently from 6ce0971 to 1f955bd Compare December 7, 2018 05:13

alaibe reviewed Dec 7, 2018

View reviewed changes

jrainville reviewed Dec 7, 2018

View reviewed changes

michaelsbradleyjr force-pushed the refactor/pingEndpoint branch 2 times, most recently from b04518c to 3e6bd40 Compare December 8, 2018 04:20

michaelsbradleyjr added the changes needed label Dec 10, 2018

michaelsbradleyjr force-pushed the refactor/pingEndpoint branch from 3e6bd40 to c9ea017 Compare December 11, 2018 00:17

michaelsbradleyjr removed the changes needed label Dec 11, 2018

michaelsbradleyjr force-pushed the refactor/pingEndpoint branch 2 times, most recently from ad5ad17 to 25520c7 Compare December 11, 2018 23:30

michaelsbradleyjr mentioned this pull request Dec 12, 2018

refactor(@embark/blockchain_process): swallow errors, revise streams #1181

Merged

emizzle reviewed Dec 12, 2018

View reviewed changes

jrainville reviewed Dec 12, 2018

View reviewed changes

michaelsbradleyjr force-pushed the refactor/pingEndpoint branch from 25520c7 to 6db9def Compare December 12, 2018 21:13

michaelsbradleyjr force-pushed the refactor/pingEndpoint branch from 6db9def to 6824264 Compare December 13, 2018 17:55

0x-r4bbit reviewed Dec 14, 2018

View reviewed changes

michaelsbradleyjr closed this Dec 14, 2018

michaelsbradleyjr mentioned this pull request Dec 14, 2018

refactor(@embark/blockchain_process): remove http-proxy-middleware #1195

Merged

michaelsbradleyjr mentioned this pull request Dec 14, 2018

refactor(@embark/utils): use a 'ws' websocket client in pingEndpoint #1196

Merged

refactor(@embark/utils): use timeouts to improve utils.pingEndpoint #1166

refactor(@embark/utils): use timeouts to improve utils.pingEndpoint #1166

Uh oh!

Conversation

michaelsbradleyjr commented Dec 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alaibe left a comment

Choose a reason for hiding this comment

Uh oh!

jrainville left a comment

Choose a reason for hiding this comment

Uh oh!

michaelsbradleyjr commented Dec 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelsbradleyjr commented Dec 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelsbradleyjr commented Dec 10, 2018

Uh oh!

michaelsbradleyjr commented Dec 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emizzle Dec 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelsbradleyjr Dec 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelsbradleyjr Dec 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelsbradleyjr commented Dec 14, 2018

Uh oh!

Uh oh!

michaelsbradleyjr commented Dec 7, 2018 •

edited

Loading

michaelsbradleyjr commented Dec 7, 2018 •

edited

Loading

michaelsbradleyjr commented Dec 7, 2018 •

edited

Loading

michaelsbradleyjr commented Dec 12, 2018 •

edited

Loading

emizzle Dec 12, 2018 •

edited

Loading

michaelsbradleyjr Dec 12, 2018 •

edited

Loading

michaelsbradleyjr Dec 12, 2018 •

edited

Loading