Configurable timeout to render requests #220

justin808 · 2021-09-16T10:41:29Z

Add ssr_timeout configuration so the Rails server will not wait more
than this many seconds for a SSR request to return once issued.
Change default for renderer_use_fallback_exec_js to false.
Change default log level to info.

This change is

justin808 · 2021-09-20T07:36:39Z

.circleci/config.yml

    command: |
      gem install bundler
      echo "Bundler version: "; bundle --version
+      bundle config set --local disable_checksum_validation true


@Judahmeek are we sure this is the right fix?

Is it secure?

It's a security vulnerability & the only reason I did it is because this is for CI only.

A better alternative is probably finding a libv8 version that doesn't fail checksum validation

dzirtusss

Overall, you terminate here on "request" level, so most probably the "request-backend" (node/execjs) will continue to run. Maybe even for quite a long time. Which is not very ok in general. So, maybe it is worth second iteration of this PR where termination signal is sent to renderer and renderer kills the process. Might be as well that js-backend is smart enough to terminate on connection close (then all is ok), but most probably not.

Reviewed 9 of 9 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Judahmeek)

.circleci/config.yml, line 53 at r2 (raw file):

Previously, justin808 (Justin Gordon) wrote…

@Judahmeek are we sure this is the right fix?

I am not very sure why it is needed - sorry maybe just don't know the reason. As well this is not obvious why, so, I would definitely put a NOTE: with some explanations.

justin808 · 2021-09-21T09:16:15Z

@dzirtusss

Overall, you terminate here on "request" level, so most probably the "request-backend" (node/execjs) will continue to run. Maybe even for quite a long time. Which is not very ok in general. So, maybe it is worth second iteration of this PR where termination signal is sent to renderer and renderer kills the process. Might be as well that js-backend is smart enough to terminate on connection close (then all is ok), but most probably not.

@Romex91 and I discussed this at length.

There are 2 main types of problems:

Infinite loop leading to out-of-memory, which crashes and is currently restarted, with this PR providing that information to Sentry. PM2 can't do much about this other than restart the process.
Slower memory leak, which also leads to an out-of-memory. This is not currently handled other than an auto-restart interval can be specified. PM2 has a good solution to restart a process when memory grows.

So the longer-term solution is to switch to PM2.

Shorter-term, regarding your suggestion to tell the cluster to kill the process, the problem is that the Rails server does not know which cluster took the request.

However, we find that the bigger issues is to retry problematic requests. Eventually, we'd like to to track the fingerprint of such requests that cause a crash and block them until the server restarts.

In order to do that, we'd need some sort of shared data, like Redis or Memcache, keep some sort of data, via some middleware calls to track:

Request came in, add request fingerprint entry, maybe with a timestamp
When the request is responded to, then remove the request fingerprint entry

For thread starvation crashes, there would be no request responded to handling.

However, given that such thread starvation crashes by particular URLs are rare and often fixed quickly, this might not be a high priority.

On the other hand, it would be relatively easy for the React on Rails Pro Ruby part to keep a Hash with the count of requests that caused timeouts, and it can prevent sending such requests to the node Renderer.

request fingerprint => { path: , count: }

I'll add a config that specifies how many times a given request should have a timeout before being blocked.

config.ssr_timeouts_before_blocking

I'll add this to the configuration, and if certain request times out more than the configured amount:

Request information is logged to Sentry to notify a type of request is blocked.
Any further requests matching this fingerprint are blocked.

If a tiny, random number of requests get timeouts due to high traffic, then those are unlikely to have say 4 timeouts for any one of them.

justin808 · 2021-09-21T09:17:15Z

Note, testing for timeouts can be done via: #136 (comment).

justin808 · 2021-09-21T10:24:09Z

@dzirtusss @Romex91 Draft changes for the next level of this PR:

justin808/add-ssr-timeout...justin808/add-ssr-timout-plus-checks

Should this PR go in as-is? Or add the check for any requests that keep erroring?

Romex91 · 2021-09-22T06:56:07Z

spec/dummy/config/initializers/react_on_rails_pro.rb

  # Retry request in case of time out on the node-renderer side
  # 0 - no retry
-  config.renderer_request_retry_limit = 5
+  config.renderer_request_retry_limit = 1


IMO this parameter should be deprecated. I would delete it right away. Or add a warning
You are using renderer_request_retry_limit > 0, which is most likely a bad idea.

This option starts working when a node renderer doesn't respond for a long time.
I can think of these potential explanations for why the renderer doesn't respond:

There is a bug in the renderer (infinite loop, etc.). Must be reported to Sentry ASAP. Retries will make the problem worse by knocking out more workers.

The page is extremely huge and takes a lot of time to render. Retries will make the situation worse by keeping workers busy. Must be reported to Sentry ASAP, so Devs can address the problem and reduce the size of the page.

The server is running out of resources. Retries will create MORE Load. An overloaded server won't benefit from MORE WORK. Must be reported to Sentry ASAP so admins can upgrade the server.

The node renderer is located on another machine, and there is a connection problem. This is not OK! The decision to split the server into separate locations lead to connectivity issues. Must be reported to Sentry ASAP so admins can investigate why machines have trouble reaching each other (maybe they should be forced to occupy the same data center or something like that).

All of these cases are problems. All of them need to be analyzed and addressed. Silent retries are only masquerading the issues. And often create problems introducing domino effects like: Some page is huge => retries => more server load => more pages get timed out => more retries => even more server load.

Retries are only good when dealing with expected and acceptable errors (E.G., electromagnetic bit switches in ethernet). Node renderer failures are not the case because they are neither expected nor acceptable.

Romex91 · 2021-09-22T08:28:13Z

Should this PR go in as-is? Or add the check for any requests that keep erroring?

About check for requests that keep erroring idea. Do we lose the possibility of quick-fixing problems by tweaking live DB values, as we did with the infinite loop in the gallery disaster?
If yes, it's getting hard to explain what's going on to RORP users. It will be equally hard to explain that to future us. Imagine someone who has a server downtime trying to quick-fix the problem by editing the DB. They edit the DB, BUT IT DOESN'T HELP. They are panicking and make more changes in the DB without realizing that the error page is cached.

Also, it seems a little bit like a solving the wrong problem dilemma.
The problem we had was infinite loop in one page can potentially affect other pages. So, instead of fixing the true reasons workers are not memory leak resistant and retries result in more server load and hence the domino effect we fix the cumulative consequences.

The fix is still good, BUT:

Will it reset the error counter when some DB values are changed?

justisb · 2021-09-22T16:41:18Z

We at least need a way to manually clear the error counter & document that thoroughly/loudly announce that pages are being blocked from rendering, so that if the fix is applied head of another deployment/reboot, there's a very straightforward way to get the functional pages back out of the blocklist.

dzirtusss

@justin808 regarding your suggestion to tell the cluster to kill the process, the problem is that the Rails server does not know which cluster took the request

Overall, I was thinking that this is not rails, but main.js process should do this or worker.js itself. E.g. you have 1st timeout 5 sec in rails and 2nd kill timeout 10 sec in worker.

As well await sleep XXXX is not very good for timeout testing, as this is gracefully giving cpu control to other thread. E.g. while(1) ... will create more realistic scenario and eat out all CPU like in real life infinite loop.

Reviewed 1 of 3 files at r3, 1 of 1 files at r4, 8 of 9 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @justin808 and @Romex91)

.circleci/config.yml, line 53 at r2 (raw file):

Previously, Judahmeek (Judah Meek) wrote…

It's a security vulnerability & the only reason I did it is because this is for CI only.

A better alternative is probably finding a libv8 version that doesn't fail checksum validation

@Judahmeek why not put a comment? That next guy will know the reason and possibly fix it when it will be not needed.

spec/dummy/config/initializers/react_on_rails_pro.rb, line 29 at r4 (raw file):

Previously, Romex91 (Roman Kuksin) wrote…

IMO this parameter should be deprecated. I would delete it right away. Or add a warning
You are using renderer_request_retry_limit > 0, which is most likely a bad idea.

This option starts working when a node renderer doesn't respond for a long time.
I can think of these potential explanations for why the renderer doesn't respond:

There is a bug in the renderer (infinite loop, etc.). Must be reported to Sentry ASAP. Retries will make the problem worse by knocking out more workers.

The page is extremely huge and takes a lot of time to render. Retries will make the situation worse by keeping workers busy. Must be reported to Sentry ASAP, so Devs can address the problem and reduce the size of the page.

The server is running out of resources. Retries will create MORE Load. An overloaded server won't benefit from MORE WORK. Must be reported to Sentry ASAP so admins can upgrade the server.

The node renderer is located on another machine, and there is a connection problem. This is not OK! The decision to split the server into separate locations lead to connectivity issues. Must be reported to Sentry ASAP so admins can investigate why machines have trouble reaching each other (maybe they should be forced to occupy the same data center or something like that).

All of these cases are problems. All of them need to be analyzed and addressed. Silent retries are only masquerading the issues. And often create problems introducing domino effects like: Some page is huge => retries => more server load => more pages get timed out => more retries => even more server load.

Retries are only good when dealing with expected and acceptable errors (E.G., electromagnetic bit switches in ethernet). Node renderer failures are not the case because they are neither expected nor acceptable.

Agree. Plus for real users this will be (or should be) browser-rendered w/o SSR in case of error. So only bot related mostly.

- Add `ssr_timeout` configuration so the Rails server will not wait more than this many seconds for a SSR request to return once issued. - Change default for `renderer_use_fallback_exec_js` to `false`. - Change default log level to info.

justin808 commented Sep 20, 2021

View reviewed changes

dzirtusss approved these changes Sep 21, 2021

View reviewed changes

justin808 mentioned this pull request Sep 21, 2021

Implement infinite loops, deadlocks, and memory leaks resilience for node workers. #222

Closed

justin808 force-pushed the justin808/add-ssr-timeout branch from 5419ae5 to 81369a0 Compare September 22, 2021 02:50

Romex91 reviewed Sep 22, 2021

View reviewed changes

dzirtusss approved these changes Sep 28, 2021

View reviewed changes

Judahmeek mentioned this pull request Oct 18, 2021

Block requests that key erroring => Follow up on #220 #225

Open

justin808 added 5 commits October 18, 2021 18:07

Configurable timeout to render requests

a47d64f

- Add `ssr_timeout` configuration so the Rails server will not wait more than this many seconds for a SSR request to return once issued. - Change default for `renderer_use_fallback_exec_js` to `false`. - Change default log level to info.

WIP-with-timeout-for-testing

2093972

WIP

6ce3abf

Improved error handling

4b5e2f5

Fix formatting

17113ec

Judahmeek force-pushed the justin808/add-ssr-timeout branch from ddd8126 to 17113ec Compare October 18, 2021 23:42

justin808 merged commit 1bb654c into master Oct 18, 2021

justin808 deleted the justin808/add-ssr-timeout branch October 18, 2021 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Configurable timeout to render requests #220

Configurable timeout to render requests #220

Uh oh!

justin808 commented Sep 16, 2021 •

edited

Loading

Uh oh!

justin808 Sep 20, 2021

Uh oh!

Romex91 Sep 22, 2021

Uh oh!

Judahmeek Sep 22, 2021

Uh oh!

dzirtusss left a comment

Uh oh!

justin808 commented Sep 21, 2021

Uh oh!

justin808 commented Sep 21, 2021

Uh oh!

justin808 commented Sep 21, 2021

Uh oh!

Romex91 Sep 22, 2021

Uh oh!

Romex91 commented Sep 22, 2021

Uh oh!

justisb commented Sep 22, 2021

Uh oh!

dzirtusss left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Configurable timeout to render requests #220

Configurable timeout to render requests #220

Uh oh!

Conversation

justin808 commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justin808 Sep 20, 2021

Choose a reason for hiding this comment

Uh oh!

Romex91 Sep 22, 2021

Choose a reason for hiding this comment

Uh oh!

Judahmeek Sep 22, 2021

Choose a reason for hiding this comment

Uh oh!

dzirtusss left a comment

Choose a reason for hiding this comment

Uh oh!

justin808 commented Sep 21, 2021

Uh oh!

justin808 commented Sep 21, 2021

Uh oh!

justin808 commented Sep 21, 2021

Uh oh!

Romex91 Sep 22, 2021

Choose a reason for hiding this comment

Uh oh!

Romex91 commented Sep 22, 2021

Uh oh!

justisb commented Sep 22, 2021

Uh oh!

dzirtusss left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

justin808 commented Sep 16, 2021 •

edited

Loading