Skip to content

Conversation

@justin808
Copy link
Member

@justin808 justin808 commented Sep 16, 2021

  • Add ssr_timeout configuration so the Rails server will not wait more
    than this many seconds for a SSR request to return once issued.
  • Change default for renderer_use_fallback_exec_js to false.
  • Change default log level to info.

This change is Reviewable

command: |
gem install bundler
echo "Bundler version: "; bundle --version
bundle config set --local disable_checksum_validation true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Judahmeek are we sure this is the right fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it secure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a security vulnerability & the only reason I did it is because this is for CI only.

A better alternative is probably finding a libv8 version that doesn't fail checksum validation

Copy link
Contributor

@dzirtusss dzirtusss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, you terminate here on "request" level, so most probably the "request-backend" (node/execjs) will continue to run. Maybe even for quite a long time. Which is not very ok in general. So, maybe it is worth second iteration of this PR where termination signal is sent to renderer and renderer kills the process. Might be as well that js-backend is smart enough to terminate on connection close (then all is ok), but most probably not.

:lgtm:

Reviewed 9 of 9 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Judahmeek)


.circleci/config.yml, line 53 at r2 (raw file):

Previously, justin808 (Justin Gordon) wrote…

@Judahmeek are we sure this is the right fix?

I am not very sure why it is needed - sorry maybe just don't know the reason. As well this is not obvious why, so, I would definitely put a NOTE: with some explanations.

@justin808
Copy link
Member Author

@dzirtusss

Overall, you terminate here on "request" level, so most probably the "request-backend" (node/execjs) will continue to run. Maybe even for quite a long time. Which is not very ok in general. So, maybe it is worth second iteration of this PR where termination signal is sent to renderer and renderer kills the process. Might be as well that js-backend is smart enough to terminate on connection close (then all is ok), but most probably not.

@Romex91 and I discussed this at length.

There are 2 main types of problems:

  1. Infinite loop leading to out-of-memory, which crashes and is currently restarted, with this PR providing that information to Sentry. PM2 can't do much about this other than restart the process.
  2. Slower memory leak, which also leads to an out-of-memory. This is not currently handled other than an auto-restart interval can be specified. PM2 has a good solution to restart a process when memory grows.

So the longer-term solution is to switch to PM2.

Shorter-term, regarding your suggestion to tell the cluster to kill the process, the problem is that the Rails server does not know which cluster took the request.

However, we find that the bigger issues is to retry problematic requests. Eventually, we'd like to to track the fingerprint of such requests that cause a crash and block them until the server restarts.

In order to do that, we'd need some sort of shared data, like Redis or Memcache, keep some sort of data, via some middleware calls to track:

  1. Request came in, add request fingerprint entry, maybe with a timestamp
  2. When the request is responded to, then remove the request fingerprint entry

For thread starvation crashes, there would be no request responded to handling.

However, given that such thread starvation crashes by particular URLs are rare and often fixed quickly, this might not be a high priority.


On the other hand, it would be relatively easy for the React on Rails Pro Ruby part to keep a Hash with the count of requests that caused timeouts, and it can prevent sending such requests to the node Renderer.

request fingerprint => { path: , count: }

I'll add a config that specifies how many times a given request should have a timeout before being blocked.

config.ssr_timeouts_before_blocking

I'll add this to the configuration, and if certain request times out more than the configured amount:

  1. Request information is logged to Sentry to notify a type of request is blocked.
  2. Any further requests matching this fingerprint are blocked.

If a tiny, random number of requests get timeouts due to high traffic, then those are unlikely to have say 4 timeouts for any one of them.

@justin808
Copy link
Member Author

Note, testing for timeouts can be done via: #136 (comment).

@justin808
Copy link
Member Author

@dzirtusss @Romex91 Draft changes for the next level of this PR:

justin808/add-ssr-timeout...justin808/add-ssr-timout-plus-checks

Should this PR go in as-is? Or add the check for any requests that keep erroring?

@justin808 justin808 force-pushed the justin808/add-ssr-timeout branch from 5419ae5 to 81369a0 Compare September 22, 2021 02:50
# Retry request in case of time out on the node-renderer side
# 0 - no retry
config.renderer_request_retry_limit = 5
config.renderer_request_retry_limit = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this parameter should be deprecated. I would delete it right away. Or add a warning
You are using renderer_request_retry_limit > 0, which is most likely a bad idea.

This option starts working when a node renderer doesn't respond for a long time.
I can think of these potential explanations for why the renderer doesn't respond:

  1. There is a bug in the renderer (infinite loop, etc.). Must be reported to Sentry ASAP. Retries will make the problem worse by knocking out more workers.
  2. The page is extremely huge and takes a lot of time to render. Retries will make the situation worse by keeping workers busy. Must be reported to Sentry ASAP, so Devs can address the problem and reduce the size of the page.
  3. The server is running out of resources. Retries will create MORE Load. An overloaded server won't benefit from MORE WORK. Must be reported to Sentry ASAP so admins can upgrade the server.
  4. The node renderer is located on another machine, and there is a connection problem. This is not OK! The decision to split the server into separate locations lead to connectivity issues. Must be reported to Sentry ASAP so admins can investigate why machines have trouble reaching each other (maybe they should be forced to occupy the same data center or something like that).

All of these cases are problems. All of them need to be analyzed and addressed. Silent retries are only masquerading the issues. And often create problems introducing domino effects like: Some page is huge => retries => more server load => more pages get timed out => more retries => even more server load.

Retries are only good when dealing with expected and acceptable errors (E.G., electromagnetic bit switches in ethernet). Node renderer failures are not the case because they are neither expected nor acceptable.

@Romex91
Copy link
Contributor

Romex91 commented Sep 22, 2021

Should this PR go in as-is? Or add the check for any requests that keep erroring?

About check for requests that keep erroring idea. Do we lose the possibility of quick-fixing problems by tweaking live DB values, as we did with the infinite loop in the gallery disaster?
If yes, it's getting hard to explain what's going on to RORP users. It will be equally hard to explain that to future us. Imagine someone who has a server downtime trying to quick-fix the problem by editing the DB. They edit the DB, BUT IT DOESN'T HELP. They are panicking and make more changes in the DB without realizing that the error page is cached.

Also, it seems a little bit like a solving the wrong problem dilemma.
The problem we had was infinite loop in one page can potentially affect other pages. So, instead of fixing the true reasons workers are not memory leak resistant and retries result in more server load and hence the domino effect we fix the cumulative consequences.

The fix is still good, BUT:

  • Will it reset the error counter when some DB values are changed?

@justisb
Copy link

justisb commented Sep 22, 2021

We at least need a way to manually clear the error counter & document that thoroughly/loudly announce that pages are being blocked from rendering, so that if the fix is applied head of another deployment/reboot, there's a very straightforward way to get the functional pages back out of the blocklist.

Copy link
Contributor

@dzirtusss dzirtusss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justin808 regarding your suggestion to tell the cluster to kill the process, the problem is that the Rails server does not know which cluster took the request

Overall, I was thinking that this is not rails, but main.js process should do this or worker.js itself. E.g. you have 1st timeout 5 sec in rails and 2nd kill timeout 10 sec in worker.

As well await sleep XXXX is not very good for timeout testing, as this is gracefully giving cpu control to other thread. E.g. while(1) ... will create more realistic scenario and eat out all CPU like in real life infinite loop.

Reviewed 1 of 3 files at r3, 1 of 1 files at r4, 8 of 9 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @justin808 and @Romex91)


.circleci/config.yml, line 53 at r2 (raw file):

Previously, Judahmeek (Judah Meek) wrote…

It's a security vulnerability & the only reason I did it is because this is for CI only.

A better alternative is probably finding a libv8 version that doesn't fail checksum validation

@Judahmeek why not put a comment? That next guy will know the reason and possibly fix it when it will be not needed.


spec/dummy/config/initializers/react_on_rails_pro.rb, line 29 at r4 (raw file):

Previously, Romex91 (Roman Kuksin) wrote…

IMO this parameter should be deprecated. I would delete it right away. Or add a warning
You are using renderer_request_retry_limit > 0, which is most likely a bad idea.

This option starts working when a node renderer doesn't respond for a long time.
I can think of these potential explanations for why the renderer doesn't respond:

  1. There is a bug in the renderer (infinite loop, etc.). Must be reported to Sentry ASAP. Retries will make the problem worse by knocking out more workers.
  2. The page is extremely huge and takes a lot of time to render. Retries will make the situation worse by keeping workers busy. Must be reported to Sentry ASAP, so Devs can address the problem and reduce the size of the page.
  3. The server is running out of resources. Retries will create MORE Load. An overloaded server won't benefit from MORE WORK. Must be reported to Sentry ASAP so admins can upgrade the server.
  4. The node renderer is located on another machine, and there is a connection problem. This is not OK! The decision to split the server into separate locations lead to connectivity issues. Must be reported to Sentry ASAP so admins can investigate why machines have trouble reaching each other (maybe they should be forced to occupy the same data center or something like that).

All of these cases are problems. All of them need to be analyzed and addressed. Silent retries are only masquerading the issues. And often create problems introducing domino effects like: Some page is huge => retries => more server load => more pages get timed out => more retries => even more server load.

Retries are only good when dealing with expected and acceptable errors (E.G., electromagnetic bit switches in ethernet). Node renderer failures are not the case because they are neither expected nor acceptable.

Agree. Plus for real users this will be (or should be) browser-rendered w/o SSR in case of error. So only bot related mostly.

- Add `ssr_timeout` configuration so the Rails server will not wait more
  than this many seconds for a SSR request to return once issued.
- Change default for `renderer_use_fallback_exec_js` to `false`.
- Change default log level to info.
@Judahmeek Judahmeek force-pushed the justin808/add-ssr-timeout branch from ddd8126 to 17113ec Compare October 18, 2021 23:42
@justin808 justin808 merged commit 1bb654c into master Oct 18, 2021
@justin808 justin808 deleted the justin808/add-ssr-timeout branch October 18, 2021 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants