-
Notifications
You must be signed in to change notification settings - Fork 0
Configurable timeout to render requests #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| command: | | ||
| gem install bundler | ||
| echo "Bundler version: "; bundle --version | ||
| bundle config set --local disable_checksum_validation true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Judahmeek are we sure this is the right fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it secure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a security vulnerability & the only reason I did it is because this is for CI only.
A better alternative is probably finding a libv8 version that doesn't fail checksum validation
dzirtusss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, you terminate here on "request" level, so most probably the "request-backend" (node/execjs) will continue to run. Maybe even for quite a long time. Which is not very ok in general. So, maybe it is worth second iteration of this PR where termination signal is sent to renderer and renderer kills the process. Might be as well that js-backend is smart enough to terminate on connection close (then all is ok), but most probably not.
Reviewed 9 of 9 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Judahmeek)
.circleci/config.yml, line 53 at r2 (raw file):
Previously, justin808 (Justin Gordon) wrote…
@Judahmeek are we sure this is the right fix?
I am not very sure why it is needed - sorry maybe just don't know the reason. As well this is not obvious why, so, I would definitely put a NOTE: with some explanations.
@Romex91 and I discussed this at length. There are 2 main types of problems:
So the longer-term solution is to switch to PM2. Shorter-term, regarding your suggestion to tell the cluster to kill the process, the problem is that the Rails server does not know which cluster took the request. However, we find that the bigger issues is to retry problematic requests. Eventually, we'd like to to track the fingerprint of such requests that cause a crash and block them until the server restarts. In order to do that, we'd need some sort of shared data, like Redis or Memcache, keep some sort of data, via some middleware calls to track:
For thread starvation crashes, there would be no request responded to handling. However, given that such thread starvation crashes by particular URLs are rare and often fixed quickly, this might not be a high priority. On the other hand, it would be relatively easy for the React on Rails Pro Ruby part to keep a Hash with the count of requests that caused timeouts, and it can prevent sending such requests to the node Renderer. request fingerprint => { path: , count: } I'll add a config that specifies how many times a given request should have a timeout before being blocked. config.ssr_timeouts_before_blockingI'll add this to the configuration, and if certain request times out more than the configured amount:
If a tiny, random number of requests get timeouts due to high traffic, then those are unlikely to have say 4 timeouts for any one of them. |
|
Note, testing for timeouts can be done via: #136 (comment). |
|
@dzirtusss @Romex91 Draft changes for the next level of this PR: justin808/add-ssr-timeout...justin808/add-ssr-timout-plus-checks Should this PR go in as-is? Or add the check for any requests that keep erroring? |
5419ae5 to
81369a0
Compare
| # Retry request in case of time out on the node-renderer side | ||
| # 0 - no retry | ||
| config.renderer_request_retry_limit = 5 | ||
| config.renderer_request_retry_limit = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this parameter should be deprecated. I would delete it right away. Or add a warning
You are using renderer_request_retry_limit > 0, which is most likely a bad idea.
This option starts working when a node renderer doesn't respond for a long time.
I can think of these potential explanations for why the renderer doesn't respond:
- There is a bug in the renderer (infinite loop, etc.). Must be reported to Sentry ASAP. Retries will make the problem worse by knocking out more workers.
- The page is extremely huge and takes a lot of time to render. Retries will make the situation worse by keeping workers busy. Must be reported to Sentry ASAP, so Devs can address the problem and reduce the size of the page.
- The server is running out of resources. Retries will create MORE Load. An overloaded server won't benefit from MORE WORK. Must be reported to Sentry ASAP so admins can upgrade the server.
- The node renderer is located on another machine, and there is a connection problem. This is not OK! The decision to split the server into separate locations lead to connectivity issues. Must be reported to Sentry ASAP so admins can investigate why machines have trouble reaching each other (maybe they should be forced to occupy the same data center or something like that).
All of these cases are problems. All of them need to be analyzed and addressed. Silent retries are only masquerading the issues. And often create problems introducing domino effects like: Some page is huge => retries => more server load => more pages get timed out => more retries => even more server load.
Retries are only good when dealing with expected and acceptable errors (E.G., electromagnetic bit switches in ethernet). Node renderer failures are not the case because they are neither expected nor acceptable.
About Also, it seems a little bit like a The fix is still good, BUT:
|
|
We at least need a way to manually clear the error counter & document that thoroughly/loudly announce that pages are being blocked from rendering, so that if the fix is applied head of another deployment/reboot, there's a very straightforward way to get the functional pages back out of the blocklist. |
dzirtusss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justin808 regarding your suggestion to tell the cluster to kill the process, the problem is that the Rails server does not know which cluster took the request
Overall, I was thinking that this is not rails, but main.js process should do this or worker.js itself. E.g. you have 1st timeout 5 sec in rails and 2nd kill timeout 10 sec in worker.
As well await sleep XXXX is not very good for timeout testing, as this is gracefully giving cpu control to other thread. E.g. while(1) ... will create more realistic scenario and eat out all CPU like in real life infinite loop.
Reviewed 1 of 3 files at r3, 1 of 1 files at r4, 8 of 9 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @justin808 and @Romex91)
.circleci/config.yml, line 53 at r2 (raw file):
Previously, Judahmeek (Judah Meek) wrote…
It's a security vulnerability & the only reason I did it is because this is for CI only.
A better alternative is probably finding a libv8 version that doesn't fail checksum validation
@Judahmeek why not put a comment? That next guy will know the reason and possibly fix it when it will be not needed.
spec/dummy/config/initializers/react_on_rails_pro.rb, line 29 at r4 (raw file):
Previously, Romex91 (Roman Kuksin) wrote…
IMO this parameter should be deprecated. I would delete it right away. Or add a warning
You are using renderer_request_retry_limit > 0, which is most likely a bad idea.This option starts working when a node renderer doesn't respond for a long time.
I can think of these potential explanations for why the renderer doesn't respond:
- There is a bug in the renderer (infinite loop, etc.). Must be reported to Sentry ASAP. Retries will make the problem worse by knocking out more workers.
- The page is extremely huge and takes a lot of time to render. Retries will make the situation worse by keeping workers busy. Must be reported to Sentry ASAP, so Devs can address the problem and reduce the size of the page.
- The server is running out of resources. Retries will create MORE Load. An overloaded server won't benefit from MORE WORK. Must be reported to Sentry ASAP so admins can upgrade the server.
- The node renderer is located on another machine, and there is a connection problem. This is not OK! The decision to split the server into separate locations lead to connectivity issues. Must be reported to Sentry ASAP so admins can investigate why machines have trouble reaching each other (maybe they should be forced to occupy the same data center or something like that).
All of these cases are problems. All of them need to be analyzed and addressed. Silent retries are only masquerading the issues. And often create problems introducing domino effects like:
Some page is huge => retries => more server load => more pages get timed out => more retries => even more server load.Retries are only good when dealing with expected and acceptable errors (E.G., electromagnetic bit switches in ethernet). Node renderer failures are not the case because they are neither expected nor acceptable.
Agree. Plus for real users this will be (or should be) browser-rendered w/o SSR in case of error. So only bot related mostly.
- Add `ssr_timeout` configuration so the Rails server will not wait more than this many seconds for a SSR request to return once issued. - Change default for `renderer_use_fallback_exec_js` to `false`. - Change default log level to info.
ddd8126 to
17113ec
Compare
ssr_timeoutconfiguration so the Rails server will not wait morethan this many seconds for a SSR request to return once issued.
renderer_use_fallback_exec_jstofalse.This change is