-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Node.js clustering in Kibana #94057
Conversation
Pinging @elastic/kibana-core (Team:Core) |
rfcs/text/0015_nodejs_clustering.md
Outdated
|
||
![image](../images/15_clustering/perf-4-workers.png) | ||
|
||
### Analysis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could explain a bit more how our benchmark test works? My understanding is that they perform a single flow and measure the response times for completing that flow. This is a good smoke test to see what performance a single user would observe on a mostly idle system. However, does our benchmarking run enough parallel requests to start maxing out a single and even more so, multiple core's?
Unless our benchmark is causing the node process to consume so much CPU that we start seeing delays in the event-loop (and some requests start hitting a timeout) we're unlikely to see the true performance gain clustering could achieve.
Although faster response time is a good sign that Kibana is able to handle more, we probably want to measure the request per second throughput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although faster response time is a good sign that Kibana is able to handle more, we probably want to measure the request per second throughput.
This is not how gatling works from what I understand. The best we can do would be increasing the number of performed requests per batch and observe the difference in response time.
cc @dmlemeshko wdty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From posted screenshot I think @pgayvallet used DemoJourney simulation that runs sequence of API calls for each virtual user:
- keeping 20 concurrent users for first 3 minutes
- increasing from 20 to 50 concurrent users within next 3 minutes
Gatling spins up virtual users based on simulation and tracks request time for each API call. It has no APM-like functionality. The full html report have other charts that could give more information, but not kibana req/sec bandwidth.
Running 200 concurrent users should give more clear diff in reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know anything about our setup but it seems like gattling should be able to give us requests succeeded vs requests failed (OK vs KO) https://gatling.io/docs/current/general/reports/
If we can get this metric we should push concurrent users up until we start seeing at least some failures (like maybe 1-5%).
We would then have to double the concurrent users when trying with two workers and compare the number of successful requests against the one worker scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, when doing performance testing for Fleet, I had to do what @rudolf suggested and manually increase the concurrent users to find the breaking point.
rfcs/text/0015_nodejs_clustering.md
Outdated
any sense to have the coordinator recreate them indefinitely, as the error requires manual intervention. | ||
|
||
Should we try to distinguish recoverable and non-recoverable errors, or are we good terminating the main Kibana process | ||
when any worker terminates unexpectedly for any reason (After all, this is already the behavior in non-cluster mode)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the advantage of having the coordinator restart the worker is that it's much faster to recover from an unhandled exception than restarting all of kibana (and doing config validation, migrations checks, etc).
However, it feels like this could be a second phase and we can start by simply killing all workers if one throws an unhandled exception since this should be rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, there are different type of events, and they need to be handled from both ends:
- Sometimes, the IPC channel is closed due to the host being overloaded. They usually auto-heal, but if the coordinator exits during that disconnection, the worker won't be stopped and we'll get a zombie process. If I'm recalling correctly, that's a
disconnect
event that needs to be handled by the worker to self-kill itself. - On
exit
, there's the exit code and theexitAfterDisconnect
flag that could help with identifying if it's a broken process or anything intentional.
But I think it's worth considering what @rudolf says: probably in conjunction with the graceful shutdowns: i.e.: one worker fails, we send the kill
signal so all the other workers gracefully stop before being killed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current proposal is to kill all processes as Rudolf describes, with worker restarts happening in a follow-up phase after we make a plan for identifying recoverable vs non-recoverable errors.
rfcs/text/0015_nodejs_clustering.md
Outdated
|
||
Notes: | ||
- What should be the default value for `clustering.workers`? We could go with `Max(1, os.cpus().length - 1)`, but do we really want to use all cpus by default, | ||
knowing that every worker is going to have its own memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the benchmarks, do nodes use less memory when they split the load?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ to this question. Also curious what Elasticsearch does here, it's possible we can follow a similar pattern. If we can minimize memory overhead (see comment above), it'd be great to have an sensible automatic scaling default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process' RSS memory will grow with the number of open requests. So, theoretically, for a given amount of requests per second, more workers will require less memory each. However, the garbage collector might not collect as aggressively if the RSS is below the maximum heap. And even if there's a trend in our benchmarks it doesn't mean that there won't be spikes, if a given worker handles a request to export a large enough amount of saved objects that worker will consume all of it's heap and adding more workers won't improve this.
So in practice I think we should ignore any memory benchmarks and create the expectation with users that each worker will use up --max-old-space-size
(or the default for their system).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want to use all cpus by default
It feels like there really isn't a good way to automatically choose the correct value here. Perhaps we should just always require that the user sets this in configuration if they want clustering enabled?
rfcs/text/0015_nodejs_clustering.md
Outdated
any sense to have the coordinator recreate them indefinitely, as the error requires manual intervention. | ||
|
||
Should we try to distinguish recoverable and non-recoverable errors, or are we good terminating the main Kibana process | ||
when any worker terminates unexpectedly for any reason (After all, this is already the behavior in non-cluster mode)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, there are different type of events, and they need to be handled from both ends:
- Sometimes, the IPC channel is closed due to the host being overloaded. They usually auto-heal, but if the coordinator exits during that disconnection, the worker won't be stopped and we'll get a zombie process. If I'm recalling correctly, that's a
disconnect
event that needs to be handled by the worker to self-kill itself. - On
exit
, there's the exit code and theexitAfterDisconnect
flag that could help with identifying if it's a broken process or anything intentional.
But I think it's worth considering what @rudolf says: probably in conjunction with the graceful shutdowns: i.e.: one worker fails, we send the kill
signal so all the other workers gracefully stop before being killed.
rfcs/text/0015_nodejs_clustering.md
Outdated
|
||
Notes: | ||
- What should be the default value for `clustering.workers`? We could go with `Max(1, os.cpus().length - 1)`, but do we really want to use all cpus by default, | ||
knowing that every worker is going to have its own memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ to this question. Also curious what Elasticsearch does here, it's possible we can follow a similar pattern. If we can minimize memory overhead (see comment above), it'd be great to have an sensible automatic scaling default.
rfcs/text/0015_nodejs_clustering.md
Outdated
- Implementation cost is going to be significant, both in core and in plugins. Also, this will have to be a collaborative | ||
effort, as we can't enable the clustered mode in production until all the identified breaking changes have been addressed. | ||
|
||
- Even if easier to deploy, it doesn't really provide anything more than a multi-instances Kibana setup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any side-by-side comparisons of the performance of two modes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't really provide anything more than a multi-instances Kibana setup.
Kibana does a lot of background polling work which creates a significant amount of network traffic. If we centralize all this background polling to a single worker we can reduce this background work by a factor equal to the number of worker processes (maybe that would be a factor of 8 or 16 for most deployments?).
It's probably worth trying to quantify the impact, e.g. reducing 1Kb of traffic per month by a factor of 16 is negligible but reducing 10Gb by a factor of 16 would be a big win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we centralize all this background polling to a single worker we can reduce this background work by a factor equal to the number of worker processes
That sounds great, but would also require more changes from plugins though.
E.g we could have only one worker perform the license check calls, and then broadcast the result to the other workers, but that means changing the licensing
plugin implementation.
All these optimization can easily be done later as follow-ups though.
It's probably worth trying to quantify the impact
We would need to list all the 'background' requests we are performing against ES. Not sure how to do that without help from the other teams though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but I think it's worth adding this as a benefit to clustering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is listed as a potential future benefit of clustering, and the current RFC doesn't preclude us from doing this. However the current proposal is to treat it as an enhancement and not address in the initial implementation.
rfcs/text/0015_nodejs_clustering.md
Outdated
|
||
- Between non-clustered and 2-worker cluster mode, we observe a 20/25% gain in the 50th percentile response time. | ||
Gain for the 75th and 95th are between 10% and 40% | ||
- Between 2-worker and 4-workers cluster mode, the gain on 50th is negligible, but the 75th and the 95th are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be interesting to compare with a setup containing several Kibana instances with a load balancer in front of them.
@dmlemeshko please correct me if I mis-speak here as I'm just referring to your work. One testing issue we're still working on is getting Elasticsearch and Kibana monitoring data while running the Gatling tests. Some early results on Cloud seemed to indicate that Elasticsearch was the bottleneck once we hit a certain number of users with Dima's demo scenario. Once the search queue builds up and we start getting searches rejected, there's nothing Kibana could do to support more users. I'm sure there are ways to scale Elasticsearch up to handle more requests/sec from Kibana but it wasn't as simple as doubling the number of Elasticsearch nodes. Running a similar Gatling test on a Jenkins machine with Elasticsearch and Kibana on the same host had significantly higher throughput (less latency, maybe higher performance machine). But I'm still not sure if Elasticsearch or Kibana is the limiting factor in this case. I think we really need to have a load test case where we know Kibana is the bottleneck by a good margin and then compare Kibana with/without clustering. The Here's a Visualization of the Gatling data during the same run. In this case I'm not seeing the response times notable get worse corresponding to the search rejections, but I think that's because Dima backed the number of users down. I think when he added more users you could see an obvious impact. |
rfcs/text/0015_nodejs_clustering.md
Outdated
|
||
![image](../images/15_clustering/perf-4-workers.png) | ||
|
||
### Analysis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, when doing performance testing for Fleet, I had to do what @rudolf suggested and manually increase the concurrent users to find the breaking point.
rfcs/text/0015_nodejs_clustering.md
Outdated
|
||
One pragmatic solution could be, when clustering is enabled, to create a sub folder under path.data for each worker. | ||
|
||
The data folder is not considered part of our public API, and the implementation and path already changed in previous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? How did it change in previous minor releases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement is coming from @joshdover, I'll let him answer if he remembers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what I said and I'm having a hard time finding an instance where we changed this, I must have misspoke!
However, I don't believe we consider the structure of the data folder to be part of our public API. We should be able to use the sub-folder approach if needed I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What all is stored in the data folder? Is it just the uuid
file? The docs are not really clear...
path.data: | The path where Kibana stores persistent data not saved in Elasticsearch. Default: data |
---|
Source: https://www.elastic.co/guide/en/kibana/current/settings.html
And then they're contradicted by the package directory layout
data | The location of the data files written to disk by Kibana and its plugins | /var/lib/kibana |
---|---|---|
path.data | logs | Logs files location |
path.logs | plugins | Plugin files location. Each plugin will be contained in a subdirectory. |
Source: https://www.elastic.co/guide/en/kibana/current/rpm.html#rpm-layout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only plugin that I have found that is referencing this is the reporting plugin, however I can't find a place in code the actually reads the configuration. @elastic/kibana-reporting-services Could you provide any guidance into how the path.data
config is used by the Reporting plugin? My guess is as some sort of temporary storage, however I can't find any usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging @elastic/kibana-app-services for any guidance on what reporting does here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we used to use it for the "user data directory" for Chromium, but as I look in the code now, we just use tmpdir
from the os
package: https://github.com/elastic/kibana/blob/master/x-pack/plugins/reporting/server/browsers/chromium/driver_factory/index.ts#L51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to differentiate between "clustered" and "non-clustered" modes? Could we go for the Cloud-first approach and only have "clustered" mode, if somebody does not need it (say a Docker container), they just spin up a single Kibana worker.
@streamich The current proposal is to not differentiate between the two modes, so that (outside of the coordinator process) the rest of Kibana doesn't need to be aware of the mode it is running in. Just needs to know if it is the "main" worker or not (which would always return true in non-cluster mode).
Yes, the idea is to get to a place where we are using clustering by default based on the # of CPUs available (currently this is Phase 3 in the proposed rollout). Then users who don't want it can opt-out by modifying the config. It's still up for debate whether we make it the default from the beginning, or have a period of time where it is opt-in. EDIT: Realizing now I think you're asking the same question as Josh asked here -- see that thread for more discussion. |
rfcs/text/0015_nodejs_clustering.md
Outdated
|
||
An example would be Reporting's queueFactory pooling. As we want to only be running a single headless at a time per | ||
Kibana instance, only one worker should have pooling enabled. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to only be running a single headless at a time per Kibana instance
If we didn't do anything, the number of Chromium processes running would always be <= to the number of cores. That seems OK to me.
From a Core perspective, I think this RFC is ✅ once it's been updated based on the most recent discussions above. |
@joshdover @pgayvallet @tsullivan @streamich @mikecote, et al: I've pushed a batch of updates based on the last round of feedback. Please take a look at let me know if anything important is missing. |
Okay folks, I'm going to go ahead and move this into a final comment period. If you have any major concerns/objections, please speak up by the end of next week (Friday 25 June, 2021). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Can't approve because I'm still the initial author of the PR, but) LGTM
Summary
This RFC(-ish) describes the proposed implementation and the required code changes to have Node.js clustering available in Kibana.
Original issue: #68626
POC PR: #93380
Rendered RFC: https://github.com/pgayvallet/kibana/blob/kbn-rfc-clustering/rfcs/text/0020_nodejs_clustering.md