-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][skip-ci] Prevent plugins from blocking Kibana startup #45796
Conversation
07ab8d4
to
ec5de3a
Compare
56b63f4
to
54018e8
Compare
54018e8
to
ee1ec45
Compare
Pinging @elastic/kibana-platform |
// (3.1.1) We can expose a convenient API by doing a lot of work | ||
adminClient: () => { | ||
callAsInternalUser: async (...args) => { | ||
adminClient = await coreSetup.elasticsearch.adminClient$.pipe(take(1)).toPromise(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this proposal, every call of callAsInternalUser
can have a different version of adminClient
. I suspect it contradicts RFC handler design https://github.com/elastic/kibana/pull/36509/files#diff-8e0211fd03975bea94cb15762f6c7eaaR92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples show what the implementation would look like if we implemented proposal (1) and (2) but not (3). If we do not implement (3), I agree that we would contradict the handler design.
If we implement (3) the context handler will look like example (4.3) which I don't think contradicts the handler design.
I didn't put this in the RFC itself, but I think the handler pattern tried to solve two problems:
A. Improve the ergonomics of consuming API's that are strongly tied to short-lived state like an incoming request. This means not having to call .asScoped(req)
inside a request handler the whole time.
B. Improve the ergonomics of consuming Observable-based API's. This means not having to do await someApi$.pipe(first()).toPromise()
I believe proposal (3) solves (B) across all of Core's API's not just API's that are relevant in handlers. I think this is a better alternative to introducing context everywhere even if there isn't some kind of state to "fold" into the context. IMO the contexts should only be used to solve (A).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If context doesn't try to solve (B) because we solved it with (3) then functionality being "fixed" inside a context as described by the Context pattern becomes less important.
// async & await isn't necessary here, but makes example a bit clearer. | ||
// Note that the elasticsearch client no longer exposes an adminClient$ | ||
// observable. | ||
const result = await core.elasticsearch.adminClient.callAsInternalUser('ping', ...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging service and Elasticsearch services have a different strategy for handling config updates.
Elasticsearch service re-creates cluster client, so dependent plugins know for sure that something has changed and they should use a newer version of the API.
Logging service provides non-reactive API. it handles all config updates internally and dependent plugins know nothing about internal changes.
@azasypkin I believe you mentioned that different design decisions were done deliberately. Does the proposed approach eliminate all the benefits of having explicit elasticsearch client updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think of it, but the logging service is a great example of an API that hides the reactivity 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@azasypkin I believe you mentioned that different design decisions were done deliberately.
Yeah, it was done on purpose, logging is something that is supposed to be used all over the place and adding Rx to the mix could turn into nightmare for the developers. I actually wanted to have the same for ES client, but we didn't get quorum on that in the past.
# Alternatives | ||
## 1. Introduce a lifecycle/context provider timeout | ||
Lifecycle methods and context providers would timeout after X seconds and any | ||
API's they expose would not be available if the timeout had been reached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have clear boundaries for every stage. We could monitor execution time and log that a plugin initialization lasts too long (kinda similar to the next approach). After that, a user can decide to disable slow plugin or not.
# Drawbacks | ||
Not being able to block on a lifecycle method also means plugins can no longer | ||
be certain that all setup is complete before they reach the start lifecycle. | ||
Plugins will have to manage such state internally. Core will still expose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes a plugin needs to get async API result. Does it mean that core API has to switch to observables? For example, we have a case for the security plugin
https://github.com/restrry/kibana/blob/48d34abd99ee17fa610e8e5d3ab8dd05e5e4d1d3/src/core/server/http/http_server.ts#L314-L332
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to implement this RFC for the createCookieSessionStorageFactory
API could look something like this:
Core Setup API becomes synchronous:
export inter HttpServerSetup {
createCookieSessionStorageFactory: <T>(
cookieOptions: SessionStorageCookieOptions<T>
) => SessionStorageFactory<T>;
Using the SessionStorageFactory becomes async:
export interface SessionStorageFactory<T> {
asScoped: (request: KibanaRequest) => Promise<SessionStorage<T>>;
}
So the "setup" API is completely synchronous, all registration happens synchronously, but using the API becomes async.
So this line in the security plugin will become
const sessionStorage = await this.options.sessionStorageFactory.asScoped(request);
Drawbacks: | ||
1. Since plugins load serially, even if they don't block startup, all the | ||
delays could add up and potentially make startup very slow. | ||
2. This opens up the potential for a third-party plugin to effectively "break" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if 3rd party plugin stucks on init step, it's hard to diagnose the problem
setup(core, deps) {
return {
async getData() {
return await deps.pluginA.search('data'); // Timeout error because pluginA is not ready or request timeout?
}
}
}
Currently, the security plugin effectively blocks startup until it's able to synchronize its privilege definition with Elasticsearch. If I'm following this proposal, this would no longer be possible. Instead, we should block individual operations which are dependent upon this, primarily all calls to Elasticsearch's _has_privileges API. In situations where we're unable to synchronize the privileges into Elasticsearch, we could retry a number of times; however, at some point we'd likely have to give-up this retry and fail the operation. In these situations, how should we surface the error to the user? Do we just log the error? We used to be able to use the status service for this purpose, but as far as I'm aware, this isn't possible in the NP. A similar situation exists within the Spaces plugin, where we effectively block start-up until we're able to insert the default space. We'd have to adopt a similar approach here of blocking space operations until we're able to insert the default space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this RFC jives with my thoughts on these lifecycle and contextual functions. Great job breaking this down.
If it isn't obvious from my comments, I think this RFC could benefit from a greater emphasis on two things:
- Plugin initialization began synchronous when plugins were introduced in v4.2. It's a relatively new concept that plugins can delay startup at all (they've always been able to prevent startup by throwing an exception). I think this is important because it establishes that we're not proposing that we unwind the fundamentals of how plugins have always worked. Of note: the original plugin system designs we had for the new platform had synchronous lifecycle functions.
- Asynchronous lifecycle functions paint a false picture about the nature of integrating with external services. All scenarios that wish to utilize async lifecycle functions ultimately involve an integration with an external service (almost universally Elasticsearch), but any APIs that are backed by such a service must be designed to handle unavailability and state changes in explicit ways anyway. We're not introducing an additional burden with this RFC, we're removing a feature that made it less obvious that this was necessary, which makes it easier to introduce bugs.
consumers have to handle the same error condition in two places. | ||
|
||
## 2. Treat anything that blocks Kibana from starting up as a bug | ||
Effectively do what we've always been doing. We can improve on the status quo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, legacy plugins cannot block Kibana from starting. They can change their status to red (the "recommended" approach today), or they can make Kibana exit entirely (which we've treated as a bug), but the ability for a plugin to delay the startup of Kibana until it deems it appropriate to continue is entirely new functionality in the new platform. I'm not sure the best way to express that here, but it seems like an important thing to clarify so people don't think we're talking about undoing an established behavior that Kibana has relied on since the beginning.
If I understand this correctly (and I hope I am otherwise it seems like we have a bug in the security plugin today), the security plugin is not blocking Kibana startup in order to synchronize its privilege definition. While the security plugin is doing its initialization, it sets it status to red. In the meantime, Kibana starts up successfully, which is to say all other plugins continue their initialization and Kibana binds to a port and starts serving HTTP traffic. Kibana the software starts. Kibana the product is made less available by nature of the status service, which is independent of setup - plugins can go into a red state long after setup completes. Edit: I think plugins should build their APIs to be resilient (even in the sense of throwing known errors) to their external service state, but if this is too hard of a thing to accomplish alongside new platform migration, we certainly can build a status service into core that behaves similarly to the legacy status behavior. That wouldn't be especially difficult, and it would be no worse than we have today. |
You're correct. I was conflating startup and the status service being green. I incorrectly assumed that in the NP since we didn't have a status service, we should be doing similar logic within the
For our usages, I don't think it's terrible hard for us to fail when external services aren't in the proper state and throw a descriptive error. My primary concern was around surfacing these errors to the appropriate end-users. |
Co-Authored-By: Court Ewing <court@epixa.com>
This does not mean we should remove all observables from Core's API's. When an | ||
API consumer is interested in the *state changes itself* it absolutely makes | ||
sense to expose this as an Observable. Good examples of this is exposing | ||
plugin config as this is state that changes over time to which a plugin should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that necessary? We can still consume async APIs from a synchronous method. We can even expose APIs from a plugin that depend on async APIs from Core. If we adopt this RFC, the only thing we can't do is delay Kibana from starting up due to an asynchronous operation.
Overall, I agree that this distinction is a good idea and that we should stop exposing Observables from Core when they are an implementation detail. However I'm not sure it's required to solve the problem this RFC aims to address. I can see why it would be more ergonomic if we make context providers synchronous, but as I mentioned above, I'm not sure that is necessary (or even desirable?).
Are there examples that demonstrate why removing these Observable APIs is necessary to make lifecycle methods synchronous? If not, I think that change should be a separate proposal.
Co-Authored-By: Josh Dover <me@joshdover.com>
This does not mean we should remove all observables from Core's API's. When an | ||
API consumer is interested in the *state changes itself* it absolutely makes | ||
sense to expose this as an Observable. Good examples of this is exposing | ||
plugin config as this is state that changes over time to which a plugin should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
console.log('about to subscribe');
configService.atPath('key').subscribe(value => {
console.log('got value');
});
console.log('after subscribe');
For me this is an implementation details that should not be relied on. Code should never make postulate that a callback on observable subscription method will execute synchronously IMHO. Definitely needs to provide an sync accessor.
Once this RFC has been merged we should educate teams on the implications to | ||
allow existing New Platform plugins to remove any async lifecycle | ||
behaviour they're relying on before we change this in core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main concern for me is (please correct me if I'm wrong), that to migrate a NP plugin to sync lifecycle methods, it will require all it's dependencies to have migrated first (more specifically, all dependencies that expose async APIs consumed by the plugin), therefor creating a strong migration order ?
If that's the case, with the increasing number of NP plugins, isn't there a risk that the RFC implementation will basically never be able to be integrated in master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migrating to sync lifecycle methods only affects API's which required an async operation to resolve before the API was returned from setup()
. This means that some API's might change, but if plugin A hasn't been migrated it wouldn't block an upstream plugin from migrating. It will however create more work later on. So even if Plugin B has finished migrating, once Plugin A changes it's API Plugin B will once again have to do some refactoring to adopt the new API.
// Dependency with unmigrated, async lifecycles
class PluginA {
async setup(core: CoreSetup) {
return {
create: async () => {return await core.savedObjects.getInternalRepository.create(...)}
}
}
}
// Migrated plugin with sync lifecycles
class PluginB {
setup(core: CoreSetup, deps) {
// const wasCreateSuccessful = (await deps.pluginA.create()) != null;
deps.pluginA.create().then(() => {
... // I can consume an async API in setup, and wait till it resolves before
... // doing further operations, I just can't block setup.
});
return {
// Plugins can continue to expose API's that make further async calls
// to dependencies
proxiedCreate: async () => {return await deps.pluginA.create(...)}
/* wasCreateSuccessful it's now impossible to expose this API */
}
}
}
Co-Authored-By: Josh Dover <me@joshdover.com>
Brandon brought up another area this RFC impacts in #52553. The security plugin registers a RightNavControl user profile button during setup. This user information lives in ES and is therefore async. Rendering this button asynchronously causes an intermediate state where the button is empty and a flicker as the user profile is rendered. With async lifecycles there would be two ways to prevent this:
Both solutions introduce a way for a single plugin to slow down or prevent the rendering of Kibana. As @joshdover mentioned, one way to mitigate this is through the use of skeleton screens (visual placeholders) |
Meeting minutes from 12 December 2019 Validated assumptionsThe core assumption behind this RFC is that Kibana needs to be resilient against plugin bugs or external conditions that could cause the entire application to be unavailable. We agreed that as Kibana will be used in increasingly mission critical conditions we cannot afford to introduce the risk that a single plugin takes Kibana down. Potential solutions
For the reasons given if 3(c) we decided to go ahead with option (3) with the eventual goal of dropping async lifecycles completely. UnknownsDo plugins ever need to read config values and pass these as parameters to Core API’s? If so we would have to expose synchronous config values to support sync lifecycles. Adoption strategy
Other: Create Plugin Status RFC |
This RFC has now entered into the final comment period. Unless any fatal flaws are raised, it will be accepted and merged on 17 December. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rubber stamp approval based on our meeting today
Summary
Prevent plugins from blocking Kibana startup, either through blocking lifecycle methods or blocking context providers from completing.
Edit history:
impact on the rest of Kibana.
for loops, so it's not a perfect guarantee (from @azasypkin).
whitelist, risky changes like the Elasticsearch host should still require a
complete restart. Added to (3) based on [RFC][skip-ci] Prevent plugins from blocking Kibana startup #45796 (comment)
plugins"
and might only be surfaced when the associated API/UI gets used (from
@azasypkin)
Rendered RFC
Background
Started as a discussion in #18854
Fixes #45417