Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Fleet to complete package upgrade before Kibana server is ready #108993

Closed
kobelb opened this issue Aug 17, 2021 · 48 comments
Closed

Allow Fleet to complete package upgrade before Kibana server is ready #108993

kobelb opened this issue Aug 17, 2021 · 48 comments
Assignees
Labels
enhancement New value added to drive a business result impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:medium Medium Level of Effort NeededFor:Fleet Needed for Fleet Team Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.2.0

Comments

@kobelb
Copy link
Contributor

kobelb commented Aug 17, 2021

After certain Fleet packages have been successfully installed, the Fleet packages must be upgraded before Elastic Agent can be upgraded and begin ingesting data in the new format. We'd like to make this process easy for both system administrators and Cloud, and do so when Kibana is upgraded so that Elastic Agent can immediately be upgraded. We've conditioned our system administrators to wait until the Kibana server is ready to denote that Kibana has upgraded successfully. Additionally, Cloud has begun relying on this behavior to determine that it's safe to proceed with the hosted Elastic Agent upgrade.

However, Kibana plugins do not have a way to run custom logic when saved-object migrations currently run and prevent the Kibana server from becoming ready.

In order to support the Fleet package upgrades, we should add the necessary extension points to allow plugins to run custom logic before making the Kibana server is marked as ready.

Historically, we've refrained from adding this functionality since plugins should ideally only be blocking their functionality and not blocking the entire Kibana startup process. We'll want to figure out how to ensure we don't misuse this extension point and inadvertently encourage arbitrary plugins to take advantage of this extension point. Perhaps we should only allow there to be a single consumer of this plugin API for the time being, or figure out a way to ensure that the Kibana Core team manually approves all new consumers. I'm interested to hear other's ideas on how to prevent misusse.

@kobelb kobelb added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc enhancement New value added to drive a business result NeededFor:Fleet Needed for Fleet Team labels Aug 17, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@kobelb
Copy link
Contributor Author

kobelb commented Aug 17, 2021

/cc @alexh97 per our Slack discussion, this is the only task that I'm aware of that is dependent on a Kibana Platform team for the short-term solution for the Fleet package lifecycle management project

@rudolf
Copy link
Contributor

rudolf commented Aug 18, 2021

I think the status service is the correct tool to use here. That way fleet doesn't block other plugins from finishing, or even Kibana from responding to requests, but Cloud has an API that can be used to know when it's safe to proceed with the hosted Elastic Agent upgrade.

I don't know the details well enough to say if this would all just work as is today, e.g. I'm not sure if Cloud is using the status API, but if this is an acceptable solution, we wouldn't need to do any further work on Core's side.

@kobelb
Copy link
Contributor Author

kobelb commented Aug 18, 2021

I think the status service is the correct tool to use here. That way fleet doesn't block other plugins from finishing, or even Kibana from responding to requests, but Cloud has an API that can be used to know when it's safe to proceed with the hosted Elastic Agent upgrade.

Users have been conditioned to expect that Kibana is fully upgraded as soon as the server begins serving HTTP traffic and no longer responding with a 503. I don't think we should violate this expectation and now tell users that they have to hit an additional status endpoint to see if Kibana is upgraded. If there's a way to use the status service to prevent Kibana from serving HTTP traffic, that's great and I think we should use it; however, I don't believe that's the case.

I'm not sure if Cloud is using the status API, but if this is an acceptable solution, we wouldn't need to do any further work on Core's side.

As far as I'm aware, Cloud is just accessing the Kibana root URL / and looking for a status code < 400.

@mshustov
Copy link
Contributor

After certain Fleet packages have been successfully installed, the Fleet packages must be upgraded before Elastic Agent can be upgraded and begin ingesting data in the new format.

I'm not familiar with the full Fleet setup, so my question is whether Fleet can leverage preboot lifecycle It's meant to be used to prepare an environment before the Kibana server starts. However, the Core API available at this stage is quite limited.

Users have been conditioned to expect that Kibana is fully upgraded as soon as the server begins serving HTTP traffic and no longer responding with a 503

It doesn't sound like a resilient approach: a plugin might go to a red state after start. So the users have to consider status value, don't they?
Also, our planned changes #96626 might add further divergence between overall Kibana status and individual plugin status.

@kobelb
Copy link
Contributor Author

kobelb commented Aug 18, 2021

I'm not familiar with the full Fleet setup, so my question is whether Fleet can leverage preboot lifecycle It's meant to be used to prepare an environment before the Kibana server starts. However, the Core API available at this stage is quite limited.

Fleet uses saved-objects to know what packages are currently installed. During the reboot lifecycle, can Fleet use the saved-objects subsystem?

Users have been conditioned to expect that Kibana is fully upgraded as soon as the server begins serving HTTP traffic and no longer responding with a 503

It doesn't sound like a resilient approach: a plugin might go to a red state after start. So the users have to consider status value, don't they?
Also, our planned changes #96626 might add further divergence between overall Kibana status and individual plugin status.

What isn't a resilient approach? Using the status service, or adding a new extension point?

@mshustov
Copy link
Contributor

What isn't a resilient approach? Using the status service, or adding a new extension point?

Sorry, for being unclear: expect that all the Kibana components are fully upgraded and functioning. So, I tried to support @rudolf's suggestion to use status endpoint as an indicator of fleet readiness.

If there's a way to use the status service to prevent Kibana from serving HTTP traffic, that's great and I think we should use it; however, I don't believe that's the case.

You are right, it's not the case. Core was going to respond with 503 when a user requests an endpoint registered by a plugin with red status, but it shouldn't affect all the HTTP endpoints. #41983

@kobelb
Copy link
Contributor Author

kobelb commented Aug 18, 2021

What isn't a resilient approach? Using the status service, or adding a new extension point?

Sorry, for being unclear: expect that all the Kibana components are fully upgraded and functioning. So, I tried to support @rudolf's suggestion to use status endpoint as an indicator of fleet readiness.

No apology is needed :) However, our system-administrators and orchestration software needs a way to know whether an upgrade was successful, or whether a rollback is required. I quite frankly don't see how this is possible with the changes proposed by #96626. I think this is going to be a super frustrating user experience if we give our users the impression that the upgrade was successful, allow users to login and making changes to Kibana internal data to later figure out that an eventual upgrade step was unsuccessful and we only have two options:

  1. wait until a new patch version of Kibana where we fix the upgrade, until then features in Kibana just won't work
  2. roll-back to a previous version, losing all changes that were made in the meantime

At the moment, the only things that need to run on upgrade are saved-object migrations, and we've documented that the saved-object migration and transitively the upgrade is complete once Kibana begins serving HTTP traffic: https://www.elastic.co/guide/en/kibana/current/upgrade-migrations.html#upgrade-migrations-process. Changing this behavior would be a breaking change.

@pgayvallet
Copy link
Contributor

pgayvallet commented Aug 18, 2021

So (just to summarize), if I understand correctly, the need would be to have a way to register arbitrary 'migration' functions that would block the Kibana server from effectively starting until they resolve, while, at the same time, having access to the core services such as savedObjects inside the executed handler?

@kobelb
Copy link
Contributor Author

kobelb commented Aug 18, 2021

So (just to summarize), if I understand correctly, the need would be to have a way to register arbitrary 'migration' functions that would block the Kibana server from effectively starting until they resolve, while, at the same time, having access to the core services such as savedObjects inside the executed handler?

Correct.

@mshustov
Copy link
Contributor

Fleet uses saved-objects to know what packages are currently installed. During the reboot lifecycle, can Fleet use the saved-objects subsystem?

No. In this case, maybe the simplest solution might be to keep async plugin lifecycles (they are deprecated) but create a lint rule or a runtime check that only plugins from a special allow-list can use async methods. We might have to adjust async lifecycle timeout as well: lifecycles have max. 30 sec to complete an operation.

I quite frankly don't see how this is possible with the changes proposed by #96626.

Would you mind adding your concerns to #96626?

@rudolf
Copy link
Contributor

rudolf commented Aug 19, 2021

There's a couple of details I don't fully understand:

  1. What error scenarios could cause a fleet package upgrade to fail? I think these packages create saved objects so theoretically many of the intermittent Elasticsearch errors that we're handling with migrations could come into play here too.
  2. What's the impact of a fleet package upgrade failing? When a migration fails your data is in an inconsistent state, but perhaps if a package upgrade fails Kibana could continue using the outdated version? Does it justify downtime for all of Kibana? What's the relationship between package upgrades and upgrading the elastic agent?

I think of being able to rollback after a failed upgrade knowing there will be no data loss as a kind of infrastructure "transaction". This is a really nice feature, the challenge is balancing how much we include in this transaction. It was perhaps a coincidental decision, but we've for instance decided not to couple Elasticsearch and Kibana this comes at the cost of ensuring Kibana is compatible with newer versions of ES, but massively reduced the impact of Kibana upgrade failures.

Has the fleet team explored ways to decouple package upgrades from the kibana "transaction" so that even if a user can't rollback Kibana fleet/agents can keep on working? This might not be a viable short or even medium term goal, but I think it's worth working towards something like this.

@joshdover
Copy link
Contributor

1. What error scenarios could cause a fleet package upgrade to fail? I think these packages create saved objects so theoretically many of the intermittent Elasticsearch errors that we're handling with migrations could come into play here too.

I believe that's an accurate assessment from my admittedly limited understanding at this point. As far as I can tell, the Fleet package installation code doesn't currently handle any transient Elasticsearch errors and seems to just give up if one is encountered. @elastic/fleet Please correct me if I am wrong.

I definitely think we have to improve this before we consider blocking Kibana startup on package upgrades. At the very minimum we should:

  • Handle common transient Elasticsearch errors and re-attempt to upgrade the package if it fails during an upgrade.
    • We probably can handle the same errors that migrations handle here:
      if (
      e instanceof EsErrors.NoLivingConnectionsError ||
      e instanceof EsErrors.ConnectionError ||
      e instanceof EsErrors.TimeoutError ||
      (e instanceof EsErrors.ResponseError &&
      (retryResponseStatuses.includes(e?.statusCode) ||
      // ES returns a 400 Bad Request when trying to close or delete an
      // index while snapshots are in progress. This should have been a 503
      // so once https://github.com/elastic/elasticsearch/issues/65883 is
      // fixed we can remove this.
      e?.body?.error?.type === 'snapshot_in_progress_exception'))
      ) {
  • If an upgrade fails after n retries, give up and throw an error to fail Kibana startup. I think we must fail startup because of the reasons in my answer listed below.

2. What's the impact of a fleet package upgrade failing? When a migration fails your data is in an inconsistent state, but perhaps if a package upgrade fails Kibana could continue using the outdated version? Does it justify downtime for all of Kibana? What's the relationship between package upgrades and upgrading the elastic agent?

It, Depends. If we're unable to upgrade an Elasticsearch asset, such as an index template or ingest pipeline, then some products will either not be able to ingest data at all or will ingest data in an outdated or invalid way. For the immediate future, we're only planning to upgrade the apm-server, system, elastic-agent, and endpoint packages. The apm-server, elastic-agent, and system packages are necessary for being able to upgrade the APM/Fleet server since that server will be ingesting data into data streams that depend on these ES assets being upgraded or installed. If they're not up-to-date we can't safely upgrade those components without risking data corruption or loss. Basically the same scenario for endpoint, but in that case, the user can't upgrade any deployed endpoint binaries installed on user machines until the endpoint package has been upgraded.

This is especially import for Cloud users where APM & Fleet are always present in the default cluster configuration.

It's possible we could split the package upgrade process so that we only depend on the ES assets being installed before unblocking Kibana startup, however in my profiling of the package install process, the Kibana assets are much faster to install than the ES assets, so I'm not sure it'd be worth the complexity.

In this case, maybe the simplest solution might be to keep async plugin lifecycles (they are deprecated) but create a lint rule or a runtime check that only plugins from a special allow-list can use async methods. We might have to adjust async lifecycle timeout as well: lifecycles have max. 30 sec to complete an operation.

This was my thought as well, though I would prefer long-term if we make this API more explicit so as to prevent plugins from accidentally opting in to this behavior. Maybe an API like this would make sense?

core.status.blockStartup(promise: Promise<void>);

@nchaulet
Copy link
Member

Yes that correct the Fleet code do not do any retry, clearly something we can improve.

@afharo
Copy link
Member

afharo commented Aug 19, 2021

My 2cents: allowing a plugin to block the execution of the rest of Kibana feels like prioritising its use case over anything else.

IMO, GET /api/status would show degraded/YELLOW (with a 503 response) while this upgrade process occurs, and it's up to the user to decide whether they want to use Kibana, knowing that some features might not work as expected (literally the definition of the degraded status), but others may work (i.e.: visualizing previous data).

I wonder why we should grant that privilege to Fleet and not to, for instance, Task Manager (which is a key service in many components in Kibana).

As a separate comment: What is the split of responsibilities in Fleet between the APM/Fleet server and Kibana? From an architectural POV, it feels to me like the package upgrades should occur in the Fleet server? I don't know the internals (sorry if I'm really off), but I would think of the Fleet server as the main connection to the agents to all their interactions (push data, package installation/upgrades). Do the agents need connections to both: the Fleet and Kibana servers?

@pgayvallet
Copy link
Contributor

pgayvallet commented Aug 19, 2021

We've conditioned our system administrators to wait until the Kibana server is ready to denote that Kibana has upgraded successfully. Additionally, Cloud has begun relying on this behavior to determine that it's safe to proceed with the hosted Elastic Agent upgrade

Honest question here: Is this something we can challenge, or are we locked with this statement?

Like the other members, I don't really like the idea to allow specific plugins to block Kibana from starting. But I also don't think the status API suggestion is more than a workaround.

On one hand, this startup block proposal goes against pretty much all the decisions we previously made in that matter (making setup/start synchronous, refusing to implement a data injection/preload service that could block page load and so on).

In addition, locking up the whole Kibana instance just because a very specific plugin needs to perform some asynchronous tasks for its own reasons feels pretty wrong.

OTOH, I'm quite surprised the 'migration service' need arises so late. I mean, atm the only step of a Kibana upgrade is the saved objects migration. Maybe we should consider that given the increasing interactions of Kibana with the rest of stack, this is just not realistic anymore, and that we effectively do need a migration API for plugins to register per-version 'arbitrary' upgrade functions?

In short: I don't really like it, but if we have to do it, I would prefer for us to do it correctly, with a real migration service, with per-version support, access to core services and so on, rather than an ugly hack to just solve this very specific need.

@kobelb
Copy link
Contributor Author

kobelb commented Aug 19, 2021

For what it's worth, I was extremely hesitant to propose that we allow the Fleet package upgrades to block the Kibana upgrade process. However, I think it's required for us to provide our system administrators a reasonable experience upgrading our software.

We can split hairs and say that we never explicitly documented that the upgrade is complete when Kibana begins serving HTTP traffic and no longer responding with a 503. The only documentation that we have around upgrades uses the following phrasing:

The first time a newer Kibana starts, it will first perform an upgrade migration before starting plugins or serving HTTP traffic

However, this is in the section about saved-object migrations, and we aren't talking about saved-object migrations, we're talking about Fleet packages. This leaves us with the leeway to make Fleet packages denote their upgrade completion using another mechanism, right? If we're being pedantic, yes. If we're being practical and taking the user experience into consideration, no.

Requiring system administrators to check the response status of a new HTTP endpoint to determine whether or not it's safe to proceed with upgrading Elastic Agent is a bad user experience. If Fleet packages only needed to be upgraded for certain UIs in Kibana to work properly, I'd agree that we shouldn't be blocking the entire Kibana UI. In general, a single plugin misbehaving should only affect that plugin, and shouldn't affect the system as a whole. However, we need to make it painfully obvious to our users when the upgrade is complete. If users upgrade Elastic Agent before they are supposed to do so, their data ingestion will grind to a halt. Whatever extension point we add to make this possible should not be frequently used, and it should only be used in exceptional circumstances, like this one.

Additionally, there are changes that we will need to make to the Fleet package installation process to make sure it's resilient and performant. Luckily, we have @joshdover on the Fleet team, who can help us ensure that this process is as bullet-proof as saved-object migrations and minimize the risk we incur.

@kobelb
Copy link
Contributor Author

kobelb commented Aug 19, 2021

@ruflin and @joshdover - is it correct that only certain Fleet packages need to be upgraded for Elastic Agent to be upgraded? Could we only upgrade some of the Fleet packages as part of the Kibana upgrade process, and upgrade other Fleet packages in a way that they only block certain Kibana UIs from functioning?

@ruflin
Copy link
Member

ruflin commented Aug 20, 2021

I'll describe the APM scenario on Cloud and what I think could be an acceptable behaviour without having to block Kibana. I agree, blocking Kibana completely because some packages could not be upgraded is not ideal. At the same time it has to be said, the packages that are automatically upgraded and could be blocking should only be the one that will also be versioned with the stack. @kobelb This also answers your question with a yes.

Mid term, ideally it is Elastic Agent itself or fleet-server that blocks from upgrading if the most recent package is not installed. The APM package should define for example for 7.16 that it only works with 7.16. In this scenario, the hosted elastic agent should only upgrade when for example in the .fleet-* indices it is written, that a more recent policy is available. The problem in Cloud and all other deployments with Docker is that the upgrade is not done by the Elastic Agent itself but triggered externally (container is upgraded). The Elastic Agent could then refuse to accept the incompatible policy and just not run until the update is triggered but that would mean a halt to data shipping for APM which is not an acceptable user experience. One of the options would be that there is a "flavor" to the health check. Kibana is all upgraded a system and the extension points are upgraded and healthy (packages). Then Cloud could check and wait for the second one to upgrade Elastic Agent.

Problem short term is, this is all not implemented yet. For the Cloud use case right now, we only need to "block" on upgrade of APM and elastic_agent packages, everything else can happen after Kibana is healthy. This makes it critical we test all upgrade paths with these packages in detail so nothing goes wrong.

@joshdover
Copy link
Contributor

joshdover commented Aug 20, 2021

I'm going to try to answer as much as I can given what I know about Fleet so far, here it goes! Please anyone correct me if I'm missing any details on the Fleet side.

allowing a plugin to block the execution of the rest of Kibana feels like prioritising its use case over anything else.

I wonder why we should grant that privilege to Fleet and not to, for instance, Task Manager (which is a key service in many components in Kibana).

What's special here is that this particular plugin (Fleet) is currently responsible for configuring the dependencies of other Stack components. Task Manager does not have the same responsibility or impact on other components outside of the Kibana server.

IMO, GET /api/status would show degraded/YELLOW (with a 503 response) while this upgrade process occurs, and it's up to the user to decide whether they want to use Kibana, knowing that some features might not work as expected (literally the definition of the degraded status), but others may work (i.e.: visualizing previous data).

Like the other members, I don't really like the idea to allow specific plugins to block Kibana from starting. But I also don't think the status API suggestion is more than a workaround.

On one hand, this startup block proposal goes against pretty much all the decisions we previously made in that matter (making setup/start synchronous, refusing to implement a data injection/preload service that could block page load and so on).

In addition, locking up the whole Kibana instance just because a very specific plugin needs to perform some asynchronous tasks for its own reasons feels pretty wrong.

OTOH, I'm quite surprised the 'migration service' need arises so late. I mean, atm the only step of a Kibana upgrade is the saved objects migration. Maybe we should consider that given the increasing interactions of Kibana with the rest of stack, this is just not realistic anymore, and that we effectively do need a migration API for plugins to register per-version 'arbitrary' upgrade functions?

As @ruflin also answered, I think we're aligned that what we're asking for here is less than ideal as a long-term solution. These dependencies between components is somewhat implicit today and we're going to need to implement a more explicit mechanism that orchestrators integrate with directly to ensure a smooth upgrade of all the components of the Stack. However, this is going to take time to coordinate changes across Cloud, ECE, ECK, Elastic's Helm charts, on-prem documentation, etc.

Long-term we also plan to be moving some of the package installation components into Elasticsearch itself, though we may still require that Kibana initiate this process.

One reason this requirement is coming up now is that we're trying to bring the apm-server Fleet package up to feature parity with APM server. APM server installs it's own ingest assets, but this is not the case with the apm-server package due to the security concerns I discuss below.

What is the split of responsibilities in Fleet between the APM/Fleet server and Kibana? From an architectural POV, it feels to me like the package upgrades should occur in the Fleet server? I don't know the internals (sorry if I'm really off), but I would think of the Fleet server as the main connection to the agents to all their interactions (push data, package installation/upgrades). Do the agents need connections to both: the Fleet and Kibana servers?

AFAIK today the reason that we don't do this upgrades from Fleet Server is because we support a "standalone mode" for Elastic Agent that doesn't require users to setup Fleet Server at all. In those cases, we still need to have these packages upgraded or else upgrading Agent will result in breaking ingestion for the reasons we outlined above.

We also can't move the package installation / upgrade process to Elastic Agent itself because it would require granting security privileges to the API keys that Agent uses that we definitely cannot permit since these credentials are present on end-user machines. Whenever we move package installation to Elasticsearch, this may be something we can revisit.

Requiring system administrators to check the response status of a new HTTP endpoint to determine whether or not it's safe to proceed with upgrading Elastic Agent is a bad user experience.

+1 on this. We need to do everything we can to prevent users from causing accidental data loss. If Elastic Agent was always upgraded itself we would be able to prevent upgrading Agent before whatever necessary packages it needs are also upgraded. But as @ruflin mentioned, Agent and Fleet Server can be upgraded by other components like Docker containers, Ansible, or Chef. Long story short, until we have a better orchestration story from the Stack, I don't see any other good options here right now.

is it correct that only certain Fleet packages need to be upgraded for Elastic Agent to be upgraded? Could we only upgrade some of the Fleet packages as part of the Kibana upgrade process, and upgrade other Fleet packages in a way that they only block certain Kibana UIs from functioning?

Yes, we plan only to upgrade the packages that are absolutely necessary to minimize the blast radius of this change.

@rudolf
Copy link
Contributor

rudolf commented Aug 20, 2021

From my point of view we're all on the same page about the risk of introducing and using this API and although we might not have a solution soon, it's encouraging that we have some options to explore so that in the future we would not have to block Kibana anymore.

So I'm +1 for proceeding with this 👍

@kobelb
Copy link
Contributor Author

kobelb commented Aug 30, 2021

During last week's sync about Fleet package lifecycle management, the decision was made that we wouldn't need the feature this issue is talking about for the immediate solution at hand. I defer to @ruflin and @joshdover on the specifics regarding timing.

@ruflin
Copy link
Member

ruflin commented Aug 31, 2021

Having Kibana blocking on something that comes from a "remote source" (package registry) is risky. In Fleet we should add support to be able to ship the packages that are blocking Kibana directly with Kibana (@joshdover @jen-huang ). It is a feature we started to build but never completed.

At the moment, we only upgrade packages when the user first accesses Kibana. Now we have the option to upgrade directly after Kibana is upgraded which already is a huge step forward without having the risk blocking Kibana. I think it is a good middleground as a first step. In most (hopefully all) cases by the time Elastic Agent with fleet-server is upgraded, Kibana has already installed the most recent packages. Ideally in parallel we start to work on how Elastic Agent could wait on enrolling / upgrading until the packages are there.

In parallel we should work to get to the end goal of having it blocking but with packaging the packages.

@rudolf
Copy link
Contributor

rudolf commented Aug 31, 2021

Good point, installing from disk removes one potential source of failures.

In most (hopefully all) cases by the time Elastic Agent with fleet-server is upgraded, Kibana has already installed the most recent packages. Ideally in parallel we start to work on how Elastic Agent could wait on enrolling / upgrading until the packages are there.
In parallel we should work to get to the end goal of having it blocking but with packaging the packages.

Can you clarify, should Core continue to add this API and if so by when? If not, would you please close the issue?

@ruflin
Copy link
Member

ruflin commented Sep 1, 2021

My understanding is, that we still need the API at one point the urgency is a bit lower. @joshdover Can you jump in here as you likely know both "sides" better.

@joshdover
Copy link
Contributor

Good point, installing from disk removes one potential source of failures.

It will remove a failure scenario, however the overall Fleet setup process is still slow. I did some investigation of this yesterday and unfortunately, bundling the packages is not going to make a significant improvement to performance. I believe we'll need to work with Elasticsearch to improve this: #110500 (comment)

Can you clarify, should Core continue to add this API and if so by when? If not, would you please close the issue?

We plan to only begin blocking Kibana upgrades on package upgrades once we can:

  1. Bundle a few key Stack packages with Kibana
  2. Have optimized the upgrade process enough to be acceptable (TBD on how fast this needs to be for "acceptable")

At the earliest, I think we may be able to do this in 8.1. Until then we can hold off on this API. I also believe continuing to support async lifecycles would be an option for our first iteration of this behavior.

@jasonrhodes
Copy link
Member

@kobelb pointed this issue out to me because for the "alerts as data" / "RAC (Rules, Alerts, and Cases)" projects, we are running into some issues related to a mappings update in a Kibana version that could potentially be incompatible and how do we handle this failure. It sounds like we may want to borrow the solution from this thread eventually, so that we don't allow a user to upgrade to a Kibana version that has this issue...

@exalate-issue-sync exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Nov 1, 2021
@kobelb
Copy link
Contributor Author

kobelb commented Nov 16, 2021

The conversation on the timeline for this extension point was forked to an e-mail and Zoom meetings. To close the loop, the decision was made to no longer need this extension point for 7.16/8.0 and to delay adding it until 8.1.

@lukeelmers lukeelmers added loe:medium Medium Level of Effort impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. and removed loe:small Small Level of Effort impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. labels Nov 16, 2021
@exalate-issue-sync exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort loe:medium Medium Level of Effort impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. and removed loe:medium Medium Level of Effort impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. labels Nov 17, 2021
@joshdover
Copy link
Contributor

Just to reiterate, the Fleet team would like to use this API starting in 8.2, so we'll need this API available a few weeks before FF for that release. We don't expect there to be a lot of work on our end to integrate with it.

@jasonrhodes
Copy link
Member

@joshdover / @kobelb Is #120616 the best summary of the plan forward that came out of this discussion?

@kobelb
Copy link
Contributor Author

kobelb commented Dec 8, 2021

@jasonrhodes yep!

@lukeelmers
Copy link
Member

Just dropping a note as I'm starting to look at this. Currently here's what I'm thinking:

  • We add a CoreStart.registerBlockingStartTask(task: Promise<unknown>) as Josh suggests above... (exact name/location within CoreStart still TBD, open to suggestions)
  • Any tasks registered here are awaited before http.start()
  • For the time being, to prevent misuse and require other teams to check with us before using this hook, we only allow one item to be registered (similar to what we've done with other APIs that we only want used once, e.g. http.registerAuth)
  • If there are other legitimate use cases, we discuss ideas for how we can keep a high degree of visibility into who is consuming this API. But to keep the scope small/controlled here, we start with just allowing a single use.

@elastic/kibana-core Any opinions on this, or anything I've missed?

@lukeelmers lukeelmers self-assigned this Feb 7, 2022
@pgayvallet
Copy link
Contributor

This plan LGTM.

@joshdover
Copy link
Contributor

@lukeelmers Thanks for the quick summary. No objections to this proposal. A couple other things I'm thinking about:

Should we have some sort of timeout that crashes Kibana?

  • I'm not sure if it's expected the application try forever and rely on failing readiness probes from orchestrators to reboot or if the system should have a timeout.
  • I only expect network calls could be failing when using an advanced configuration, and still need to confirm this is even the case. I am sure this will not affect the default configuration on our Cloud or ECE offerings with the configuration we allow there because all packages we attempt to upgrade will be available on disk for fallback.
  • Related, I've proposed we add a configurable max_retries setting for Fleet's setup process in [Fleet] Block Kibana startup for Fleet setup completion #120616. If we decide that orchestrators should handle this instead, we can probably drop this and retry forever.

@afharo
Copy link
Member

afharo commented Feb 10, 2022

  • I'm not sure if it's expected the application try forever and rely on failing readiness probes from orchestrators to reboot or if the system should have a timeout.

I'd vote for the system imposing a timeout, but the timeout is provided by the registered task. I don't think a general timeout will fit all use cases. If it did, we wouldn't have to move away from the Async lifecycle steps in the first place.

What do you think?

I only expect network calls could be failing when using an advanced configuration, and still need to confirm this is even the case. I am sure this will not affect the default configuration on our Cloud or ECE offerings with the configuration we allow there because all packages we attempt to upgrade will be available on disk for fallback.

How about clusters with tons of packages to upgrade? Would it be a problem with the timeouts?

@lukeelmers
Copy link
Member

Should we have some sort of timeout that crashes Kibana?

I think we should to be safe, and I like @afharo's suggestion:

I'd vote for the system imposing a timeout, but the timeout is provided by the registered task. I don't think a general timeout will fit all use cases.

If the registered task provides its own timeout, then core can stay out of the business of trying to guess what a reasonable limit is. It would also keep core from having to manage retries or any other logic, as it would be up to the registered task to decide what its retry conditions are. This would also allow Fleet to make the timeout configurable which could help from a support perspective (otherwise I could imagine a support request where a user is continually unable to start Kibana because the timeout is exceeded).

I think there is some risk in allowing plugins to define their own timeout, but if in our initial implementation we are only allowing a single registrar (Fleet), I am less worried about this situation. Core could also enforce some arbitrary really-high timeout that cannot be exceeded in the task configuration.

@joshdover Is this still your best estimate of how long you think upgrades will take, or has it changed at all?

One thing to note is that these upgrades may take 10-20s in the average case, possibly up to several minutes (?) on a very overloaded cluster.

@joshdover
Copy link
Contributor

joshdover commented Feb 11, 2022

I'd vote for the system imposing a timeout, but the timeout is provided by the registered task. I don't think a general timeout will fit all use cases. If it did, we wouldn't have to move away from the Async lifecycle steps in the first place.

Having the task define the timeout seems reasonable to me and will likely make this implementation and testing much simpler. We can generalize later if there are other needs for similar initialization tasks in other areas of Kibana.

That said, I'd still like to hear some feedback from those who have experience running our software before we add a timeout. I don't think have a similar timeout on attempting to make the initial connection to Elasticsearch to begin Saved Object migrations, and instead we retry forever. I wonder if the same logic should apply here.

@elastic/cloud-applications-solutions @elastic/cloud-k8s Do you have any input on this? The situation is that Kibana has some required initialization that must complete before serving HTTP traffic. This initialization may fail due to issues with connecting to Elasticsearch or capacity problems in the cluster. Would you expect Kibana to continuously retry or to eventually give up and exit with an error? If we should eventually give up, what do you recommend for a default timeout? (we'd also make this configurable of course)

@joshdover Is this still your best estimate of how long you think upgrades will take, or has it changed at all?

One thing to note is that these upgrades may take 10-20s in the average case, possibly up to several minutes (?) on a very overloaded cluster.

Yes this is still accurate. We have one performance issue that is making this a bit worse today but we're considering it a blocker to making this setup logic blocking: #121639

@david-kow
Copy link

I didn't read through the entire issue, but regarding the last comment:

https://github.com/orgs/elastic/teams/cloud-applications-solutions https://github.com/orgs/elastic/teams/cloud-k8s Do you have any input on this? The situation is that Kibana has some required initialization that must complete before serving HTTP traffic. This initialization may fail due to issues with connecting to Elasticsearch or capacity problems in the cluster. Would you expect Kibana to continuously retry or to eventually give up and exit with an error? If we should eventually give up, what do you recommend for a default timeout? (we'd also make this configurable of course)

In case of ECK, Kubernetes will restart failing Pods on its own anyway, so this is not critical to us. But one thing we definitely care about is that in a simple, small cluster, happy case scenario, when ES and Kibana are started at the exact same time, we won't see Kibana erroring out before ES is up. That is, the default timeout for Kibana to error out is higher than the time ES cluster (and again, that is for a small, new cluster) needs to start up.

The reason for this "requirement" (or rather an ask) is that many of ECK users journey start with quickstart or other example manifests we provide. Through those manifests, they will deploy ES and Kibana simultaneously and having users see Kibana Pod(s) error out is not a great first impression. Other ECK team members might have more to add.

@pgayvallet
Copy link
Contributor

Situation has changed a lot since those discussions occured, making me feel the whole thing is now outdated.

I'll go ahead and close this. Feel free to reopen a fresh issue up to date with the latest requierements and constraints (e.g. serverless/zdt) if necessary.

@pgayvallet pgayvallet closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:medium Medium Level of Effort NeededFor:Fleet Needed for Fleet Team Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.2.0
Projects
None yet
Development

No branches or pull requests