-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Fleet to complete package upgrade before Kibana server is ready #108993
Comments
Pinging @elastic/kibana-core (Team:Core) |
/cc @alexh97 per our Slack discussion, this is the only task that I'm aware of that is dependent on a Kibana Platform team for the short-term solution for the Fleet package lifecycle management project |
I think the status service is the correct tool to use here. That way fleet doesn't block other plugins from finishing, or even Kibana from responding to requests, but Cloud has an API that can be used to know when it's safe to proceed with the hosted Elastic Agent upgrade. I don't know the details well enough to say if this would all just work as is today, e.g. I'm not sure if Cloud is using the status API, but if this is an acceptable solution, we wouldn't need to do any further work on Core's side. |
Users have been conditioned to expect that Kibana is fully upgraded as soon as the server begins serving HTTP traffic and no longer responding with a
As far as I'm aware, Cloud is just accessing the Kibana root URL |
I'm not familiar with the full Fleet setup, so my question is whether Fleet can leverage preboot lifecycle It's meant to be used to prepare an environment before the Kibana server starts. However, the Core API available at this stage is quite limited.
It doesn't sound like a resilient approach: a plugin might go to a |
Fleet uses saved-objects to know what packages are currently installed. During the reboot lifecycle, can Fleet use the saved-objects subsystem?
What isn't a resilient approach? Using the status service, or adding a new extension point? |
Sorry, for being unclear: expect that all the Kibana components are fully upgraded and functioning. So, I tried to support @rudolf's suggestion to use
You are right, it's not the case. Core was going to respond with |
No apology is needed :) However, our system-administrators and orchestration software needs a way to know whether an upgrade was successful, or whether a rollback is required. I quite frankly don't see how this is possible with the changes proposed by #96626. I think this is going to be a super frustrating user experience if we give our users the impression that the upgrade was successful, allow users to login and making changes to Kibana internal data to later figure out that an eventual upgrade step was unsuccessful and we only have two options:
At the moment, the only things that need to run on upgrade are saved-object migrations, and we've documented that the saved-object migration and transitively the upgrade is complete once Kibana begins serving HTTP traffic: https://www.elastic.co/guide/en/kibana/current/upgrade-migrations.html#upgrade-migrations-process. Changing this behavior would be a breaking change. |
So (just to summarize), if I understand correctly, the need would be to have a way to register arbitrary 'migration' functions that would block the Kibana server from effectively starting until they resolve, while, at the same time, having access to the core services such as savedObjects inside the executed handler? |
Correct. |
No. In this case, maybe the simplest solution might be to keep
Would you mind adding your concerns to #96626? |
There's a couple of details I don't fully understand:
I think of being able to rollback after a failed upgrade knowing there will be no data loss as a kind of infrastructure "transaction". This is a really nice feature, the challenge is balancing how much we include in this transaction. It was perhaps a coincidental decision, but we've for instance decided not to couple Elasticsearch and Kibana this comes at the cost of ensuring Kibana is compatible with newer versions of ES, but massively reduced the impact of Kibana upgrade failures. Has the fleet team explored ways to decouple package upgrades from the kibana "transaction" so that even if a user can't rollback Kibana fleet/agents can keep on working? This might not be a viable short or even medium term goal, but I think it's worth working towards something like this. |
I believe that's an accurate assessment from my admittedly limited understanding at this point. As far as I can tell, the Fleet package installation code doesn't currently handle any transient Elasticsearch errors and seems to just give up if one is encountered. @elastic/fleet Please correct me if I am wrong. I definitely think we have to improve this before we consider blocking Kibana startup on package upgrades. At the very minimum we should:
It, Depends. If we're unable to upgrade an Elasticsearch asset, such as an index template or ingest pipeline, then some products will either not be able to ingest data at all or will ingest data in an outdated or invalid way. For the immediate future, we're only planning to upgrade the apm-server, system, elastic-agent, and endpoint packages. The apm-server, elastic-agent, and system packages are necessary for being able to upgrade the APM/Fleet server since that server will be ingesting data into data streams that depend on these ES assets being upgraded or installed. If they're not up-to-date we can't safely upgrade those components without risking data corruption or loss. Basically the same scenario for endpoint, but in that case, the user can't upgrade any deployed endpoint binaries installed on user machines until the endpoint package has been upgraded. This is especially import for Cloud users where APM & Fleet are always present in the default cluster configuration. It's possible we could split the package upgrade process so that we only depend on the ES assets being installed before unblocking Kibana startup, however in my profiling of the package install process, the Kibana assets are much faster to install than the ES assets, so I'm not sure it'd be worth the complexity.
This was my thought as well, though I would prefer long-term if we make this API more explicit so as to prevent plugins from accidentally opting in to this behavior. Maybe an API like this would make sense? core.status.blockStartup(promise: Promise<void>); |
Yes that correct the Fleet code do not do any retry, clearly something we can improve. |
My 2cents: allowing a plugin to block the execution of the rest of Kibana feels like prioritising its use case over anything else. IMO, I wonder why we should grant that privilege to Fleet and not to, for instance, Task Manager (which is a key service in many components in Kibana). As a separate comment: What is the split of responsibilities in Fleet between the APM/Fleet server and Kibana? From an architectural POV, it feels to me like the package upgrades should occur in the Fleet server? I don't know the internals (sorry if I'm really off), but I would think of the Fleet server as the main connection to the agents to all their interactions (push data, package installation/upgrades). Do the agents need connections to both: the Fleet and Kibana servers? |
Honest question here: Is this something we can challenge, or are we locked with this statement? Like the other members, I don't really like the idea to allow specific plugins to block Kibana from starting. But I also don't think the status API suggestion is more than a workaround. On one hand, this startup block proposal goes against pretty much all the decisions we previously made in that matter (making setup/start synchronous, refusing to implement a data injection/preload service that could block page load and so on). In addition, locking up the whole Kibana instance just because a very specific plugin needs to perform some asynchronous tasks for its own reasons feels pretty wrong. OTOH, I'm quite surprised the 'migration service' need arises so late. I mean, atm the only step of a Kibana upgrade is the saved objects migration. Maybe we should consider that given the increasing interactions of Kibana with the rest of stack, this is just not realistic anymore, and that we effectively do need a In short: I don't really like it, but if we have to do it, I would prefer for us to do it correctly, with a real migration service, with per-version support, access to core services and so on, rather than an ugly hack to just solve this very specific need. |
For what it's worth, I was extremely hesitant to propose that we allow the Fleet package upgrades to block the Kibana upgrade process. However, I think it's required for us to provide our system administrators a reasonable experience upgrading our software. We can split hairs and say that we never explicitly documented that the upgrade is complete when Kibana begins serving HTTP traffic and no longer responding with a
However, this is in the section about saved-object migrations, and we aren't talking about saved-object migrations, we're talking about Fleet packages. This leaves us with the leeway to make Fleet packages denote their upgrade completion using another mechanism, right? If we're being pedantic, yes. If we're being practical and taking the user experience into consideration, no. Requiring system administrators to check the response status of a new HTTP endpoint to determine whether or not it's safe to proceed with upgrading Elastic Agent is a bad user experience. If Fleet packages only needed to be upgraded for certain UIs in Kibana to work properly, I'd agree that we shouldn't be blocking the entire Kibana UI. In general, a single plugin misbehaving should only affect that plugin, and shouldn't affect the system as a whole. However, we need to make it painfully obvious to our users when the upgrade is complete. If users upgrade Elastic Agent before they are supposed to do so, their data ingestion will grind to a halt. Whatever extension point we add to make this possible should not be frequently used, and it should only be used in exceptional circumstances, like this one. Additionally, there are changes that we will need to make to the Fleet package installation process to make sure it's resilient and performant. Luckily, we have @joshdover on the Fleet team, who can help us ensure that this process is as bullet-proof as saved-object migrations and minimize the risk we incur. |
@ruflin and @joshdover - is it correct that only certain Fleet packages need to be upgraded for Elastic Agent to be upgraded? Could we only upgrade some of the Fleet packages as part of the Kibana upgrade process, and upgrade other Fleet packages in a way that they only block certain Kibana UIs from functioning? |
I'll describe the APM scenario on Cloud and what I think could be an acceptable behaviour without having to block Kibana. I agree, blocking Kibana completely because some packages could not be upgraded is not ideal. At the same time it has to be said, the packages that are automatically upgraded and could be blocking should only be the one that will also be versioned with the stack. @kobelb This also answers your question with a yes. Mid term, ideally it is Elastic Agent itself or fleet-server that blocks from upgrading if the most recent package is not installed. The APM package should define for example for 7.16 that it only works with 7.16. In this scenario, the hosted elastic agent should only upgrade when for example in the .fleet-* indices it is written, that a more recent policy is available. The problem in Cloud and all other deployments with Docker is that the upgrade is not done by the Elastic Agent itself but triggered externally (container is upgraded). The Elastic Agent could then refuse to accept the incompatible policy and just not run until the update is triggered but that would mean a halt to data shipping for APM which is not an acceptable user experience. One of the options would be that there is a "flavor" to the health check. Kibana is all upgraded a system and the extension points are upgraded and healthy (packages). Then Cloud could check and wait for the second one to upgrade Elastic Agent. Problem short term is, this is all not implemented yet. For the Cloud use case right now, we only need to "block" on upgrade of APM and elastic_agent packages, everything else can happen after Kibana is healthy. This makes it critical we test all upgrade paths with these packages in detail so nothing goes wrong. |
I'm going to try to answer as much as I can given what I know about Fleet so far, here it goes! Please anyone correct me if I'm missing any details on the Fleet side.
What's special here is that this particular plugin (Fleet) is currently responsible for configuring the dependencies of other Stack components. Task Manager does not have the same responsibility or impact on other components outside of the Kibana server.
As @ruflin also answered, I think we're aligned that what we're asking for here is less than ideal as a long-term solution. These dependencies between components is somewhat implicit today and we're going to need to implement a more explicit mechanism that orchestrators integrate with directly to ensure a smooth upgrade of all the components of the Stack. However, this is going to take time to coordinate changes across Cloud, ECE, ECK, Elastic's Helm charts, on-prem documentation, etc. Long-term we also plan to be moving some of the package installation components into Elasticsearch itself, though we may still require that Kibana initiate this process. One reason this requirement is coming up now is that we're trying to bring the apm-server Fleet package up to feature parity with APM server. APM server installs it's own ingest assets, but this is not the case with the apm-server package due to the security concerns I discuss below.
AFAIK today the reason that we don't do this upgrades from Fleet Server is because we support a "standalone mode" for Elastic Agent that doesn't require users to setup Fleet Server at all. In those cases, we still need to have these packages upgraded or else upgrading Agent will result in breaking ingestion for the reasons we outlined above. We also can't move the package installation / upgrade process to Elastic Agent itself because it would require granting security privileges to the API keys that Agent uses that we definitely cannot permit since these credentials are present on end-user machines. Whenever we move package installation to Elasticsearch, this may be something we can revisit.
+1 on this. We need to do everything we can to prevent users from causing accidental data loss. If Elastic Agent was always upgraded itself we would be able to prevent upgrading Agent before whatever necessary packages it needs are also upgraded. But as @ruflin mentioned, Agent and Fleet Server can be upgraded by other components like Docker containers, Ansible, or Chef. Long story short, until we have a better orchestration story from the Stack, I don't see any other good options here right now.
Yes, we plan only to upgrade the packages that are absolutely necessary to minimize the blast radius of this change. |
From my point of view we're all on the same page about the risk of introducing and using this API and although we might not have a solution soon, it's encouraging that we have some options to explore so that in the future we would not have to block Kibana anymore. So I'm +1 for proceeding with this 👍 |
During last week's sync about Fleet package lifecycle management, the decision was made that we wouldn't need the feature this issue is talking about for the immediate solution at hand. I defer to @ruflin and @joshdover on the specifics regarding timing. |
Having Kibana blocking on something that comes from a "remote source" (package registry) is risky. In Fleet we should add support to be able to ship the packages that are blocking Kibana directly with Kibana (@joshdover @jen-huang ). It is a feature we started to build but never completed. At the moment, we only upgrade packages when the user first accesses Kibana. Now we have the option to upgrade directly after Kibana is upgraded which already is a huge step forward without having the risk blocking Kibana. I think it is a good middleground as a first step. In most (hopefully all) cases by the time Elastic Agent with fleet-server is upgraded, Kibana has already installed the most recent packages. Ideally in parallel we start to work on how Elastic Agent could wait on enrolling / upgrading until the packages are there. In parallel we should work to get to the end goal of having it blocking but with packaging the packages. |
Good point, installing from disk removes one potential source of failures.
Can you clarify, should Core continue to add this API and if so by when? If not, would you please close the issue? |
My understanding is, that we still need the API at one point the urgency is a bit lower. @joshdover Can you jump in here as you likely know both "sides" better. |
It will remove a failure scenario, however the overall Fleet setup process is still slow. I did some investigation of this yesterday and unfortunately, bundling the packages is not going to make a significant improvement to performance. I believe we'll need to work with Elasticsearch to improve this: #110500 (comment)
We plan to only begin blocking Kibana upgrades on package upgrades once we can:
At the earliest, I think we may be able to do this in 8.1. Until then we can hold off on this API. I also believe continuing to support async lifecycles would be an option for our first iteration of this behavior. |
@kobelb pointed this issue out to me because for the "alerts as data" / "RAC (Rules, Alerts, and Cases)" projects, we are running into some issues related to a mappings update in a Kibana version that could potentially be incompatible and how do we handle this failure. It sounds like we may want to borrow the solution from this thread eventually, so that we don't allow a user to upgrade to a Kibana version that has this issue... |
The conversation on the timeline for this extension point was forked to an e-mail and Zoom meetings. To close the loop, the decision was made to no longer need this extension point for 7.16/8.0 and to delay adding it until 8.1. |
Just to reiterate, the Fleet team would like to use this API starting in 8.2, so we'll need this API available a few weeks before FF for that release. We don't expect there to be a lot of work on our end to integrate with it. |
@joshdover / @kobelb Is #120616 the best summary of the plan forward that came out of this discussion? |
@jasonrhodes yep! |
Just dropping a note as I'm starting to look at this. Currently here's what I'm thinking:
@elastic/kibana-core Any opinions on this, or anything I've missed? |
This plan LGTM. |
@lukeelmers Thanks for the quick summary. No objections to this proposal. A couple other things I'm thinking about: Should we have some sort of timeout that crashes Kibana?
|
I'd vote for the system imposing a timeout, but the timeout is provided by the registered task. I don't think a general timeout will fit all use cases. If it did, we wouldn't have to move away from the Async lifecycle steps in the first place. What do you think?
How about clusters with tons of packages to upgrade? Would it be a problem with the timeouts? |
I think we should to be safe, and I like @afharo's suggestion:
If the registered task provides its own timeout, then core can stay out of the business of trying to guess what a reasonable limit is. It would also keep core from having to manage retries or any other logic, as it would be up to the registered task to decide what its retry conditions are. This would also allow Fleet to make the timeout configurable which could help from a support perspective (otherwise I could imagine a support request where a user is continually unable to start Kibana because the timeout is exceeded). I think there is some risk in allowing plugins to define their own timeout, but if in our initial implementation we are only allowing a single registrar (Fleet), I am less worried about this situation. Core could also enforce some arbitrary really-high timeout that cannot be exceeded in the task configuration. @joshdover Is this still your best estimate of how long you think upgrades will take, or has it changed at all?
|
Having the task define the timeout seems reasonable to me and will likely make this implementation and testing much simpler. We can generalize later if there are other needs for similar initialization tasks in other areas of Kibana. That said, I'd still like to hear some feedback from those who have experience running our software before we add a timeout. I don't think have a similar timeout on attempting to make the initial connection to Elasticsearch to begin Saved Object migrations, and instead we retry forever. I wonder if the same logic should apply here. @elastic/cloud-applications-solutions @elastic/cloud-k8s Do you have any input on this? The situation is that Kibana has some required initialization that must complete before serving HTTP traffic. This initialization may fail due to issues with connecting to Elasticsearch or capacity problems in the cluster. Would you expect Kibana to continuously retry or to eventually give up and exit with an error? If we should eventually give up, what do you recommend for a default timeout? (we'd also make this configurable of course)
Yes this is still accurate. We have one performance issue that is making this a bit worse today but we're considering it a blocker to making this setup logic blocking: #121639 |
I didn't read through the entire issue, but regarding the last comment:
In case of ECK, Kubernetes will restart failing Pods on its own anyway, so this is not critical to us. But one thing we definitely care about is that in a simple, small cluster, happy case scenario, when ES and Kibana are started at the exact same time, we won't see Kibana erroring out before ES is up. That is, the default timeout for Kibana to error out is higher than the time ES cluster (and again, that is for a small, new cluster) needs to start up. The reason for this "requirement" (or rather an ask) is that many of ECK users journey start with quickstart or other example manifests we provide. Through those manifests, they will deploy ES and Kibana simultaneously and having users see Kibana Pod(s) error out is not a great first impression. Other ECK team members might have more to add. |
Situation has changed a lot since those discussions occured, making me feel the whole thing is now outdated. I'll go ahead and close this. Feel free to reopen a fresh issue up to date with the latest requierements and constraints (e.g. serverless/zdt) if necessary. |
After certain Fleet packages have been successfully installed, the Fleet packages must be upgraded before Elastic Agent can be upgraded and begin ingesting data in the new format. We'd like to make this process easy for both system administrators and Cloud, and do so when Kibana is upgraded so that Elastic Agent can immediately be upgraded. We've conditioned our system administrators to wait until the Kibana server is ready to denote that Kibana has upgraded successfully. Additionally, Cloud has begun relying on this behavior to determine that it's safe to proceed with the hosted Elastic Agent upgrade.
However, Kibana plugins do not have a way to run custom logic when saved-object migrations currently run and prevent the Kibana server from becoming ready.
In order to support the Fleet package upgrades, we should add the necessary extension points to allow plugins to run custom logic before making the Kibana server is marked as ready.
Historically, we've refrained from adding this functionality since plugins should ideally only be blocking their functionality and not blocking the entire Kibana startup process. We'll want to figure out how to ensure we don't misuse this extension point and inadvertently encourage arbitrary plugins to take advantage of this extension point. Perhaps we should only allow there to be a single consumer of this plugin API for the time being, or figure out a way to ensure that the Kibana Core team manually approves all new consumers. I'm interested to hear other's ideas on how to prevent misusse.
The text was updated successfully, but these errors were encountered: