-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agent-mode loads output from policy #3411
agent-mode loads output from policy #3411
Conversation
61bf7e9
to
b510690
Compare
buildkite test this |
5600e30
to
9732cbb
Compare
buildkite test this |
9732cbb
to
bb7c713
Compare
bb7c713
to
9d08739
Compare
9d08739
to
6c615be
Compare
@@ -379,6 +381,7 @@ func TestServerConfigErrorReload(t *testing.T) { | |||
cancel() | |||
}).Return(nil) | |||
mReporter.On("UpdateState", client.UnitStateStopping, mock.Anything, mock.Anything).Return(nil) | |||
mReporter.On("UpdateState", client.UnitStateFailed, mock.MatchedBy(func(err error) bool { return errors.Is(err, context.Canceled) }), mock.Anything).Return(nil).Maybe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that without this in the mocked calls list the test will panic because the cancel()
call when the state updates to healthy interrupts something and returns the context cancelled error triggering a fail-state detection instead of a stopping call
}} | ||
for _, tc := range tests { | ||
t.Run(tc.name, func(t *testing.T) { | ||
res := MergeElasticsearchFromPolicy(cfg, tc.pol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michel-laterman why do we need to merge the policy here? cannot use what come from the config? the proxy and tls settings should be configured there too no? otherwise there is no way for user to change proxy
and tls
from Fleet right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the elastic-agent injection will not replace keys that are already present in local config:
So the agent can set tls.CAs
from policy if it was not an enrollment arg (i'm not sure if it serializes locally)
but the bigger issue is that the agent only ever sends a single host in the output block due to this behaviour
changelog/fragments/1712108631-Use-policy-outputs-when-running-in-agent-mode.yaml
Outdated
Show resolved
Hide resolved
Fleet-server will retrieve and use the output from the policy when running in agent-mode. | ||
This allows the fleet-server to connect to multiple Elasticsearch hosts if it is successful when | ||
connecting to the host provided at enrolment/installation. | ||
We expect that the host provided during enrollment/installation is never removed as a valid output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we raise an ingest-docs issue to document this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we should
As long as agent has received a policy from Fleet server at least once, it will have persisted it locally on disk. This should mean that except for the very first time agent checks in, there is a cached set of fleet server outputs that are going to arrive via the control protocol. Are the control protocol outputs also added to the set of Fleet server hosts when they are receive? They are the same as the ones Fleet Server could query itself, they just come in via a different path. Removing the bootstrap broker is interesting, it might make sense to model the bootstrap broker as an explicit thing in Fleet if somebody asks us for this. Its purpose is to get the policy the very first time before there are policy outputs to read from. At the same time, once you have gotten the policy at least once, there shouldn't ever be a need to rely only on the bootstrap output again. |
Modelling the bootstrap broker explicitly in diagnostics as something separate from the outputs in the policy might also be a way to account for it diagnostics. |
@cmacknz, just to be clear I have not adjusted any bootstrap behaviour in fleet-server or the elastic-agent. We know that the (elastic-agent) component modifier adjusts the es hosts it sends to fleet-server to use only what is specified during enrollment when elastic-agent is not restarted if the desciption of agent persistency is correct:
So we can have an edge case that occurs as follows:
I'm going to test what output is sent to a fleet-server after an agent restarts |
Ugh right we are overriding the outputs in the policy. We'd need fleet server to cache the outputs list to get around this. I don't want to blow up the scope of this change too much, but I wonder if all of this would be simpler if we defined Fleet server to always have two outputs, one is explicitly the bootstrap output and the other is the set of standard outputs in the policy (which may overlap with bootstrap). The bootstrap output only gets used if the standard outputs don't exist yet. That way you'd benefit from the agent's cache of the most recent policy. Basically we'd want agent to keep injecting an output it would just be an additional output instead of overwriting the single output that exists today. |
I tried the following for a test: Created 8.14-SNAPSHOT on qa
the new I also then tried uninstalling the agent, then re-installing without the
and the ca path appears in fleet-server's "initial server configuration" message. The main finding is that the elastic-agent will never replace an attribute that was present during enrolment with a value from a policy when sending config to fleet-server. |
buildkite test this |
} | ||
|
||
// getFleetOutputName returns the output name that the fleet-server input of the policy uses | ||
func getFleetOutputName(p *model.Policy) (string, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliaElastic I've added this method so we use the use_output
attribute of the fleet-server input to get the output name directly instead of scanning for the 1st elasticsearch
type output
Quality Gate passedThe SonarQube Quality Gate passed, but some issues were introduced. 3 New issues |
Hello!
|
@cmacknz I created #3464 to handle output bootstrap tracking separately to limit how large this pr grows. Should I make another issue in fleet server to discuss/track my findings above?
@lucabelluccini I can confirm that the |
Hey team, in which version we will have this fixed? There's a tentative release version? Are there other dependencies to be merged? |
8.14.0 would be the first version with the fix. |
This reverts commit fe7955b.
PLEASE NOTE: THIS FIX HAS BEEN REVERTED IN #3496 |
What is the problem this PR solves?
When running in agent-mode fleet-server can never connect to any additional Elasticsearch hosts.
From the discussions in the initial attempt to fix this issue in the elastic-agent (elastic/elastic-agent#4473) we have decided to treat the output config that the agent passes to the fleet-server more like a "bootstrap" block that the fleet-server can use to form an initial connection to ES and pull more output config from the associated policy.
This approach does not solve for the edge-case where the URL used in bootstrapping (specified when enrolling/installing) goes down and the fleet-server needs to be restarted.
In this case fleet-server will not be healthy as it can't gather output config from the bootstrap ES host.
Additionally we still expect that the bootstrap es host will never be removed as a valid output.
How does this PR solve the problem?
Allow the self-monitor that fleet-server starts when in agent mode to send a config.Config struct through the server's config channel. This struct only has the
Output
and (new)RevisionIdx
attributes set from values retrieved from the policy output.When fleet receives new config it will handle the output only config as a special case and merge it with the previous
output config in order to get an up to date complete config.
When merging config, non-default values from the policy are preferred.
For an output block to be used, at least one host must be reachable.
How to test this PR locally
fleet-server.yml
in the bundle should have multiple hosts in the output block, logs in the bundle should also indicate that the output block from the latest revision is being usedDesign Checklist
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.Checklist
I have made corresponding change to the default configuration files./changelog/fragments
using the changelog toolRelated issues