Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet-Server not starting up on 8.14-SNAPSHOT on cloud (intermittent) #3328

Closed
juliaElastic opened this issue Mar 7, 2024 · 32 comments · Fixed by #3508
Closed

Fleet-Server not starting up on 8.14-SNAPSHOT on cloud (intermittent) #3328

juliaElastic opened this issue Mar 7, 2024 · 32 comments · Fixed by #3508
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@juliaElastic
Copy link
Contributor

juliaElastic commented Mar 7, 2024

Reported on cloud deployments in 8.14-SNAPSHOT version, that Fleet-Server not starting up with the following error:

Waiting on default policy with Fleet Server integration

When looking at the .fleet-policies index, it seems that the coordinator is not picking up the policy changes.

GET .fleet-policies/_search?q=policy_id:policy-elastic-agent-on-cloud
{
  "size": 10, 
  "_source": ["revision_idx", "coordinator_idx"],
  "sort": [
    {
      "revision_idx": {
        "order": "desc"
      }
    }
  ]
}

  "hits": [
      {
        "_index": ".fleet-policies-7",
        "_id": "413fe8ef-8abf-5a6b-bce7-6a530f665c65",
        "_score": null,
        "_source": {
          "revision_idx": 5,
          "coordinator_idx": 0
        },
        "sort": [
          5
        ]
      },

In fleet-server logs, there are no errors, but it seems the coordinator doesn't pick up the change.

image

Deployment where the issue is reproduced: https://admin.found.no/deployments/19ed6657cf3dccbba39c8b6faacb67f9

@juliaElastic juliaElastic added the bug Something isn't working label Mar 7, 2024
@michel-laterman
Copy link
Contributor

I've been unable to recreate this with a self managed cluster using the latest snapshots locally

@juliaElastic
Copy link
Contributor Author

Were you able to reproduce on cloud? I'm wondering if the issue is specific to cloud preconfiguration or some other cloud specific config.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Mar 8, 2024

I found this warning in the logs, which is interesting because the fleet-server host points to a https host url. Not sure if it has anything to do with the coordinator.
Comes from this code: https://github.com/elastic/fleet-server/blob/main/internal/pkg/api/server.go#L113
Admin url: https://admin.found.no/deployments/19ed6657cf3dccbba39c8b6faacb67f9/elasticsearch/console

Exposed over insecure HTTP; enablement of TLS is strongly recommended

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Mar 13, 2024

It seems that the issue is only reproducible with terraform, for some reason the hardware template gcp-io-optimized-v3 has this issue, changing the hardware template to gcp-storage-optimized works.

I tried to reproduce the issue with the template gcp-io-optimized-v3 with perf tests, but no luck so far.

@kuisathaverat
Copy link

kuisathaverat commented Mar 21, 2024

@juliaElastic it started happening with gcp-storage-optimized template too

https://admin.found.no/deployments/3483523f271adc9936722a933979ad11

@juliaElastic
Copy link
Contributor Author

In the latest cluster, I'm seeing that the coordinator was not started, and not picking up the policy changes.

This is how the logs should look like (took from a healthy cluster):
image

In the unhealthy cluster it looks like this:
image

@juliaElastic
Copy link
Contributor Author

I'm missing this from ES logs (other .fleet- indices are created):

[instance-0000000000] [.fleet-agents-7] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]

@juliaElastic
Copy link
Contributor Author

Seeing this error in the logs, not sure if it is related:
image

@cmacknz
Copy link
Member

cmacknz commented Mar 21, 2024

That error comes from the index monitor used to watch for fleet actions and policy updates.

Getting regular i/o timeouts from those index monitors seems like it could cause this.

@kuisathaverat
Copy link

kuisathaverat commented Apr 23, 2024

it still happens in 8.14.0-BC1, with a different template.

@kpollich
Copy link
Member

@kuisathaverat - Hey Ivan, this should be fixed by elastic/kibana#181624. Are you still seeing this behavior on the latest snapshot build?

@kuisathaverat
Copy link

We have tested these Docker images that contain the fix, the issue persists

https://artifacts-api.elastic.co/v1/versions/8.15.0-SNAPSHOT/builds/8.15.0-00251ce4
https://artifacts-api.elastic.co/v1/versions/8.15.0-SNAPSHOT/builds/8.15.0-81021969


{"log.level":"info","@timestamp":"2024-04-29T06:41:29.462Z","message":"Waiting on default policy with Fleet Server integration","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","state":"STARTING","ecs.version":"1.6.0"}

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Apr 29, 2024

Looking at the latest instance logs, I'm seeing something strange, it says "HTTP":{"Enabled":false, while I don't see any ssl or certificates in the config Inputs.Server.TLS: null. Compared to a healthy ECS instance, which has a valid TLS config.
@michel-laterman Do you think this could be the issue?

Admin link: https://admin.found.no/deployments/15e5bafefa50a01b902fb82c919495fa/integrations_server
Fleet server logs: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/4OkCu

{"log.level":"info","@timestamp":"2024-04-29T07:45:26.834Z","message":"initial server configuration","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"new":{"Fleet":{"Agent":{"ID":"","Logging":{"Level":""},"Version":"8.14.0"},"Host":{"ID":"","Name":""}},"HTTP":{"Enabled":false,"Host":"localhost","Port":5066,"SecurityDescriptor":"","User":""},"Inputs":[{"Cache":{"APIKeyJitter":0,"APIKeyTTL":0,"ActionTTL":0,"ArtifactTTL":0,"EnrollKeyTTL":0,"MaxCost":0,"NumCounters":0},"Monitor":{"FetchSize":0,"PolicyDebounceTime":0,"PollTimeout":0},"Policy":{"ID":""},"Server":{"Bulk":{"FlushInterval":250000000,"FlushMaxPending":8,"FlushThresholdCount":2048,"FlushThresholdSize":1048576},"CompressionLevel":1,"CompressionThresh":1024,"GC":{"CleanupAfterExpiredInterval":"30d","ScheduleInterval":3600000000000},"Host":"0.0.0.0","Instrumentation":{"APIKey":"","APIKeyPath":"","Enabled":true,"Environment":"","GlobalLabels":"deploymentId=15e5bafefa50a01b902fb82c919495fa,deploymentName=test-images-ess-gttya,organizationId=2870499056","Hosts":["https://7831d021b36b4d53b849aae988d3b6db.apm.us-west2.gcp.elastic-cloud.com:443"],"SecretToken":"m4KjsiPiBAz0w7wW81","SecretTokenPath":"","TLS":{"ServerCA":"","ServerCertificate":"","SkipVerify":false},"TransactionSampleRate":""},"InternalPort":8221,"Limits":{"AckLimit":{"Burst":8000,"Interval":250000,"Max":16000,"MaxBody":2097152},"ActionLimit":{"Burst":100,"Interval":250000,"Max":0,"MaxBody":0},"ArtifactLimit":{"Burst":8000,"Interval":250000,"Max":16000,"MaxBody":0},"CheckinLimit":{"Burst":8000,"Interval":250000,"Max":80000,"MaxBody":1048576},"DeliverFileLimit":{"Burst":40,"Interval":100000000,"Max":80,"MaxBody":0},"EnrollLimit":{"Burst":200,"Interval":10000000,"Max":400,"MaxBody":524288},"GetPGPKey":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"MaxAgents":0,"MaxConnections":0,"MaxHeaderByteSize":8192,"PolicyLimit":{"Burst":1,"Interval":250000,"Max":0,"MaxBody":0},"PolicyThrottle":0,"StatusLimit":{"Burst":200,"Interval":5000000,"Max":400,"MaxBody":0},"UploadChunkLimit":{"Burst":40,"Interval":3000000,"Max":80,"MaxBody":4194304},"UploadEndLimit":{"Burst":40,"Interval":2000000000,"Max":80,"MaxBody":1024},"UploadStartLimit":{"Burst":40,"Interval":2000000000,"Max":80,"MaxBody":5242880}},"PGP":{"Dir":"/usr/share/elastic-agent/data/elastic-agent-372976/components/elastic-agent-upgrade-keys","UpstreamURL":"https://artifacts.elastic.co/GPG-KEY-elastic-agent"},"Port":8220,"Profiler":{"Bind":"localhost:6060","Enabled":false},"Runtime":{"GCPercent":0,"MemoryLimit":0},"StaticPolicyTokens":{"Enabled":false,"PolicyTokens":null},"TLS":null,"Timeouts":{"CheckinJitter":30000000000,"CheckinLongPoll":300000000000,"CheckinMaxPoll":3600000000000,"CheckinTimestamp":30000000000,"Drain":10000000000,"Idle":30000000000,"Read":60000000000,"ReadHeader":5000000000,"Write":600000000000}},"Type":""}],"Logging":{"Files":null,"Level":"info","Pretty":false,"ToFiles":true,"ToStderr":true},"Output":{"Elasticsearch":{"Headers":{"X-Elastic-App-Auth":"eyJhbGciOiJSUzI1NiJ9.eyJpc3MiOiIyNzY3MzQ0NTg5MjA0Mzc0ODk4OTEzZTIwOGU5M2E4NyIsInN1YiI6IjcxNGQ5NWQ4ZWY3OTQ0ZjZiYmI5NmY1YzIwOTg1ODc5IiwiYXVkIjoiMjc2NzM0NDU4OTIwNDM3NDg5ODkxM2UyMDhlOTNhODciLCJpYXQiOjE3MTQzNzY3MTIsImtpbmQiOiJhcG0ifQ.lMGUgcKsyTVGcc6iC9AL_E0xaTnqyPgztxcK7DqArAsm8kYuYRsb5rXheGY7o6Uli9s03FsGYj4_4r8h3KUMXditUZQyk0ADmr7BiNnDDAdBfIwI6uElG0Mwn3pV4BWkb4vpJHjRb0qwz-sP1T3LXlhI18GREwCZ4-a-psgAVSu2_Kp8V8V_m0MdXn0-gfbB0sApCq4-iJAZeiLsaUvxr24PtZ2gf6xdZx9hXvIPGwMTK9VpYTYtGkwotYnVkNTVkhT8lJA4So-A5GlxD9UD4gdBBqxNAcjdFSd3v-TLO7wCsJhSxOjAEyJQeIp89zoUbSuSv8NC4eKGOzoZ1w-Rew"},"Hosts":["2767344589204374898913e208e93a87.containerhost:9244"],"MaxConnPerHost":128,"MaxContentLength":104857600,"MaxRetries":3,"Path":"","Protocol":"http","ProxyDisable":false,"ProxyHeaders":{},"ProxyURL":"","ServiceToken":"[redacted]","ServiceTokenPath":"","TLS":{"CASha256":null,"CATrustedFingerprint":"","CAs":["/app/config/certs/internal_tls_ca.crt"],"Certificate":{"Certificate":"","Key":"","Passphrase":"","PassphrasePath":""},"CipherSuites":null,"CurveTypes":null,"Enabled":null,"Renegotiation":"never","VerificationMode":"full","Versions":null},"Timeout":90000000000},"Extra":null}},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","ecs.version":"1.6.0"}

While also seeing this: Exposed over insecure HTTP; enablement of TLS is strongly recommended

@michel-laterman
Copy link
Contributor

The top level http.* attributes are for the http metrics listener that fleet-server starts.
The TLS info (if passed by elastic-agent) should be in Inputs.Server.TLS

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Apr 29, 2024

One more thing I noticed, on the deployment where we have the issue it says APM instrumentation enabled, while on the healthy one it is APM instrumentation disabled, both are using the same oblt-cli ess template and the same 8.14.0-SNAPSHOT build.
So something seems flaky with the apm config, and something seems to prevent policy monitoring to be started.

Though the issue can be reproduced even when APM instrumentation disabled e.g. this cluster https://admin.found.no/deployments/6c1958f4e689e8e8bc7740e0bc0afeed/elasticsearch/console

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Apr 30, 2024

I created a custom image from 8.14 branch with a lot of info logs, and can't reproduce the issue:

oblt-cli cluster create custom --slack-channel '#fleet-notifications' --username juliaElastic  \
    --template ess \
    --parameters '{
        "StackVersion": "8.14.0-SNAPSHOT",
        "ElasticAgentDockerImage": "docker.elastic.co/observability-ci/elastic-agent:8.14.0-SNAPSHOT-juliabardi-1714466020"
    }'

I'm wondering if this is some kind of concurrency issue, something like fleet-server not picking up config changes in some cases (apm or tls config)

@nchaulet
Copy link
Member

I'm wondering if this is some kind of concurrency issue, something like fleet-server not picking up config changes in some cases (apm or tls config)

It seems it's a concurrency issue at it's do not happen every time and a restart seems to fix it, maybe we could try to hardcode some delay to be able to reproduce that issue

@nchaulet
Copy link
Member

Looking at fleet server code there we should never log Waiting on default policy with Fleet Server integration unless fleet server we start fleet server without a policyID configured, it looks like the healthy deployments never have that log line.

@michel-laterman how can we get an empty policy ID here

sm = policy.NewSelfMonitor(cfg.Fleet, bulker, pim, cfg.Inputs[0].Policy.ID, f.reporter)
? Do we have anything in Fleet-server that could mutate the config? or any ideas why this is not configured?

Also it seems it's something we introduced in 8.14 as if you search for Waiting on default policy with Fleet Server integration in all cloud logs excluding 8.14+ version we do not have results

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 1, 2024

I think this log message comes because there is no fleet server policy with coordinator_idx:1 that fleet-server waits for.
Good idea to hardcode some delay, though I'm not sure where to put that in the code.
Added sleep in a few places but still can't reproduce the issue: https://github.com/elastic/fleet-server/pull/3507/files

It seems the issue started on Feb 22 on 8.14-SNAPSHOT: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/zvjg9
And it happens in all 3 us-west2-a/b/c zones.
This is the first image it started happening: docker.elastic.co/observability-ci/elastic-agent-cloud:8.14.0-70f8c9bb

Though this message is misleading, it doesn't always indicate a problem with missing policies, it is looked on healthy clusters as well: e.g. this today https://admin.found.no/deployments/a91a27b6929a1da13493ceaed4411c56/integrations_server

Interesting that the "APM instrumentation enabled" messages somewhat correlate, they started from Feb 23
https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/owW6x

Another correlation with the Exposed over insecure HTTP message that started from Feb 23, only in 8.14+
https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/cA8LF

I found one deployment today where the Policy.ID is empty, though it didn't occur together with the missing coordinator
https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/eXiJ7

Here is another cluster when this happens, still running: https://admin.found.no/deployments/fddc23512824237462861336b0cfe2ed

I found that the issue is reproducible in BC builds (version 8.14.0) and hard to reproduce on 8.14.0-SNAPSHOT or custom images.
Not sure if this change could be related elastic/elastic-agent#4288 cc @rdner

@nchaulet
Copy link
Member

nchaulet commented May 1, 2024

I think this log message comes because there is no fleet server policy with coordinator_idx:1 that fleet-server waits for.
Good idea to hardcode some delay, though I'm not sure where to put that in the code.
Added sleep in a few places but still can't reproduce the issue: https://github.com/elastic/fleet-server/pull/3507/files

I think the log message is because Fleet server is not correctly configured (without policy ID) if the policy ID was correct it should log Waiting on policy with Fleet Server integration: %s not Waiting on default...

m.reporter.UpdateState(client.UnitStateStarting, fmt.Sprintf("Waiting on policy with Fleet Server integration: %s", m.policyID), nil) //nolint:errcheck // not clear what to do in failure cases

Looking at the timeline you provided it could be related to that PR #3277 I added some test here and it seems we are loosing the policy ID #3508

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 1, 2024

good catch, I think the cfgData.Merge overrides the new.Inputs.Policy.ID field, and it seems to occur on BC builds when APM instrumentation is enabled, so this code is triggered.
I'm not sure how to enable APM for snapshot/custom builds

obj := map[string]interface{}{
"inputs": []interface{}{map[string]interface{}{
"server": map[string]interface{}{
"instrumentation": instrumentationCfg,
},
},
}}
err = cfgData.Merge(obj, config.DefaultOptions...)

@nchaulet tested locally with your agent_test and adding policy to the instrumenation config works. I'm not sure if there is a way to add all keys from input (like spread operator in ts ...input)
cc @michel-laterman

			obj := map[string]interface{}{
				"inputs": []interface{}{map[string]interface{}{
					"policy": input["policy"],
					"server": map[string]interface{}{
						"instrumentation": instrumentationCfg,
					},
				},
				}}

@nchaulet
Copy link
Member

nchaulet commented May 1, 2024

@nchaulet tested locally with your agent_test and adding policy to the instrumenation config works. I'm not sure if there is a way to add all keys from input (like spread operator in ts ...input)
cc @michel-laterman

if we merge with different option than config.DefaultOptions we could maybe have a proper merge.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 1, 2024

if we merge with different option than config.DefaultOptions we could maybe have a proper merge.

You are right, the default options have this: ucfg.FieldReplaceValues("inputs"), so I guess this is why inputs was replaced, and not merged.

Using FieldMergeValues seems to work:

MergeOptions := []ucfg.Option{
				ucfg.PathSep("."),
				ucfg.ResolveEnv,
				ucfg.VarExp,
				ucfg.FieldMergeValues("inputs"),
			}
			err = cfgData.Merge(obj, MergeOptions...)

@kpollich
Copy link
Member

kpollich commented May 2, 2024

@juliaElastic - Can this issue be closed as the fix has been verified?

@juliaElastic
Copy link
Contributor Author

We can close it, will verify again in cloud when BC3 is built.

@kpollich
Copy link
Member

kpollich commented May 2, 2024

Added test issue here: #3516. Let me know if I've captured this issue accurately, and feel free to comment on that issue to clarify further.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 3, 2024

While trying to verify the fix in BC3 build, I'm noticing something strange.
This deployment has a healthy fleet-server, but APM instrumentation is not enabled, even though the monitoring config is added to the agent config.
The cloud elastic-agent.yml has this config:

agent.monitoring.traces: true
agent.monitoring.apm.hosts:
- "[https://7831d021b36b4d53b849aae988d3b6db.apm.us-west2.gcp.elastic-cloud.com:443](https://7831d021b36b4d53b849aae988d3b6db.apm.us-west2.gcp.elastic-cloud.com/)"
agent.monitoring.apm.secret_token: "XXX"
agent.monitoring.apm.global_labels.deploymentId: "3abb33f6b6a2fb5683d4ea73abf81e15"
agent.monitoring.apm.global_labels.deploymentName: "ess-igkfx"
agent.monitoring.apm.global_labels.organizationId: "2870499056"

But still in fleet-server logs tracing is not enabled.
I think there might still be a bug in merging the APM Config. Investigating.

Tried enabling traces manually in this deployment by adding this in Advanced Edit / Deployment config under integrations_server:

 "user_settings_yaml": "agent.monitoring.traces: true\n"

I'm seeing the logs that instrumentation is enabled with the right apm settings, but then it's immediately disabled again.
https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/GeOfW

At least the original issue seems resolved, but I'm not sure if the cloud APM instrumentation works correctly.

Continued testing on snapshot build, by adding full apm config to user settings, it is confirmed that arrived to elastic-agent.yml, but not picked up by fleet-server.

 "user_settings_yaml": "agent.monitoring.enabled: true\nagent.monitoring.traces: true \nagent.monitoring.apm.hosts: \n- https://184993dcfa7f4a57bf3725cd956939f9.apm.us-central1.gcp.cloud.es.io:443 \nagent.monitoring.apm.secret_token: L6azzQBxERPYa73Ojs\n"

I added more logs around the merge config logic, not seeing any of that in the logs.

It looks as if agent doesn't pick up the APM config added by cloud, tried restarting integration server, but it didn't help.
Another deployment still running: https://admin.found.no/deployments/a5255d3b5b2d9ccce4745c2ffdd22d1f

@juliaElastic juliaElastic reopened this May 3, 2024
@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 6, 2024
@michel-laterman
Copy link
Contributor

@juliaElastic, How are you enabling instrumentation?

I've tried to use the terraform in dev-tools/cloud/terraform to deploy + update a cluster to include APM config.
I deploy a test cluster, then alter the deployment resource in main.tf to include

observability = {
  deployment_id = "self"
}

integrations_server = {
  config = {
    docker_image = local.docker_image_ea
    user_settings_yaml =  "agent.monitoring.enabled: true\nagent.monitoring.traces: true\nagent.monitoring.apm.hosts: [\"https://f3bdc01d3e5b44c093ce34ad3f4ad006.apm.us-west2.gcp.elastic-cloud.com:443\"]\nagent.monitoring.apm.secret_token: REDACTED"
  }
 }

Where the host in the list is the cluster's own address; then update the deployment.

The logs you posted had one entry where tracing (in fleet-server) was enabled; however i don't see that in my deployment.
I'm trying to use a custom image that has additional logging around loading APM instrumentation: #3523
link to deployment
link to logs

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 7, 2024

There is a default APM config added to non-snapshot version deployments (documented here), Alex Piggott confirmed that the apm config is there in the elastic-agent.yml, but somehow not reaching the agent. I don't know if this is a new or existing bug, but previously when we reproduced the bug in this issue, I've seen APM instrumentation enabled in the agent logs.

It's possible that this is not a new issue, I created a 8.13.2 cluster with oblt-cli, and the APM instrumentation is disabled there too: https://admin.found.no/deployments/801c633a8dd40de2d71470a5c5e0b01d/integrations_server

Also looked at monitoring-oblt and not seeing any fleet-server traces from the past 5 months on any version.
https://monitoring-oblt.kb.us-west2.gcp.elastic-cloud.com/app/apm/services?comparisonEnabled=false&environment=ENVIRONMENT_ALL&rangeFrom=now-5M&rangeTo=now&offset=13129200000ms&kuery=

Though it's possible the cloud config didn't work in fleet-server before because this change was needed: #3277

@kuisathaverat
Copy link

kuisathaverat commented May 7, 2024

Also looked at monitoring-oblt and not seeing any fleet-server traces from the past 5 months

IIRC we have only 30 days of APM retention

I enabled it on edge-oblt and edge-lite-oblt. It seems to work is flaky, it worked for about 50min, I was adding labels. I am trying to reapply the configuration.

agent.monitoring.enabled: true
agent.monitoring.traces: true
agent.monitoring.apm.hosts: ["https://redacted.apm.us-west2.gcp.elastic-cloud.com:443"]
agent.monitoring.apm.secret_token: REDACTED
Screenshot 2024-05-07 at 11 16 59

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 7, 2024

Okay, I was able to enable fleet-server traces with a config added to user settings here (8.14 cluster): https://admin.found.no/deployments/6bd9a5eb5220c666921a8f44501d938e/integrations_server

"user_settings_yaml": "agent.monitoring.enabled: true\nagent.monitoring.traces: true\nagent.monitoring.apm.hosts: [\"https://monitoring-oblt.apm.us-west2.gcp.elastic-cloud.com:443\"]\nagent.monitoring.apm.secret_token: VwTJMFNGvRaiwVdxnW\n"

Traces here

It is strange though that the log says APM instrumentation disabled, while it is enabled. When I added apm config to a local agent, I saw the correct enabled message.

So to summarize what seems to work in 8.14:

  • adding apm config to cloud user settings yml to enable fleet-server traces
  • adding apm config locally to elastic-agent.yml

What does not seem to work:

Following the thread mentioned here it seems that the default apm config sends the traces to the overview cloud cluster, I'm seeing traces from 8.14 here: https://overview.elastic-cloud.com/app/r/s/JIEzg
Though for a newly created 8.14 cluster I'm not seeing traces showing up here.

These are the deployments having fleet-server traces in the overview cloud prod cluster:
image

@juliaElastic
Copy link
Contributor Author

Created a follow up issue for the missing traces. Closing this as the original issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants