Panic from concurrent map write after system metrics package update #32467

cmacknz · 2022-07-22T12:53:50Z

Introduced by #32336 which incorporated the changes from elastic/elastic-agent-system-metrics#40.

We have one confirmed case of APM server crashing with a panic from a concurrent map write when it is run under agent with monitoring enabled.

Stack trace:

{"log.level":"error","@timestamp":"2022-07-22T06:24:18.918Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"apm-server stderr: "fatal error: concurrent map writes\n"","agent.console.name":"apm-server","agent.console.type":"stderr","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-07-22T06:24:18.922Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"apm-server stderr: "\ngoroutine 47 [running]:\nruntime.throw({0xaaaad510411d, 0x15})\n\truntime/panic.go:1198 +0x54 fp=0x400275eb10 sp=0x400275eae0 pc=0xaaaad3387814\nruntime.mapassign(0xaaaad6a97640, 0x40004a8db0, 0x400275ed88)\n\truntime/map.go:585 +0x4d0 fp=0x400275eb90 sp=0x400275eb10 pc=0xaaaad335e7a0\ngithub.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).GetSelf(0x40002024b0)\n\tgithub.com/elastic/elastic-agent-system-metrics@v0.4.3/metric/system/process/process.go:137 +0x1d0 fp=0x400275f140 sp=0x400275eb90 pc=0xaaaad43fecf0\ngithub.com/elastic/elastic-agent-system-metrics/report.InstanceCPUReporter.func1(0x1, {0xaaaad6ed7598, 0x4000d2a000})\n\tgithub.com/elastic/elastic-agent-system-metrics@v0.4.3/report/report.go:68 +0x98 fp=0x400275f6f0 sp=0x400275f140 pc=0xaaaad4405668\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Func).Visit(0x4000011038, 0x1, {0xaaaad6ed7598, 0x4000d2a000})\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/metrics.go:221 +0x44 fp=0x400275f720 sp=0x400275f6f0 pc=0xaaaad383a2a4\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).doVisit(0x40005a0cc0, 0x1, {0xaaaad6ed7598, 0x4000d2a000})\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:83 +0x1bc fp=0x400275f7f0 sp=0x400275f720 pc=0xaaaad383b82c\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).Visit(0x40005a0cc0, 0x1, {0xaaaad6ed7598, 0x4000d2a000})\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:65 +0x40 fp=0x400275f820 sp=0x400275f7f0 pc=0xaaaad383b650\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).doVisit(0x40000ac700, 0x1, {0xaaaad6ed7598, 0x4000d2a000})\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:83 +0x1bc fp=0x400275f8f0 sp=0x400275f820 pc=0xaaaad383b82c\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).Visit(...)\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:65\ngithub.com/elastic/elastic-agent-libs/monitoring.CollectStructSnapshot(0x40000ac700, 0x1, 0x0)\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/snapshot.go:89 +0xd4 fp=0x400275f9c0 sp=0x400275f8f0 pc=0xaaaad383cda4\ngithub.com/elastic/beats/v7/libbeat/api.makeAPIHandler.func1({0xaaaad6e8c810, 0x400017b260}, 0x4000f1a300)\n\tgithub.com/elastic/beats/v7@v7.0.0-alpha2.0.20220722011543-0c1f1a945d3a/libbeat/api/routes.go:74 +0x118 fp=0x400275fa20 sp=0x400275f9c0 pc=0xaaaad426fab8\nnet/http.HandlerFunc.ServeHTTP(0x400007b5b0, {0xaaaad6e8c810, 0x400017b260}, 0x4000f1a300)\n\tnet/http/server.go:2047 +0x40 fp=0x400275fa50 sp=0x400275fa20 pc=0xaaaad370f560\nnet/http.(*ServeMux).ServeHTTP(0x40005a1980, {0xaaaad6e8c810, 0x400017b260}, 0x4000f1a300)\n\tnet/http/server.go:2425 +0x18c fp=0x400275fab0 sp=0x400275fa50 pc=0xaaaad3710ecc\nnet/http.serverHandler.ServeHTTP({0x400017a000}"","agent.console.name":"apm-server","agent.console.type":"stderr","ecs.version":"1.6.0"}

elasticmachine · 2022-07-22T12:53:52Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

LeeDr · 2022-07-22T16:11:10Z

Does this meet any of these guidelines for a blocker? Is there a work-around? How many users would this impact?

Here are some rough guidelines for identifying bugs that block releases:

The bug causes users to lose data.
It causes a severe security issue that the Security team has approved.
It impacts functionality used by a majority of the users.
It's embarrassing. For example: users can't even start Logstash in its basic form.

fearful-symmetry · 2022-07-22T16:19:46Z

@LeeDr based on a quick 10-minute lookover of the bug, it's originating in beat's monitoring/reporting of its own resource usage, probably because the self-monitoring is the only thing that access a single instance of the process monitoring libs in multiple threads; I don't think this issue is arising in user-facing metricsets. That being said, it's a quick fix, and I should have a PR today.

cmacknz · 2022-07-22T16:35:34Z

Does this meet any of these guidelines for a blocker? Is there a work-around? How many users would this impact?

@LeeDr The impact appears to be that any beat run in standalone mode or started by agent can crash unexpectedly based on a race condition when HTTP monitoring is enabled. This both has the ability to impact any user of beats or agent (HTTP monitoring is a very common use case), and is an embarrassing failure in that the process simply crashes without warning.

We do not have exact numbers. I agree with @fearful-symmetry that this should be a quick fix. It is faster to simply fix the problem than evaluate possible work arounds. Apologies for not providing more detail, the simplicity of the fix here is causing us focus on eliminating the issue as quickly as we can.

For an even lower risk alternative we could simply revert the change that caused the problem in 8.3, and ship the fix for the bug it was originally attempting to address (agent CPU reporting as zero in self-monitoring) in 8.4. Both this option and the probable fix are similar amounts of work.

fearful-symmetry · 2022-07-22T16:41:45Z

The impact appears to be that any beat run in standalone mode or started by agent can crash unexpectedly based on a race condition when HTTP monitoring is enabled

I'm kind of surprised this causes a whole crash of the application. Metricbeat will catch panics of threads from metricsets, I would have assumed that monitoring did something similar, although I guess it's not too surprising that it doesn't.

cmacknz · 2022-07-22T16:46:34Z

@fearful-symmetry the error above is coming from inside APM server, not any of our own beats. Likely we need to update libbeat in APM server for this fix to take effect.

cmacknz · 2022-07-22T16:48:05Z

Is anyone from @elastic/apm-server available to help us evaluate the impact here and make sure we fix this in the correct place? CC @simitt @axw

cmacknz · 2022-07-22T17:00:34Z

The change was brought into API server with elastic/apm-server#8612. We'll need a PR to update the beats dependency there and then ensure the APM server artifacts are staged for the DRA process to pick them up.

cmacknz · 2022-07-22T17:07:24Z

Cloudbeat is not affected, their beats dependency in the 8.3 branch is pinned to cab8871124af
from June before this change was introduced on July 13 in main and July 14 in 8.3.

Agent is pinned to v0.3.0. Fleet server is pinned to v0.3.0. Both are unaffected as the change was introduced in elastic-agent-system-metrics@v0.4.3.

cmacknz · 2022-07-22T17:18:49Z

To get the fastest and lowest risk resolution for 8.3.3 I am going to revert the offending change from beats. APM server has automation in place to update beats we can hopefully trigger to resolve the problem there quickly.

elastic#32467 was introduced in v0.4.3.

cmacknz · 2022-07-22T17:23:43Z

I opened #32470 against main as the APM server automation appears to pick up changes from the beats main branch. The 8.3 backport PR will follow this one.

cmacknz · 2022-07-22T17:25:05Z

The Jenkins jobs to update APM server can be found at https://apm-ci.elastic.co/job/apm-server/job/update-beats-mbp. We should be able to manually trigger the one for 8.3 once the backport is merged.

elastic#32467 was introduced in v0.4.3. Manual backport of elastic#32470

cmacknz · 2022-07-22T17:41:08Z

https://github.com/elastic/dev/issues/2071#issuecomment-1192799064

We are going to downgrade the elastic-agent-system-metrics package in the affected repositories to resolve this for the 8.3 release. We will continue fixing the root cause of the problem separately.

#32467 was introduced in v0.4.3.

#32467 was introduced in v0.4.3. Manual backport of #32470

b0le · 2022-08-11T12:53:24Z

@LeeDr @cmacknz @fearful-symmetry No pressure but we have hit this issue in production after upgrading to Elastic Stack 8.3.3 from 7.17.1. Out of 8 Elastic Agents we are running with APM integration enabled about half has encountered this issue within the first 2-4 hours after the upgrade. Worse yet there's not automatic restart of the APM server. Following the crash Elastic Agent does not reap the child APM server process by wait-in on it, leaving behind zombies.

In which version can we expect a fix and are there any workarounds in the mean time?

cmacknz · 2022-08-11T13:53:57Z

@b0le that is surprising, this specific problem should not occur in released versions of 8.3.3.

For beats and apm-server the package downgrade the fixes this problem is included in the 8.3.3 release commits:

This problem should be fixed in 8.3.3 from what I can see.

LeeDr · 2022-08-11T13:55:46Z

@fearful-symmetry I think the v8.3.0 label on this PR is incorrect and might confuse people. I think this fix was in the 8.3.3 release?

fearful-symmetry · 2022-08-11T17:07:42Z

@LeeDr going through the version tags, the 8.3.3 release contains @cmacknz 's rollback to a previous version, so either this bug is older than we think, or it's a different but similar-looking bug.

cmacknz · 2022-08-11T17:30:54Z

I don't think this bug was ever released either, it would only have been affecting snapshot versions of 8.3.3.

Most likely this is an unrelated problem that requires more investigation.

LeeDr · 2022-08-11T17:39:29Z

Sorry, I think I commented in the wrong place. I really don't care too much on the version label on the issue. It's the version labels on the PRs that are important. I'll comment instead on elastic/elastic-agent-system-metrics#43 (comment)

b0le · 2022-08-11T22:02:44Z

@cmacknz @fearful-symmetry We are afraid that the release with the problematic elastic-agent-system-metrics v0.4.3 has somehow made it into the wild as we are seeing the following log lines with Elastic Agent 8.3.3, note the third error message with the stack trace which includes the string elastic-agent-system-metrics@v0.4.3/metric/system/process/process.go:137:

"apm-server stderr: \"fatal error: concurrent map writes\\n\""
"apm-server stderr: \"\\ngoroutine 382 [running]:\\nruntime.throw({0x559feefb4b73?, 0x0?})\\n\\truntime/panic.go\""
"apm-server stderr: \":992 +0x71 fp=0xc0020d14c0 sp=0xc0020d1490 pc=0x559fecffe011\\nruntime.mapassign(0xc0029ac3e6?, 0xa?, 0x559feef97562?)\\n\\truntime/map.go:595 +0x4d6 fp=0xc0020d1540 sp=0xc0020d14c0 pc=0x559fecfd5f36\\ngithub.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).GetSelf(_)\\n\\tgithub.com/elastic/elastic-agent-system-metrics@v0.4.3/metric/system/process/process.go:137 +0x239 fp=0xc0020d1ad0 sp=0xc0020d1540 pc=0x559fee1e9c99\\ngithub.com/elastic/elastic-agent-system-metrics/report.MemStatsReporter.func1(0x1, {0x559fefe57598, 0xc0027eaa80})\\n\\tgithub.com/elastic/elastic-agent-system-metrics@v0.4.3/report/report.go:\""
"apm-server stderr: \"52 +0x1e5 fp=0xc0020d36e8 sp=0xc0020d1ad0 pc=0x559fee1f2785\\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Func).Visit(0xc00297d770?, 0x1c?, {0x559fefe57598?, 0xc0027eaa80?})\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/metrics.go\""
"apm-server stderr: \":221 +0x2e fp=0xc0020d3710 sp=0xc0020d36e8 pc=0x559fed55b64e\\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).doVisit(0xc000361780, 0x1, {0x559fefe57598\""
"apm-server stderr: \", 0xc0027eaa80})\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:83 +0x1c3 fp=0xc0020d37f0 sp=0xc0020d3710 pc=0x559fed55d023\\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).Visit(0xc00297d880?, 0xa9\""
"apm-server stderr: \"?, {0x559fefe57598?, 0xc0027eaa80?})\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:65 +0x25 fp=0xc0020d3820 sp=0xc0020d37f0 pc=0x559fed55ce25\\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).doVisit(0xc00005c740, 0x1, {0x559fefe57598, 0xc0027eaa80\""
"apm-server stderr: \"})\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:83 +0x1c3 fp=0xc0020d3900 sp=0xc0020d3820 pc=0x559fed55d023\\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).Visit(...)\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:\""
"apm-server stderr: \"65\\ngithub.com/elastic/elastic-agent-libs/monitoring.CollectStructSnapshot(0xc00007e540?, 0xb0?, 0x0)\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/snapshot.go:89 +\""
"apm-server stderr: \"0xcc fp=0xc0020d39c8 sp=0xc0020d3900 pc=0x559fed55ea4c\\ngithub.com/elastic/beats/v7/libbeat/api.makeAPIHandler.func1({0x559fefe4f0a8, 0xc0007530a0}\""
"apm-server stderr: \", 0xc002901600)\\n\\tgithub.com/elastic/beats/v7@v7.0.0-alpha2.0.20220722214259-1755b5dd3127/libbeat/api/routes.go:74 +0xde fp=0xc0020d3a20 sp=0xc0020d39c8 pc=\"

Full logs from one of the instances that has crashed are attached as elastic-agent-diagnostics-2022-08-10T16-34-42Z-00-redacted.zip.

For any users encountering this: disabling Elastic Agent's metrics collection is a reliable workaround. We have not seen any crashes in the past 8 hours since disabling it.

fearful-symmetry · 2022-08-11T22:14:59Z

So, it looks like it did sneak into the 8.3.3 tag there during a bot's dependency update:

https://github.com/elastic/apm-server/blob/f39045c2366eefb537ebdc70125a1e8e6be0367c/go.mod#L85

However, the 8.3 and 8.4 branches currently have the correct version

joshdover · 2022-08-12T12:09:05Z

For any users encountering this: disabling Elastic Agent's metrics collection is a reliable workaround. We have not seen any crashes in the past 8 hours since disabling it.

If this works, then my guess is the error only happens when APM's /stats endpoint is called.

One problem is that in our Cloud service, our own internal monitoring likely calls this endpoint which we can't really disable. What I'm not sure about is whether or not this endpoint is only called when in standalone (legacy) APM mode or not since we don't run Agent monitoring directly on Cloud.

In my own cluster where APM is running in managed mode (but on 8.3.2 right now), it does seem that we're still collecting metrics, so I would suspect that this could be affecting all APM Server instances in Cloud. This makes me think we should move to release a new patch version to fix this. @cmacknz do you agree?

cmacknz · 2022-08-12T12:52:27Z

So, it looks like it did sneak into the 8.3.3 tag there during a bot's dependency update:

What I think happened is that in the automated apm-server PR that updated to the beats version with the fix, it did not actually update the elastic-agent-system-metrics package transitive dependency.

You can see that elastic/apm-server#8694 only updates go.sum and not go.mod. We would have needed to manually go get github.com/elastic/elastic-agent-system-metrics@v0.4.2 in the apm-server repository. Seeing it now this the expected behaviour from the Go tooling in this case (prefer the highest version that satisfies the transitive dependency) but we trusted the automation here too much during a last minute fix before a release.

This makes me think we should move to release a new patch version to fix this. @cmacknz do you agree?

Yes, the impact of this is a race condition that will randomly cause APM server to panic. We should never have released this, the original issue was a blocker for 8.3.3.

cmacknz · 2022-08-12T12:59:58Z

The head of the 8.3 branch in apm-server is at https://github.com/elastic/elastic-agent-system-metrics/releases/tag/v0.4.4 which adds a mutex to the map in question to fix the problem.

For beats we reverted to v0.4.2 in the 8.3 release which removes a bug fix for agent monitoring (CPU reporting as zero in the agent dashboard) that introduced the concurrency problem. v0.4.4 has both this agent CPU bug fix and the concurrency bug fix.

cmacknz · 2022-08-12T13:15:21Z

Given that there is a probable work around in a comment above, we may not need an 8.3.4 release given that 8.4.0 is planned for a little more than a week from now.

For any users encountering this: disabling Elastic Agent's metrics collection is a reliable workaround. We have not seen any crashes in the past 8 hours since disabling it.

Mpdreamz · 2022-08-12T13:31:18Z

I believe on ESS the user can not disable all metrics collection e.g the following is baked into the Integration Server docker container:

https://github.com/elastic/cloud/blob/master/scala-services/runner/src/main/scala/no/found/runner/allocation/stateless/ApmDockerContainer.scala#L285

#32467 was introduced in v0.4.3.

cmacknz added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.3.0 labels Jul 22, 2022

cmacknz assigned fearful-symmetry Jul 22, 2022

cmacknz added bug blocker labels Jul 22, 2022

cmacknz added a commit to cmacknz/beats that referenced this issue Jul 22, 2022

Downgrade elastic-agent-system-metrics to v0.4.2.

c2a88ec

elastic#32467 was introduced in v0.4.3.

cmacknz mentioned this issue Jul 22, 2022

Downgrade elastic-agent-system-metrics to v0.4.2. #32470

Merged

cmacknz added a commit to cmacknz/beats that referenced this issue Jul 22, 2022

Downgrade elastic-agent-system-metrics to v0.4.2.

83515f6

elastic#32467 was introduced in v0.4.3. Manual backport of elastic#32470

cmacknz mentioned this issue Jul 22, 2022

Downgrade elastic-agent-system-metrics to v0.4.2. #32471

Merged

fearful-symmetry mentioned this issue Jul 22, 2022

fix thread safety in process code elastic/elastic-agent-system-metrics#43

Merged

5 tasks

cmacknz added a commit that referenced this issue Jul 22, 2022

Downgrade elastic-agent-system-metrics to v0.4.2. (#32470)

fba8b86

#32467 was introduced in v0.4.3.

cmacknz added a commit that referenced this issue Jul 22, 2022

Downgrade elastic-agent-system-metrics to v0.4.2. (#32471)

2a23d4a

#32467 was introduced in v0.4.3. Manual backport of #32470

cmacknz removed the blocker label Jul 25, 2022

fearful-symmetry closed this as completed in elastic/elastic-agent-system-metrics#43 Jul 25, 2022

fearful-symmetry mentioned this issue Jul 25, 2022

update elastic-agent-system-metrics to v.0.4.4 to fix concurrent map writes #32491

Merged

6 tasks

Mpdreamz assigned Mpdreamz and unassigned fearful-symmetry Aug 12, 2022

Mpdreamz reopened this Aug 12, 2022

Mpdreamz assigned fearful-symmetry and unassigned Mpdreamz Aug 12, 2022

Mpdreamz closed this as completed Aug 12, 2022

Mpdreamz mentioned this issue Aug 12, 2022

Downgrade elastic-agent-system-metrics from 0.4.3 to 0.4.2. elastic/apm-server#8857

Closed

chrisberkhout pushed a commit that referenced this issue Jun 1, 2023

Downgrade elastic-agent-system-metrics to v0.4.2. (#32470)

de57c0b

#32467 was introduced in v0.4.3.

kruskall mentioned this issue Aug 14, 2023

apm service goes into frequent restarts elastic/apm-server#11380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic from concurrent map write after system metrics package update #32467

Panic from concurrent map write after system metrics package update #32467

cmacknz commented Jul 22, 2022 •

edited

Loading

elasticmachine commented Jul 22, 2022

LeeDr commented Jul 22, 2022

fearful-symmetry commented Jul 22, 2022

cmacknz commented Jul 22, 2022

fearful-symmetry commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

b0le commented Aug 11, 2022

cmacknz commented Aug 11, 2022

LeeDr commented Aug 11, 2022

fearful-symmetry commented Aug 11, 2022

cmacknz commented Aug 11, 2022

LeeDr commented Aug 11, 2022

b0le commented Aug 11, 2022

fearful-symmetry commented Aug 11, 2022

joshdover commented Aug 12, 2022 •

edited

Loading

cmacknz commented Aug 12, 2022 •

edited

Loading

cmacknz commented Aug 12, 2022

cmacknz commented Aug 12, 2022

Mpdreamz commented Aug 12, 2022 •

edited

Loading

Panic from concurrent map write after system metrics package update #32467

Panic from concurrent map write after system metrics package update #32467

Comments

cmacknz commented Jul 22, 2022 • edited Loading

elasticmachine commented Jul 22, 2022

LeeDr commented Jul 22, 2022

fearful-symmetry commented Jul 22, 2022

cmacknz commented Jul 22, 2022

fearful-symmetry commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

cmacknz commented Jul 22, 2022

b0le commented Aug 11, 2022

cmacknz commented Aug 11, 2022

LeeDr commented Aug 11, 2022

fearful-symmetry commented Aug 11, 2022

cmacknz commented Aug 11, 2022

LeeDr commented Aug 11, 2022

b0le commented Aug 11, 2022

fearful-symmetry commented Aug 11, 2022

joshdover commented Aug 12, 2022 • edited Loading

cmacknz commented Aug 12, 2022 • edited Loading

cmacknz commented Aug 12, 2022

cmacknz commented Aug 12, 2022

Mpdreamz commented Aug 12, 2022 • edited Loading

cmacknz commented Jul 22, 2022 •

edited

Loading

joshdover commented Aug 12, 2022 •

edited

Loading

cmacknz commented Aug 12, 2022 •

edited

Loading

Mpdreamz commented Aug 12, 2022 •

edited

Loading