Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic from concurrent map write after system metrics package update #32467

Closed
cmacknz opened this issue Jul 22, 2022 · 26 comments · Fixed by elastic/elastic-agent-system-metrics#43
Assignees
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.3.0

Comments

@cmacknz
Copy link
Member

cmacknz commented Jul 22, 2022

Introduced by #32336 which incorporated the changes from elastic/elastic-agent-system-metrics#40.

We have one confirmed case of APM server crashing with a panic from a concurrent map write when it is run under agent with monitoring enabled.

Stack trace:

{"log.level":"error","@timestamp":"2022-07-22T06:24:18.918Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"apm-server stderr: "fatal error: concurrent map writes\n"","agent.console.name":"apm-server","agent.console.type":"stderr","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-07-22T06:24:18.922Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"apm-server stderr: "\ngoroutine 47 [running]:\nruntime.throw({0xaaaad510411d, 0x15})\n\truntime/panic.go:1198 +0x54 fp=0x400275eb10 sp=0x400275eae0 pc=0xaaaad3387814\nruntime.mapassign(0xaaaad6a97640, 0x40004a8db0, 0x400275ed88)\n\truntime/map.go:585 +0x4d0 fp=0x400275eb90 sp=0x400275eb10 pc=0xaaaad335e7a0\ngithub.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).GetSelf(0x40002024b0)\n\tgithub.com/elastic/elastic-agent-system-metrics@v0.4.3/metric/system/process/process.go:137 +0x1d0 fp=0x400275f140 sp=0x400275eb90 pc=0xaaaad43fecf0\ngithub.com/elastic/elastic-agent-system-metrics/report.InstanceCPUReporter.func1(0x1, {0xaaaad6ed7598, 0x4000d2a000})\n\tgithub.com/elastic/elastic-agent-system-metrics@v0.4.3/report/report.go:68 +0x98 fp=0x400275f6f0 sp=0x400275f140 pc=0xaaaad4405668\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Func).Visit(0x4000011038, 0x1, {0xaaaad6ed7598, 0x4000d2a000})\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/metrics.go:221 +0x44 fp=0x400275f720 sp=0x400275f6f0 pc=0xaaaad383a2a4\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).doVisit(0x40005a0cc0, 0x1, {0xaaaad6ed7598, 0x4000d2a000})\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:83 +0x1bc fp=0x400275f7f0 sp=0x400275f720 pc=0xaaaad383b82c\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).Visit(0x40005a0cc0, 0x1, {0xaaaad6ed7598, 0x4000d2a000})\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:65 +0x40 fp=0x400275f820 sp=0x400275f7f0 pc=0xaaaad383b650\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).doVisit(0x40000ac700, 0x1, {0xaaaad6ed7598, 0x4000d2a000})\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:83 +0x1bc fp=0x400275f8f0 sp=0x400275f820 pc=0xaaaad383b82c\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).Visit(...)\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:65\ngithub.com/elastic/elastic-agent-libs/monitoring.CollectStructSnapshot(0x40000ac700, 0x1, 0x0)\n\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/snapshot.go:89 +0xd4 fp=0x400275f9c0 sp=0x400275f8f0 pc=0xaaaad383cda4\ngithub.com/elastic/beats/v7/libbeat/api.makeAPIHandler.func1({0xaaaad6e8c810, 0x400017b260}, 0x4000f1a300)\n\tgithub.com/elastic/beats/v7@v7.0.0-alpha2.0.20220722011543-0c1f1a945d3a/libbeat/api/routes.go:74 +0x118 fp=0x400275fa20 sp=0x400275f9c0 pc=0xaaaad426fab8\nnet/http.HandlerFunc.ServeHTTP(0x400007b5b0, {0xaaaad6e8c810, 0x400017b260}, 0x4000f1a300)\n\tnet/http/server.go:2047 +0x40 fp=0x400275fa50 sp=0x400275fa20 pc=0xaaaad370f560\nnet/http.(*ServeMux).ServeHTTP(0x40005a1980, {0xaaaad6e8c810, 0x400017b260}, 0x4000f1a300)\n\tnet/http/server.go:2425 +0x18c fp=0x400275fab0 sp=0x400275fa50 pc=0xaaaad3710ecc\nnet/http.serverHandler.ServeHTTP({0x400017a000}"","agent.console.name":"apm-server","agent.console.type":"stderr","ecs.version":"1.6.0"}

@cmacknz cmacknz added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.3.0 labels Jul 22, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@LeeDr
Copy link

LeeDr commented Jul 22, 2022

Does this meet any of these guidelines for a blocker? Is there a work-around? How many users would this impact?

Here are some rough guidelines for identifying bugs that block releases:

  • The bug causes users to lose data.
  • It causes a severe security issue that the Security team has approved.
  • It impacts functionality used by a majority of the users.
  • It's embarrassing. For example: users can't even start Logstash in its basic form.

@fearful-symmetry
Copy link
Contributor

@LeeDr based on a quick 10-minute lookover of the bug, it's originating in beat's monitoring/reporting of its own resource usage, probably because the self-monitoring is the only thing that access a single instance of the process monitoring libs in multiple threads; I don't think this issue is arising in user-facing metricsets. That being said, it's a quick fix, and I should have a PR today.

@cmacknz
Copy link
Member Author

cmacknz commented Jul 22, 2022

Does this meet any of these guidelines for a blocker? Is there a work-around? How many users would this impact?

@LeeDr The impact appears to be that any beat run in standalone mode or started by agent can crash unexpectedly based on a race condition when HTTP monitoring is enabled. This both has the ability to impact any user of beats or agent (HTTP monitoring is a very common use case), and is an embarrassing failure in that the process simply crashes without warning.

We do not have exact numbers. I agree with @fearful-symmetry that this should be a quick fix. It is faster to simply fix the problem than evaluate possible work arounds. Apologies for not providing more detail, the simplicity of the fix here is causing us focus on eliminating the issue as quickly as we can.

For an even lower risk alternative we could simply revert the change that caused the problem in 8.3, and ship the fix for the bug it was originally attempting to address (agent CPU reporting as zero in self-monitoring) in 8.4. Both this option and the probable fix are similar amounts of work.

@fearful-symmetry
Copy link
Contributor

The impact appears to be that any beat run in standalone mode or started by agent can crash unexpectedly based on a race condition when HTTP monitoring is enabled

I'm kind of surprised this causes a whole crash of the application. Metricbeat will catch panics of threads from metricsets, I would have assumed that monitoring did something similar, although I guess it's not too surprising that it doesn't.

@cmacknz
Copy link
Member Author

cmacknz commented Jul 22, 2022

@fearful-symmetry the error above is coming from inside APM server, not any of our own beats. Likely we need to update libbeat in APM server for this fix to take effect.

@cmacknz
Copy link
Member Author

cmacknz commented Jul 22, 2022

Is anyone from @elastic/apm-server available to help us evaluate the impact here and make sure we fix this in the correct place? CC @simitt @axw

@cmacknz
Copy link
Member Author

cmacknz commented Jul 22, 2022

The change was brought into API server with elastic/apm-server#8612. We'll need a PR to update the beats dependency there and then ensure the APM server artifacts are staged for the DRA process to pick them up.

@cmacknz
Copy link
Member Author

cmacknz commented Jul 22, 2022

Cloudbeat is not affected, their beats dependency in the 8.3 branch is pinned to cab8871124af
from June before this change was introduced on July 13 in main and July 14 in 8.3.

Agent is pinned to v0.3.0. Fleet server is pinned to v0.3.0. Both are unaffected as the change was introduced in elastic-agent-system-metrics@v0.4.3.

@cmacknz
Copy link
Member Author

cmacknz commented Jul 22, 2022

To get the fastest and lowest risk resolution for 8.3.3 I am going to revert the offending change from beats. APM server has automation in place to update beats we can hopefully trigger to resolve the problem there quickly.

cmacknz added a commit to cmacknz/beats that referenced this issue Jul 22, 2022
@cmacknz
Copy link
Member Author

cmacknz commented Jul 22, 2022

I opened #32470 against main as the APM server automation appears to pick up changes from the beats main branch. The 8.3 backport PR will follow this one.

@cmacknz
Copy link
Member Author

cmacknz commented Jul 22, 2022

The Jenkins jobs to update APM server can be found at https://apm-ci.elastic.co/job/apm-server/job/update-beats-mbp. We should be able to manually trigger the one for 8.3 once the backport is merged.

cmacknz added a commit to cmacknz/beats that referenced this issue Jul 22, 2022
@cmacknz
Copy link
Member Author

cmacknz commented Jul 22, 2022

https://github.com/elastic/dev/issues/2071#issuecomment-1192799064

We are going to downgrade the elastic-agent-system-metrics package in the affected repositories to resolve this for the 8.3 release. We will continue fixing the root cause of the problem separately.

@b0le
Copy link

b0le commented Aug 11, 2022

@LeeDr @cmacknz @fearful-symmetry No pressure but we have hit this issue in production after upgrading to Elastic Stack 8.3.3 from 7.17.1. Out of 8 Elastic Agents we are running with APM integration enabled about half has encountered this issue within the first 2-4 hours after the upgrade. Worse yet there's not automatic restart of the APM server. Following the crash Elastic Agent does not reap the child APM server process by wait-in on it, leaving behind zombies.

In which version can we expect a fix and are there any workarounds in the mean time?

@cmacknz
Copy link
Member Author

cmacknz commented Aug 11, 2022

@b0le that is surprising, this specific problem should not occur in released versions of 8.3.3.

For beats and apm-server the package downgrade the fixes this problem is included in the 8.3.3 release commits:

This problem should be fixed in 8.3.3 from what I can see.

@LeeDr
Copy link

LeeDr commented Aug 11, 2022

@fearful-symmetry I think the v8.3.0 label on this PR is incorrect and might confuse people. I think this fix was in the 8.3.3 release?

@fearful-symmetry
Copy link
Contributor

@LeeDr going through the version tags, the 8.3.3 release contains @cmacknz 's rollback to a previous version, so either this bug is older than we think, or it's a different but similar-looking bug.

@cmacknz
Copy link
Member Author

cmacknz commented Aug 11, 2022

I don't think this bug was ever released either, it would only have been affecting snapshot versions of 8.3.3.

Most likely this is an unrelated problem that requires more investigation.

@LeeDr
Copy link

LeeDr commented Aug 11, 2022

Sorry, I think I commented in the wrong place. I really don't care too much on the version label on the issue. It's the version labels on the PRs that are important. I'll comment instead on elastic/elastic-agent-system-metrics#43 (comment)

@b0le
Copy link

b0le commented Aug 11, 2022

@cmacknz @fearful-symmetry We are afraid that the release with the problematic elastic-agent-system-metrics v0.4.3 has somehow made it into the wild as we are seeing the following log lines with Elastic Agent 8.3.3, note the third error message with the stack trace which includes the string elastic-agent-system-metrics@v0.4.3/metric/system/process/process.go:137:

"apm-server stderr: \"fatal error: concurrent map writes\\n\""
"apm-server stderr: \"\\ngoroutine 382 [running]:\\nruntime.throw({0x559feefb4b73?, 0x0?})\\n\\truntime/panic.go\""
"apm-server stderr: \":992 +0x71 fp=0xc0020d14c0 sp=0xc0020d1490 pc=0x559fecffe011\\nruntime.mapassign(0xc0029ac3e6?, 0xa?, 0x559feef97562?)\\n\\truntime/map.go:595 +0x4d6 fp=0xc0020d1540 sp=0xc0020d14c0 pc=0x559fecfd5f36\\ngithub.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).GetSelf(_)\\n\\tgithub.com/elastic/elastic-agent-system-metrics@v0.4.3/metric/system/process/process.go:137 +0x239 fp=0xc0020d1ad0 sp=0xc0020d1540 pc=0x559fee1e9c99\\ngithub.com/elastic/elastic-agent-system-metrics/report.MemStatsReporter.func1(0x1, {0x559fefe57598, 0xc0027eaa80})\\n\\tgithub.com/elastic/elastic-agent-system-metrics@v0.4.3/report/report.go:\""
"apm-server stderr: \"52 +0x1e5 fp=0xc0020d36e8 sp=0xc0020d1ad0 pc=0x559fee1f2785\\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Func).Visit(0xc00297d770?, 0x1c?, {0x559fefe57598?, 0xc0027eaa80?})\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/metrics.go\""
"apm-server stderr: \":221 +0x2e fp=0xc0020d3710 sp=0xc0020d36e8 pc=0x559fed55b64e\\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).doVisit(0xc000361780, 0x1, {0x559fefe57598\""
"apm-server stderr: \", 0xc0027eaa80})\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:83 +0x1c3 fp=0xc0020d37f0 sp=0xc0020d3710 pc=0x559fed55d023\\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).Visit(0xc00297d880?, 0xa9\""
"apm-server stderr: \"?, {0x559fefe57598?, 0xc0027eaa80?})\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:65 +0x25 fp=0xc0020d3820 sp=0xc0020d37f0 pc=0x559fed55ce25\\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).doVisit(0xc00005c740, 0x1, {0x559fefe57598, 0xc0027eaa80\""
"apm-server stderr: \"})\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:83 +0x1c3 fp=0xc0020d3900 sp=0xc0020d3820 pc=0x559fed55d023\\ngithub.com/elastic/elastic-agent-libs/monitoring.(*Registry).Visit(...)\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/registry.go:\""
"apm-server stderr: \"65\\ngithub.com/elastic/elastic-agent-libs/monitoring.CollectStructSnapshot(0xc00007e540?, 0xb0?, 0x0)\\n\\tgithub.com/elastic/elastic-agent-libs@v0.2.9/monitoring/snapshot.go:89 +\""
"apm-server stderr: \"0xcc fp=0xc0020d39c8 sp=0xc0020d3900 pc=0x559fed55ea4c\\ngithub.com/elastic/beats/v7/libbeat/api.makeAPIHandler.func1({0x559fefe4f0a8, 0xc0007530a0}\""
"apm-server stderr: \", 0xc002901600)\\n\\tgithub.com/elastic/beats/v7@v7.0.0-alpha2.0.20220722214259-1755b5dd3127/libbeat/api/routes.go:74 +0xde fp=0xc0020d3a20 sp=0xc0020d39c8 pc=\"

Full logs from one of the instances that has crashed are attached as elastic-agent-diagnostics-2022-08-10T16-34-42Z-00-redacted.zip.

For any users encountering this: disabling Elastic Agent's metrics collection is a reliable workaround. We have not seen any crashes in the past 8 hours since disabling it.

@fearful-symmetry
Copy link
Contributor

So, it looks like it did sneak into the 8.3.3 tag there during a bot's dependency update:

https://github.com/elastic/apm-server/blob/f39045c2366eefb537ebdc70125a1e8e6be0367c/go.mod#L85

However, the 8.3 and 8.4 branches currently have the correct version

@joshdover
Copy link
Contributor

joshdover commented Aug 12, 2022

For any users encountering this: disabling Elastic Agent's metrics collection is a reliable workaround. We have not seen any crashes in the past 8 hours since disabling it.

If this works, then my guess is the error only happens when APM's /stats endpoint is called.

One problem is that in our Cloud service, our own internal monitoring likely calls this endpoint which we can't really disable. What I'm not sure about is whether or not this endpoint is only called when in standalone (legacy) APM mode or not since we don't run Agent monitoring directly on Cloud.

In my own cluster where APM is running in managed mode (but on 8.3.2 right now), it does seem that we're still collecting metrics, so I would suspect that this could be affecting all APM Server instances in Cloud. This makes me think we should move to release a new patch version to fix this. @cmacknz do you agree?

@cmacknz
Copy link
Member Author

cmacknz commented Aug 12, 2022

So, it looks like it did sneak into the 8.3.3 tag there during a bot's dependency update:

What I think happened is that in the automated apm-server PR that updated to the beats version with the fix, it did not actually update the elastic-agent-system-metrics package transitive dependency.

You can see that elastic/apm-server#8694 only updates go.sum and not go.mod. We would have needed to manually go get github.com/elastic/elastic-agent-system-metrics@v0.4.2 in the apm-server repository. Seeing it now this the expected behaviour from the Go tooling in this case (prefer the highest version that satisfies the transitive dependency) but we trusted the automation here too much during a last minute fix before a release.

This makes me think we should move to release a new patch version to fix this. @cmacknz do you agree?

Yes, the impact of this is a race condition that will randomly cause APM server to panic. We should never have released this, the original issue was a blocker for 8.3.3.

@cmacknz
Copy link
Member Author

cmacknz commented Aug 12, 2022

The head of the 8.3 branch in apm-server is at https://github.com/elastic/elastic-agent-system-metrics/releases/tag/v0.4.4 which adds a mutex to the map in question to fix the problem.

For beats we reverted to v0.4.2 in the 8.3 release which removes a bug fix for agent monitoring (CPU reporting as zero in the agent dashboard) that introduced the concurrency problem. v0.4.4 has both this agent CPU bug fix and the concurrency bug fix.

@cmacknz
Copy link
Member Author

cmacknz commented Aug 12, 2022

Given that there is a probable work around in a comment above, we may not need an 8.3.4 release given that 8.4.0 is planned for a little more than a week from now.

For any users encountering this: disabling Elastic Agent's metrics collection is a reliable workaround. We have not seen any crashes in the past 8 hours since disabling it.

@Mpdreamz
Copy link
Member

Mpdreamz commented Aug 12, 2022

I believe on ESS the user can not disable all metrics collection e.g the following is baked into the Integration Server docker container:

https://github.com/elastic/cloud/blob/master/scala-services/runner/src/main/scala/no/found/runner/allocation/stateless/ApmDockerContainer.scala#L285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.3.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants