-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic from concurrent map write after system metrics package update #32467
Panic from concurrent map write after system metrics package update #32467
Comments
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
Does this meet any of these guidelines for a blocker? Is there a work-around? How many users would this impact? Here are some rough guidelines for identifying bugs that block releases:
|
@LeeDr based on a quick 10-minute lookover of the bug, it's originating in beat's monitoring/reporting of its own resource usage, probably because the self-monitoring is the only thing that access a single instance of the process monitoring libs in multiple threads; I don't think this issue is arising in user-facing metricsets. That being said, it's a quick fix, and I should have a PR today. |
@LeeDr The impact appears to be that any beat run in standalone mode or started by agent can crash unexpectedly based on a race condition when HTTP monitoring is enabled. This both has the ability to impact any user of beats or agent (HTTP monitoring is a very common use case), and is an embarrassing failure in that the process simply crashes without warning. We do not have exact numbers. I agree with @fearful-symmetry that this should be a quick fix. It is faster to simply fix the problem than evaluate possible work arounds. Apologies for not providing more detail, the simplicity of the fix here is causing us focus on eliminating the issue as quickly as we can. For an even lower risk alternative we could simply revert the change that caused the problem in 8.3, and ship the fix for the bug it was originally attempting to address (agent CPU reporting as zero in self-monitoring) in 8.4. Both this option and the probable fix are similar amounts of work. |
I'm kind of surprised this causes a whole crash of the application. Metricbeat will catch panics of threads from metricsets, I would have assumed that monitoring did something similar, although I guess it's not too surprising that it doesn't. |
@fearful-symmetry the error above is coming from inside APM server, not any of our own beats. Likely we need to update libbeat in APM server for this fix to take effect. |
The change was brought into API server with elastic/apm-server#8612. We'll need a PR to update the beats dependency there and then ensure the APM server artifacts are staged for the DRA process to pick them up. |
Cloudbeat is not affected, their beats dependency in the 8.3 branch is pinned to cab8871124af Agent is pinned to v0.3.0. Fleet server is pinned to v0.3.0. Both are unaffected as the change was introduced in elastic-agent-system-metrics@v0.4.3. |
To get the fastest and lowest risk resolution for 8.3.3 I am going to revert the offending change from beats. APM server has automation in place to update beats we can hopefully trigger to resolve the problem there quickly. |
elastic#32467 was introduced in v0.4.3.
I opened #32470 against main as the APM server automation appears to pick up changes from the beats main branch. The 8.3 backport PR will follow this one. |
The Jenkins jobs to update APM server can be found at https://apm-ci.elastic.co/job/apm-server/job/update-beats-mbp. We should be able to manually trigger the one for 8.3 once the backport is merged. |
elastic#32467 was introduced in v0.4.3. Manual backport of elastic#32470
https://github.com/elastic/dev/issues/2071#issuecomment-1192799064 We are going to downgrade the elastic-agent-system-metrics package in the affected repositories to resolve this for the 8.3 release. We will continue fixing the root cause of the problem separately. |
@LeeDr @cmacknz @fearful-symmetry No pressure but we have hit this issue in production after upgrading to Elastic Stack 8.3.3 from 7.17.1. Out of 8 Elastic Agents we are running with APM integration enabled about half has encountered this issue within the first 2-4 hours after the upgrade. Worse yet there's not automatic restart of the APM server. Following the crash Elastic Agent does not reap the child APM server process by In which version can we expect a fix and are there any workarounds in the mean time? |
@b0le that is surprising, this specific problem should not occur in released versions of 8.3.3. For beats and apm-server the package downgrade the fixes this problem is included in the 8.3.3 release commits:
This problem should be fixed in 8.3.3 from what I can see. |
@fearful-symmetry I think the v8.3.0 label on this PR is incorrect and might confuse people. I think this fix was in the 8.3.3 release? |
I don't think this bug was ever released either, it would only have been affecting snapshot versions of 8.3.3. Most likely this is an unrelated problem that requires more investigation. |
Sorry, I think I commented in the wrong place. I really don't care too much on the version label on the issue. It's the version labels on the PRs that are important. I'll comment instead on elastic/elastic-agent-system-metrics#43 (comment) |
@cmacknz @fearful-symmetry We are afraid that the release with the problematic elastic-agent-system-metrics v0.4.3 has somehow made it into the wild as we are seeing the following log lines with Elastic Agent 8.3.3, note the third error message with the stack trace which includes the string
Full logs from one of the instances that has crashed are attached as elastic-agent-diagnostics-2022-08-10T16-34-42Z-00-redacted.zip. For any users encountering this: disabling Elastic Agent's metrics collection is a reliable workaround. We have not seen any crashes in the past 8 hours since disabling it. |
So, it looks like it did sneak into the 8.3.3 tag there during a bot's dependency update: However, the |
If this works, then my guess is the error only happens when APM's One problem is that in our Cloud service, our own internal monitoring likely calls this endpoint which we can't really disable. What I'm not sure about is whether or not this endpoint is only called when in standalone (legacy) APM mode or not since we don't run Agent monitoring directly on Cloud. In my own cluster where APM is running in managed mode (but on 8.3.2 right now), it does seem that we're still collecting metrics, so I would suspect that this could be affecting all APM Server instances in Cloud. This makes me think we should move to release a new patch version to fix this. @cmacknz do you agree? |
What I think happened is that in the automated apm-server PR that updated to the beats version with the fix, it did not actually update the elastic-agent-system-metrics package transitive dependency. You can see that elastic/apm-server#8694 only updates go.sum and not go.mod. We would have needed to manually
Yes, the impact of this is a race condition that will randomly cause APM server to panic. We should never have released this, the original issue was a blocker for 8.3.3. |
The head of the 8.3 branch in apm-server is at https://github.com/elastic/elastic-agent-system-metrics/releases/tag/v0.4.4 which adds a mutex to the map in question to fix the problem. For beats we reverted to v0.4.2 in the 8.3 release which removes a bug fix for agent monitoring (CPU reporting as zero in the agent dashboard) that introduced the concurrency problem. v0.4.4 has both this agent CPU bug fix and the concurrency bug fix. |
Given that there is a probable work around in a comment above, we may not need an 8.3.4 release given that 8.4.0 is planned for a little more than a week from now.
|
I believe on ESS the user can not disable all metrics collection e.g the following is baked into the Integration Server docker container: |
#32467 was introduced in v0.4.3.
Introduced by #32336 which incorporated the changes from elastic/elastic-agent-system-metrics#40.
We have one confirmed case of APM server crashing with a panic from a concurrent map write when it is run under agent with monitoring enabled.
Stack trace:
The text was updated successfully, but these errors were encountered: