-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APM][Stack Monitoring] Changes for integrating APM with Elastic Agent #90157
Comments
What if there are more than one APM server? Wouldn't this show the same data as the |
I believe it would show the same data if there is only one instance - but not if there are multiple, as I expect the Overview to show the aggregated values, while the single instance to only show its data. IMO showing the same values for an aggregated and a per node view if there is only one node sounds fine. |
I'm not sure I fully understand. If there is a single APM server, the panels will show the same thing as the aggregated and single instance values will be the same If there are multiple APM servers, the Overview panel will show the aggregated values, but what does the second panel show? How do I pick one instance to show? Like: |
@simitt FYI I met with @katrin-freihofner and we are heading toward a style where we use "gravity" on dashboards containing observability data coming from multiple layers application layer above the infrastructure layer). In the context of Elastic agent monitoring, it probably means that we want the system graphs (eg. sysload ) at the bottom of the screen and the graphs related to the internals of the shippers (e.g. processing queue...) at the top. |
@cyrille-leclerc and @katrin-freihofner thanks for looking at this. The motivation for moving the system metrics on top is that they are the most significant metrics indicating when the deployment should be scaled. Followed by the number of requests and response statuses and processed events. If that doesn't fit the general design direction - it's fine ofc to keep the order as is. @chrisronline you are right, sorry for the confusion. What do you think about also conditionally showing |
Sure, I can do that |
update: after chattting with @cyrille-leclerc I added a Problems to solve section in the description, to make it more clear and obvious why it is important to keep the Stack Monitoring UI working. |
@simitt what does |
@jasonrhodes nothing really in this context, I updated the description and removed it. |
Reopening as per request: #90873 (comment) Currently blocked by reason mentioned here: #90873 (review) |
Is this still technically blocked by #90873 (review)? And is it still on track for |
Yes this is still on tack for |
for reference - the rest of the conversation was taking place directly in the open PR #95129 (comment). |
This has been implemented for |
Motivation and Overview
Integrating APM Server with Elastic Agent has some impact on collected metrics. Continue to provide useful insights into running deployments to users requires some changes to the APM Stack Monitoring UI. The focus will stay on APM Server specific metrics where an isolated view on APM Server makes sense (processed events, number of requests, etc.), and on Elastic Agent aggregated metrics otherwise (system metrics when running inside a container). The system related metrics are the most important metrics for scaling decisions, showing them for the overall group seems the most useful when running inside a container.
There is an existing issue to switch to using
cgroups
data for system metrics #79050 (planned for7.12
). Container resource limits are reflected in thecgroup
data, giving better insights into how much of the actually available resources are used. When running inside a container and as an Elastic Agent integration, potential resource limits will be set for the whole group (Elastic Agent + sub processes). To be clear about the semantics of the system resouce metrics, showing a correct and precise terminology is important.Adding other, Elastic Agent or integrations specific, information to the Stack Monitoring UI is not scope of this issue, and not generally planned. For more details related to Elastic Agent related visualisations refer to kibana#81872.
Problems to Solve with the Stack Monitoring UI
503 Queue is Full
Response Errors IntakeChanges mostly concern renaming and moving around components, but also involve some conditional logic for deciding on the right terminology and metrics to show.
Break up per View
Cluster Listing (no changes required)
No changes are required for the Cluster Listing.Cluster Overview
This overview is designed to act as a high level health indicator for the APM Server instances. Currently it shows Processed Events and Last Events for the APM Server overview (all instances combined) and Memory Usage for a concrete APM Server instance.When running as Elastic Agent sub process, the system resources might be shared with other Agent sub processes. Showing the Memory Usage of APM Server would still be possible, but seems less important. The suggested change is to keep this overview focused on APM Server and also show the Processed Events and Last Events for the concrete APM Server instance. See mock up below.
APM server overview
* Move resource related metrics (CPU, memory, load) up in the page into a dedicated section (between Alerts and Response Count metrics)In case this can be added to the Stack Monitoring UI, it requires some small additional changes on the metrics collection, so it would be good to know if this will be planned or not.
Conditional Logic to distinguish between apm-server and elastic-agent-group:
For the detection of whether or not
cgroup
values should be used @chrisronline mentioned that other apps set a flag in the Kibana config options. We could do something similar for APM. I am wondering how this works when using a dedicated monitoring cluster, to which data from multiple other clusters are shipped, where other clusters could partially be running inside containers, partially directly on a host system?For the Elastic Agent detection let's follow a similar approach as for the
cgroup
/container decision.APM server instances (no changes required)
No changes are required.APM server instance xyz
Same changes should be made as for the APM Server overview page (moving system resource usage up and into dedicated section, conditionally change title)Timeline
7.13
: APM Server integration with Elastic Agent (beta)7.14
: APM Server integration with Elastic Agent (GA)It would be great to get the changes in for
7.13
.@cyrille-leclerc could you review the proposed changes, and also have a focus on the used terminology and involved design changes.
cc @ruflin and @elastic/apm-server
cc @jasonrhodes @ravikesarwani @chrisronline
The text was updated successfully, but these errors were encountered: