[APM][Stack Monitoring] Changes for integrating APM with Elastic Agent #90157

simitt · 2021-02-03T12:43:36Z

Motivation and Overview

Integrating APM Server with Elastic Agent has some impact on collected metrics. Continue to provide useful insights into running deployments to users requires some changes to the APM Stack Monitoring UI. The focus will stay on APM Server specific metrics where an isolated view on APM Server makes sense (processed events, number of requests, etc.), and on Elastic Agent aggregated metrics otherwise (system metrics when running inside a container). The system related metrics are the most important metrics for scaling decisions, showing them for the overall group seems the most useful when running inside a container.
There is an existing issue to switch to using cgroups data for system metrics #79050 (planned for 7.12). Container resource limits are reflected in the cgroup data, giving better insights into how much of the actually available resources are used. When running inside a container and as an Elastic Agent integration, potential resource limits will be set for the whole group (Elastic Agent + sub processes). To be clear about the semantics of the system resouce metrics, showing a correct and precise terminology is important.

Adding other, Elastic Agent or integrations specific, information to the Stack Monitoring UI is not scope of this issue, and not generally planned. For more details related to Elastic Agent related visualisations refer to kibana#81872.

Problems to Solve with the Stack Monitoring UI

When to scale up APM Server, or in the future Elastic Agent?
- system resource usage: CPU, memory
- Response Errors Intake - 503 Queue is Full
When to change the internal memory queue settings
- system resource usage: CPU, memory - not using 100% CPU, while seeing 503 Queue is Full Response Errors Intake
Identify potential issues between APM Agents and Server, some examples:
- Response Errors Intake - Validate (e.g. version incompatibility)
- Response Errors Intake - Unauthorized (invalid secret_token or API Key configured in APM Agent)
- Response Errors Intake - Too large (events are larger than usually allowed, users can customize configuration)
- Response Errors Intake - Rate limit (more RUM requests than expected, rate limiting settings can be customized by users)
Identify potential APM Server internal issues (e.g. invalid events filling up the queue)
- Output Events Rate, Output Failed Events Rate, Processed Events
Identify potential issues with agent remote configuration (APM Server querying information from Kibana)
- Response Count Agent Config Management, Response Errors Agent Config Management
Troubleshoot APM Server and APM Agents even when the Observability Cluster is severely damaged
- Use a monitoring system different from the Elastic Observability cluster to monitor this Elastic Observability cluster

Changes mostly concern renaming and moving around components, but also involve some conditional logic for deciding on the right terminology and metrics to show.

Break up per View

Cluster Listing (no changes required)

No changes are required for the Cluster Listing.

Cluster Overview

This overview is designed to act as a high level health indicator for the APM Server instances. Currently it shows Processed Events and Last Events for the APM Server overview (all instances combined) and Memory Usage for a concrete APM Server instance.

When running as Elastic Agent sub process, the system resources might be shared with other Agent sub processes. Showing the Memory Usage of APM Server would still be possible, but seems less important. The suggested change is to keep this overview focused on APM Server and also show the Processed Events and Last Events for the concrete APM Server instance. See mock up below.

APM server overview

* Move resource related metrics (CPU, memory, load) up in the page into a dedicated section (between Alerts and Response Count metrics)

Show APM Server - Resource Usage or Elastic Agent Group - Resource Usage in the title of the resource usage section, based on below described logic

Show the rest of the metrics in a dedicated section with the header APM Server - Custom Metrics

Nice to have: if available, calculate the relative resource usage for CPU and memory per process and show inside the CPU and Memory graphs. The data would be distinguishable by beat.type (e.g. fleet-server, apm, ..).
In case this can be added to the Stack Monitoring UI, it requires some small additional changes on the metrics collection, so it would be good to know if this will be planned or not.

Conditional Logic to distinguish between apm-server and elastic-agent-group:

not running inside a container -> apm-server
running inside a container but not detecting Elastic Agent integration -> apm-server
running inside a container and detecting Elastic Agent integration -> elastic-agent-group

For the detection of whether or not cgroup values should be used @chrisronline mentioned that other apps set a flag in the Kibana config options. We could do something similar for APM. I am wondering how this works when using a dedicated monitoring cluster, to which data from multiple other clusters are shipped, where other clusters could partially be running inside containers, partially directly on a host system?
For the Elastic Agent detection let's follow a similar approach as for the cgroup/container decision.

APM server instances (no changes required)

No changes are required.

APM server instance xyz

Same changes should be made as for the APM Server overview page (moving system resource usage up and into dedicated section, conditionally change title)

Timeline

7.13: APM Server integration with Elastic Agent (beta)
7.14: APM Server integration with Elastic Agent (GA)

It would be great to get the changes in for 7.13.

@cyrille-leclerc could you review the proposed changes, and also have a focus on the used terminology and involved design changes.
cc @ruflin and @elastic/apm-server
cc @jasonrhodes @ravikesarwani @chrisronline

The text was updated successfully, but these errors were encountered:

chrisronline · 2021-02-03T15:56:25Z

The suggested change is to keep this overview focused on APM Server and also show the Processed Events and Last Events for the concrete APM Server instance.

What if there are more than one APM server? Wouldn't this show the same data as the overview panel?

simitt · 2021-02-03T16:06:46Z

I believe it would show the same data if there is only one instance - but not if there are multiple, as I expect the Overview to show the aggregated values, while the single instance to only show its data.

IMO showing the same values for an aggregated and a per node view if there is only one node sounds fine.

chrisronline · 2021-02-03T18:42:19Z

I'm not sure I fully understand.

If there is a single APM server, the panels will show the same thing as the aggregated and single instance values will be the same

If there are multiple APM servers, the Overview panel will show the aggregated values, but what does the second panel show? How do I pick one instance to show?

Like:

cyrille-leclerc · 2021-02-04T22:00:50Z

@simitt FYI I met with @katrin-freihofner and we are heading toward a style where we use "gravity" on dashboards containing observability data coming from multiple layers application layer above the infrastructure layer).

In the context of Elastic agent monitoring, it probably means that we want the system graphs (eg. sysload ) at the bottom of the screen and the graphs related to the internals of the shippers (e.g. processing queue...) at the top.
I'm wondering if your exploration here proposes the alternative model to show the infrastructure graphs at the top.

simitt · 2021-02-05T06:41:24Z

@cyrille-leclerc and @katrin-freihofner thanks for looking at this. The motivation for moving the system metrics on top is that they are the most significant metrics indicating when the deployment should be scaled. Followed by the number of requests and response statuses and processed events. If that doesn't fit the general design direction - it's fine ofc to keep the order as is.
The most important part is that the system metrics should be grouped separately as, depending on the deployment type, they might belong to the whole Elastic Agent group and not only the APM Server.

@chrisronline you are right, sorry for the confusion. What do you think about also conditionally showing Elastic Agent Group or APM Server here and then keep showing the memory usage. This would require the same conditional logic to decide on the title and the used values as mentioned for the detail page.

chrisronline · 2021-02-08T14:08:23Z

@simitt

What do you think about also conditionally showing Elastic Agent Group or APM Server here and then keep showing the memory usage

Sure, I can do that

simitt · 2021-02-09T15:21:48Z

update: after chattting with @cyrille-leclerc I added a Problems to solve section in the description, to make it more clear and obvious why it is important to keep the Stack Monitoring UI working.

jasonrhodes · 2021-02-22T13:54:08Z

@simitt what does [user, elastic] mean in that Problems to Solve section?

simitt · 2021-02-22T15:09:31Z

@jasonrhodes nothing really in this context, I updated the description and removed it.

cyrille-leclerc · 2021-02-23T13:33:35Z

@simitt what does [user, elastic] mean in that Problems to Solve section?

@simitt I assumed you were identifying the persona who needs to solve the problem and I felt it was convenient :-)

igoristic · 2021-02-23T17:41:16Z

Reopening as per request: #90873 (comment)

Currently blocked by reason mentioned here: #90873 (review)

igoristic · 2021-03-15T18:07:40Z

@simitt

Is this still technically blocked by #90873 (review)?

And is it still on track for 7.13?

simitt · 2021-03-15T18:51:36Z

Yes this is still on tack for 7.13. I haven't updated the information, as there were still conversations going on around the monitoring changes (in a dedicated mail thread). Will come back to this with more information asap.

simitt · 2021-04-19T11:35:35Z

for reference - the rest of the conversation was taking place directly in the open PR #95129 (comment).

simitt · 2021-05-27T06:23:52Z

This has been implemented for 7.13, @igoristic and @jasonrhodes if there are no concerns from your side, I suggest to close this issue now.

simitt added apm Feature:Stack Monitoring labels Feb 3, 2021

chrisronline mentioned this issue Feb 3, 2021

[Monitoring] Use cgroup for APM #90022

Closed

sgrodzicki added this to the Stack Monitoring UI 7.12 milestone Feb 8, 2021

simitt mentioned this issue Feb 9, 2021

[meta] APM Server managed by Elastic Agent with Fleet (GA) elastic/apm-server#4636

Closed

16 tasks

This was referenced Feb 9, 2021

[Monitoring] Added cgroup option for APM cpu usage #90873

Merged

[monitoring] Document all elasticsearch settings #18448

Closed

igoristic closed this as completed in #90873 Feb 23, 2021

igoristic reopened this Feb 23, 2021

igoristic modified the milestones: Stack Monitoring UI 7.12, Stack Monitoring UI 7.13 Feb 23, 2021

igoristic mentioned this issue Mar 23, 2021

[Monitoring] Added ability to possibly distinguish between Agent type metrics in APM #95129

Merged

sgrodzicki assigned igoristic Mar 29, 2021

jasonrhodes mentioned this issue Apr 21, 2021

[Stack Monitoring] Adjust APM server headings when we detect a container + agent env #97879

Closed

jasonrhodes closed this as completed May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APM][Stack Monitoring] Changes for integrating APM with Elastic Agent #90157

[APM][Stack Monitoring] Changes for integrating APM with Elastic Agent #90157

simitt commented Feb 3, 2021 •

edited

Loading

chrisronline commented Feb 3, 2021

simitt commented Feb 3, 2021

chrisronline commented Feb 3, 2021

cyrille-leclerc commented Feb 4, 2021 •

edited

Loading

simitt commented Feb 5, 2021

chrisronline commented Feb 8, 2021

simitt commented Feb 9, 2021

jasonrhodes commented Feb 22, 2021

simitt commented Feb 22, 2021

cyrille-leclerc commented Feb 23, 2021 •

edited

Loading

igoristic commented Feb 23, 2021 •

edited

Loading

igoristic commented Mar 15, 2021

simitt commented Mar 15, 2021

simitt commented Apr 19, 2021

simitt commented May 27, 2021

[APM][Stack Monitoring] Changes for integrating APM with Elastic Agent #90157

[APM][Stack Monitoring] Changes for integrating APM with Elastic Agent #90157

Comments

simitt commented Feb 3, 2021 • edited Loading

Motivation and Overview

Problems to Solve with the Stack Monitoring UI

Break up per View

Timeline

chrisronline commented Feb 3, 2021

simitt commented Feb 3, 2021

chrisronline commented Feb 3, 2021

cyrille-leclerc commented Feb 4, 2021 • edited Loading

simitt commented Feb 5, 2021

chrisronline commented Feb 8, 2021

simitt commented Feb 9, 2021

jasonrhodes commented Feb 22, 2021

simitt commented Feb 22, 2021

cyrille-leclerc commented Feb 23, 2021 • edited Loading

igoristic commented Feb 23, 2021 • edited Loading

igoristic commented Mar 15, 2021

simitt commented Mar 15, 2021

simitt commented Apr 19, 2021

simitt commented May 27, 2021

simitt commented Feb 3, 2021 •

edited

Loading

cyrille-leclerc commented Feb 4, 2021 •

edited

Loading

cyrille-leclerc commented Feb 23, 2021 •

edited

Loading

igoristic commented Feb 23, 2021 •

edited

Loading