Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM][Stack Monitoring] Changes for integrating APM with Elastic Agent #90157

Closed
simitt opened this issue Feb 3, 2021 · 15 comments
Closed

[APM][Stack Monitoring] Changes for integrating APM with Elastic Agent #90157

simitt opened this issue Feb 3, 2021 · 15 comments

Comments

@simitt
Copy link
Contributor

simitt commented Feb 3, 2021

Motivation and Overview

Integrating APM Server with Elastic Agent has some impact on collected metrics. Continue to provide useful insights into running deployments to users requires some changes to the APM Stack Monitoring UI. The focus will stay on APM Server specific metrics where an isolated view on APM Server makes sense (processed events, number of requests, etc.), and on Elastic Agent aggregated metrics otherwise (system metrics when running inside a container). The system related metrics are the most important metrics for scaling decisions, showing them for the overall group seems the most useful when running inside a container.
There is an existing issue to switch to using cgroups data for system metrics #79050 (planned for 7.12). Container resource limits are reflected in the cgroup data, giving better insights into how much of the actually available resources are used. When running inside a container and as an Elastic Agent integration, potential resource limits will be set for the whole group (Elastic Agent + sub processes). To be clear about the semantics of the system resouce metrics, showing a correct and precise terminology is important.

Adding other, Elastic Agent or integrations specific, information to the Stack Monitoring UI is not scope of this issue, and not generally planned. For more details related to Elastic Agent related visualisations refer to kibana#81872.

Problems to Solve with the Stack Monitoring UI

  • When to scale up APM Server, or in the future Elastic Agent?
    • system resource usage: CPU, memory
    • Response Errors Intake - 503 Queue is Full
  • When to change the internal memory queue settings
    • system resource usage: CPU, memory - not using 100% CPU, while seeing 503 Queue is Full Response Errors Intake
  • Identify potential issues between APM Agents and Server, some examples:
    • Response Errors Intake - Validate (e.g. version incompatibility)
    • Response Errors Intake - Unauthorized (invalid secret_token or API Key configured in APM Agent)
    • Response Errors Intake - Too large (events are larger than usually allowed, users can customize configuration)
    • Response Errors Intake - Rate limit (more RUM requests than expected, rate limiting settings can be customized by users)
  • Identify potential APM Server internal issues (e.g. invalid events filling up the queue)
    • Output Events Rate, Output Failed Events Rate, Processed Events
  • Identify potential issues with agent remote configuration (APM Server querying information from Kibana)
    • Response Count Agent Config Management, Response Errors Agent Config Management
  • Troubleshoot APM Server and APM Agents even when the Observability Cluster is severely damaged
    • Use a monitoring system different from the Elastic Observability cluster to monitor this Elastic Observability cluster

Changes mostly concern renaming and moving around components, but also involve some conditional logic for deciding on the right terminology and metrics to show.

Break up per View

Cluster Listing (no changes required) No changes are required for the Cluster Listing. Screenshot 2021-02-03 at 10 57 48
Cluster Overview This overview is designed to act as a high level health indicator for the APM Server instances. Currently it shows Processed Events and Last Events for the APM Server overview (all instances combined) and Memory Usage for a concrete APM Server instance.

When running as Elastic Agent sub process, the system resources might be shared with other Agent sub processes. Showing the Memory Usage of APM Server would still be possible, but seems less important. The suggested change is to keep this overview focused on APM Server and also show the Processed Events and Last Events for the concrete APM Server instance. See mock up below.
Screenshot 2021-01-29 at 21 43 19

APM server overview * Move resource related metrics (CPU, memory, load) up in the page into a dedicated section (between Alerts and Response Count metrics) Screenshot 2021-02-03 at 12 01 39
  • Show APM Server - Resource Usage or Elastic Agent Group - Resource Usage in the title of the resource usage section, based on below described logic
Screenshot 2021-02-03 at 12 04 25
  • Show the rest of the metrics in a dedicated section with the header APM Server - Custom Metrics
Screenshot 2021-02-03 at 12 08 07
  • Nice to have: if available, calculate the relative resource usage for CPU and memory per process and show inside the CPU and Memory graphs. The data would be distinguishable by beat.type (e.g. fleet-server, apm, ..).
    In case this can be added to the Stack Monitoring UI, it requires some small additional changes on the metrics collection, so it would be good to know if this will be planned or not.

Conditional Logic to distinguish between apm-server and elastic-agent-group:

  • not running inside a container -> apm-server
  • running inside a container but not detecting Elastic Agent integration -> apm-server
  • running inside a container and detecting Elastic Agent integration -> elastic-agent-group

For the detection of whether or not cgroup values should be used @chrisronline mentioned that other apps set a flag in the Kibana config options. We could do something similar for APM. I am wondering how this works when using a dedicated monitoring cluster, to which data from multiple other clusters are shipped, where other clusters could partially be running inside containers, partially directly on a host system?
For the Elastic Agent detection let's follow a similar approach as for the cgroup/container decision.

APM server instances (no changes required) No changes are required. Screenshot 2021-02-03 at 12 22 50
APM server instance xyz Same changes should be made as for the APM Server overview page (moving system resource usage up and into dedicated section, conditionally change title)

Timeline

7.13: APM Server integration with Elastic Agent (beta)
7.14: APM Server integration with Elastic Agent (GA)

It would be great to get the changes in for 7.13.

@cyrille-leclerc could you review the proposed changes, and also have a focus on the used terminology and involved design changes.
cc @ruflin and @elastic/apm-server
cc @jasonrhodes @ravikesarwani @chrisronline

@chrisronline
Copy link
Contributor

The suggested change is to keep this overview focused on APM Server and also show the Processed Events and Last Events for the concrete APM Server instance.

What if there are more than one APM server? Wouldn't this show the same data as the overview panel?

@simitt
Copy link
Contributor Author

simitt commented Feb 3, 2021

I believe it would show the same data if there is only one instance - but not if there are multiple, as I expect the Overview to show the aggregated values, while the single instance to only show its data.

IMO showing the same values for an aggregated and a per node view if there is only one node sounds fine.

@chrisronline
Copy link
Contributor

I'm not sure I fully understand.

If there is a single APM server, the panels will show the same thing as the aggregated and single instance values will be the same

If there are multiple APM servers, the Overview panel will show the aggregated values, but what does the second panel show? How do I pick one instance to show?

Like:

Screen Shot 2021-02-03 at 1 39 50 PM

@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented Feb 4, 2021

@simitt FYI I met with @katrin-freihofner and we are heading toward a style where we use "gravity" on dashboards containing observability data coming from multiple layers application layer above the infrastructure layer).

In the context of Elastic agent monitoring, it probably means that we want the system graphs (eg. sysload ) at the bottom of the screen and the graphs related to the internals of the shippers (e.g. processing queue...) at the top.
I'm wondering if your exploration here proposes the alternative model to show the infrastructure graphs at the top.

@simitt
Copy link
Contributor Author

simitt commented Feb 5, 2021

@cyrille-leclerc and @katrin-freihofner thanks for looking at this. The motivation for moving the system metrics on top is that they are the most significant metrics indicating when the deployment should be scaled. Followed by the number of requests and response statuses and processed events. If that doesn't fit the general design direction - it's fine ofc to keep the order as is.
The most important part is that the system metrics should be grouped separately as, depending on the deployment type, they might belong to the whole Elastic Agent group and not only the APM Server.

@chrisronline you are right, sorry for the confusion. What do you think about also conditionally showing Elastic Agent Group or APM Server here and then keep showing the memory usage. This would require the same conditional logic to decide on the title and the used values as mentioned for the detail page.
Screenshot 2021-01-29 at 10 48 39

@chrisronline
Copy link
Contributor

@simitt

What do you think about also conditionally showing Elastic Agent Group or APM Server here and then keep showing the memory usage

Sure, I can do that

@sgrodzicki sgrodzicki added this to the Stack Monitoring UI 7.12 milestone Feb 8, 2021
@simitt
Copy link
Contributor Author

simitt commented Feb 9, 2021

update: after chattting with @cyrille-leclerc I added a Problems to solve section in the description, to make it more clear and obvious why it is important to keep the Stack Monitoring UI working.

@jasonrhodes
Copy link
Member

@simitt what does [user, elastic] mean in that Problems to Solve section?

@simitt
Copy link
Contributor Author

simitt commented Feb 22, 2021

@jasonrhodes nothing really in this context, I updated the description and removed it.

@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented Feb 23, 2021

@simitt what does [user, elastic] mean in that Problems to Solve section?

@simitt I assumed you were identifying the persona who needs to solve the problem and I felt it was convenient :-)

@igoristic
Copy link
Contributor

igoristic commented Feb 23, 2021

Reopening as per request: #90873 (comment)

Currently blocked by reason mentioned here: #90873 (review)

@igoristic
Copy link
Contributor

@simitt

Is this still technically blocked by #90873 (review)?

And is it still on track for 7.13?

@simitt
Copy link
Contributor Author

simitt commented Mar 15, 2021

Yes this is still on tack for 7.13. I haven't updated the information, as there were still conversations going on around the monitoring changes (in a dedicated mail thread). Will come back to this with more information asap.

@simitt
Copy link
Contributor Author

simitt commented Apr 19, 2021

for reference - the rest of the conversation was taking place directly in the open PR #95129 (comment).

@simitt
Copy link
Contributor Author

simitt commented May 27, 2021

This has been implemented for 7.13, @igoristic and @jasonrhodes if there are no concerns from your side, I suggest to close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants