Stack monitoring isn't capturing APM Server metrics #7139

marclop · 2022-01-26T09:59:06Z

APM Server version (apm-server version):

8.0.0-rc2

Description of the problem including expected versus actual behavior:

When enabling stack monitoring, most of the apm-server metrics aren't displayed in the UI. All the metrics appear to be 0.

For reference, 10000 traces have been ingested:

Steps to reproduce:

Create a deployment in ESS using 8.0.0-rc2
Enable stack monitoring through the UI in: Deployment > Logs and metrics > Enable (the same deployment can be used)
Ingest some data to the APM Server, run this Go program a few times (Export the ELASTIC_APM_SERVER_URL and ELASTIC_APM_SECRET_TOKEN environment variables):

package main

import (
	"context"
	"fmt"
	"os"
	"time"

	"go.elastic.co/apm"
)

func main() {
	version := "undefined"
	if len(os.Args) > 1 {
		version = os.Args[1]
	}
	name := fmt.Sprintf("apm-server-%s", version)
	// apm.DefaultTracer.SetSampler(apm.NewRatioSampler(0.0))

	for i := 0; i < 1000; i++ {
		tx := apm.DefaultTracer.StartTransaction(name, "type")
		ctx := apm.ContextWithTransaction(context.Background(), tx)
		span, ctx := apm.StartSpan(ctx, name, "type")
		span.Duration = time.Second
		span.End()
		tx.Duration = 2 * time.Second
		tx.End()
		<-time.After(time.Millisecond)
	}
	<-time.After(time.Second)
	apm.DefaultTracer.Flush(nil)
	fmt.Printf("%s: %+v\n", name, apm.DefaultTracer.Stats())
}

Provide logs (if relevant):

The text was updated successfully, but these errors were encountered:

simitt · 2022-01-26T10:44:19Z

cc @jasonrhodes and @ruflin
We are not yet clear on why Stack Monitoring is broken again; this has been successfully tested with previous snapshots. Any help in figuring out the root cause would be much appreciated.

droberts195 · 2022-01-26T10:47:40Z

Could it be the same problem as elastic/beats#29880? There is a PR open for that: elastic/beats#30018

kvch · 2022-01-26T11:16:00Z

@droberts195 To me it seems like a different issue. The documents are in the index, but for some reason the visualizations are not working. If I understood correctly the Elasticsearch module issue is that the documents are not indexed at all.

ruflin · 2022-01-26T12:30:19Z

My understanding is, this is the data coming from Metricbeat inside the Elastic Agent container but NOT run by Elastic Agent. This data is shipped to the .monitoring-* index to be consumed by Stack Monitoring. I'm stating this to make sure we are all on the same page. If data is in the traces data stream only matters to confirm there was some load on the server and it should not be zero.

The issues linked above are about logs and not metrics meaning it should not be related.

@marclop Have you looked at the data in the .monitoring-* indices? Can you share by chance some example docs? Are the values all just 0? Lets figure out if the queries are wrong or the data is empty, this should bring us to the next step quickly.

@ph @cmacknz We need to resolve the above quickly. I'm not aware of any recent change in Cloud or Beats which should cause this.

marclop · 2022-01-26T12:32:25Z

I've reproduced this on an ECE environment and manually checked what the APM server /stats endpoint publishes after indexing a few events, the metrics in the registry appear to indeed be 0:

$ curl --unix-socket /app/elastic-agent/data/tmp/es-containerhost/apm-server/apm-server.sock http:/localhost/stats
    "libbeat": {
        "output": {
            "events": {
                "acked": 0,
                "active": 0,
                "batches": 0,
                "dropped": 0,
                "duplicates": 0,
                "failed": 0,
                "toomany": 0,
                "total": 0
            },
            "read": {
                "bytes": 0,
                "errors": 0
            },
            "type": "elasticsearch",
            "write": {
                "bytes": 0,
                "errors": 0
            }
        },
...

This seems to indicate that the issue is not on the collection side, but rather within the APM Server.

I tried to reproduce this locally using the APM Server in standalone mode and it seemed to work just fine, after indexing a few thousand events I was able to get the metrics from the /stats endpoint:

$ curl -s http://localhost:5066/stats | jq '.libbeat.output'
{
  "events": {
    "acked": 3326,
    "active": 0,
    "batches": 3,
    "failed": 1,
    "toomany": 0,
    "total": 3327
  },
  "read": {
    "bytes": 0,
    "errors": 0
  },
  "type": "elasticsearch",
  "write": {
    "bytes": 0,
    "errors": 0
  }
}

Then tried to reproduce it running under the Elastic Agent and it is easily reproducible when the APM Server is running in managed mode. From 8.0+ we do not rely on libbeat anymore and inject the new output metrics when we construct a new instance of the modelindexer as you can see below.

apm-server/beater/beater.go

Lines 686 to 704 in 535f709

    
           // Remove libbeat output counters, and install our own callback which uses the modelindexer stats. 
        
           monitoring.Default.Remove("libbeat.output.events") 
        
           monitoring.NewFunc(monitoring.Default, "libbeat.output.events", func(_ monitoring.Mode, v monitoring.Visitor) { 
        
           	v.OnRegistryStart() 
        
           	defer v.OnRegistryFinished() 
        
           	stats := indexer.Stats() 
        
           	v.OnKey("acked") 
        
           	v.OnInt(stats.Indexed) 
        
           	v.OnKey("active") 
        
           	v.OnInt(stats.Active) 
        
           	v.OnKey("batches") 
        
           	v.OnInt(stats.BulkRequests) 
        
           	v.OnKey("failed") 
        
           	v.OnInt(stats.Failed) 
        
           	v.OnKey("toomany") 
        
           	v.OnInt(stats.TooManyRequests) 
        
           	v.OnKey("total") 
        
           	v.OnInt(stats.Added) 
        
           })

I'm not intimately familiar with how the registries work and how reloading the APM Server the way we do may affect the reference that we're using to register the modelindexer stats to, but it may be that we're losing the reference at some point? The libbeat code seems to not have changed at all so I'm puzzled as to how running under the Elastic agent causes the metrics to not be updated.

EDIT: Oddly enough, I've injected a locally built binary from the latest 8.0 branch commit in the Elastic Agent container and the metrics started flowing into the /stats endpoint, here's a small guide: https://gist.github.com/marclop/7b8e63197f383e516ba2d1bd5e2c6c41.

stuartnelson3 · 2022-01-26T15:01:09Z

What @marclop references in the previous message concerns only the "Output Events Rate" graph; I believe the reason the other graphs always render 0 is because of incorrect aliases, which requires changes external to the apm-server.

The requests made from kibana to get stack monitoring data are available here, under "field": https://github.com/elastic/kibana/blob/main/x-pack/plugins/monitoring/server/lib/metrics/apm/metrics.js#L164

For example, field: 'beats_stats.metrics.apm-server.server.request.count'.

The problem is the alias to the real data is missing the metrics nesting: https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/core/src/main/resources/monitoring-beats-mb.json#L1043-L1049

Kibana makes a request for beats_stats.metrics.apm-server.server.request.count, but the alias that we have is actually for beats_stats.apm-server.server.request.count (note the missing metrics).

"Output Events Rate" renders properly when it has data because it's actually nested under metrics: https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/core/src/main/resources/monitoring-beats-mb.json#L1513

I think if the whole apm-server section were nested under metrics, it would start working.

@klacabane @probakowski can you confirm if I'm reading and understanding the problem correctly?

klacabane · 2022-01-26T15:36:29Z

@stuartnelson3 Good catch

I think if the whole apm-server section were nested under metrics, it would start working.

I'll give this a try

Updated the mappings to have the apm-server block nested under beats_stats.metrics, ran @marclop program to generate traces and got some data to show up in various graphs. Is there a way to generate errors to validate the related graphs ?

stuartnelson3 · 2022-01-26T16:27:59Z

You should be able to unexport or provide a bogus ELASTIC_APM_SECRET_TOKEN to force an auth error on ingest.

axw · 2022-01-27T06:04:27Z

I believe the issues regarding libbeat metrics noted in #7139 (comment) are due to a race condition between apm-server replacing the libbeat metrics (see code in linked comment), and libbeat clearing/recreating the metrics registry when output config is reloaded (https://github.com/elastic/beats/blob/72a43be9ed23efc3dc2b371e2cadf5a7c575e429/libbeat/publisher/pipeline/module.go#L138-L146).

matschaffer · 2022-01-31T03:37:27Z

I gave this a test today and was able to get everything except the response errors intake going.

I triggered the output errors by marking the indices as read only

PUT .ds-metrics-apm.internal-default-2022.01.31-000001/_settings
{
  "blocks.read_only": true
}

PUT .ds-traces-apm-default-2022.01.31-000001/_settings
{
  "blocks.read_only": true
}

Unfortunately I get a 403 attempting to remove the block 😆 but I can just wipe the data instead.

matschaffer · 2022-01-31T03:40:45Z

Opened elastic/elasticsearch#83305 for the index blocking issue I ran into

simitt · 2022-01-31T09:26:19Z

Verified that both of the issues are fixed with the BC-3 for rc2:

matschaffer · 2022-02-09T05:35:58Z

I think this might still be broken for the instance page. In testing 8.0.0-da04e908 today I noted that I see an output spike in the overview:

That seems to be missing on the instance graph:

matschaffer · 2022-02-09T05:42:25Z

The test script gets the "Active" rate to go up, but not the acked rate for instance graphs.

The overview graph is a little odd. Total seems to trend higher than active, even though all I did was run the test app. And "Acked" is just a constant 50-ish/s

I'll open up a new issue just for the above oddities.

marclop added bug v8.0.0 labels Jan 26, 2022

simitt added the blocker label Jan 26, 2022

klacabane mentioned this issue Jan 26, 2022

fix monitoring-beats-mb template elastic/elasticsearch#83160

Merged

axw mentioned this issue Jan 27, 2022

Fix libbeat metrics replacement #7143

Merged

axw closed this as completed in #7143 Jan 27, 2022

matschaffer mentioned this issue Jan 31, 2022

[APM] Confusing agent onboarding UX after adding the integration elastic/kibana#124078

Open

bmorelli25 mentioned this issue Feb 3, 2022

[docs] Document how to enable Stack Monitoring with agent managed apm-server #5987

Closed

matschaffer mentioned this issue Feb 9, 2022

Unclear of output event rate graph accuracy in stack monitoring #7244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack monitoring isn't capturing APM Server metrics #7139

Stack monitoring isn't capturing APM Server metrics #7139

marclop commented Jan 26, 2022

simitt commented Jan 26, 2022

droberts195 commented Jan 26, 2022

kvch commented Jan 26, 2022

ruflin commented Jan 26, 2022

marclop commented Jan 26, 2022 •

edited

Loading

stuartnelson3 commented Jan 26, 2022

klacabane commented Jan 26, 2022 •

edited

Loading

stuartnelson3 commented Jan 26, 2022

axw commented Jan 27, 2022

matschaffer commented Jan 31, 2022

matschaffer commented Jan 31, 2022

simitt commented Jan 31, 2022

matschaffer commented Feb 9, 2022 •

edited

Loading

matschaffer commented Feb 9, 2022

Stack monitoring isn't capturing APM Server metrics #7139

Stack monitoring isn't capturing APM Server metrics #7139

Comments

marclop commented Jan 26, 2022

simitt commented Jan 26, 2022

droberts195 commented Jan 26, 2022

kvch commented Jan 26, 2022

ruflin commented Jan 26, 2022

marclop commented Jan 26, 2022 • edited Loading

stuartnelson3 commented Jan 26, 2022

klacabane commented Jan 26, 2022 • edited Loading

stuartnelson3 commented Jan 26, 2022

axw commented Jan 27, 2022

matschaffer commented Jan 31, 2022

matschaffer commented Jan 31, 2022

simitt commented Jan 31, 2022

matschaffer commented Feb 9, 2022 • edited Loading

matschaffer commented Feb 9, 2022

marclop commented Jan 26, 2022 •

edited

Loading

klacabane commented Jan 26, 2022 •

edited

Loading

matschaffer commented Feb 9, 2022 •

edited

Loading