Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack monitoring isn't capturing APM Server metrics #7139

Closed
marclop opened this issue Jan 26, 2022 · 14 comments · Fixed by #7143
Closed

Stack monitoring isn't capturing APM Server metrics #7139

marclop opened this issue Jan 26, 2022 · 14 comments · Fixed by #7143

Comments

@marclop
Copy link
Contributor

marclop commented Jan 26, 2022

APM Server version (apm-server version):

8.0.0-rc2

Description of the problem including expected versus actual behavior:

When enabling stack monitoring, most of the apm-server metrics aren't displayed in the UI. All the metrics appear to be 0.

image

image

For reference, 10000 traces have been ingested:

image

Steps to reproduce:

  1. Create a deployment in ESS using 8.0.0-rc2
  2. Enable stack monitoring through the UI in: Deployment > Logs and metrics > Enable (the same deployment can be used)
  3. Ingest some data to the APM Server, run this Go program a few times (Export the ELASTIC_APM_SERVER_URL and ELASTIC_APM_SECRET_TOKEN environment variables):
package main

import (
	"context"
	"fmt"
	"os"
	"time"

	"go.elastic.co/apm"
)

func main() {
	version := "undefined"
	if len(os.Args) > 1 {
		version = os.Args[1]
	}
	name := fmt.Sprintf("apm-server-%s", version)
	// apm.DefaultTracer.SetSampler(apm.NewRatioSampler(0.0))

	for i := 0; i < 1000; i++ {
		tx := apm.DefaultTracer.StartTransaction(name, "type")
		ctx := apm.ContextWithTransaction(context.Background(), tx)
		span, ctx := apm.StartSpan(ctx, name, "type")
		span.Duration = time.Second
		span.End()
		tx.Duration = 2 * time.Second
		tx.End()
		<-time.After(time.Millisecond)
	}
	<-time.After(time.Second)
	apm.DefaultTracer.Flush(nil)
	fmt.Printf("%s: %+v\n", name, apm.DefaultTracer.Stats())
}

Provide logs (if relevant):

@simitt
Copy link
Contributor

simitt commented Jan 26, 2022

cc @jasonrhodes and @ruflin
We are not yet clear on why Stack Monitoring is broken again; this has been successfully tested with previous snapshots. Any help in figuring out the root cause would be much appreciated.

@droberts195
Copy link

Could it be the same problem as elastic/beats#29880? There is a PR open for that: elastic/beats#30018

@kvch
Copy link

kvch commented Jan 26, 2022

@droberts195 To me it seems like a different issue. The documents are in the index, but for some reason the visualizations are not working. If I understood correctly the Elasticsearch module issue is that the documents are not indexed at all.

@ruflin
Copy link
Contributor

ruflin commented Jan 26, 2022

My understanding is, this is the data coming from Metricbeat inside the Elastic Agent container but NOT run by Elastic Agent. This data is shipped to the .monitoring-* index to be consumed by Stack Monitoring. I'm stating this to make sure we are all on the same page. If data is in the traces data stream only matters to confirm there was some load on the server and it should not be zero.

The issues linked above are about logs and not metrics meaning it should not be related.

@marclop Have you looked at the data in the .monitoring-* indices? Can you share by chance some example docs? Are the values all just 0? Lets figure out if the queries are wrong or the data is empty, this should bring us to the next step quickly.

@ph @cmacknz We need to resolve the above quickly. I'm not aware of any recent change in Cloud or Beats which should cause this.

@marclop
Copy link
Contributor Author

marclop commented Jan 26, 2022

I've reproduced this on an ECE environment and manually checked what the APM server /stats endpoint publishes after indexing a few events, the metrics in the registry appear to indeed be 0:

$ curl --unix-socket /app/elastic-agent/data/tmp/es-containerhost/apm-server/apm-server.sock http:/localhost/stats
    "libbeat": {
        "output": {
            "events": {
                "acked": 0,
                "active": 0,
                "batches": 0,
                "dropped": 0,
                "duplicates": 0,
                "failed": 0,
                "toomany": 0,
                "total": 0
            },
            "read": {
                "bytes": 0,
                "errors": 0
            },
            "type": "elasticsearch",
            "write": {
                "bytes": 0,
                "errors": 0
            }
        },
...

This seems to indicate that the issue is not on the collection side, but rather within the APM Server.

I tried to reproduce this locally using the APM Server in standalone mode and it seemed to work just fine, after indexing a few thousand events I was able to get the metrics from the /stats endpoint:

$ curl -s http://localhost:5066/stats | jq '.libbeat.output'
{
  "events": {
    "acked": 3326,
    "active": 0,
    "batches": 3,
    "failed": 1,
    "toomany": 0,
    "total": 3327
  },
  "read": {
    "bytes": 0,
    "errors": 0
  },
  "type": "elasticsearch",
  "write": {
    "bytes": 0,
    "errors": 0
  }
}

Then tried to reproduce it running under the Elastic Agent and it is easily reproducible when the APM Server is running in managed mode. From 8.0+ we do not rely on libbeat anymore and inject the new output metrics when we construct a new instance of the modelindexer as you can see below.

apm-server/beater/beater.go

Lines 686 to 704 in 535f709

// Remove libbeat output counters, and install our own callback which uses the modelindexer stats.
monitoring.Default.Remove("libbeat.output.events")
monitoring.NewFunc(monitoring.Default, "libbeat.output.events", func(_ monitoring.Mode, v monitoring.Visitor) {
v.OnRegistryStart()
defer v.OnRegistryFinished()
stats := indexer.Stats()
v.OnKey("acked")
v.OnInt(stats.Indexed)
v.OnKey("active")
v.OnInt(stats.Active)
v.OnKey("batches")
v.OnInt(stats.BulkRequests)
v.OnKey("failed")
v.OnInt(stats.Failed)
v.OnKey("toomany")
v.OnInt(stats.TooManyRequests)
v.OnKey("total")
v.OnInt(stats.Added)
})

I'm not intimately familiar with how the registries work and how reloading the APM Server the way we do may affect the reference that we're using to register the modelindexer stats to, but it may be that we're losing the reference at some point? The libbeat code seems to not have changed at all so I'm puzzled as to how running under the Elastic agent causes the metrics to not be updated.

EDIT: Oddly enough, I've injected a locally built binary from the latest 8.0 branch commit in the Elastic Agent container and the metrics started flowing into the /stats endpoint, here's a small guide: https://gist.github.com/marclop/7b8e63197f383e516ba2d1bd5e2c6c41.

@stuartnelson3
Copy link
Contributor

What @marclop references in the previous message concerns only the "Output Events Rate" graph; I believe the reason the other graphs always render 0 is because of incorrect aliases, which requires changes external to the apm-server.

The requests made from kibana to get stack monitoring data are available here, under "field": https://github.com/elastic/kibana/blob/main/x-pack/plugins/monitoring/server/lib/metrics/apm/metrics.js#L164

For example, field: 'beats_stats.metrics.apm-server.server.request.count'.

The problem is the alias to the real data is missing the metrics nesting: https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/core/src/main/resources/monitoring-beats-mb.json#L1043-L1049

Kibana makes a request for beats_stats.metrics.apm-server.server.request.count, but the alias that we have is actually for beats_stats.apm-server.server.request.count (note the missing metrics).

"Output Events Rate" renders properly when it has data because it's actually nested under metrics: https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/core/src/main/resources/monitoring-beats-mb.json#L1513

I think if the whole apm-server section were nested under metrics, it would start working.

@klacabane @probakowski can you confirm if I'm reading and understanding the problem correctly?

@klacabane
Copy link

klacabane commented Jan 26, 2022

@stuartnelson3 Good catch

I think if the whole apm-server section were nested under metrics, it would start working.

I'll give this a try


Updated the mappings to have the apm-server block nested under beats_stats.metrics, ran @marclop program to generate traces and got some data to show up in various graphs. Is there a way to generate errors to validate the related graphs ?

Screenshot 2022-01-26 at 17 04 32

@stuartnelson3
Copy link
Contributor

You should be able to unexport or provide a bogus ELASTIC_APM_SECRET_TOKEN to force an auth error on ingest.

@axw
Copy link
Member

axw commented Jan 27, 2022

I believe the issues regarding libbeat metrics noted in #7139 (comment) are due to a race condition between apm-server replacing the libbeat metrics (see code in linked comment), and libbeat clearing/recreating the metrics registry when output config is reloaded (https://github.com/elastic/beats/blob/72a43be9ed23efc3dc2b371e2cadf5a7c575e429/libbeat/publisher/pipeline/module.go#L138-L146).

@matschaffer
Copy link

I gave this a test today and was able to get everything except the response errors intake going.

image

I triggered the output errors by marking the indices as read only

PUT .ds-metrics-apm.internal-default-2022.01.31-000001/_settings
{
  "blocks.read_only": true
}

PUT .ds-traces-apm-default-2022.01.31-000001/_settings
{
  "blocks.read_only": true
}

Unfortunately I get a 403 attempting to remove the block 😆 but I can just wipe the data instead.

@matschaffer
Copy link

Opened elastic/elasticsearch#83305 for the index blocking issue I ran into

@simitt
Copy link
Contributor

simitt commented Jan 31, 2022

Verified that both of the issues are fixed with the BC-3 for rc2:
Screenshot 2022-01-31 at 10 25 08

@matschaffer
Copy link

matschaffer commented Feb 9, 2022

I think this might still be broken for the instance page. In testing 8.0.0-da04e908 today I noted that I see an output spike in the overview:

Screen_Shot_2022-02-09_at_14_35_03

That seems to be missing on the instance graph:

Screen_Shot_2022-02-09_at_14_35_26

@matschaffer
Copy link

The test script gets the "Active" rate to go up, but not the acked rate for instance graphs.

Screen Shot 2022-02-09 at 14 40 07

The overview graph is a little odd. Total seems to trend higher than active, even though all I did was run the test app. And "Acked" is just a constant 50-ish/s

Screen Shot 2022-02-09 at 14 41 13

I'll open up a new issue just for the above oddities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants