Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use staleness markers generated by prometheus, rather than making our own #5062

Merged
merged 4 commits into from
Sep 3, 2021

Conversation

dashpole
Copy link
Contributor

@dashpole dashpole commented Sep 1, 2021

Description:

Commits 1 and 2 reverts most of open-telemetry/opentelemetry-collector#3423, and open-telemetry/opentelemetry-collector#3414. I didn't revert the start timestamp changes because we had already used them for other purposes.

Commit 3 removes the check in our implementation of Append, where we drop metrics if they include NaN values. This effectively "enables" collection of staleness markers that had previously been disabled. I re-introduced some of the test changes required for handling staleness markers in our e2e tests from open-telemetry/opentelemetry-collector#3423.

Commit 4 reintroduces the same staleness end-to-end test used in the reverted PRs to demonstrate that this PR does, indeed, add staleness markers.

Overall, this replaces our custom implementation of staleness markers with prometheus's implementation. This results in less duplication of work (since prometheus is already tracking staleness), is more likely to stay compliant, and is less likely to have correctness problems.

Link to tracking Issue:

This fixes an issue i've been debugging with GKE's use of the opentelemetry-collector. I believe it is the same underlying issue as #4907, but I haven't confirmed that.

Testing:

This passes the prometheus remote write compliance test suite.

I kept the same staleness e2e test that was previously used. I've also tested this with GKE's use of the collector. In my testing with GKE's use-case, this fixed a problem that I wasn't able to fully root cause. I was able to show that GKE's issue was caused by open-telemetry/opentelemetry-collector#3423, and then realized that we didn't even need to implement staleness markers ourselves.

Documentation:

Changes apply only to internal implementation details.

cc @jmacd @Aneurysm9 @odeke-em

@dashpole dashpole marked this pull request as ready for review September 1, 2021 23:05
@dashpole dashpole requested review from a team and anuraaga September 1, 2021 23:05
@odeke-em
Copy link
Member

odeke-em commented Sep 2, 2021

@dashpole thank you for working on this fix! What do you mean by Prometheus already emits staleness markers? AFAIK Prometheus wasn't issuing staleness markers and only the server handles that. A revert of this might bring us back to square 0, but to alleviate currently problems sure it'll work.

@dashpole
Copy link
Contributor Author

dashpole commented Sep 2, 2021

Verified that this passes the prometheus compliance suite:

--- PASS: TestRemoteWrite (40.04s)
    --- PASS: TestRemoteWrite/otelcollector (0.01s)
        --- PASS: TestRemoteWrite/otelcollector/Counter (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/Retries500 (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/Retries400 (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/RepeatedLabels (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/Invalid (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/EmptyLabels (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/Up (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/HonorLabels (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/Summary (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/Histogram (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/Headers (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/Ordering (17.78s)
        --- PASS: TestRemoteWrite/otelcollector/Staleness (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/Timestamp (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/SortedLabels (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/InstanceLabel (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/NameLabel (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/JobLabel (10.01s)
        --- PASS: TestRemoteWrite/otelcollector/Gauge (10.01s)
PASS
ok  	github.com/prometheus/compliance/remote_write	40.097s

What do you mean by Prometheus already emits staleness markers?

By "emit", I mean that it calls Append with StaleNaN values on stale timeseries. Currently, we ignore calls to Append with NaN values, instead of sending them through the pipeline. This PR just removes that if math.IsNaN(v) { return 0, nil } check.

@odeke-em
Copy link
Member

odeke-em commented Sep 2, 2021

Verified that this passes the prometheus compliance suite:


--- PASS: TestRemoteWrite (40.04s)

    --- PASS: TestRemoteWrite/otelcollector (0.01s)

        --- PASS: TestRemoteWrite/otelcollector/Counter (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/Retries500 (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/Retries400 (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/RepeatedLabels (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/Invalid (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/EmptyLabels (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/Up (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/HonorLabels (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/Summary (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/Histogram (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/Headers (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/Ordering (17.78s)

        --- PASS: TestRemoteWrite/otelcollector/Staleness (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/Timestamp (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/SortedLabels (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/InstanceLabel (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/NameLabel (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/JobLabel (10.01s)

        --- PASS: TestRemoteWrite/otelcollector/Gauge (10.01s)

PASS

ok  	github.com/prometheus/compliance/remote_write	40.097s

What do you mean by Prometheus already emits staleness markers?

By "emit", I mean that it calls Append with StaleNaN values on stale timeseries. Currently, we ignore calls to Append with NaN values, instead of sending them through the pipeline. This PR just removes that if math.IsNaN(v) { return 0, nil } check.

Awesome news, thank you @dashpole, LGTM!

Copy link
Member

@odeke-em odeke-em left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you and LGTM @dashpole!

@odeke-em
Copy link
Member

odeke-em commented Sep 2, 2021

@dashpole I've crafted for you another end-to-end test that checks that multiple targets won't have staleness markers spuriously generatedas was reported in #4748 which we should also tag as fixed if it works

package repro

import (
	"context"
	"fmt"
	"io/ioutil"
	"net/http"
	"net/http/httptest"
	"net/url"
	"strings"
	"testing"

	"github.com/gogo/protobuf/proto"
	"github.com/golang/snappy"
	"github.com/prometheus/prometheus/pkg/value"
	"github.com/prometheus/prometheus/prompb"
	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
	"go.opentelemetry.io/collector/component"
	"go.opentelemetry.io/collector/processor/batchprocessor"
	"go.opentelemetry.io/collector/service"
	"go.opentelemetry.io/collector/service/parserprovider"
	"go.uber.org/zap"
	"go.uber.org/zap/zapcore"

	"github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter"
	"github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver"
)

// Test that staleness markers are NOT emitted for replicated services whose scrapes might alternate
// between intervals.
// Ideally after this test, which runs the entire collector and end-to-end scrapes then checks with the
// Prometheus remotewrite exporter should count NO staleness markers emitted.
// See https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/4748
func TestStalenessMarkersReplicatedServicesDoNotProduceStalenessMarkers(t *testing.T) {
	if testing.Short() {
		t.Skip("This test can take a long time")
	}

	ctx, cancel := context.WithCancel(context.Background())

	createStalenessHandler := func() http.HandlerFunc {
		// 1. Setup the server.
		return http.HandlerFunc(func(rw http.ResponseWriter, req *http.Request) {
			select {
			case <-ctx.Done():
				return
			default:
			}

			fmt.Fprintf(rw, `
# HELP jvm_memory_bytes_used Used bytes of a given JVM memory area.
# TYPE jvm_memory_bytes_used gauge
jvm_memory_bytes_used{area="heap"} %.1f`, float64(i))
		})
	}

	scrapeServer1 := httptest.NewServer(createStalenessHandler())
	defer scrapeServer1.Close()
	scrapeServer2 := httptest.NewServer(createStalenessHandler())
	defer scrapeServer2.Close()

	serverURL1, err := url.Parse(scrapeServer1.URL)
	require.Nil(t, err)
	serverURL2, err := url.Parse(scrapeServer2.URL)
	require.Nil(t, err)

	// 2. Set up the Prometheus RemoteWrite endpoint.
	prweUploads := make(chan *prompb.WriteRequest)
	prweServer := httptest.NewServer(http.HandlerFunc(func(rw http.ResponseWriter, req *http.Request) {
		// Snappy decode the uploads.
		payload, rerr := ioutil.ReadAll(req.Body)
		if err != nil {
			panic(rerr)
		}
		recv := make([]byte, len(payload))
		decoded, derr := snappy.Decode(recv, payload)
		if err != nil {
			panic(derr)
		}

		writeReq := new(prompb.WriteRequest)
		if uerr := proto.Unmarshal(decoded, writeReq); uerr != nil {
			panic(uerr)
		}

		select {
		case <-ctx.Done():
			return
		case prweUploads <- writeReq:
		}
	}))
	defer prweServer.Close()

	// 3. Set the OpenTelemetry Prometheus receiver.
	config := fmt.Sprintf(`
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'test'
          scrape_interval: 2ms
          static_configs:
            - targets: [%q,%q]

processors:
  batch:
exporters:
  prometheusremotewrite:
    endpoint: %q
    insecure: true

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [prometheusremotewrite]`, serverURL1.Host, serverURL2.Host, prweServer.URL)

	// 4. Run the OpenTelemetry Collector.
	receivers, err := component.MakeReceiverFactoryMap(prometheusreceiver.NewFactory())
	require.Nil(t, err)
	exporters, err := component.MakeExporterFactoryMap(prometheusremotewriteexporter.NewFactory())
	require.Nil(t, err)
	processors, err := component.MakeProcessorFactoryMap(batchprocessor.NewFactory())
	require.Nil(t, err)

	factories := component.Factories{
		Receivers:  receivers,
		Exporters:  exporters,
		Processors: processors,
	}

	appSettings := service.CollectorSettings{
		Factories:      factories,
		ParserProvider: parserprovider.NewInMemory(strings.NewReader(config)),
		BuildInfo: component.BuildInfo{
			Command:     "otelcol",
			Description: "OpenTelemetry Collector",
			Version:     "tests",
		},
		LoggingOptions: []zap.Option{
			// Turn off the verbose logging from the collector.
			zap.WrapCore(func(zapcore.Core) zapcore.Core {
				return zapcore.NewNopCore()
			}),
		},
	}
	app, err := service.New(appSettings)
	require.Nil(t, err)

	go func() {
		if err := app.Run(); err != nil {
			t.Error(err)
		}
	}()

	// Wait until the collector has actually started.
	stateChannel := app.GetStateChannel()
	for notYetStarted := true; notYetStarted; {
		switch state := <-stateChannel; state {
		case service.Running, service.Closed, service.Closing:
			notYetStarted = false
		}
	}

	// The OpenTelemetry collector has a data race because it closes
	// a channel while
	if false {
		defer app.Shutdown()
	}

	// 5. Let's wait on 10 fetches.
	var wReqL []*prompb.WriteRequest
	for i := 0; i < 10; i++ {
		wReqL = append(wReqL, <-prweUploads)
	}
	defer cancel()

	// 6. Assert that we encounter the stale markers aka special NaNs for the various time series.
	staleMarkerCount := 0
	totalSamples := 0
	for i, wReq := range wReqL {
		name := fmt.Sprintf("WriteRequest#%d", i)
		require.True(t, len(wReq.Timeseries) > 0, "Expecting at least 1 timeSeries for:: "+name)
		for j, ts := range wReq.Timeseries {
			fullName := fmt.Sprintf("%s/TimeSeries#%d", name, j)
			assert.True(t, len(ts.Samples) > 0, "Expected at least 1 Sample in:: "+fullName)

			// We are strictly counting series directly included in the scrapes, and no
			// internal timeseries like "up" nor "scrape_seconds" etc.
			metricName := ""
			for _, label := range ts.Labels {
				if label.Name == "__name__" {
					metricName = label.Value
				}
			}
			if !strings.HasPrefix(metricName, "jvm") {
				continue
			}

			for _, sample := range ts.Samples {
				totalSamples++
				if value.IsStaleNaN(sample.Value) {
					staleMarkerCount++
				}
			}
		}
	}

	require.True(t, totalSamples > 0, "Expected at least 1 sample")
	require.False(t, staleMarkerCount > 0, fmt.Sprintf("Expected no stale markers, got: %d", staleMarkerCount))
}

@bogdandrutu
Copy link
Member

@Aneurysm9 @rakyll please review

@bogdandrutu
Copy link
Member

Actually already done :)

@bogdandrutu bogdandrutu merged commit 34afd06 into open-telemetry:main Sep 3, 2021
@dashpole dashpole deleted the fix_staleness_markers branch September 3, 2021 14:09
@dashpole
Copy link
Contributor Author

dashpole commented Sep 3, 2021

@odeke-em feel free to open a PR with the additional test. I'll gladly provide a review.

@bogdandrutu would you accept cherry-picks of this to previous release branches (I realize previous i'd have to make the changes by hand in core)? My preference would be for 0.30.0 and on (excl. 0.32, which was retracted).

@dashpole
Copy link
Contributor Author

dashpole commented Sep 8, 2021

The plan for cherrypicks (discussed in the 9/8 sig meeting) is:

  • Cherrypicks to 0.30(1), 0.31, 0.33 release branches in core; Cherrypick to 0.34 release branch in contrib.
  • Update corresponding releases in contrib for core changes

bogdandrutu pushed a commit that referenced this pull request Sep 9, 2021
… own (#5062) (#5170)

* Revert "receiver/prometheus: glue and complete staleness marking for disappearing metrics (#3423)"

This reverts commit 8b79380.

* Revert "receiver/prometheus: add store to track stale metrics (#3414)"

This reverts commit cdc1634.

* stop dropping staleness markers from prometheus, and fix tests

* add staleness end to end test from #3423
hex1848 pushed a commit to hex1848/opentelemetry-collector-contrib that referenced this pull request Jun 2, 2022
…pen-telemetry#5062)

* Change `UnmarshalJSON[Traces|Metrics|Logs][Reques|Response]` to be a func `UnmarshalJSON` on the struct.
* Rename `[Traces|Metrics|Logs][Reques|Response].Marshal` to `MarshalProto` consistent with JSON/Text standard interfaces.
* Change `UnmarshalJSON[Traces|Metrics|Logs][Reques|Response]` to be a func `UnmarshalProto` on the struct.

Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants