Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateless Rule: Panic opening /alerts after alert state restored. Duplicate alertstate ? #6060

Closed
ahurtaud opened this issue Jan 20, 2023 · 5 comments

Comments

@ahurtaud
Copy link
Contributor

Thanos, Prometheus and Golang version used:
v0.30.1

What happened:
I implemented the alert state restore in stateless ruler thanks to #5230
But the ruler after restart keeps the alert state for the firing ones but also create some matching pending alerts.

This issue makes the Alerts webpage to go panic (see full logs in details) because alerts have 2 states at once I think.
Screenshot 2023-01-20 at 10 27 40

The pending alert have all the '--restore-ignored-label labels on top of the matching firing alert.
for exemple tenant_id here (and all other labels I properly set as ignored)
Screenshot 2023-01-20 at 11 04 12

What you expected to happen:
Go Panic should not happen
Restored alerts should not create their corresponding Pending alerts with the ignored labels?

stateless ruler restart: (firing are ~kept), but pending are created.
Screenshot 2023-01-20 at 11 06 01

Full logs to relevant components:

Logs

2023/01/20 09:06:26 http: panic serving 10.225.2.209:47096: merger not found for type:int
goroutine 1015242 [running]:
net/http.(*conn).serve.func1()
	/usr/local/go/src/net/http/server.go:1850 +0xbf
panic({0x2184f60, 0xc00217d3c0})
	/usr/local/go/src/runtime/panic.go:890 +0x262
github.com/gogo/protobuf/proto.(*mergeInfo).computeMergeInfo(0xc006101e80)
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:662 +0xe85
github.com/gogo/protobuf/proto.(*mergeInfo).merge(0xc006101e80, {0xc0001980c0?}, {0x2145420?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:113 +0x58
github.com/gogo/protobuf/proto.(*mergeInfo).computeMergeInfo.func27({0xc005dfaec0?}, {0x3f5560?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:545 +0x165
github.com/gogo/protobuf/proto.(*mergeInfo).merge(0xc005dfaec0, {0xc000aaa310?}, {0x25bf1a0?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:139 +0x305
github.com/gogo/protobuf/proto.(*mergeInfo).computeMergeInfo.func30({0x40d95f?}, {0xc0006709e0?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:587 +0x8b
github.com/gogo/protobuf/proto.(*mergeInfo).merge(0xc005dfac00, {0xc005230108?}, {0x94ce46?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:139 +0x305
github.com/gogo/protobuf/proto.(*mergeInfo).computeMergeInfo.func30({0x40d95f?}, {0x596320?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:587 +0x8b
github.com/gogo/protobuf/proto.(*mergeInfo).merge(0xc005dfad00, {0xc00260b2c0?}, {0x1?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:139 +0x305
github.com/gogo/protobuf/proto.(*mergeInfo).computeMergeInfo.func29({0x1609a0?}, {0xc00252e9c0?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:567 +0xf2
github.com/gogo/protobuf/proto.(*mergeInfo).merge(0xc005dfac40, {0x2356140?}, {0x4?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:139 +0x305
github.com/gogo/protobuf/proto.(*InternalMessageInfo).Merge(0x40b8bd?, {0x2bee230, 0xc00183f2c0}, {0x2bee230, 0xc001c2cb40})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:50 +0xb6
github.com/thanos-io/thanos/pkg/rules/rulespb.(*Alert).XXX_Merge(0x3e42ea0?, {0x2bee230?, 0xc001c2cb40?})
	/app/pkg/rules/rulespb/rpc.pb.go:486 +0x3a
github.com/gogo/protobuf/proto.Merge({0x2bee230?, 0xc00183f2c0}, {0x2bee230?, 0xc001c2cb40})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/clone.go:95 +0x4a3
github.com/gogo/protobuf/proto.(*mergeInfo).computeMergeInfo.func32({0x40d95f?}, {0xc00062c560?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:652 +0x686
github.com/gogo/protobuf/proto.(*mergeInfo).merge(0xc005dfabc0, {0xc00217d2a0?}, {0xc0061cc5b8?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:139 +0x305
github.com/gogo/protobuf/proto.(*mergeInfo).computeMergeInfo.func29({0x25bf040?}, {0xc001cf71f0?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:567 +0xf2
github.com/gogo/protobuf/proto.(*mergeInfo).merge(0xc005dfab00, {0x25723e0?}, {0x8?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:139 +0x305
github.com/gogo/protobuf/proto.(*InternalMessageInfo).Merge(0x40b8bd?, {0x2bee2f0, 0xc000aaa2a0}, {0x2bee2f0, 0xc001cf7260})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_merge.go:50 +0xb6
github.com/thanos-io/thanos/pkg/rules/rulespb.(*RuleGroup).XXX_Merge(0x3e42ea0?, {0x2bee2f0?, 0xc001cf7260?})
	/app/pkg/rules/rulespb/rpc.pb.go:310 +0x3a
github.com/gogo/protobuf/proto.Merge({0x2bee2f0?, 0xc000aaa2a0}, {0x2bee2f0?, 0xc001cf7260})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/clone.go:95 +0x4a3
github.com/gogo/protobuf/proto.Clone({0x2bee2f0?, 0xc001cf7260?})
	/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/clone.go:52 +0x1a5
github.com/thanos-io/thanos/pkg/rules.(*Manager).Rules(0xc000cbfb60, 0xc00602be40, {0x2c024f0, 0xc0019daea0})
	/app/pkg/rules/manager.go:409 +0x219
github.com/thanos-io/thanos/pkg/rules.(*GRPCClient).Rules(0xc0001bbef0, {0x2bf4288?, 0xc0060d3bf0?}, 0xc00602be40)
	/app/pkg/rules/rules.go:60 +0x174
github.com/thanos-io/thanos/pkg/api/query.NewRulesHandler.func1.3({0x2bf4288?, 0xc0060d3bf0?})
	/app/pkg/api/query/v1.go:990 +0x58
github.com/thanos-io/thanos/pkg/tracing.DoInSpan({0x2bf4288?, 0xc0060d3b30?}, {0x26a1d36?, 0x7?}, 0xc0048b4c60, {0x0?, 0x0?, 0x7fa1c4b23a68?})
	/app/pkg/tracing/tracing.go:95 +0xa3
github.com/thanos-io/thanos/pkg/api/query.NewRulesHandler.func1(0xc0060cdb00)
	/app/pkg/api/query/v1.go:989 +0x485
github.com/thanos-io/thanos/pkg/api.GetInstr.func1.1({0x2be9a00, 0xc001cf6f50}, 0x20?)
	/app/pkg/api/api.go:211 +0x50
net/http.HandlerFunc.ServeHTTP(0xc0060c7680?, {0x2be9a00?, 0xc001cf6f50?}, 0x5?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/thanos-io/thanos/pkg/logging.(*HTTPServerMiddleware).HTTPMiddleware.func1({0x2be9a00?, 0xc001cf6f50}, 0xc0060cdb00)
	/app/pkg/logging/http.go:69 +0x3b8
net/http.HandlerFunc.ServeHTTP(0x2bf4288?, {0x2be9a00?, 0xc001cf6f50?}, 0x2bcecd8?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/thanos-io/thanos/pkg/server/http/middleware.RequestID.func1({0x2be9a00, 0xc001cf6f50}, 0xc0060cda00)
	/app/pkg/server/http/middleware/request_id.go:40 +0x542
net/http.HandlerFunc.ServeHTTP(0x2184f60?, {0x2be9a00?, 0xc001cf6f50?}, 0x4?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1({0x2bedde0, 0xc00602be00}, 0x490001?)
	/go/pkg/mod/github.com/!n!y!times/gziphandler@v1.1.1/gzip.go:338 +0x26f
net/http.HandlerFunc.ServeHTTP(0x10?, {0x2bedde0?, 0xc00602be00?}, 0x1?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/thanos-io/thanos/pkg/extprom/http.httpInstrumentationHandler.func1({0x7fa19d434678?, 0xc006096f00}, 0xc0060cda00)
	/app/pkg/extprom/http/instrument_server.go:75 +0x10b
net/http.HandlerFunc.ServeHTTP(0x7fa19d434678?, {0x7fa19d434678?, 0xc006096f00?}, 0xc0060d3a10?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1({0x7fa19d434678?, 0xc006096eb0?}, 0xc0060cda00)
	/go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/promhttp/instrument_server.go:288 +0xc5
net/http.HandlerFunc.ServeHTTP(0x7fa19d434678?, {0x7fa19d434678?, 0xc006096eb0?}, 0xc0048b5470?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1({0x7fa19d434678?, 0xc006096e60?}, 0xc0060cda00)
	/go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/promhttp/instrument_server.go:146 +0xb8
net/http.HandlerFunc.ServeHTTP(0x22c9b80?, {0x7fa19d434678?, 0xc006096e60?}, 0x6?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/thanos-io/thanos/pkg/extprom/http.instrumentHandlerInFlight.func1({0x7fa19d434678, 0xc006096e60}, 0xc0060cda00)
	/app/pkg/extprom/http/instrument_server.go:162 +0x169
net/http.HandlerFunc.ServeHTTP(0x2bf13b0?, {0x7fa19d434678?, 0xc006096e60?}, 0xc0048b5698?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerRequestSize.func1({0x2bf13b0?, 0xc000947180?}, 0xc0060cda00)
	/go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/promhttp/instrument_server.go:238 +0xc5
net/http.HandlerFunc.ServeHTTP(0x2bf4288?, {0x2bf13b0?, 0xc000947180?}, 0x417d220?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/thanos-io/thanos/pkg/tracing.HTTPMiddleware.func1({0x2bf13b0, 0xc000947180}, 0xc0060cd900)
	/app/pkg/tracing/http.go:62 +0x9a2
github.com/prometheus/common/route.(*Router).handle.func1({0x2bf13b0, 0xc000947180}, 0xc005ca2e00, {0x0, 0x0, 0x478d4e?})
	/go/pkg/mod/github.com/prometheus/common@v0.37.1/route/route.go:83 +0x18d
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0001d1920, {0x2bf13b0, 0xc000947180}, 0xc005ca2e00)
	/go/pkg/mod/github.com/julienschmidt/httprouter@v1.3.0/router.go:387 +0x81c
github.com/prometheus/common/route.(*Router).ServeHTTP(0xc0048b5af0?, {0x2bf13b0?, 0xc000947180?}, 0x0?)
	/go/pkg/mod/github.com/prometheus/common@v0.37.1/route/route.go:126 +0x26
net/http.(*ServeMux).ServeHTTP(0xc00242ea92?, {0x2bf13b0, 0xc000947180}, 0xc005ca2e00)
	/usr/local/go/src/net/http/server.go:2487 +0x149
net/http.serverHandler.ServeHTTP({0xc003d54510?}, {0x2bf13b0, 0xc000947180}, 0xc005ca2e00)
	/usr/local/go/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc001ea52c0, {0x2bf4288, 0xc0005a61b0})
	/usr/local/go/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3102 +0x4db

Anything else we need to know:
I hope I haven't made any configuration mistakes as it is quite difficult to configure:
below some interesting for investations flags:

Interesting ruler config flags

            - '--for-grace-period=10m'
            - '--for-outage-tolerance=1h'
            - '--restore-ignored-label=prometheus'
            - '--restore-ignored-label=prometheus_namespace'
            - '--restore-ignored-label=receive'
            - '--restore-ignored-label=stack'
            - '--restore-ignored-label=tenant_id'
            - '--query.default-step=1s'
            - '--label=stack="nld7"'
            - '--label=prometheus="big"'
            - '--label=replica="$(POD_NAME)"'
            - '--label=production_ready="false"'
            - '--label=prometheus_namespace="argos"'
            - '--alert.label-drop=replica'

Interesting receive config flag

            - '--label=receive_replica="$(NAME)"'
            - '--label=receive="true"'

Interesting query config flag

            - '--query.replica-label=replica'
            - '--query.replica-label=receive_replica'
            - '--query.replica-label=production_ready'

@ahurtaud
Copy link
Contributor Author

Hum found a misconfig where alerts are also fired by prometheus (and alerts state may restore to sidecar)
Closing this as I will continue to investigate.

@TomHellier
Copy link

@ahurtaud probably a long shot did but you find anything from your investigations?

I'm seeing the exact same panic on my instance of thanos ruler

@ahurtaud
Copy link
Contributor Author

@ahurtaud probably a long shot did but you find anything from your investigations?

I'm seeing the exact same panic on my instance of thanos ruler

We are not seeing the gopanic anymore, if I remember correctly it was definitely an issue where:
same alerts were deployed on ruler + prometheus.
And the querier "responsible" of restoring the state using the ALERTS_FOR_STATE metric was definitely plugged to thanos ruler AND sidecar+prometheus. I guess it triggered some sort of "impossible" state not catched by any error. Hard to reproduce now to be honest :/. Feel free to reopen an issue if that is not your case in your config.

@TomHellier
Copy link

thanks for the update @ahurtaud

I deleted the PVC attached to the thanos ruler and it seemed to have cleared the error, so I think there was something odd going on in the persistence.

I deleted it on thursday and haven't had a reoccurence since then.

@sbeginCoveo
Copy link

We've been hit with this. There is a fix for this in 0.32
https://github.com/thanos-io/thanos/releases/tag/v0.32.0
#6189

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants