Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: panic on api.Respond warn.Error() #4777

Closed
hanjm opened this issue Oct 14, 2021 · 6 comments · Fixed by #4805
Closed

Query: panic on api.Respond warn.Error() #4777

hanjm opened this issue Oct 14, 2021 · 6 comments · Fixed by #4805

Comments

@hanjm
Copy link
Member

hanjm commented Oct 14, 2021

Thanos, Prometheus and Golang version used:
Thanos main-2021-10-13-f8927b9a

Object Storage Provider:

What happened:

thanos query omit panic log

Logs

2021-10-14T16:22:14.721126503+08:00 level=warn ts=2021-10-14T08:22:14.720882507Z caller=proxy.go:471 component=proxy request="min_time:1634196130000 max_time:1634199730000 matchers:**> max_resolution_window:9223372036854775807 aggregates:COUNT aggregates:SUM skip_chunks:true " err="failed to receive any data in 5s from Addr: ***:19090 LabelSets: {***} Mint: 1634162400003 Maxt: 9223372036854775807: context deadline exceeded" msg="returning partial response"
2021-10-14T16:27:48.777928363+08:00 2021/10/14 08:27:48 http: panic serving 11.167.14.3:60892: runtime error: invalid memory address or nil pointer dereference
2021-10-14T16:27:48.777963479+08:00 goroutine 1993995 [running]:
2021-10-14T16:27:48.777968207+08:00 net/http.(*conn).serve.func1(0xc0009b81e0)
2021-10-14T16:27:48.777971263+08:00 	/usr/local/go/src/net/http/server.go:1804 +0x153
2021-10-14T16:27:48.777974309+08:00 panic(0x1b80a80, 0x2f8b270)
2021-10-14T16:27:48.777978877+08:00 	/usr/local/go/src/runtime/panic.go:971 +0x499
2021-10-14T16:27:48.777981492+08:00 github.com/pkg/errors.(*fundamental).Error(0x0, 0xc002566090, 0x82)
2021-10-14T16:27:48.777986372+08:00 	/home/circleci/go/pkg/mod/github.com/pkg/errors@v0.9.1/errors.go:125 +0x5
2021-10-14T16:27:48.77799077+08:00 github.com/thanos-io/thanos/pkg/api.Respond(0x2180820, 0xc000450e00, 0x1aaa2e0, 0x2fee060, 0xc000b59000, 0x43, 0x80)
2021-10-14T16:27:48.777995919+08:00 	/home/circleci/project/pkg/api/api.go:245 +0x1d1
2021-10-14T16:27:48.777999456+08:00 github.com/thanos-io/thanos/pkg/api.GetInstr.func1.1(0x2180820, 0xc000450e00, 0xc00240e200)
2021-10-14T16:27:48.778003634+08:00 	/home/circleci/project/pkg/api/api.go:214 +0xa5
2021-10-14T16:27:48.7780069+08:00 net/http.HandlerFunc.ServeHTTP(0xc00000df68, 0x2180820, 0xc000450e00, 0xc00240e200)
2021-10-14T16:27:48.778011459+08:00 	/usr/local/go/src/net/http/server.go:2049 +0x44
2021-10-14T16:27:48.778014865+08:00 github.com/thanos-io/thanos/pkg/server/http/middleware.RequestID.func1(0x2180820, 0xc000450e00, 0xc00240e000)
2021-10-14T16:27:48.778018141+08:00 	/home/circleci/project/pkg/server/http/middleware/request_id.go:40 +0x20c
2021-10-14T16:27:48.778021427+08:00 net/http.HandlerFunc.ServeHTTP(0xc00000df80, 0x2180820, 0xc000450e00, 0xc00240e000)
2021-10-14T16:27:48.778023892+08:00 	/usr/local/go/src/net/http/server.go:2049 +0x44
2021-10-14T16:27:48.778026467+08:00 github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1(0x21842d0, 0xc001b3ccc0, 0xc00240e000)
2021-10-14T16:27:48.778029132+08:00 	/home/circleci/go/pkg/mod/github.com/!n!y!times/gziphandler@v1.1.1/gzip.go:338 +0x299
2021-10-14T16:27:48.778043438+08:00 net/http.HandlerFunc.ServeHTTP(0xc000701ec0, 0x21842d0, 0xc001b3ccc0, 0xc00240e000)
2021-10-14T16:27:48.778047024+08:00 	/usr/local/go/src/net/http/server.go:2049 +0x44
2021-10-14T16:27:48.77805+08:00 github.com/thanos-io/thanos/pkg/logging.(*HTTPServerMiddleware).HTTPMiddleware.func1(0x2183ca0, 0xc001b3cca0, 0xc00240e000)
2021-10-14T16:27:48.778052374+08:00 	/home/circleci/project/pkg/logging/http.go:72 +0x242
2021-10-14T16:27:48.77805525+08:00 net/http.HandlerFunc.ServeHTTP(0xc000701ef0, 0x2183ca0, 0xc001b3cca0, 0xc00240e000)
2021-10-14T16:27:48.778057564+08:00 	/usr/local/go/src/net/http/server.go:2049 +0x44
2021-10-14T16:27:48.77806095+08:00 github.com/thanos-io/thanos/pkg/extprom/http.(*defaultInstrumentationMiddleware).NewHandler.func1(0x7fabe6204c40, 0xc001b3b6d0, 0xc00240e000)
2021-10-14T16:27:48.778064587+08:00 	/home/circleci/project/pkg/extprom/http/instrument_server.go:108 +0x10f
2021-10-14T16:27:48.778068495+08:00 net/http.HandlerFunc.ServeHTTP(0xc000701fb0, 0x7fabe6204c40, 0xc001b3b6d0, 0xc00240e000)
2021-10-14T16:27:48.778072322+08:00 	/usr/local/go/src/net/http/server.go:2049 +0x44
2021-10-14T16:27:48.778074666+08:00 github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1(0x7fabe6204c40, 0xc001b3b680, 0xc00240e000)
2021-10-14T16:27:48.778077161+08:00 	/home/circleci/go/pkg/mod/github.com/prometheus/client_golang@v1.11.0/prometheus/promhttp/instrument_server.go:198 +0xee
2021-10-14T16:27:48.778079425+08:00 net/http.HandlerFunc.ServeHTTP(0xc0007202d0, 0x7fabe6204c40, 0xc001b3b680, 0xc00240e000)
2021-10-14T16:27:48.778081659+08:00 	/usr/local/go/src/net/http/server.go:2049 +0x44
2021-10-14T16:27:48.778084945+08:00 github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1(0x7fabe6204c40, 0xc001b3b630, 0xc00240e000)
2021-10-14T16:27:48.77808717+08:00 	/home/circleci/go/pkg/mod/github.com/prometheus/client_golang@v1.11.0/prometheus/promhttp/instrument_server.go:101 +0xdf
2021-10-14T16:27:48.778089324+08:00 net/http.HandlerFunc.ServeHTTP(0xc000720540, 0x7fabe6204c40, 0xc001b3b630, 0xc00240e000)
2021-10-14T16:27:48.778091377+08:00 	/usr/local/go/src/net/http/server.go:2049 +0x44
2021-10-14T16:27:48.778093501+08:00 github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerRequestSize.func1(0x2186070, 0xc000858380, 0xc00240e000)
2021-10-14T16:27:48.778097028+08:00 	/home/circleci/go/pkg/mod/github.com/prometheus/client_golang@v1.11.0/prometheus/promhttp/instrument_server.go:165 +0xee
2021-10-14T16:27:48.778099012+08:00 net/http.HandlerFunc.ServeHTTP(0xc000720840, 0x2186070, 0xc000858380, 0xc00240e000)
2021-10-14T16:27:48.778101446+08:00 	/usr/local/go/src/net/http/server.go:2049 +0x44
2021-10-14T16:27:48.77810361+08:00 github.com/thanos-io/thanos/pkg/tracing.HTTPMiddleware.func1(0x2186070, 0xc000858380, 0xc001b25f00)
2021-10-14T16:27:48.778105865+08:00 	/home/circleci/project/pkg/tracing/http.go:46 +0x54c
2021-10-14T16:27:48.778112136+08:00 github.com/prometheus/common/route.(*Router).handle.func1(0x2186070, 0xc000858380, 0xc001b25e00, 0x0, 0x0, 0x0)
2021-10-14T16:27:48.778114691+08:00 	/home/circleci/go/pkg/mod/github.com/prometheus/common@v0.30.0/route/route.go:83 +0x27f
2021-10-14T16:27:48.778116825+08:00 github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc000750780, 0x2186070, 0xc000858380, 0xc001b25e00)
2021-10-14T16:27:48.778118789+08:00 	/home/circleci/go/pkg/mod/github.com/julienschmidt/httprouter@v1.3.0/router.go:387 +0xc7e
2021-10-14T16:27:48.778121153+08:00 github.com/prometheus/common/route.(*Router).ServeHTTP(0xc00073c600, 0x2186070, 0xc000858380, 0xc001b25e00)
2021-10-14T16:27:48.778133587+08:00 	/home/circleci/go/pkg/mod/github.com/prometheus/common@v0.30.0/route/route.go:121 +0x4c
2021-10-14T16:27:48.778139247+08:00 net/http.(*ServeMux).ServeHTTP(0xc0006155c0, 0x2186070, 0xc000858380, 0xc001b25e00)
2021-10-14T16:27:48.778141612+08:00 	/usr/local/go/src/net/http/server.go:2428 +0x1ad
2021-10-14T16:27:48.778143696+08:00 net/http.serverHandler.ServeHTTP(0xc00048a000, 0x2186070, 0xc000858380, 0xc001b25e00)
2021-10-14T16:27:48.778145789+08:00 	/usr/local/go/src/net/http/server.go:2867 +0xa3
2021-10-14T16:27:48.778148174+08:00 net/http.(*conn).serve(0xc0009b81e0, 0x218d710, 0xc004648040)
2021-10-14T16:27:48.778150308+08:00 	/usr/local/go/src/net/http/server.go:1932 +0x8cd
2021-10-14T16:27:48.778152282+08:00 created by net/http.(*Server).Serve
2021-10-14T16:27:48.778154356+08:00 	/usr/local/go/src/net/http/server.go:2993 +0x39b

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

May take time to futher investigate.

Full logs to relevant components:

Anything else we need to know:

ianwoolf added a commit to ianwoolf/thanos that referenced this issue Oct 15, 2021
Signed-off-by: ian woolf <btw515wolf2@gmail.com>
@ianwoolf
Copy link
Contributor

@hanjm I'm investigate the root cause. I notice the project path is /home/circleci/project/. Is it circleci test logs?

@hanjm
Copy link
Member Author

hanjm commented Oct 18, 2021

@hanjm I'm investigate the root cause. I notice the project path is /home/circleci/project/. Is it circleci test logs?

No. I use official image,so the path is ci path.

hanjm added a commit to hanjm/thanos that referenced this issue Oct 28, 2021
hanjm added a commit to hanjm/thanos that referenced this issue Oct 28, 2021
Signed-off-by: Jimmiehan <hanjinming@outlook.com>
hanjm added a commit to hanjm/thanos that referenced this issue Oct 28, 2021
Signed-off-by: Jimmiehan <hanjinming@outlook.com>
@hanjm
Copy link
Member Author

hanjm commented Oct 28, 2021

I found it.
here is my step:

  1. add log to pkg/api/api.go#L199
// GetInstr returns a http HandlerFunc with the instrumentation middleware.
func GetInstr(
	tracer opentracing.Tracer,
	logger log.Logger,
	ins extpromhttp.InstrumentationMiddleware,
	logMiddleware *logging.HTTPServerMiddleware,
	disableCORS bool,
) InstrFunc {
	instr := func(name string, f ApiFunc) http.HandlerFunc {
		hf := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			if !disableCORS {
				SetCORS(w)
			}
			if data, warnings, err := f(r); err != nil {
				RespondError(w, err, data)
			} else if data != nil {
				// TODO temp debug code
				var warningsNew []error
				for i, warn := range warnings {
					if warn == nil {
						level.Error(logger).Log("msg", fmt.Sprintf("warn(%d) is nil, path:%s,"+
							"data:%+v, warnings:%v",
							i, r.RequestURI, data, warnings))
					} else {
						warningsNew = append(warningsNew, warn)
					}
				}
				Respond(w, data, warningsNew)
			} else {
				w.WriteHeader(http.StatusNoContent)
			}
		})
  1. find the error log
{"caller":"api.go:218","level":"error","msg":"warn is nil, path:/api/v1/query_exemplars?query=sum(cur_distribute_num%7Bapp%3D%27diff%27%7D)&start=1635218940&end=1635305340,data:[], warnings:[receiving exemplars from exemplars client &{0xc000e8aa80}: rpc error: code = Unimplemented desc = unknown service thanos.Exemplars\ngithub.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/exemplars.go:44\ngithub.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:191\ngithub.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:133\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/Users/jimmiehan/go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\t/usr/local/go117/src/runtime/asm_amd64.s:1581 receiving exemplars from exemplars client &{0xc000d70e00}: rpc error: code = Unimplemented desc = unknown service thanos.Exemplars\ngithub.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/exemplars.go:44\ngithub.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:191\ngithub.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:133\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/Users/jimmiehan/go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\t/usr/local/go117/src/runtime/asm_amd64.s:1581 receiving exemplars from exemplars client &{0xc000dca700}: rpc error: code = Unimplemented desc = unknown service thanos.Exemplars\ngithub.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/exemplars.go:44\ngithub.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:191\ngithub.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:133\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/Users/jimmiehan/go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\t/usr/local/go117/src/runtime/asm_amd64.s:1581 receiving exemplars from exemplars client &{0xc0000dbc00}: rpc error: code = Unimplemented desc = unknown service thanos.Exemplars\ngithub.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/exemplars.go:44\ngithub.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:191\ngithub.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1\n\t/Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:133\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/Users/jimmiehan/go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\t/usr/local/go117/src/runtime/asm_amd64.s:1581 receiving exemplars from exemplars client &{0xc000cb0380}: rpc error: code = Unimplemented desc = unknown service thanos.Exemplars\ngithub.com/thanos-io/thanos/pkg/exemp
  1. find the concurrent bug: many goroutines use same server exemplarspb.Exemplars_ExemplarsServer https://github.com/thanos-io/thanos/blob/main/pkg/exemplars/proxy.go#L52 (implement by *exemplarsServer https://github.com/thanos-io/thanos/blob/main/pkg/exemplars/exemplars.go#L44), so the append is unsafe.
  2. write a unit test, go test -race to reproduce it .
$ go test -race -run TestProxyDataRace -v
=== RUN   TestProxyDataRace
==================
WARNING: DATA RACE
Read at 0x00c0000b6570 by goroutine 22:
  github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send()
      /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/exemplars.go:44 +0x1b6
  github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive()
      /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:191 +0xa5b
  github.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1()
      /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:133 +0x4c
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /Users/jimmiehan/go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57 +0x96

Previous write at 0x00c0000b6570 by goroutine 21:
  github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send()
      /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/exemplars.go:44 +0x276
  github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive()
      /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:191 +0xa5b
  github.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1()
      /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:133 +0x4c
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /Users/jimmiehan/go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57 +0x96

  1. add sync.Mutex to fix it.
  2. run race test, deploy it.

hanjm added a commit to hanjm/thanos that referenced this issue Oct 28, 2021
Signed-off-by: Jimmiehan <hanjinming@outlook.com>
hanjm added a commit to hanjm/thanos that referenced this issue Oct 28, 2021
Signed-off-by: Jimmiehan <hanjinming@outlook.com>
@ianwoolf
Copy link
Contributor

👍

ianwoolf added a commit to ianwoolf/thanos that referenced this issue Oct 28, 2021
Signed-off-by: ian woolf <btw515wolf2@gmail.com>
ianwoolf added a commit to ianwoolf/thanos that referenced this issue Oct 28, 2021
Signed-off-by: ian woolf <btw515wolf2@gmail.com>
ianwoolf added a commit to ianwoolf/thanos that referenced this issue Oct 28, 2021
Signed-off-by: ian woolf <btw515wolf2@gmail.com>
GiedriusS pushed a commit that referenced this issue Oct 28, 2021
* Query: add exemplarsServer data race test (#4777)

Signed-off-by: Jimmiehan <hanjinming@outlook.com>

* Query: fix exemplarsServer data race (#4777)

Signed-off-by: Jimmiehan <hanjinming@outlook.com>
@GiedriusS GiedriusS reopened this Oct 28, 2021
@GiedriusS
Copy link
Member

Stupid question but how is the warning being nil related to the data race? @hanjm @ianwoolf 🤔

@hanjm
Copy link
Member Author

hanjm commented Oct 29, 2021

Stupid question but how is the warning being nil related to the data race? @hanjm @ianwoolf 🤔

@GiedriusS Append slice concurrently is unsafe, I guess it is relevent. I can reproduce it, as is a data race, it is not a stable bug, need run multi times go test -race -run TestProxyDataRace -v.
here is the code:

// TestProxyDataRace find the concurrent data race bug ( go test -race -run TestProxyDataRace -v ).
func TestProxyDataRace(t *testing.T) {
	logger := log.NewLogfmtLogger(os.Stderr)
	p := NewProxy(logger, func() []*exemplarspb.ExemplarStore {
		es := &exemplarspb.ExemplarStore{
			ExemplarsClient: &testExemplarClient{
				recvErr: errors.New("err"),
			},
			LabelSets: []labels.Labels{labels.FromMap(map[string]string{"cluster": "A"})},
		}
		size := 1000000
		endpoints := make([]*exemplarspb.ExemplarStore, 0, size)
		for i := 0; i < size; i++ {
			endpoints = append(endpoints, es)
		}
		return endpoints
	}, labels.FromMap(map[string]string{"query": "foo"}))
	req := &exemplarspb.ExemplarsRequest{
		Query:                   `http_request_duration_bucket{query="foo"}`,
		PartialResponseStrategy: storepb.PartialResponseStrategy_WARN,
	}
	s := &exemplarsServer{
		ctx: context.Background(),
	}
	_ = p.Exemplars(req, s)
	t.Logf("warns size:%v", len(s.warnings))
	for _, v := range s.warnings {
		if v == nil {
			t.Logf("warn is nil:%v", v)
		}
	}
}

I search in google and find this post https://groups.google.com/g/golang-nuts/c/pNuplOvIeWA , Jan Mercl explain that:

Slice is a multi word value. So writing it concurrently is not atomic
and it may eventually produce a value that's a mix of the concurrently
written values. After append it may happen that the slice's pointer to
the backing array points incorrectly to the old value but the len and
capacity are for the new value. Then it may happen that one can index
only to a certain extent, lower than would be using a non corrupted
slice value. Exact failure modes can be only guessed without more
info.

@hanjm hanjm closed this as completed Oct 29, 2021
akanshat pushed a commit to akanshat/thanos that referenced this issue Oct 30, 2021
* Query: add exemplarsServer data race test (thanos-io#4777)

Signed-off-by: Jimmiehan <hanjinming@outlook.com>

* Query: fix exemplarsServer data race (thanos-io#4777)

Signed-off-by: Jimmiehan <hanjinming@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants