Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing lock around rng access in rand.go #5814

Closed
pkwarren opened this issue Sep 12, 2024 · 2 comments · Fixed by #5819
Closed

Missing lock around rng access in rand.go #5814

pkwarren opened this issue Sep 12, 2024 · 2 comments · Fixed by #5819
Assignees
Labels
bug Something isn't working
Milestone

Comments

@pkwarren
Copy link
Contributor

Description

A rand.Rand instance is used without locking in rand.go, leading to potential panics.

Environment

  • OS: macOS
  • Architecture: arm64
  • Go Version: 1.23
  • opentelemetry-go version: 0.29.0

Steps To Reproduce

  1. Call Int64Histogram.Record from multiple goroutines.
  2. See error below.
panic: runtime error: index out of range [-1]
2024-09-12 13:11:53.296	

2024-09-12 13:11:53.296	
goroutine 1375207 [running]:
2024-09-12 13:11:53.296	
math/rand.(*rngSource).Uint64(...)
2024-09-12 13:11:53.296	
	math/rand/rng.go:249
2024-09-12 13:11:53.296	
math/rand.(*rngSource).Int63(0x2911fc8?)
2024-09-12 13:11:53.296	
	math/rand/rng.go:234 +0x85
2024-09-12 13:11:53.296	
math/rand.(*Rand).Int63(...)
2024-09-12 13:11:53.296	
	math/rand/rand.go:96
2024-09-12 13:11:53.296	
math/rand.(*Rand).Int63n(0x8c14e4ce7d6c0496?, 0x1?)
2024-09-12 13:11:53.296	
	math/rand/rand.go:125 +0x76
2024-09-12 13:11:53.296	
go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*randRes).Offer(0xc000822780, {0x2947a18, 0xc1c9c0d770}, {0xc1c9cfc160?, 0xc1b579d528?, 0x3dcfce0?}, {0x28?, 0x40810d?}, {0x0, 0x0, ...})
2024-09-12 13:11:53.296	
	go.opentelemetry.io/otel/sdk/metric@v1.29.0/internal/exemplar/rand.go:126 +0x96
2024-09-12 13:11:53.296	
go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*filteredReservoir[...]).Offer(0x291e2e0, {0x2947a18, 0xc1c9c0d770}, 0x6f0, {0x0, 0x0, 0x0})
2024-09-12 13:11:53.296	
	go.opentelemetry.io/otel/sdk/metric@v1.29.0/internal/exemplar/filtered_reservoir.go:45 +0x127
2024-09-12 13:11:53.296	
go.opentelemetry.io/otel/sdk/metric/internal/aggregate.(*valueMap[...]).measure(0x2930680, {0x2947a18, 0xc1c9c0d770}, 0x6f0, {{{0x202a300, 0xc1c9cf2d80?}}}, {0x0, 0x0, 0x0})
2024-09-12 13:11:53.296	
	go.opentelemetry.io/otel/sdk/metric@v1.29.0/internal/aggregate/sum.go:50 +0x31f
2024-09-12 13:11:53.296	
go.opentelemetry.io/otel/sdk/metric/internal/aggregate.Builder[...].filter.func9(0x0?, {{{0x202a300?, 0xc1c9cf2d80?}}})
2024-09-12 13:11:53.296	
	go.opentelemetry.io/otel/sdk/metric@v1.29.0/internal/aggregate/aggregate.go:72 +0x35
2024-09-12 13:11:53.296	
go.opentelemetry.io/otel/sdk/metric.(*int64Inst).aggregate(...)
2024-09-12 13:11:53.296	
	go.opentelemetry.io/otel/sdk/metric@v1.29.0/instrument.go:201
2024-09-12 13:11:53.296	
go.opentelemetry.io/otel/sdk/metric.(*int64Inst).Add(0xc000f615c0, {0x2947a18, 0xc1c9c0d770}, 0x6f0, {0xc1c9cfc180?, 0x4155ab?, 0x7a5a8d6b5c60?})
2024-09-12 13:11:53.296	
	go.opentelemetry.io/otel/sdk/metric@v1.29.0/instrument.go:191 +0x89

Expected behavior

No panics from code with concurrent calls to a histogram.

@pkwarren pkwarren added the bug Something isn't working label Sep 12, 2024
pkwarren added a commit to pkwarren/opentelemetry-go that referenced this issue Sep 12, 2024
Update rand.Rand usage to add locking around a global rand.Rand
instance, which is not safe for concurrent access by multiple
goroutines.

Fixes open-telemetry#5814.
pkwarren added a commit to pkwarren/opentelemetry-go that referenced this issue Sep 12, 2024
Update rand.Rand usage to add locking around a global rand.Rand
instance, which is not safe for concurrent access by multiple
goroutines.

Fixes open-telemetry#5814.
@MrAlias
Copy link
Contributor

MrAlias commented Sep 13, 2024

For those also looking into this, this looks to be cause by an Int64Counter not an Int64Histogram.

@MrAlias
Copy link
Contributor

MrAlias commented Sep 13, 2024

Reproduction:

package metric // import "go.opentelemetry.io/otel/sdk/metric"

import (
	"context"
	"sync"
	"testing"

	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
	"go.opentelemetry.io/otel/sdk/metric/metricdata"
)

func TestRandomExemplarConcurrentSafe(t *testing.T) {
	t.Setenv("OTEL_METRICS_EXEMPLAR_FILTER", "always_on")

	r := NewManualReader()
	p := NewMeterProvider(WithReader(r))
	m := p.Meter("exemplar-concurrency")
	// Use two instruments to get concurrent access to any shared globals.
	i0, err := m.Int64Counter("counter.0")
	require.NoError(t, err)
	i1, err := m.Int64Counter("counter.1")
	require.NoError(t, err)

	ctx := context.Background()

	const goRoutines = 10

	var wg sync.WaitGroup
	wg.Add(goRoutines)
	done := make(chan struct{})
	go func() {
		wg.Wait()
		close(done)
	}()

	go func() {
		select {
		case <-done:
			return
		default:
		}
		var out metricdata.ResourceMetrics
		require.NotPanics(t, func() {
			require.NoError(t, r.Collect(ctx, &out))
		})
	}()

	for n := 0; n < goRoutines; n++ {
		go func() {
			defer wg.Done()
			require.NotPanics(t, func() {
				i0.Add(ctx, 1)
				i1.Add(ctx, 2)
			})
		}()
	}

	<-done

	var out metricdata.ResourceMetrics
	require.NoError(t, r.Collect(ctx, &out))
	p.Shutdown(ctx)
}
go test -race -count=100 -failfast -run="TestRandomExemplarConcurrentSafe" -v
=== RUN   TestRandomExemplarConcurrentSafe
--- PASS: TestRandomExemplarConcurrentSafe (0.00s)
=== RUN   TestRandomExemplarConcurrentSafe
==================
WARNING: DATA RACE
Read at 0x00c000199500 by goroutine 31:
  math/rand.(*rngSource).Uint64()
      /usr/lib/go/src/math/rand/rng.go:239 +0x30
  math/rand.(*rngSource).Int63()
      /usr/lib/go/src/math/rand/rng.go:234 +0x1cd
  math/rand.(*Rand).Int63()
      /usr/lib/go/src/math/rand/rand.go:96 +0xa7
  math/rand.(*Rand).Int63n()
      /usr/lib/go/src/math/rand/rand.go:125 +0xad
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*randRes).Offer()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/rand.go:126 +0x3ac
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*filteredReservoir[go.shape.int64]).Offer()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/filtered_reservoir.go:45 +0x282
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*filteredReservoir[int64]).Offer()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/filtered_reservoir.go:42 +0x84
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.(*valueMap[go.shape.int64]).measure()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/sum.go:50 +0x4eb
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.Builder[go.shape.int64].Sum.func3()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/aggregate.go:119 +0xdb
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.Builder[go.shape.int64].Sum.Builder[go.shape.int64].filter.func9()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/aggregate.go:72 +0x79
  go.opentelemetry.io/otel/sdk/metric.(*int64Inst).aggregate()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/instrument.go:201 +0xe2
  go.opentelemetry.io/otel/sdk/metric.(*int64Inst).Add()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/instrument.go:191 +0x76
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe.func3.1()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:56 +0x7b
  github.com/stretchr/testify/assert.didPanic()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/assert/assertions.go:1201 +0xdc
  github.com/stretchr/testify/assert.NotPanics()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/assert/assertions.go:1272 +0xa4
  github.com/stretchr/testify/require.NotPanics()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/require/require.go:1605 +0xb3
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe.func3()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:55 +0x205

Previous write at 0x00c000199500 by goroutine 32:
  math/rand.(*rngSource).Uint64()
      /usr/lib/go/src/math/rand/rng.go:239 +0x44
  math/rand.(*rngSource).Int63()
      /usr/lib/go/src/math/rand/rng.go:234 +0x1cd
  math/rand.(*Rand).Int63()
      /usr/lib/go/src/math/rand/rand.go:96 +0xd8
  math/rand.(*Rand).Float64()
      /usr/lib/go/src/math/rand/rand.go:207 +0xd9
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.random()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/rand.go:49 +0x7f
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*randRes).advance()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/rand.go:181 +0xfc
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*randRes).reset()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/rand.go:152 +0x144
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.FixedSize()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/rand.go:62 +0x124
  go.opentelemetry.io/otel/sdk/metric.reservoirFunc[go.shape.int64].func3()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar.go:75 +0x125
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.(*valueMap[go.shape.int64]).measure()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/sum.go:45 +0x46d
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.Builder[go.shape.int64].Sum.func3()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/aggregate.go:119 +0xdb
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.Builder[go.shape.int64].Sum.Builder[go.shape.int64].filter.func9()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/aggregate.go:72 +0x79
  go.opentelemetry.io/otel/sdk/metric.(*int64Inst).aggregate()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/instrument.go:201 +0xe2
  go.opentelemetry.io/otel/sdk/metric.(*int64Inst).Add()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/instrument.go:191 +0x76
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe.func3.1()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:57 +0xa5
  github.com/stretchr/testify/assert.didPanic()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/assert/assertions.go:1201 +0xdc
  github.com/stretchr/testify/assert.NotPanics()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/assert/assertions.go:1272 +0xa4
  github.com/stretchr/testify/require.NotPanics()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/require/require.go:1605 +0xb3
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe.func3()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:55 +0x205

Goroutine 31 (running) created at:
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:53 +0x45b
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/lib/go/src/testing/testing.go:1743 +0x44

Goroutine 32 (finished) created at:
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:53 +0x45b
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/lib/go/src/testing/testing.go:1743 +0x44
==================
==================
WARNING: DATA RACE
Read at 0x00c000199508 by goroutine 31:
  math/rand.(*rngSource).Uint64()
      /usr/lib/go/src/math/rand/rng.go:244 +0xa4
  math/rand.(*rngSource).Int63()
      /usr/lib/go/src/math/rand/rng.go:234 +0x1cd
  math/rand.(*Rand).Int63()
      /usr/lib/go/src/math/rand/rand.go:96 +0xa7
  math/rand.(*Rand).Int63n()
      /usr/lib/go/src/math/rand/rand.go:125 +0xad
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*randRes).Offer()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/rand.go:126 +0x3ac
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*filteredReservoir[go.shape.int64]).Offer()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/filtered_reservoir.go:45 +0x282
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*filteredReservoir[int64]).Offer()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/filtered_reservoir.go:42 +0x84
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.(*valueMap[go.shape.int64]).measure()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/sum.go:50 +0x4eb
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.Builder[go.shape.int64].Sum.func3()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/aggregate.go:119 +0xdb
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.Builder[go.shape.int64].Sum.Builder[go.shape.int64].filter.func9()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/aggregate.go:72 +0x79
  go.opentelemetry.io/otel/sdk/metric.(*int64Inst).aggregate()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/instrument.go:201 +0xe2
  go.opentelemetry.io/otel/sdk/metric.(*int64Inst).Add()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/instrument.go:191 +0x76
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe.func3.1()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:56 +0x7b
  github.com/stretchr/testify/assert.didPanic()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/assert/assertions.go:1201 +0xdc
  github.com/stretchr/testify/assert.NotPanics()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/assert/assertions.go:1272 +0xa4
  github.com/stretchr/testify/require.NotPanics()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/require/require.go:1605 +0xb3
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe.func3()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:55 +0x205

Previous write at 0x00c000199508 by goroutine 32:
  math/rand.(*rngSource).Uint64()
      /usr/lib/go/src/math/rand/rng.go:244 +0xbc
  math/rand.(*rngSource).Int63()
      /usr/lib/go/src/math/rand/rng.go:234 +0x1cd
  math/rand.(*Rand).Int63()
      /usr/lib/go/src/math/rand/rand.go:96 +0xd8
  math/rand.(*Rand).Float64()
      /usr/lib/go/src/math/rand/rand.go:207 +0xd9
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.random()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/rand.go:49 +0x7f
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*randRes).advance()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/rand.go:181 +0xfc
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.(*randRes).reset()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/rand.go:152 +0x144
  go.opentelemetry.io/otel/sdk/metric/internal/exemplar.FixedSize()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/exemplar/rand.go:62 +0x124
  go.opentelemetry.io/otel/sdk/metric.reservoirFunc[go.shape.int64].func3()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar.go:75 +0x125
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.(*valueMap[go.shape.int64]).measure()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/sum.go:45 +0x46d
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.Builder[go.shape.int64].Sum.func3()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/aggregate.go:119 +0xdb
  go.opentelemetry.io/otel/sdk/metric/internal/aggregate.Builder[go.shape.int64].Sum.Builder[go.shape.int64].filter.func9()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/internal/aggregate/aggregate.go:72 +0x79
  go.opentelemetry.io/otel/sdk/metric.(*int64Inst).aggregate()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/instrument.go:201 +0xe2
  go.opentelemetry.io/otel/sdk/metric.(*int64Inst).Add()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/instrument.go:191 +0x76
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe.func3.1()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:57 +0xa5
  github.com/stretchr/testify/assert.didPanic()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/assert/assertions.go:1201 +0xdc
  github.com/stretchr/testify/assert.NotPanics()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/assert/assertions.go:1272 +0xa4
  github.com/stretchr/testify/require.NotPanics()
      /home/tyler/go/pkg/mod/github.com/stretchr/testify@v1.9.0/require/require.go:1605 +0xb3
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe.func3()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:55 +0x205

Goroutine 31 (running) created at:
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:53 +0x45b
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/lib/go/src/testing/testing.go:1743 +0x44

Goroutine 32 (finished) created at:
  go.opentelemetry.io/otel/sdk/metric.TestRandomExemplarConcurrentSafe()
      /home/tyler/go/src/go.opentelemetry.io/otel/sdk/metric/exemplar_test.go:53 +0x45b
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/lib/go/src/testing/testing.go:1743 +0x44
==================
    testing.go:1399: race detected during execution of test
--- FAIL: TestRandomExemplarConcurrentSafe (0.00s)
FAIL
exit status 1
FAIL	go.opentelemetry.io/otel/sdk/metric	0.017s

The key here is the duplicate instruments (i.e. i0 and i1) that share access to the global. This does not fail with only one instrument.

This does not fail reliably with only one test run. Need to build multiple collections into this test to cause it to do so.

MrAlias added a commit to MrAlias/opentelemetry-go that referenced this issue Sep 13, 2024
@MrAlias MrAlias self-assigned this Sep 13, 2024
@MrAlias MrAlias added this to the v1.31.0 milestone Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants