WIP: Replace atomic StateLocker approach in MMSC with Mutex approach #600

bogdandrutu · 2020-03-28T02:10:52Z

In my example I did not use core.Number because I don't need operations to be atomic under the lock.

go test -benchmem -run=^$ go.opentelemetry.io/otel/sdk/metric/aggregator/minmaxsumcount -bench=.
goos: darwin
goarch: amd64
pkg: go.opentelemetry.io/otel/sdk/metric/aggregator/minmaxsumcount
BenchmarkCurrentMinMaxSumCount-16                       31125422                38.2 ns/op             0 B/op          0 allocs/op
BenchmarkCurrentMinMaxSumCountRunParallel-16             7957776               171 ns/op               0 B/op          0 allocs/op
BenchmarkCurrentMinMaxSumCountMutex-16                  80215216                13.1 ns/op             0 B/op          0 allocs/op
BenchmarkCurrentMinMaxSumCountMutexRunParallel-16       19457298                60.1 ns/op             0 B/op          0 allocs/op
PASS
ok      go.opentelemetry.io/otel/sdk/metric/aggregator/minmaxsumcount   8.256s

Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>

evantorrie · 2020-03-30T04:58:11Z

I presume this probably applies to Histogram -- maybe even Sum under concurrency > 4 or so?

bogdandrutu · 2020-03-30T15:52:47Z

@evantorrie there are couple of issues here:

core.Numbers uses unnecessary atomic ops when reading the input argument. This causes unnecessary barriers and forces code to not be optimized by the compiler which cannot reorder operations.
- Suggest to have core.Numbers not using atomics and have an sdk.internal.Numbers which uses atomics for these kind of operations. This way we limit the atomics usage to only where we need that.
In case of MinMaxSumCount we have 1 CAS a lot of times, this operation is almost as expensive as using a Mutex and can cause more troubles than just a Mutex.
Cache false sharing - all 4 variables that we load/store in the MinMaxSumCount are most likely in the same cache line, which cause concurrent operations on different CPUs to invalidate the cache line for the other CPUs.

As a suggestion I think we can start by fixing the first issue and see where we are. I have a feeling that in case of MinMaxSumCount a Mutex will be better, but in case of Sum or Histogram (probably 2 atomic adds one for sum one for the bucket count) where we need only atomic add operations, and we don't need to read the result, the cpu/compiler will do a much better job.

jmacd · 2020-03-30T16:47:57Z

I'm not sure I understand point (1) about core.Number forcing atomics. It's just an int64 and you only get atomic operations when you ask for them. Which statement are you referring to?

Point (3) is a good one.

For point (2) I'm not sure -- we were following the Prometheus library in using this design. I wonder why they decided to use a lock-free histogram bucket if your argument is true.

Then, I remember situations where mutexes were problematic in the past. We use lock-free structures not because they have the best performance, we choose them because they have the most consistent performance. I'd like to contribute one more benchmark to this debate.

bogdandrutu · 2020-03-30T17:21:10Z

For point (2) I'm not sure -- we were following the Prometheus library in using this design. I wonder why they decided to use a lock-free histogram bucket if your argument is true.

For histogram you don't need CAS because you just do adds. Only in MinMaxSumCount you need CAS, also see my last sentence which confirms that for Sum and Histogram this is better.

bogdandrutu · 2020-03-30T17:34:08Z

I'd like to contribute one more benchmark to this debate.

Please, would like to see this.

jmacd · 2020-03-30T17:38:04Z

For histogram you don't need CAS because you just do adds

Prometheus uses the lock-free approach to maintain a consistent sum and counter per bucket without a mutex. Also note that the original MMSC implementation used no locking or synchronization, and some users raised flags about this (including @evantorrie).

I'll write another benchmark, just to add to this discussion. I specifically remember a case where due to large fan-out, a number of simultaneous RPC responses would be received at once, and if they all have to grab a mutex to finish the RPC, and they all do it at the same moment, we end up with poor performance. This is the benchmark I will write.

evantorrie · 2020-03-31T22:58:03Z

I'll write another benchmark, just to add to this discussion. I specifically remember a case where due to large fan-out, a number of simultaneous RPC responses would be received at once, and if they all have to grab a mutex to finish the RPC, and they all do it at the same moment, we end up with poor performance.

There was some work that went into go 1.14 to address issues with sync.Mutex and high concurrency (particularly high core count machines). See golang/go#33747

jmacd · 2020-04-01T17:03:19Z

I ran this benchmark. I suspect I don't have enough CPUs on my machine to test the regime where mutexes are expected to underperform. Does it mean we should remove StateLocker entirely? (Sorry @paivagustavo)


func BenchmarkMinMaxSumCountConcurrentLockFree(b *testing.B) {
	descriptor := test.NewAggregatorTest(metric.MeasureKind, core.Float64NumberKind)
	agg := New(descriptor)
	benchmarkMinMaxSumCountConcurrent(b, descriptor, agg)
}

func BenchmarkMinMaxSumCountConcurrentMutex(b *testing.B) {
	descriptor := test.NewAggregatorTest(metric.MeasureKind, core.Float64NumberKind)
	agg := newTestMMSCFloat64()
	benchmarkMinMaxSumCountConcurrent(b, descriptor, agg)
}

func benchmarkMinMaxSumCountConcurrent(b *testing.B, descriptor *metric.Descriptor, agg export.Aggregator) {
	ctx := context.Background()

	cpus := runtime.NumCPU()
	stop := make(chan struct{})
	cond := sync.NewCond(new(sync.Mutex))
	var wait *sync.WaitGroup

	current := 0

	for i := 0; i < cpus-1; i++ {
		go func() {
			for trial := 1; ; trial++ {
				select {
				case <-stop:
					return
				default:
				}

				cond.L.Lock()
				for current < trial {
					cond.Wait()
				}
				cond.L.Unlock()

				for i := 0; i < 10; i++ {

					if err := agg.Update(ctx, core.NewFloat64Number(float64(i)), descriptor); err != nil {
						fmt.Print(err)
					}
				}

				wait.Done()
			}
		}()
	}

	once := func() {
		cond.L.Lock()
		wait = new(sync.WaitGroup)
		wait.Add(cpus - 1)
		current++
		cond.L.Unlock()
		cond.Broadcast()
		wait.Wait()
	}

	once()

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		once()
	}

	close(stop)
}

paivagustavo · 2020-04-01T17:49:43Z

I've replicated this for histogram and indeed it is an improvement. We can find the histogram bucket before locking the mutex which reduces the blocking part to 3 simple number operations just like MMSC.

BenchmarkCurrentHistogram-12             	19110954	        63.6 ns/op
BenchmarkCurrentHistogramParallel-12     	 9365832	       132 ns/op
BenchmarkHistogramMutex-12               	44798426	        25.7 ns/op
BenchmarkHistogramMutexRunParallel-12    	15714463	        77.8 ns/op

@jmacd Don't need to be sorry, It was a valid attempt and I've learned a lot with it, probably should benchmarked it sooner. After these benchmarks, I'm +1 to remove StateLocker.

jmacd · 2020-04-23T17:33:13Z

We are going to accept this change, and we'll keep an eye on the performance of MMSC and Histogram aggregators. There is a possibility that high-CPU environments notice a degradation, we think, but in the future new aggregators could be added specifically for those cases (e.g., AtomicMMSC, AtomicHistogram, ...).

jmacd · 2020-04-23T17:34:48Z

Actually, this PR is just a demonstration.
We are looking for a complete PR that replaces the implementation in MMSC and Histogram.

jmacd · 2020-04-27T18:39:03Z

Closing this as it's not a complete change. This has been documented and is linked from #657.

We discussed in the last OTel-Go SIG call that a Mutex is probably the best default and that should a need for lockless aggregators come along, we can add new implementations in that case.

bogdandrutu requested review from jmacd, krnowak, lizthegrey, MrAlias, paivagustavo, rghetia and tedsuo as code owners March 28, 2020 02:10

bogdandrutu mentioned this pull request Mar 28, 2020

Metrics API instrument foundation and refinements open-telemetry/oteps#88

Merged

if always(atomics) { fmt.println("Not best performance") }

c402556

Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>

bogdandrutu force-pushed the atomics_are_not_always_the_answer branch from b408e44 to c402556 Compare March 28, 2020 02:30

lizthegrey approved these changes Mar 28, 2020

View reviewed changes

bogdandrutu added the prototype Feature to prototype a spec-level decision label Mar 28, 2020

jmacd approved these changes Apr 1, 2020

View reviewed changes

evantorrie mentioned this pull request Apr 8, 2020

Reimplement MinMaxSumCount aggregator using mutex rather than lock-free atomic algorithm #625

Closed

jmacd added the area:metrics Part of OpenTelemetry Metrics label Apr 15, 2020

Merge branch 'master' into atomics_are_not_always_the_answer

e426933

jmacd changed the title ~~if always(atomics) { fmt.println("Not best performance") }~~ Replace atomic StateLocker approach in MMSC with Mutex approach Apr 23, 2020

jmacd changed the title ~~Replace atomic StateLocker approach in MMSC with Mutex approach~~ WIP: Replace atomic StateLocker approach in MMSC with Mutex approach Apr 23, 2020

jmacd marked this pull request as draft April 23, 2020 17:35

jmacd mentioned this pull request Apr 23, 2020

Complete removal of StateLocker #657

Closed

jmacd closed this Apr 27, 2020

evantorrie mentioned this pull request Apr 27, 2020

Switch MinMaxSumCount to a mutex lock instead of StateLocker #667

Merged

bogdandrutu deleted the atomics_are_not_always_the_answer branch November 15, 2021 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Replace atomic StateLocker approach in MMSC with Mutex approach #600

WIP: Replace atomic StateLocker approach in MMSC with Mutex approach #600

bogdandrutu commented Mar 28, 2020 •

edited

Loading

evantorrie commented Mar 30, 2020

bogdandrutu commented Mar 30, 2020

jmacd commented Mar 30, 2020

bogdandrutu commented Mar 30, 2020

bogdandrutu commented Mar 30, 2020 •

edited

Loading

jmacd commented Mar 30, 2020

evantorrie commented Mar 31, 2020

jmacd commented Apr 1, 2020

paivagustavo commented Apr 1, 2020 •

edited

Loading

jmacd commented Apr 23, 2020

jmacd commented Apr 23, 2020

jmacd commented Apr 27, 2020

WIP: Replace atomic StateLocker approach in MMSC with Mutex approach #600

WIP: Replace atomic StateLocker approach in MMSC with Mutex approach #600

Conversation

bogdandrutu commented Mar 28, 2020 • edited Loading

evantorrie commented Mar 30, 2020

bogdandrutu commented Mar 30, 2020

jmacd commented Mar 30, 2020

bogdandrutu commented Mar 30, 2020

bogdandrutu commented Mar 30, 2020 • edited Loading

jmacd commented Mar 30, 2020

evantorrie commented Mar 31, 2020

jmacd commented Apr 1, 2020

paivagustavo commented Apr 1, 2020 • edited Loading

jmacd commented Apr 23, 2020

jmacd commented Apr 23, 2020

jmacd commented Apr 27, 2020

bogdandrutu commented Mar 28, 2020 •

edited

Loading

bogdandrutu commented Mar 30, 2020 •

edited

Loading

paivagustavo commented Apr 1, 2020 •

edited

Loading