Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Replace atomic StateLocker approach in MMSC with Mutex approach #600

Conversation

bogdandrutu
Copy link
Member

@bogdandrutu bogdandrutu commented Mar 28, 2020

In my example I did not use core.Number because I don't need operations to be atomic under the lock.

go test -benchmem -run=^$ go.opentelemetry.io/otel/sdk/metric/aggregator/minmaxsumcount -bench=.
goos: darwin
goarch: amd64
pkg: go.opentelemetry.io/otel/sdk/metric/aggregator/minmaxsumcount
BenchmarkCurrentMinMaxSumCount-16                       31125422                38.2 ns/op             0 B/op          0 allocs/op
BenchmarkCurrentMinMaxSumCountRunParallel-16             7957776               171 ns/op               0 B/op          0 allocs/op
BenchmarkCurrentMinMaxSumCountMutex-16                  80215216                13.1 ns/op             0 B/op          0 allocs/op
BenchmarkCurrentMinMaxSumCountMutexRunParallel-16       19457298                60.1 ns/op             0 B/op          0 allocs/op
PASS
ok      go.opentelemetry.io/otel/sdk/metric/aggregator/minmaxsumcount   8.256s

Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>
@bogdandrutu bogdandrutu force-pushed the atomics_are_not_always_the_answer branch from b408e44 to c402556 Compare March 28, 2020 02:30
@bogdandrutu bogdandrutu added the prototype Feature to prototype a spec-level decision label Mar 28, 2020
@evantorrie
Copy link
Contributor

I presume this probably applies to Histogram -- maybe even Sum under concurrency > 4 or so?

@bogdandrutu
Copy link
Member Author

@evantorrie there are couple of issues here:

  1. core.Numbers uses unnecessary atomic ops when reading the input argument. This causes unnecessary barriers and forces code to not be optimized by the compiler which cannot reorder operations.
    • Suggest to have core.Numbers not using atomics and have an sdk.internal.Numbers which uses atomics for these kind of operations. This way we limit the atomics usage to only where we need that.
  2. In case of MinMaxSumCount we have 1 CAS a lot of times, this operation is almost as expensive as using a Mutex and can cause more troubles than just a Mutex.
  3. Cache false sharing - all 4 variables that we load/store in the MinMaxSumCount are most likely in the same cache line, which cause concurrent operations on different CPUs to invalidate the cache line for the other CPUs.

As a suggestion I think we can start by fixing the first issue and see where we are. I have a feeling that in case of MinMaxSumCount a Mutex will be better, but in case of Sum or Histogram (probably 2 atomic adds one for sum one for the bucket count) where we need only atomic add operations, and we don't need to read the result, the cpu/compiler will do a much better job.

@jmacd
Copy link
Contributor

jmacd commented Mar 30, 2020

I'm not sure I understand point (1) about core.Number forcing atomics. It's just an int64 and you only get atomic operations when you ask for them. Which statement are you referring to?

Point (3) is a good one.

For point (2) I'm not sure -- we were following the Prometheus library in using this design. I wonder why they decided to use a lock-free histogram bucket if your argument is true.

Then, I remember situations where mutexes were problematic in the past. We use lock-free structures not because they have the best performance, we choose them because they have the most consistent performance. I'd like to contribute one more benchmark to this debate.

@bogdandrutu
Copy link
Member Author

For point (2) I'm not sure -- we were following the Prometheus library in using this design. I wonder why they decided to use a lock-free histogram bucket if your argument is true.

For histogram you don't need CAS because you just do adds. Only in MinMaxSumCount you need CAS, also see my last sentence which confirms that for Sum and Histogram this is better.

@bogdandrutu
Copy link
Member Author

bogdandrutu commented Mar 30, 2020

I'd like to contribute one more benchmark to this debate.

Please, would like to see this.

@jmacd
Copy link
Contributor

jmacd commented Mar 30, 2020

For histogram you don't need CAS because you just do adds

Prometheus uses the lock-free approach to maintain a consistent sum and counter per bucket without a mutex. Also note that the original MMSC implementation used no locking or synchronization, and some users raised flags about this (including @evantorrie).

I'll write another benchmark, just to add to this discussion. I specifically remember a case where due to large fan-out, a number of simultaneous RPC responses would be received at once, and if they all have to grab a mutex to finish the RPC, and they all do it at the same moment, we end up with poor performance. This is the benchmark I will write.

@evantorrie
Copy link
Contributor

I'll write another benchmark, just to add to this discussion. I specifically remember a case where due to large fan-out, a number of simultaneous RPC responses would be received at once, and if they all have to grab a mutex to finish the RPC, and they all do it at the same moment, we end up with poor performance.

There was some work that went into go 1.14 to address issues with sync.Mutex and high concurrency (particularly high core count machines). See golang/go#33747

@jmacd
Copy link
Contributor

jmacd commented Apr 1, 2020

I ran this benchmark. I suspect I don't have enough CPUs on my machine to test the regime where mutexes are expected to underperform. Does it mean we should remove StateLocker entirely? (Sorry @paivagustavo)


func BenchmarkMinMaxSumCountConcurrentLockFree(b *testing.B) {
	descriptor := test.NewAggregatorTest(metric.MeasureKind, core.Float64NumberKind)
	agg := New(descriptor)
	benchmarkMinMaxSumCountConcurrent(b, descriptor, agg)
}

func BenchmarkMinMaxSumCountConcurrentMutex(b *testing.B) {
	descriptor := test.NewAggregatorTest(metric.MeasureKind, core.Float64NumberKind)
	agg := newTestMMSCFloat64()
	benchmarkMinMaxSumCountConcurrent(b, descriptor, agg)
}

func benchmarkMinMaxSumCountConcurrent(b *testing.B, descriptor *metric.Descriptor, agg export.Aggregator) {
	ctx := context.Background()

	cpus := runtime.NumCPU()
	stop := make(chan struct{})
	cond := sync.NewCond(new(sync.Mutex))
	var wait *sync.WaitGroup

	current := 0

	for i := 0; i < cpus-1; i++ {
		go func() {
			for trial := 1; ; trial++ {
				select {
				case <-stop:
					return
				default:
				}

				cond.L.Lock()
				for current < trial {
					cond.Wait()
				}
				cond.L.Unlock()

				for i := 0; i < 10; i++ {

					if err := agg.Update(ctx, core.NewFloat64Number(float64(i)), descriptor); err != nil {
						fmt.Print(err)
					}
				}

				wait.Done()
			}
		}()
	}

	once := func() {
		cond.L.Lock()
		wait = new(sync.WaitGroup)
		wait.Add(cpus - 1)
		current++
		cond.L.Unlock()
		cond.Broadcast()
		wait.Wait()
	}

	once()

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		once()
	}

	close(stop)
}

@paivagustavo
Copy link
Member

paivagustavo commented Apr 1, 2020

I've replicated this for histogram and indeed it is an improvement. We can find the histogram bucket before locking the mutex which reduces the blocking part to 3 simple number operations just like MMSC.

BenchmarkCurrentHistogram-12             	19110954	        63.6 ns/op
BenchmarkCurrentHistogramParallel-12     	 9365832	       132 ns/op
BenchmarkHistogramMutex-12               	44798426	        25.7 ns/op
BenchmarkHistogramMutexRunParallel-12    	15714463	        77.8 ns/op

@jmacd Don't need to be sorry, It was a valid attempt and I've learned a lot with it, probably should benchmarked it sooner. After these benchmarks, I'm +1 to remove StateLocker.

@jmacd
Copy link
Contributor

jmacd commented Apr 23, 2020

We are going to accept this change, and we'll keep an eye on the performance of MMSC and Histogram aggregators. There is a possibility that high-CPU environments notice a degradation, we think, but in the future new aggregators could be added specifically for those cases (e.g., AtomicMMSC, AtomicHistogram, ...).

@jmacd jmacd changed the title if always(atomics) { fmt.println("Not best performance") } Replace atomic StateLocker approach in MMSC with Mutex approach Apr 23, 2020
@jmacd
Copy link
Contributor

jmacd commented Apr 23, 2020

Actually, this PR is just a demonstration.
We are looking for a complete PR that replaces the implementation in MMSC and Histogram.

@jmacd jmacd changed the title Replace atomic StateLocker approach in MMSC with Mutex approach WIP: Replace atomic StateLocker approach in MMSC with Mutex approach Apr 23, 2020
@jmacd jmacd marked this pull request as draft April 23, 2020 17:35
@jmacd
Copy link
Contributor

jmacd commented Apr 27, 2020

Closing this as it's not a complete change. This has been documented and is linked from #657.

We discussed in the last OTel-Go SIG call that a Mutex is probably the best default and that should a need for lockless aggregators come along, we can add new implementations in that case.

@jmacd jmacd closed this Apr 27, 2020
@bogdandrutu bogdandrutu deleted the atomics_are_not_always_the_answer branch November 15, 2021 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:metrics Part of OpenTelemetry Metrics prototype Feature to prototype a spec-level decision
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants