Single profiler goroutine #655

vaind · 2023-06-18T19:32:14Z

Follow up on #626 - run a single profiler continuously (instead of a new profiler for each profiled transaction) and slice its data as needed.

The profileRecorder starts when a first transaction would be sampled and then runs until the app is stopped. I was thinking about starting/stopping when there's no transaction running but it would get more complicated and I'm not sure it's worth it because it would also make overall app performance less predictable (would differ when there's a transaction being profiled and when it's not). On the other hand, for very low sampling rate, it would be beneficial to only run when needed. In any case, the change to make the profiler start & pause on demand would be an incremental change that can be done in the future when the need arises (based on user feedback).

Example of a captured profile: https://sentry-sdks.sentry.io/profiling/profile/sentry-go/7f41d10eeb8749a68d760e89c9ae8a8a/flamechart/?colorCoding=by+system+vs+application+frame&query=&sorting=call+order&tid=16&view=top+down

Additional remarks

The changes this PR makes are not going to be faster for an app with a low number of concurrent transactions. It just has to do more to serve the same purpose. On the other hand, you should see a difference when the number of parallel profiled transactions raises. Unless there's a way to tune this in the demo, it may not be suitable for the evaluation.

For example, the TestProfilerOverhead would fail on the main branch after adapting it to run parallel goroutines. The overhead on this branch for 100 parallel goroutines is about 3 % on this PR, while the "same" code running on the original implementation (current status of the master branch), i.e. the one that launches a different profiler for each goroutine ended up failing with an overhead of over 30 % for the same number of goroutines.

Benchmark code from this branch adapted to `master`

func profilerBenchmark(t *testing.T, b *testing.B, withProfiling bool, arg int) {
	var wg sync.WaitGroup
	const numRoutines = 100
	wg.Add(numRoutines)
	for i := 0; i < numRoutines; i++ {
		go func() {
			var stopFn func() *profilerResult
			if withProfiling {
				stopFn = startProfiling(time.Now())
			}
			_ = findPrimeNumber(arg)
			if withProfiling {
				stopFn()
			}
			wg.Done()
		}()
	}
	wg.Wait()
}

func TestProfilerOverhead(t *testing.T) {
	if testing.Short() {
		t.Skip("Skipping overhead benchmark in short mode.")
	}
	if isCI() {
		t.Skip("Skipping on CI because the machines are too overloaded to run the test properly - they show between 3 and 30 %% overhead....")
	}

	// First, find a large-enough argument so that findPrimeNumber(arg) takes more than 100ms.
	var arg = 10000
	for {
		start := time.Now()
		_ = findPrimeNumber(arg)
		end := time.Now()
		if end.Sub(start) > 100*time.Millisecond {
			t.Logf("Found arg = %d that takes %d ms to process.", arg, end.Sub(start).Milliseconds())
			break
		}
		arg += 10000
	}

	var assert = assert.New(t)
	var baseline = testing.Benchmark(func(b *testing.B) { profilerBenchmark(t, b, false, arg) })
	var profiling = testing.Benchmark(func(b *testing.B) { profilerBenchmark(t, b, true, arg) })

	t.Logf("Without profiling: %v\n", baseline.String())
	t.Logf("With profiling:    %v\n", profiling.String())

	var overhead = float64(profiling.NsPerOp())/float64(baseline.NsPerOp())*100 - 100
	var maxOverhead = 5.0
	t.Logf("Profiling overhead: %f percent\n", overhead)
	assert.Less(overhead, maxOverhead)
}

codecov · 2023-06-18T19:35:06Z

Codecov Report

Patch coverage: 89.30% and project coverage change: -0.02% ⚠️

Comparison is base (8b3833f) 80.80% compared to head (240d519) 80.78%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #655      +/-   ##
==========================================
- Coverage   80.80%   80.78%   -0.02%     
==========================================
  Files          45       45              
  Lines        4474     4574     +100     
==========================================
+ Hits         3615     3695      +80     
- Misses        729      742      +13     
- Partials      130      137       +7

Files Changed	Coverage Δ
traces_profiler.go	`68.25% <63.63%> (-2.44%)`	⬇️
profiler.go	`90.03% <90.50%> (-3.82%)`	⬇️
profiler_windows.go	`100.00% <100.00%> (ø)`
tracing.go	`85.76% <100.00%> (-0.19%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

profiler.go

phacops · 2023-08-09T13:00:46Z

Looking at profiles generated by this branch, I find samples quite inconsistent, especially during HTTP requests. We'd see regular sampling every 10ms up until sometimes at the beginning of the request then a 500ms gap until the end of the request.

Can you look into why this is happening please?

vaind · 2023-08-09T13:26:26Z

Looking at profiles generated by this branch, I find samples quite inconsistent, especially during HTTP requests. We'd see regular sampling every 10ms up until sometimes at the beginning of the request then a 500ms gap until the end of the request.

Can you look into why this is happening please?

From your description it looks like the wait time is not being sampled. I will have a look. What did you test this on?

phacops · 2023-08-09T13:28:38Z

From your description it looks like the wait time is not being sampled. I will have a look. What did you test this on?

Thanks. We run this branch on https://github.com/getsentry/gib-potato/tree/main/potal.

vaind · 2023-08-10T18:35:23Z

From your description it looks like the wait time is not being sampled. I will have a look. What did you test this on?

Thanks. We run this branch on getsentry/gib-potato@main/potal.

Thanks for the sample app. However, I wasn't able to reproduce that, likely because I don't know how to run it in a realistic environment. However, I've tried updating some handler to run for about 90 ms and samples' times look good to me.

Can you send me a dump from the 🥔 app profile that exhibits the behaviour? I don't have access to that project in Sentry.

…ing()

vaind · 2023-08-15T19:17:20Z

From your description it looks like the wait time is not being sampled. I will have a look. What did you test this on?

Thanks. We run this branch on getsentry/gib-potato@main/potal.

Thanks for the sample app. However, I wasn't able to reproduce that, likely because I don't know how to run it in a realistic environment. However, I've tried updating some handler to run for about 90 ms and samples' times look good to me.

Can you send me a dump from the 🥔 app profile that exhibits the behaviour? I don't have access to that project in Sentry.

@phacops I've managed to have a look at a sample dump that has the gaps you're describing. In the attached sheet, you can see that for some time, the ticks were coming at an expected interval. Then there are periods of time when the ticks are missing. Interestingly, these are often in 10s of the tick interval (i.e. the gap is ~100ms if the column says "10"). I will have one more look at reproducing this locally but so far, I think it is a scheduling issue:

either the profiler goroutine isn't getting scheduled, because there are other running routines taking up all the slots up to GOMAXPROCS - this is most likely IMO
or because the whole process isn't being scheduled to run due to host machine overload. Although not completely related, the first three paragraphs of this comment may be relevant reading.
or there's a bug in the profiler code which somehow loses samples (~~I'll try to rule this out tomorrow~~ I couldn't find any)

Can you verify what the GOMAXPROCS says in the app?

As for an alternative way that wouldn't rely on a profiler goroutine being scheduled at a specific interval, I am not sure there's a feasible alternative at the moment. There's pprof.StartCPUProfile() but it collects aggregate data - number of executions for each stack - so we don't get all the stacks in the order and with times there were sampled.
We could, in theory, reimplement what cpuprof and set up signal handlers (don't have source code at hand for that) but that doesn't look compelling to maintain, TBH.

phacops · 2023-08-18T13:13:53Z

It's running on a server with 1 vCPU so GOMAXPROCS is 1.

I think we can still go ahead with this and eventually address it if we see it being a problem. I actually don't think we can do much better than this if the GOMAXPROCS is 1 so I'd call that work as expected.

We'll keep an eye on it since data quality is very important to extract value from profiling so we pay a lot of attention to these issues but it's not like every SDK are exempt from these type of issues as well.

vaind added 3 commits June 18, 2023 21:26

wip: single profiler goroutine

bb66958

wip: continous profiling

f219e9f

linter issues

92d49b2

vaind added 4 commits June 19, 2023 09:47

minor cleanups

61ede03

profiler test improvements and internal fixes

0944a0b

linter issue

d6afa02

minor changes

4605763

vaind force-pushed the feat/single-profiler branch from 0489ab7 to 4605763 Compare June 20, 2023 12:23

vaind added 2 commits June 20, 2023 15:17

test improvements

a441f6a

fix timestamps

6c0364f

vaind changed the title ~~wip: single profiler goroutine~~ Single profiler goroutine Jun 20, 2023

vaind marked this pull request as ready for review June 20, 2023 16:45

vaind requested a review from tonyo June 20, 2023 16:45

tonyo reviewed Jun 22, 2023

View reviewed changes

profiler.go Outdated Show resolved Hide resolved

profiler.go Outdated Show resolved Hide resolved

tonyo requested a review from phacops June 22, 2023 10:41

vaind added 2 commits June 22, 2023 12:56

minor refactorings

71f3098

test deflaking

b0c2fa5

tonyo mentioned this pull request Jun 26, 2023

feat: Bump sentry-go to a feature branch getsentry/opentelemetry-demo#95

Closed

vaind added 7 commits July 5, 2023 14:14

wip: avoid unbounded memory growth

53e7949

refactor: create goroutines at the time of profile slicing

12877bc

update profiler benchmarks

c7b2c7c

Merge branch 'master' into feat/single-profiler

60611ce

linter issues

9d85794

minor improvements

95f7a5f

fix for go 1.18

bcd29eb

tonyo added Status: In Progress Type: Improvement labels Jul 6, 2023

cleptric assigned vaind Aug 3, 2023

Merge branch 'master' into feat/single-profiler

c1315c3

Merge branch 'master' into feat/single-profiler

aaeae24

vaind and others added 13 commits August 11, 2023 13:31

Merge branch 'master' into feat/single-profiler

eb6a9c1

test: asserts in TestProfilerPanicOnTick

ce9a03e

feat: log panics in the profiler

cf79aff

Merge branch 'master' into feat/single-profiler

c753bd4

Merge branch 'master' into feat/single-profiler

1dc1184

Merge branch 'master' into feat/single-profiler

3f81eeb

fix: profiler must be started fully before returning from startProfil…

7e08660

…ing()

Merge branch 'master' into feat/single-profiler

d843aaf

improve profiler test logging

a174b6b

use a channel to signal profiler startup state

6598871

linter issue

20f12da

profiler test logging changes

2aab77b

wait for profiler to stop

c5f1055

vaind force-pushed the feat/single-profiler branch from e000133 to c5f1055 Compare August 15, 2023 11:14

Merge branch 'master' into feat/single-profiler

a4cc539

phacops approved these changes Aug 18, 2023

View reviewed changes

tonyo approved these changes Aug 18, 2023

View reviewed changes

chore: update changelog

240d519

vaind enabled auto-merge (squash) August 18, 2023 13:45

vaind merged commit 82a00ab into master Aug 18, 2023
17 checks passed

vaind deleted the feat/single-profiler branch August 18, 2023 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single profiler goroutine #655

Single profiler goroutine #655

vaind commented Jun 18, 2023 •

edited

Loading

codecov bot commented Jun 18, 2023 •

edited

Loading

phacops commented Aug 9, 2023 •

edited

Loading

vaind commented Aug 9, 2023

phacops commented Aug 9, 2023

vaind commented Aug 10, 2023

vaind commented Aug 15, 2023 •

edited

Loading

phacops commented Aug 18, 2023

Single profiler goroutine #655

Single profiler goroutine #655

Conversation

vaind commented Jun 18, 2023 • edited Loading

Additional remarks

codecov bot commented Jun 18, 2023 • edited Loading

Codecov Report

phacops commented Aug 9, 2023 • edited Loading

vaind commented Aug 9, 2023

phacops commented Aug 9, 2023

vaind commented Aug 10, 2023

vaind commented Aug 15, 2023 • edited Loading

phacops commented Aug 18, 2023

vaind commented Jun 18, 2023 •

edited

Loading

codecov bot commented Jun 18, 2023 •

edited

Loading

phacops commented Aug 9, 2023 •

edited

Loading

vaind commented Aug 15, 2023 •

edited

Loading