Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single profiler goroutine #655

Merged
merged 35 commits into from
Aug 18, 2023
Merged

Single profiler goroutine #655

merged 35 commits into from
Aug 18, 2023

Conversation

vaind
Copy link
Collaborator

@vaind vaind commented Jun 18, 2023

Follow up on #626 - run a single profiler continuously (instead of a new profiler for each profiled transaction) and slice its data as needed.

The profileRecorder starts when a first transaction would be sampled and then runs until the app is stopped. I was thinking about starting/stopping when there's no transaction running but it would get more complicated and I'm not sure it's worth it because it would also make overall app performance less predictable (would differ when there's a transaction being profiled and when it's not). On the other hand, for very low sampling rate, it would be beneficial to only run when needed. In any case, the change to make the profiler start & pause on demand would be an incremental change that can be done in the future when the need arises (based on user feedback).

Example of a captured profile: https://sentry-sdks.sentry.io/profiling/profile/sentry-go/7f41d10eeb8749a68d760e89c9ae8a8a/flamechart/?colorCoding=by+system+vs+application+frame&query=&sorting=call+order&tid=16&view=top+down

Additional remarks

The changes this PR makes are not going to be faster for an app with a low number of concurrent transactions. It just has to do more to serve the same purpose. On the other hand, you should see a difference when the number of parallel profiled transactions raises. Unless there's a way to tune this in the demo, it may not be suitable for the evaluation.

For example, the TestProfilerOverhead would fail on the main branch after adapting it to run parallel goroutines. The overhead on this branch for 100 parallel goroutines is about 3 % on this PR, while the "same" code running on the original implementation (current status of the master branch), i.e. the one that launches a different profiler for each goroutine ended up failing with an overhead of over 30 % for the same number of goroutines.

Benchmark code from this branch adapted to `master`
func profilerBenchmark(t *testing.T, b *testing.B, withProfiling bool, arg int) {
	var wg sync.WaitGroup
	const numRoutines = 100
	wg.Add(numRoutines)
	for i := 0; i < numRoutines; i++ {
		go func() {
			var stopFn func() *profilerResult
			if withProfiling {
				stopFn = startProfiling(time.Now())
			}
			_ = findPrimeNumber(arg)
			if withProfiling {
				stopFn()
			}
			wg.Done()
		}()
	}
	wg.Wait()
}

func TestProfilerOverhead(t *testing.T) {
	if testing.Short() {
		t.Skip("Skipping overhead benchmark in short mode.")
	}
	if isCI() {
		t.Skip("Skipping on CI because the machines are too overloaded to run the test properly - they show between 3 and 30 %% overhead....")
	}

	// First, find a large-enough argument so that findPrimeNumber(arg) takes more than 100ms.
	var arg = 10000
	for {
		start := time.Now()
		_ = findPrimeNumber(arg)
		end := time.Now()
		if end.Sub(start) > 100*time.Millisecond {
			t.Logf("Found arg = %d that takes %d ms to process.", arg, end.Sub(start).Milliseconds())
			break
		}
		arg += 10000
	}

	var assert = assert.New(t)
	var baseline = testing.Benchmark(func(b *testing.B) { profilerBenchmark(t, b, false, arg) })
	var profiling = testing.Benchmark(func(b *testing.B) { profilerBenchmark(t, b, true, arg) })

	t.Logf("Without profiling: %v\n", baseline.String())
	t.Logf("With profiling:    %v\n", profiling.String())

	var overhead = float64(profiling.NsPerOp())/float64(baseline.NsPerOp())*100 - 100
	var maxOverhead = 5.0
	t.Logf("Profiling overhead: %f percent\n", overhead)
	assert.Less(overhead, maxOverhead)
}

@codecov
Copy link

codecov bot commented Jun 18, 2023

Codecov Report

Patch coverage: 89.30% and project coverage change: -0.02% ⚠️

Comparison is base (8b3833f) 80.80% compared to head (240d519) 80.78%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #655      +/-   ##
==========================================
- Coverage   80.80%   80.78%   -0.02%     
==========================================
  Files          45       45              
  Lines        4474     4574     +100     
==========================================
+ Hits         3615     3695      +80     
- Misses        729      742      +13     
- Partials      130      137       +7     
Files Changed Coverage Δ
traces_profiler.go 68.25% <63.63%> (-2.44%) ⬇️
profiler.go 90.03% <90.50%> (-3.82%) ⬇️
profiler_windows.go 100.00% <100.00%> (ø)
tracing.go 85.76% <100.00%> (-0.19%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vaind vaind changed the title wip: single profiler goroutine Single profiler goroutine Jun 20, 2023
@vaind vaind marked this pull request as ready for review June 20, 2023 16:45
@vaind vaind requested a review from tonyo June 20, 2023 16:45
profiler.go Outdated Show resolved Hide resolved
profiler.go Outdated Show resolved Hide resolved
@tonyo tonyo requested a review from phacops June 22, 2023 10:41
@phacops
Copy link
Contributor

phacops commented Aug 9, 2023

Looking at profiles generated by this branch, I find samples quite inconsistent, especially during HTTP requests. We'd see regular sampling every 10ms up until sometimes at the beginning of the request then a 500ms gap until the end of the request.

Can you look into why this is happening please?

@vaind
Copy link
Collaborator Author

vaind commented Aug 9, 2023

Looking at profiles generated by this branch, I find samples quite inconsistent, especially during HTTP requests. We'd see regular sampling every 10ms up until sometimes at the beginning of the request then a 500ms gap until the end of the request.

Can you look into why this is happening please?

From your description it looks like the wait time is not being sampled. I will have a look. What did you test this on?

@phacops
Copy link
Contributor

phacops commented Aug 9, 2023

From your description it looks like the wait time is not being sampled. I will have a look. What did you test this on?

Thanks. We run this branch on https://github.com/getsentry/gib-potato/tree/main/potal.

@vaind
Copy link
Collaborator Author

vaind commented Aug 10, 2023

From your description it looks like the wait time is not being sampled. I will have a look. What did you test this on?

Thanks. We run this branch on getsentry/gib-potato@main/potal.

Thanks for the sample app. However, I wasn't able to reproduce that, likely because I don't know how to run it in a realistic environment. However, I've tried updating some handler to run for about 90 ms and samples' times look good to me.

Can you send me a dump from the 🥔 app profile that exhibits the behaviour? I don't have access to that project in Sentry.

@vaind
Copy link
Collaborator Author

vaind commented Aug 15, 2023

From your description it looks like the wait time is not being sampled. I will have a look. What did you test this on?

Thanks. We run this branch on getsentry/gib-potato@main/potal.

Thanks for the sample app. However, I wasn't able to reproduce that, likely because I don't know how to run it in a realistic environment. However, I've tried updating some handler to run for about 90 ms and samples' times look good to me.

Can you send me a dump from the 🥔 app profile that exhibits the behaviour? I don't have access to that project in Sentry.

@phacops I've managed to have a look at a sample dump that has the gaps you're describing. In the attached sheet, you can see that for some time, the ticks were coming at an expected interval. Then there are periods of time when the ticks are missing. Interestingly, these are often in 10s of the tick interval (i.e. the gap is ~100ms if the column says "10"). I will have one more look at reproducing this locally but so far, I think it is a scheduling issue:

  1. either the profiler goroutine isn't getting scheduled, because there are other running routines taking up all the slots up to GOMAXPROCS - this is most likely IMO
  2. or because the whole process isn't being scheduled to run due to host machine overload. Although not completely related, the first three paragraphs of this comment may be relevant reading.
  3. or there's a bug in the profiler code which somehow loses samples (I'll try to rule this out tomorrow I couldn't find any)

Can you verify what the GOMAXPROCS says in the app?

As for an alternative way that wouldn't rely on a profiler goroutine being scheduled at a specific interval, I am not sure there's a feasible alternative at the moment. There's pprof.StartCPUProfile() but it collects aggregate data - number of executions for each stack - so we don't get all the stacks in the order and with times there were sampled.
We could, in theory, reimplement what cpuprof and set up signal handlers (don't have source code at hand for that) but that doesn't look compelling to maintain, TBH.

@phacops
Copy link
Contributor

phacops commented Aug 18, 2023

It's running on a server with 1 vCPU so GOMAXPROCS is 1.

I think we can still go ahead with this and eventually address it if we see it being a problem. I actually don't think we can do much better than this if the GOMAXPROCS is 1 so I'd call that work as expected.

We'll keep an eye on it since data quality is very important to extract value from profiling so we pay a lot of attention to these issues but it's not like every SDK are exempt from these type of issues as well.

@vaind vaind enabled auto-merge (squash) August 18, 2023 13:45
@vaind vaind merged commit 82a00ab into master Aug 18, 2023
17 checks passed
@vaind vaind deleted the feat/single-profiler branch August 18, 2023 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants