-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single profiler goroutine #655
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #655 +/- ##
==========================================
- Coverage 80.80% 80.78% -0.02%
==========================================
Files 45 45
Lines 4474 4574 +100
==========================================
+ Hits 3615 3695 +80
- Misses 729 742 +13
- Partials 130 137 +7
☔ View full report in Codecov by Sentry. |
0489ab7
to
4605763
Compare
Looking at profiles generated by this branch, I find samples quite inconsistent, especially during HTTP requests. We'd see regular sampling every 10ms up until sometimes at the beginning of the request then a 500ms gap until the end of the request. Can you look into why this is happening please? |
From your description it looks like the wait time is not being sampled. I will have a look. What did you test this on? |
Thanks. We run this branch on https://github.com/getsentry/gib-potato/tree/main/potal. |
Thanks for the sample app. However, I wasn't able to reproduce that, likely because I don't know how to run it in a realistic environment. However, I've tried updating some handler to run for about 90 ms and samples' times look good to me. Can you send me a dump from the 🥔 app profile that exhibits the behaviour? I don't have access to that project in Sentry. |
e000133
to
c5f1055
Compare
@phacops I've managed to have a look at a sample dump that has the gaps you're describing. In the attached sheet, you can see that for some time, the ticks were coming at an expected interval. Then there are periods of time when the ticks are missing. Interestingly, these are often in 10s of the tick interval (i.e. the gap is ~100ms if the column says "10"). I will have one more look at reproducing this locally but so far, I think it is a scheduling issue:
Can you verify what the As for an alternative way that wouldn't rely on a profiler goroutine being scheduled at a specific interval, I am not sure there's a feasible alternative at the moment. There's |
It's running on a server with 1 vCPU so I think we can still go ahead with this and eventually address it if we see it being a problem. I actually don't think we can do much better than this if the We'll keep an eye on it since data quality is very important to extract value from profiling so we pay a lot of attention to these issues but it's not like every SDK are exempt from these type of issues as well. |
Follow up on #626 - run a single profiler continuously (instead of a new profiler for each profiled transaction) and slice its data as needed.
The
profileRecorder
starts when a first transaction would be sampled and then runs until the app is stopped. I was thinking about starting/stopping when there's no transaction running but it would get more complicated and I'm not sure it's worth it because it would also make overall app performance less predictable (would differ when there's a transaction being profiled and when it's not). On the other hand, for very low sampling rate, it would be beneficial to only run when needed. In any case, the change to make the profiler start & pause on demand would be an incremental change that can be done in the future when the need arises (based on user feedback).Example of a captured profile: https://sentry-sdks.sentry.io/profiling/profile/sentry-go/7f41d10eeb8749a68d760e89c9ae8a8a/flamechart/?colorCoding=by+system+vs+application+frame&query=&sorting=call+order&tid=16&view=top+down
Additional remarks
The changes this PR makes are not going to be faster for an app with a low number of concurrent transactions. It just has to do more to serve the same purpose. On the other hand, you should see a difference when the number of parallel profiled transactions raises. Unless there's a way to tune this in the demo, it may not be suitable for the evaluation.
For example, the
TestProfilerOverhead
would fail on themain
branch after adapting it to run parallel goroutines. The overhead on this branch for 100 parallel goroutines is about 3 % on this PR, while the "same" code running on the original implementation (current status of themaster
branch), i.e. the one that launches a different profiler for each goroutine ended up failing with an overhead of over 30 % for the same number of goroutines.Benchmark code from this branch adapted to `master`