-
-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: profiler timeout scheduling and data preservation #3135
fix: profiler timeout scheduling and data preservation #3135
Conversation
Performance metrics 🚀
|
Revision | Plain | With Sentry | Diff |
---|---|---|---|
0d95a04 | 1229.83 ms | 1244.94 ms | 15.11 ms |
411db4a | 1263.90 ms | 1270.30 ms | 6.40 ms |
e6bd377 | 1242.76 ms | 1252.59 ms | 9.84 ms |
2fa2a0d | 1261.69 ms | 1270.29 ms | 8.59 ms |
App size
Revision | Plain | With Sentry | Diff |
---|---|---|---|
0d95a04 | 22.84 KiB | 402.13 KiB | 379.28 KiB |
411db4a | 22.84 KiB | 401.96 KiB | 379.11 KiB |
e6bd377 | 22.84 KiB | 402.12 KiB | 379.28 KiB |
2fa2a0d | 22.84 KiB | 402.13 KiB | 379.29 KiB |
Sources/Sentry/SentryProfiler.mm
Outdated
// from NSTimer.h: Timers scheduled in an async context may never fire. | ||
dispatch_async(dispatch_get_main_queue(), ^{ [self scheduleTimeoutTimer]; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL; that's super weird. If you are sure about this and your PR fixes the memory growth issue, we also have to fix it here
sentry-cocoa/Sources/Sentry/SentryTracer.m
Lines 213 to 225 in 6c31077
- (void)startDeadlineTimer | |
{ | |
__weak SentryTracer *weakSelf = self; | |
self.deadlineTimer = | |
[_configuration.timerFactory scheduledTimerWithTimeInterval:SENTRY_AUTO_TRANSACTION_DEADLINE | |
repeats:NO | |
block:^(NSTimer *_Nonnull timer) { | |
if (weakSelf == nil) { | |
return; | |
} | |
[weakSelf deadlineTimerFired]; | |
}]; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it, yeah! See the updated implementation where I check +[NSThread isMainThread]
, because if you dispatch_async to the main queue when you're already on it, and the caller blocks the main queue, it'll never even get scheduled!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #3138
740e21b
to
1371ba6
Compare
SentryHub *hub) | ||
SentryHub *hub | ||
# if SENTRY_HAS_UIKIT | ||
, | ||
SentryScreenFrames *gpuData | ||
# endif // SENTRY_HAS_UIKIT | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ugly, I know. I have a future refactor planned to move this entire function into SentryProfilerState
.
@@ -231,28 +236,25 @@ | |||
auto metrics = serializedMetrics; | |||
|
|||
# if SENTRY_HAS_UIKIT | |||
const auto framesTracker = SentryDependencyContainer.sharedInstance.framesTracker; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this hunk, we now use a private copy of the data in SentryDependencyContainer.sharedInstance.framesTracker.currentFrames
that is copied/stored per profiler instance. That way, when the profiler running for the last in-flight transaction is stopped, we can reset the SentryFramesTracker
's version of the data, while keeping the copies in instances of profilers that may still be waiting in memory for their transactions to finish.
[_gCurrentProfiler start]; | ||
|
||
_gCurrentProfiler->_timeoutTimer = [SentryDependencyContainer.sharedInstance.timerFactory | ||
scheduledTimerWithTimeInterval:kSentryProfilerTimeoutInterval | ||
target:self | ||
selector:@selector(timeoutAbort) | ||
userInfo:nil | ||
repeats:NO]; | ||
# if SENTRY_HAS_UIKIT | ||
[[NSNotificationCenter defaultCenter] addObserver:self | ||
selector:@selector(backgroundAbort) | ||
name:UIApplicationWillResignActiveNotification | ||
object:nil]; | ||
# endif // SENTRY_HAS_UIKIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 313-326 were moved from this class method into the (preexisting) instance method -[SentryProfiler initWithHub:]
a bit above this in the diff..
Sources/Sentry/SentryProfiler.mm
Outdated
[self start]; | ||
[self scheduleTimeoutTimer]; | ||
|
||
# if SENTRY_HAS_UIKIT | ||
[[NSNotificationCenter defaultCenter] addObserver:self | ||
selector:@selector(backgroundAbort) | ||
name:UIApplicationWillResignActiveNotification | ||
object:nil]; | ||
# endif // SENTRY_HAS_UIKIT | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 293-301 were originally in the class method + [SentryProfiler startWithHub:]
.
# if SENTRY_HAS_UIKIT | ||
[SentryDependencyContainer.sharedInstance.framesTracker resetProfilingTimestamps]; | ||
# endif // SENTRY_HAS_UIKIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is removed from here to SentryTracerConcurrency, where we can know how many profilers are being held waiting for their associated tracers to finish. Only when there are no profilers left do we reset this data.
+ (SentryEnvelopeItem *)envelopeItemForProfileData:(NSDictionary<NSString *, id> *)profile | ||
profileID:(SentryId *)profileID | ||
{ | ||
const auto JSONData = [SentrySerialization dataWithJSONObject:profile]; | ||
if (JSONData == nil) { | ||
SENTRY_LOG_DEBUG(@"Failed to encode profile to JSON."); | ||
return nil; | ||
} | ||
|
||
const auto header = [[SentryEnvelopeItemHeader alloc] initWithType:SentryEnvelopeItemTypeProfile | ||
length:JSONData.length]; | ||
return [[SentryEnvelopeItem alloc] initWithHeader:header data:JSONData]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+[envelopeItemForProfileData:profileID:]
was inlined at the sole callsite on new lines 379-387
- (void)stop | ||
{ | ||
if (_profiler == nullptr) { | ||
SENTRY_LOG_WARN(@"No profiler instance found."); | ||
return; | ||
} | ||
if (!_profiler->isSampling()) { | ||
SENTRY_LOG_WARN(@"Profiler is not currently sampling."); | ||
return; | ||
} | ||
|
||
_profiler->stopSampling(); | ||
[_metricProfiler stop]; | ||
SENTRY_LOG_DEBUG(@"Stopped profiler %@.", self); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preexisting instance method -[stop]
was combined with the eliminated class method +[stopProfilerForReason:]
to create the only instance method now needed to stop a profiler: -[stopForReason:]
on new lines 432-442.
@@ -151,8 +151,7 @@ - (instancetype)initWithTransactionContext:(SentryTransactionContext *)transacti | |||
if (_configuration.profilesSamplerDecision.decision == kSentrySampleDecisionYes) { | |||
_isProfiling = YES; | |||
_startSystemTime = SentryCurrentDate.systemTime; | |||
[SentryProfiler startWithHub:hub]; | |||
trackTracerWithID(self.traceId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now just call the bookkeeping function trackTracerWithID
from SentryProfiler
's implementation (the function was renamed to trackProfilerForTracer(SentryProfiler *profiler, SentryTracer *tracer)
if (!profileEnvelopeItem) { | ||
[_hub captureTransaction:transaction withScope:_hub.scope]; | ||
return; | ||
} | ||
|
||
stopTrackingTracerWithID(self.traceId, ^{ [SentryProfiler stop]; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now handled automatically from SentryTracerConcurrency.profilerForFinishedTracer
(which is only called from +[SentryProfiler createProfilingEnvelopeItemForTransaction:]
called on line 517 above).
* Schedule a timeout timer on the main thread. | ||
* @warning from NSTimer.h: Timers scheduled in an async context may never fire. | ||
*/ | ||
- (void)scheduleTimeoutTimer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the cruxes of the change. By not ensuring we were scheduling the timeout timer from the main thread, we created a situation where the profiler may never stop, leading to unbounded memory growth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we wouldn't use a dispatch timer that lets you explicitly specify which queue it fires on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no specific reason we didn't use a dispatch timer AFAIK. Do you think we should change it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be addressed separately 👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, I still need to have a close look at SentryTracerConcurrency.m
.
{ | ||
const auto profileID = [[SentryId alloc] init]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just a question
* Schedule a timeout timer on the main thread. | ||
* @warning from NSTimer.h: Timers scheduled in an async context may never fire. | ||
*/ | ||
- (void)scheduleTimeoutTimer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we wouldn't use a dispatch timer that lets you explicitly specify which queue it fires on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I maybe found one important memory growth issue, apart from that, LGTM.
} else { | ||
SENTRY_LOG_DEBUG(@"Waiting on %lu other tracers to complete: %@.", _gInFlightTraceIDs.count, | ||
_gInFlightTraceIDs); | ||
SentryProfiler *_Nullable profilerForFinishedTracer(SentryTracer *tracer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h
: This method only gets called in SentryProfiler.createProfilingEnvelopeItemForTransaction
, which is called by, SentryTracer.captureTransactionWithProfile
, which is called by SentryTracer.finishInternal
. There is no guarantee that SentryTracer.finishInternal
will actually call SentryTracer.captureTransactionWithProfile
. Therefore, it can happen that _gProfilersToTracers
and _gTracersToProfilers
keep references to SentryTracer
and SentryProfile
which should have been deallocated. This can lead to infinite memory growth, or am I missing something? We could solve this by keeping weak references here or we need some other way of cleaning up the two dictionaries. I think this problem already existed before, so we can also address in an extra PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use weak refs because there won't be anything else keeping valid profilers alive. The simplest solution I can think of is to call some cleanup function that removes them from the dicts at every early return in the codepath that would otherwise lead to capturing a transaction.
Are there any early returns in finishInternal
that shouldn't do such cleanup, because it will be called again later?
And conversely, are there any callers to finishInternal
that have early returns where we'd also need to do this cleanup because there won't be another call to finishInternal
?
Happy to tackle this in a separate PR as stated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1f5e025
to
be1808c
Compare
…er stop] method impl no longer needed
be1808c
to
998cd6b
Compare
Explain that not symbolicating locally speeds up certain SDK actions.
📜 Description
Fixes an unbounded memory growth issue if a transaction, and therefore the profiler, is started from a non-main context, where the timeout timer needs to be scheduled on the main thread.
Making that small change led me to finding the design flaw that was causing data loss (originally a crash) in #3082. Because we didn't keep instances of profilers that had timed out at 30s, while their associated transactions would go on for up to 430 more seconds until timing out, those transactions could query for data that had been lost. I think there was a race at the moment a profiler was queried for its data and when another one was started, thus losing the data being queried as it was being processed.
So,
SentryTracerConcurrency
andSentryProfiler
are refactored here to remove the assumption that only one profiler instance would ever need to be kept inSentryProfiler->_gCurrentProfiler
, and instead hold references to all outstanding profilers inSentryTracerConcurrency._gProfilersToTracers
.💡 Motivation and Context
Customer reported issue.
💚 How did you test it?
Reproduced it manually, by starting many txs on a non-main queue and never stopping them, so that the profiler should time out. Validated that the dispatch to schedule the timeout timer fixed the issue.
Wrote a regression test in #3145 (the new base for this PR) so we can see it fail there and pass here. It's not a new test, but rather, fixes a preexisting test that didn't actually function correctly until gaining the capability to mock backtraces in #3133.
📝 Checklist
You have to check all boxes before merging:
sendDefaultPII
is enabled.🔮 Next steps