Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: profiler timeout scheduling and data preservation #3135

Conversation

armcknight
Copy link
Member

@armcknight armcknight commented Jul 7, 2023

📜 Description

Fixes an unbounded memory growth issue if a transaction, and therefore the profiler, is started from a non-main context, where the timeout timer needs to be scheduled on the main thread.

Making that small change led me to finding the design flaw that was causing data loss (originally a crash) in #3082. Because we didn't keep instances of profilers that had timed out at 30s, while their associated transactions would go on for up to 430 more seconds until timing out, those transactions could query for data that had been lost. I think there was a race at the moment a profiler was queried for its data and when another one was started, thus losing the data being queried as it was being processed.

So, SentryTracerConcurrency and SentryProfiler are refactored here to remove the assumption that only one profiler instance would ever need to be kept in SentryProfiler->_gCurrentProfiler, and instead hold references to all outstanding profilers in SentryTracerConcurrency._gProfilersToTracers.

💡 Motivation and Context

Customer reported issue.

💚 How did you test it?

Reproduced it manually, by starting many txs on a non-main queue and never stopping them, so that the profiler should time out. Validated that the dispatch to schedule the timeout timer fixed the issue.

Wrote a regression test in #3145 (the new base for this PR) so we can see it fail there and pass here. It's not a new test, but rather, fixes a preexisting test that didn't actually function correctly until gaining the capability to mock backtraces in #3133.

📝 Checklist

You have to check all boxes before merging:

  • I reviewed the submitted code.
  • I added tests to verify the changes.
  • No new PII added or SDK only sends newly added PII if sendDefaultPII is enabled.
  • I updated the docs if needed.
  • Review from the native team if needed.
  • No breaking change or entry added to the changelog.
  • No breaking change for hybrid SDKs or communicated to hybrid SDKs.

🔮 Next steps

@github-actions
Copy link

github-actions bot commented Jul 7, 2023

Performance metrics 🚀

  Plain With Sentry Diff
Startup time 1237.31 ms 1242.66 ms 5.35 ms
Size 22.84 KiB 402.12 KiB 379.28 KiB

Previous results on branch: armcknight/fix/profiler-timeout-failure

Startup times

Revision Plain With Sentry Diff
0d95a04 1229.83 ms 1244.94 ms 15.11 ms
411db4a 1263.90 ms 1270.30 ms 6.40 ms
e6bd377 1242.76 ms 1252.59 ms 9.84 ms
2fa2a0d 1261.69 ms 1270.29 ms 8.59 ms

App size

Revision Plain With Sentry Diff
0d95a04 22.84 KiB 402.13 KiB 379.28 KiB
411db4a 22.84 KiB 401.96 KiB 379.11 KiB
e6bd377 22.84 KiB 402.12 KiB 379.28 KiB
2fa2a0d 22.84 KiB 402.13 KiB 379.29 KiB

@armcknight armcknight changed the title extract implementation of SentryProfilerState; extract SentrySample fix: profiler timeout timer from non-main contexts Jul 7, 2023
Comment on lines 315 to 316
// from NSTimer.h: Timers scheduled in an async context may never fire.
dispatch_async(dispatch_get_main_queue(), ^{ [self scheduleTimeoutTimer]; });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL; that's super weird. If you are sure about this and your PR fixes the memory growth issue, we also have to fix it here

- (void)startDeadlineTimer
{
__weak SentryTracer *weakSelf = self;
self.deadlineTimer =
[_configuration.timerFactory scheduledTimerWithTimeInterval:SENTRY_AUTO_TRANSACTION_DEADLINE
repeats:NO
block:^(NSTimer *_Nonnull timer) {
if (weakSelf == nil) {
return;
}
[weakSelf deadlineTimerFired];
}];
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it, yeah! See the updated implementation where I check +[NSThread isMainThread], because if you dispatch_async to the main queue when you're already on it, and the caller blocks the main queue, it'll never even get scheduled!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #3138

@armcknight armcknight force-pushed the armcknight/fix/profiler-timeout-failure branch from 740e21b to 1371ba6 Compare July 11, 2023 01:58
Comment on lines -158 to +164
SentryHub *hub)
SentryHub *hub
# if SENTRY_HAS_UIKIT
,
SentryScreenFrames *gpuData
# endif // SENTRY_HAS_UIKIT
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ugly, I know. I have a future refactor planned to move this entire function into SentryProfilerState.

@@ -231,28 +236,25 @@
auto metrics = serializedMetrics;

# if SENTRY_HAS_UIKIT
const auto framesTracker = SentryDependencyContainer.sharedInstance.framesTracker;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this hunk, we now use a private copy of the data in SentryDependencyContainer.sharedInstance.framesTracker.currentFrames that is copied/stored per profiler instance. That way, when the profiler running for the last in-flight transaction is stopped, we can reset the SentryFramesTracker's version of the data, while keeping the copies in instances of profilers that may still be waiting in memory for their transactions to finish.

Comment on lines -313 to -326
[_gCurrentProfiler start];

_gCurrentProfiler->_timeoutTimer = [SentryDependencyContainer.sharedInstance.timerFactory
scheduledTimerWithTimeInterval:kSentryProfilerTimeoutInterval
target:self
selector:@selector(timeoutAbort)
userInfo:nil
repeats:NO];
# if SENTRY_HAS_UIKIT
[[NSNotificationCenter defaultCenter] addObserver:self
selector:@selector(backgroundAbort)
name:UIApplicationWillResignActiveNotification
object:nil];
# endif // SENTRY_HAS_UIKIT
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 313-326 were moved from this class method into the (preexisting) instance method -[SentryProfiler initWithHub:] a bit above this in the diff..

Comment on lines 293 to 304
[self start];
[self scheduleTimeoutTimer];

# if SENTRY_HAS_UIKIT
[[NSNotificationCenter defaultCenter] addObserver:self
selector:@selector(backgroundAbort)
name:UIApplicationWillResignActiveNotification
object:nil];
# endif // SENTRY_HAS_UIKIT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 293-301 were originally in the class method + [SentryProfiler startWithHub:].

Comment on lines -309 to -311
# if SENTRY_HAS_UIKIT
[SentryDependencyContainer.sharedInstance.framesTracker resetProfilingTimestamps];
# endif // SENTRY_HAS_UIKIT
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removed from here to SentryTracerConcurrency, where we can know how many profilers are being held waiting for their associated tracers to finish. Only when there are no profilers left do we reset this data.

Comment on lines -509 to -521
+ (SentryEnvelopeItem *)envelopeItemForProfileData:(NSDictionary<NSString *, id> *)profile
profileID:(SentryId *)profileID
{
const auto JSONData = [SentrySerialization dataWithJSONObject:profile];
if (JSONData == nil) {
SENTRY_LOG_DEBUG(@"Failed to encode profile to JSON.");
return nil;
}

const auto header = [[SentryEnvelopeItemHeader alloc] initWithType:SentryEnvelopeItemTypeProfile
length:JSONData.length];
return [[SentryEnvelopeItem alloc] initWithHeader:header data:JSONData];
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+[envelopeItemForProfileData:profileID:] was inlined at the sole callsite on new lines 379-387

Comment on lines -493 to -507
- (void)stop
{
if (_profiler == nullptr) {
SENTRY_LOG_WARN(@"No profiler instance found.");
return;
}
if (!_profiler->isSampling()) {
SENTRY_LOG_WARN(@"Profiler is not currently sampling.");
return;
}

_profiler->stopSampling();
[_metricProfiler stop];
SENTRY_LOG_DEBUG(@"Stopped profiler %@.", self);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preexisting instance method -[stop] was combined with the eliminated class method +[stopProfilerForReason:] to create the only instance method now needed to stop a profiler: -[stopForReason:] on new lines 432-442.

@@ -151,8 +151,7 @@ - (instancetype)initWithTransactionContext:(SentryTransactionContext *)transacti
if (_configuration.profilesSamplerDecision.decision == kSentrySampleDecisionYes) {
_isProfiling = YES;
_startSystemTime = SentryCurrentDate.systemTime;
[SentryProfiler startWithHub:hub];
trackTracerWithID(self.traceId);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now just call the bookkeeping function trackTracerWithID from SentryProfiler's implementation (the function was renamed to trackProfilerForTracer(SentryProfiler *profiler, SentryTracer *tracer)

if (!profileEnvelopeItem) {
[_hub captureTransaction:transaction withScope:_hub.scope];
return;
}

stopTrackingTracerWithID(self.traceId, ^{ [SentryProfiler stop]; });
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now handled automatically from SentryTracerConcurrency.profilerForFinishedTracer (which is only called from +[SentryProfiler createProfilingEnvelopeItemForTransaction:] called on line 517 above).

* Schedule a timeout timer on the main thread.
* @warning from NSTimer.h: Timers scheduled in an async context may never fire.
*/
- (void)scheduleTimeoutTimer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the cruxes of the change. By not ensuring we were scheduling the timeout timer from the main thread, we created a situation where the profiler may never stop, leading to unbounded memory growth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we wouldn't use a dispatch timer that lets you explicitly specify which queue it fires on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no specific reason we didn't use a dispatch timer AFAIK. Do you think we should change it in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be addressed separately 👍🏽

@armcknight armcknight changed the title fix: profiler timeout timer from non-main contexts fix: profiler timeout scheduling and data preservation Jul 11, 2023
Base automatically changed from armcknight/ref/profiler-mocks to main July 11, 2023 05:05
Copy link
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, I still need to have a close look at SentryTracerConcurrency.m.

Sources/Sentry/SentryProfiler.mm Show resolved Hide resolved
{
const auto profileID = [[SentryId alloc] init];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

Sources/Sentry/SentryScreenFrames.m Outdated Show resolved Hide resolved
Sources/Sentry/SentryProfiler.mm Outdated Show resolved Hide resolved
Copy link
Member

@indragiek indragiek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a question

* Schedule a timeout timer on the main thread.
* @warning from NSTimer.h: Timers scheduled in an async context may never fire.
*/
- (void)scheduleTimeoutTimer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we wouldn't use a dispatch timer that lets you explicitly specify which queue it fires on?

Sources/Sentry/SentryTracerConcurrency.mm Outdated Show resolved Hide resolved
Copy link
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I maybe found one important memory growth issue, apart from that, LGTM.

Sources/Sentry/SentryTracerConcurrency.mm Outdated Show resolved Hide resolved
Sources/Sentry/SentryTracerConcurrency.mm Outdated Show resolved Hide resolved
} else {
SENTRY_LOG_DEBUG(@"Waiting on %lu other tracers to complete: %@.", _gInFlightTraceIDs.count,
_gInFlightTraceIDs);
SentryProfiler *_Nullable profilerForFinishedTracer(SentryTracer *tracer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h: This method only gets called in SentryProfiler.createProfilingEnvelopeItemForTransaction, which is called by, SentryTracer.captureTransactionWithProfile, which is called by SentryTracer.finishInternal. There is no guarantee that SentryTracer.finishInternal will actually call SentryTracer.captureTransactionWithProfile. Therefore, it can happen that _gProfilersToTracers and _gTracersToProfilers keep references to SentryTracer and SentryProfile which should have been deallocated. This can lead to infinite memory growth, or am I missing something? We could solve this by keeping weak references here or we need some other way of cleaning up the two dictionaries. I think this problem already existed before, so we can also address in an extra PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use weak refs because there won't be anything else keeping valid profilers alive. The simplest solution I can think of is to call some cleanup function that removes them from the dicts at every early return in the codepath that would otherwise lead to capturing a transaction.

Are there any early returns in finishInternal that shouldn't do such cleanup, because it will be called again later?

And conversely, are there any callers to finishInternal that have early returns where we'd also need to do this cleanup because there won't be another call to finishInternal?

Happy to tackle this in a separate PR as stated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this in two change: #3154 to clean up profilers for discarded transactions, and weak references in #3155

@armcknight armcknight force-pushed the armcknight/fix/profiler-timeout-failure branch from 1f5e025 to be1808c Compare July 13, 2023 00:06
@armcknight armcknight changed the base branch from main to armcknight/fix/memory-growth-regression-test July 13, 2023 00:06
@armcknight armcknight marked this pull request as ready for review July 13, 2023 00:10
@armcknight armcknight force-pushed the armcknight/fix/profiler-timeout-failure branch from be1808c to 998cd6b Compare July 13, 2023 00:22
@armcknight armcknight merged commit f24e9a7 into armcknight/fix/memory-growth-regression-test Jul 14, 2023
@armcknight armcknight deleted the armcknight/fix/profiler-timeout-failure branch July 14, 2023 00:10
philipphofmann added a commit that referenced this pull request Jul 19, 2023
Explain that not symbolicating locally speeds up certain SDK actions.
philipphofmann added a commit that referenced this pull request Jul 19, 2023
Explain that not symbolicating locally speeds up certain SDK actions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants