fix: profiler timeout scheduling and data preservation #3135

armcknight · 2023-07-07T07:00:16Z

📜 Description

Fixes an unbounded memory growth issue if a transaction, and therefore the profiler, is started from a non-main context, where the timeout timer needs to be scheduled on the main thread.

Making that small change led me to finding the design flaw that was causing data loss (originally a crash) in #3082. Because we didn't keep instances of profilers that had timed out at 30s, while their associated transactions would go on for up to 430 more seconds until timing out, those transactions could query for data that had been lost. I think there was a race at the moment a profiler was queried for its data and when another one was started, thus losing the data being queried as it was being processed.

So, SentryTracerConcurrency and SentryProfiler are refactored here to remove the assumption that only one profiler instance would ever need to be kept in SentryProfiler->_gCurrentProfiler, and instead hold references to all outstanding profilers in SentryTracerConcurrency._gProfilersToTracers.

💡 Motivation and Context

Customer reported issue.

💚 How did you test it?

Reproduced it manually, by starting many txs on a non-main queue and never stopping them, so that the profiler should time out. Validated that the dispatch to schedule the timeout timer fixed the issue.

Wrote a regression test in #3145 (the new base for this PR) so we can see it fail there and pass here. It's not a new test, but rather, fixes a preexisting test that didn't actually function correctly until gaining the capability to mock backtraces in #3133.

📝 Checklist

You have to check all boxes before merging:

I reviewed the submitted code.
I added tests to verify the changes.
No new PII added or SDK only sends newly added PII if sendDefaultPII is enabled.
I updated the docs if needed.
Review from the native team if needed.
No breaking change or entry added to the changelog.
No breaking change for hybrid SDKs or communicated to hybrid SDKs.

🔮 Next steps

github-actions · 2023-07-07T07:16:00Z

Performance metrics 🚀

	Plain	With Sentry	Diff
Startup time	1237.31 ms	1242.66 ms	5.35 ms
Size	22.84 KiB	402.12 KiB	379.28 KiB

Previous results on branch: armcknight/fix/profiler-timeout-failure

Startup times

Revision	Plain	With Sentry	Diff
`0d95a04`	1229.83 ms	1244.94 ms	15.11 ms
`411db4a`	1263.90 ms	1270.30 ms	6.40 ms
`e6bd377`	1242.76 ms	1252.59 ms	9.84 ms
`2fa2a0d`	1261.69 ms	1270.29 ms	8.59 ms

App size

Revision	Plain	With Sentry	Diff
`0d95a04`	22.84 KiB	402.13 KiB	379.28 KiB
`411db4a`	22.84 KiB	401.96 KiB	379.11 KiB
`e6bd377`	22.84 KiB	402.12 KiB	379.28 KiB
`2fa2a0d`	22.84 KiB	402.13 KiB	379.29 KiB

philipphofmann · 2023-07-07T09:17:13Z

Sources/Sentry/SentryProfiler.mm

+    // from NSTimer.h: Timers scheduled in an async context may never fire.
+    dispatch_async(dispatch_get_main_queue(), ^{ [self scheduleTimeoutTimer]; });


TIL; that's super weird. If you are sure about this and your PR fixes the memory growth issue, we also have to fix it here

sentry-cocoa/Sources/Sentry/SentryTracer.m

Lines 213 to 225 in 6c31077

- (void)startDeadlineTimer

{

__weak SentryTracer *weakSelf = self;

self.deadlineTimer =

[_configuration.timerFactory scheduledTimerWithTimeInterval:SENTRY_AUTO_TRANSACTION_DEADLINE

repeats:NO

block:^(NSTimer *_Nonnull timer) {

if (weakSelf == nil) {

return;

}

[weakSelf deadlineTimerFired];

}];

}

Looks like it, yeah! See the updated implementation where I check +[NSThread isMainThread], because if you dispatch_async to the main queue when you're already on it, and the caller blocks the main queue, it'll never even get scheduled!

Done in #3138

armcknight · 2023-07-11T02:02:47Z

Sources/Sentry/SentryProfiler.mm

-    SentryHub *hub)
+    SentryHub *hub
+#    if SENTRY_HAS_UIKIT
+    ,
+    SentryScreenFrames *gpuData
+#    endif // SENTRY_HAS_UIKIT
+)


This is ugly, I know. I have a future refactor planned to move this entire function into SentryProfilerState.

armcknight · 2023-07-11T02:03:54Z

Sources/Sentry/SentryProfiler.mm

@@ -231,28 +236,25 @@
    auto metrics = serializedMetrics;

 #    if SENTRY_HAS_UIKIT
-    const auto framesTracker = SentryDependencyContainer.sharedInstance.framesTracker;


In this hunk, we now use a private copy of the data in SentryDependencyContainer.sharedInstance.framesTracker.currentFrames that is copied/stored per profiler instance. That way, when the profiler running for the last in-flight transaction is stopped, we can reset the SentryFramesTracker's version of the data, while keeping the copies in instances of profilers that may still be waiting in memory for their transactions to finish.

armcknight · 2023-07-11T02:05:42Z

Sources/Sentry/SentryProfiler.mm

-    [_gCurrentProfiler start];
-
-    _gCurrentProfiler->_timeoutTimer = [SentryDependencyContainer.sharedInstance.timerFactory
-        scheduledTimerWithTimeInterval:kSentryProfilerTimeoutInterval
-                                target:self
-                              selector:@selector(timeoutAbort)
-                              userInfo:nil
-                               repeats:NO];
-#    if SENTRY_HAS_UIKIT
-    [[NSNotificationCenter defaultCenter] addObserver:self
-                                             selector:@selector(backgroundAbort)
-                                                 name:UIApplicationWillResignActiveNotification
-                                               object:nil];
-#    endif // SENTRY_HAS_UIKIT


Lines 313-326 were moved from this class method into the (preexisting) instance method -[SentryProfiler initWithHub:] a bit above this in the diff..

armcknight · 2023-07-11T02:06:09Z

Sources/Sentry/SentryProfiler.mm

+    [self start];
+    [self scheduleTimeoutTimer];
+
+#    if SENTRY_HAS_UIKIT
+    [[NSNotificationCenter defaultCenter] addObserver:self
+                                             selector:@selector(backgroundAbort)
+                                                 name:UIApplicationWillResignActiveNotification
+                                               object:nil];
+#    endif // SENTRY_HAS_UIKIT
+


Lines 293-301 were originally in the class method + [SentryProfiler startWithHub:].

armcknight · 2023-07-11T02:07:01Z

Sources/Sentry/SentryProfiler.mm

-#    if SENTRY_HAS_UIKIT
-    [SentryDependencyContainer.sharedInstance.framesTracker resetProfilingTimestamps];
-#    endif // SENTRY_HAS_UIKIT


This is removed from here to SentryTracerConcurrency, where we can know how many profilers are being held waiting for their associated tracers to finish. Only when there are no profilers left do we reset this data.

armcknight · 2023-07-11T02:20:35Z

Sources/Sentry/SentryProfiler.mm

-+ (SentryEnvelopeItem *)envelopeItemForProfileData:(NSDictionary<NSString *, id> *)profile
-                                         profileID:(SentryId *)profileID
-{
-    const auto JSONData = [SentrySerialization dataWithJSONObject:profile];
-    if (JSONData == nil) {
-        SENTRY_LOG_DEBUG(@"Failed to encode profile to JSON.");
-        return nil;
-    }
-
-    const auto header = [[SentryEnvelopeItemHeader alloc] initWithType:SentryEnvelopeItemTypeProfile
-                                                                length:JSONData.length];
-    return [[SentryEnvelopeItem alloc] initWithHeader:header data:JSONData];
-}


+[envelopeItemForProfileData:profileID:] was inlined at the sole callsite on new lines 379-387

armcknight · 2023-07-11T02:21:58Z

Sources/Sentry/SentryProfiler.mm

- (void)stop
-{
-    if (_profiler == nullptr) {
-        SENTRY_LOG_WARN(@"No profiler instance found.");
-        return;
-    }
-    if (!_profiler->isSampling()) {
-        SENTRY_LOG_WARN(@"Profiler is not currently sampling.");
-        return;
-    }
-
-    _profiler->stopSampling();
-    [_metricProfiler stop];
-    SENTRY_LOG_DEBUG(@"Stopped profiler %@.", self);
-}


The preexisting instance method -[stop] was combined with the eliminated class method +[stopProfilerForReason:] to create the only instance method now needed to stop a profiler: -[stopForReason:] on new lines 432-442.

armcknight · 2023-07-11T02:23:18Z

Sources/Sentry/SentryTracer.m

@@ -151,8 +151,7 @@ - (instancetype)initWithTransactionContext:(SentryTransactionContext *)transacti
    if (_configuration.profilesSamplerDecision.decision == kSentrySampleDecisionYes) {
        _isProfiling = YES;
        _startSystemTime = SentryCurrentDate.systemTime;
-        [SentryProfiler startWithHub:hub];
-        trackTracerWithID(self.traceId);


We now just call the bookkeeping function trackTracerWithID from SentryProfiler's implementation (the function was renamed to trackProfilerForTracer(SentryProfiler *profiler, SentryTracer *tracer)

armcknight · 2023-07-11T02:24:39Z

Sources/Sentry/SentryTracer.m

    if (!profileEnvelopeItem) {
        [_hub captureTransaction:transaction withScope:_hub.scope];
        return;
    }

-    stopTrackingTracerWithID(self.traceId, ^{ [SentryProfiler stop]; });


This is now handled automatically from SentryTracerConcurrency.profilerForFinishedTracer (which is only called from +[SentryProfiler createProfilingEnvelopeItemForTransaction:] called on line 517 above).

armcknight · 2023-07-11T02:27:53Z

Sources/Sentry/SentryProfiler.mm

+ * Schedule a timeout timer on the main thread.
+ * @warning from NSTimer.h: Timers scheduled in an async context may never fire.
+ */
+- (void)scheduleTimeoutTimer


This is one of the cruxes of the change. By not ensuring we were scheduling the timeout timer from the main thread, we created a situation where the profiler may never stop, leading to unbounded memory growth.

Any reason why we wouldn't use a dispatch timer that lets you explicitly specify which queue it fires on?

There was no specific reason we didn't use a dispatch timer AFAIK. Do you think we should change it in this PR?

It can be addressed separately 👍🏽

philipphofmann

First pass, I still need to have a close look at SentryTracerConcurrency.m.

Sources/Sentry/SentryProfiler.mm

philipphofmann · 2023-07-11T15:45:20Z

Sources/Sentry/SentryProfiler.mm

 {
-    const auto profileID = [[SentryId alloc] init];


That makes sense.

Sources/Sentry/SentryScreenFrames.m

Sources/Sentry/SentryProfiler.mm

indragiek

lgtm, just a question

indragiek · 2023-07-12T03:43:59Z

Sources/Sentry/SentryProfiler.mm

+ * Schedule a timeout timer on the main thread.
+ * @warning from NSTimer.h: Timers scheduled in an async context may never fire.
+ */
+- (void)scheduleTimeoutTimer


Any reason why we wouldn't use a dispatch timer that lets you explicitly specify which queue it fires on?

Sources/Sentry/SentryTracerConcurrency.mm

philipphofmann

I maybe found one important memory growth issue, apart from that, LGTM.

Sources/Sentry/SentryTracerConcurrency.mm

philipphofmann · 2023-07-12T15:34:22Z

Sources/Sentry/SentryTracerConcurrency.mm

-    } else {
-        SENTRY_LOG_DEBUG(@"Waiting on %lu other tracers to complete: %@.", _gInFlightTraceIDs.count,
-            _gInFlightTraceIDs);
+SentryProfiler *_Nullable profilerForFinishedTracer(SentryTracer *tracer)


h: This method only gets called in SentryProfiler.createProfilingEnvelopeItemForTransaction, which is called by, SentryTracer.captureTransactionWithProfile, which is called by SentryTracer.finishInternal. There is no guarantee that SentryTracer.finishInternal will actually call SentryTracer.captureTransactionWithProfile. Therefore, it can happen that _gProfilersToTracers and _gTracersToProfilers keep references to SentryTracer and SentryProfile which should have been deallocated. This can lead to infinite memory growth, or am I missing something? We could solve this by keeping weak references here or we need some other way of cleaning up the two dictionaries. I think this problem already existed before, so we can also address in an extra PR.

We can't use weak refs because there won't be anything else keeping valid profilers alive. The simplest solution I can think of is to call some cleanup function that removes them from the dicts at every early return in the codepath that would otherwise lead to capturing a transaction.

Are there any early returns in finishInternal that shouldn't do such cleanup, because it will be called again later?

And conversely, are there any callers to finishInternal that have early returns where we'd also need to do this cleanup because there won't be another call to finishInternal?

Happy to tackle this in a separate PR as stated.

I did this in two change: #3154 to clean up profilers for discarded transactions, and weak references in #3155

…cheduled

…er stop] method impl no longer needed

Explain that not symbolicating locally speeds up certain SDK actions.

armcknight changed the title ~~extract implementation of SentryProfilerState; extract SentrySample~~ fix: profiler timeout timer from non-main contexts Jul 7, 2023

philipphofmann reviewed Jul 7, 2023

View reviewed changes

armcknight force-pushed the armcknight/fix/profiler-timeout-failure branch from 740e21b to 1371ba6 Compare July 11, 2023 01:58

armcknight commented Jul 11, 2023

View reviewed changes

armcknight changed the title ~~fix: profiler timeout timer from non-main contexts~~ fix: profiler timeout scheduling and data preservation Jul 11, 2023

Base automatically changed from armcknight/ref/profiler-mocks to main July 11, 2023 05:05

philipphofmann reviewed Jul 11, 2023

View reviewed changes

indragiek approved these changes Jul 12, 2023

View reviewed changes

philipphofmann reviewed Jul 12, 2023

View reviewed changes

armcknight mentioned this pull request Jul 13, 2023

fix: profiling memory growth fix #3145

Merged

armcknight force-pushed the armcknight/fix/profiler-timeout-failure branch from 1f5e025 to be1808c Compare July 13, 2023 00:06

armcknight changed the base branch from main to armcknight/fix/memory-growth-regression-test July 13, 2023 00:06

armcknight marked this pull request as ready for review July 13, 2023 00:10

armcknight requested a review from brustolin as a code owner July 13, 2023 00:10

armcknight added 13 commits July 12, 2023 16:22

schedule timeout timer in main context

a8926b8

changelog

5e0e683

wip moving stoppage methods to instance level

47e26cf

wip managing tracer/profiler concurrency bookkeeping

f1bfd01

track existing profilers as well as new ones

d653223

fix test to not block the dispatched async timeout timer from being s…

b1d060e

…cheduled

copy off and reset gpu data at the right times; remove +[SentryProfil…

67bc184

…er stop] method impl no longer needed

undo the file rename so the diff is better

16aee94

cleanup

220e182

conform to nscopying

6b606f0

use notification center wrapper

5a3dc0b

fix build

12cf3d0

use own state

998cd6b

armcknight force-pushed the armcknight/fix/profiler-timeout-failure branch from be1808c to 998cd6b Compare July 13, 2023 00:22

pr feedback

b8e11f7

fix test build

6a4e1a7

armcknight merged commit f24e9a7 into armcknight/fix/memory-growth-regression-test Jul 14, 2023

armcknight deleted the armcknight/fix/profiler-timeout-failure branch July 14, 2023 00:10

This was referenced Jul 14, 2023

ref: rename SentryTracerConcurrency -> SentryProfiledTracerConcurrency #3153

Merged

fix: clean up profilers for discarded transactions #3154

Merged

philipphofmann added a commit that referenced this pull request Jul 19, 2023

chore: Add note on improvement of #3135

a86cad8

Explain that not symbolicating locally speeds up certain SDK actions.

philipphofmann added a commit that referenced this pull request Jul 19, 2023

chore: Add note on improvement of #3135 (#3170)

afd1a08

Explain that not symbolicating locally speeds up certain SDK actions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: profiler timeout scheduling and data preservation #3135

fix: profiler timeout scheduling and data preservation #3135

armcknight commented Jul 7, 2023 •

edited

Loading

github-actions bot commented Jul 7, 2023 •

edited

Loading

Previous results on branch: armcknight/fix/profiler-timeout-failure

Startup times

App size

philipphofmann Jul 7, 2023

armcknight Jul 11, 2023

armcknight Jul 13, 2023

armcknight Jul 11, 2023

armcknight Jul 11, 2023

armcknight Jul 11, 2023

armcknight Jul 11, 2023

armcknight Jul 11, 2023

armcknight Jul 11, 2023

armcknight Jul 11, 2023

armcknight Jul 11, 2023

armcknight Jul 11, 2023

armcknight Jul 11, 2023

indragiek Jul 12, 2023

armcknight Jul 13, 2023

indragiek Jul 13, 2023

philipphofmann left a comment

philipphofmann Jul 11, 2023

indragiek left a comment

indragiek Jul 12, 2023

philipphofmann left a comment

philipphofmann Jul 12, 2023

armcknight Jul 13, 2023

armcknight Jul 14, 2023

		// from NSTimer.h: Timers scheduled in an async context may never fire.
		dispatch_async(dispatch_get_main_queue(), ^{ [self scheduleTimeoutTimer]; });

	- (void)startDeadlineTimer
	{
	__weak SentryTracer *weakSelf = self;
	self.deadlineTimer =
	[_configuration.timerFactory scheduledTimerWithTimeInterval:SENTRY_AUTO_TRANSACTION_DEADLINE
	repeats:NO
	block:^(NSTimer *_Nonnull timer) {
	if (weakSelf == nil) {
	return;
	}
	[weakSelf deadlineTimerFired];
	}];
	}

fix: profiler timeout scheduling and data preservation #3135

fix: profiler timeout scheduling and data preservation #3135

Conversation

armcknight commented Jul 7, 2023 • edited Loading

📜 Description

💡 Motivation and Context

💚 How did you test it?

📝 Checklist

🔮 Next steps

github-actions bot commented Jul 7, 2023 • edited Loading

Performance metrics 🚀

Previous results on branch: armcknight/fix/profiler-timeout-failure

Startup times

App size

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philipphofmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

indragiek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philipphofmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armcknight commented Jul 7, 2023 •

edited

Loading

github-actions bot commented Jul 7, 2023 •

edited

Loading