Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Potential deadlock in app hang detection #4063

Merged
merged 4 commits into from
Jun 14, 2024

Conversation

philipphofmann
Copy link
Member

📜 Description

The app hang detection retrieves the stacktraces of all threads when it detects an app hang. The logic for retrieving the stacktrace uses sentrycrashdl_dladdr, which isn't async-safe, although it claims to be because it accesses _dyld_get_image_header, which acquires a lock. Therefore, in rare cases, retrieving the stacktrace for all threads can lead to a deadlock. We fix this by skipping symbolication, which calls sentrycrashdl_dladdr when debug is false, which we we thought we did in #3079 but somehow missed for all threads.

💡 Motivation and Context

Fixes GH-4056

💚 How did you test it?

Setting debug=false for the iOS-Swift sample app to see if the symbolication still works. I wasn't able to reproduce the deadlock mentioned in GH-4056, but according to the description this fix should fix the problem.

📝 Checklist

You have to check all boxes before merging:

  • I reviewed the submitted code.
  • I added tests to verify the changes.
  • No new PII added or SDK only sends newly added PII if sendDefaultPII is enabled.
  • I updated the docs if needed.
  • Review from the native team if needed.
  • No breaking change or entry added to the changelog.
  • No breaking change for hybrid SDKs or communicated to hybrid SDKs.

🔮 Next steps

The app hang detection retrieves the stacktraces of all threads when it
detects an app hang. The logic for retrieving the stacktrace uses
sentrycrashdl_dladdr, which isn't async-safe, although it claims to be
because it accesses _dyld_get_image_header, which acquires a lock.
Therefore, in rare cases, retrieving the stacktrace for all threads can
lead to a deadlock. We fix this by skipping symbolication, which calls
sentrycrashdl_dladdr when debug is false, which we we thought we did in

Fixes GH-4056
Copy link

codecov bot commented Jun 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.207%. Comparing base (2093e05) to head (b4ccc53).

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##              main     #4063       +/-   ##
=============================================
+ Coverage   91.158%   91.207%   +0.049%     
=============================================
  Files          611       611               
  Lines        48100     48130       +30     
  Branches     17273     17272        -1     
=============================================
+ Hits         43847     43898       +51     
+ Misses        4162      4139       -23     
- Partials        91        93        +2     
Files Coverage Δ
Sources/Sentry/SentryCrashStackEntryMapper.m 100.000% <100.000%> (ø)
Sources/Sentry/SentryStacktraceBuilder.m 77.777% <100.000%> (ø)
Sources/Sentry/SentryThreadInspector.m 99.009% <100.000%> (+0.030%) ⬆️
...tryCrash/Recording/Tools/SentryCrashSymbolicator.c 82.758% <ø> (ø)
...SentryCrash/SentryCrashStackEntryMapperTests.swift 100.000% <100.000%> (ø)
...Tests/SentryCrash/SentryThreadInspectorTests.swift 97.142% <100.000%> (+11.046%) ⬆️
.../SentryTests/SentryCrash/TestThreadInspector.swift 100.000% <100.000%> (ø)

... and 6 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2093e05...b4ccc53. Read the comment docs.

Copy link

github-actions bot commented Jun 13, 2024

Performance metrics 🚀

  Plain With Sentry Diff
Startup time 1222.58 ms 1237.38 ms 14.79 ms
Size 21.58 KiB 669.99 KiB 648.41 KiB

Baseline results on branch: main

Startup times

Revision Plain With Sentry Diff
2af280d 1246.22 ms 1253.10 ms 6.88 ms
4977fbc 1217.26 ms 1239.82 ms 22.56 ms
cf97209 1211.20 ms 1233.31 ms 22.10 ms
9e6665f 1233.33 ms 1257.14 ms 23.81 ms
ef5821b 1253.18 ms 1265.46 ms 12.28 ms
3912b16 1230.54 ms 1249.65 ms 19.11 ms
59ea421 1258.83 ms 1272.55 ms 13.72 ms
1437c68 1244.86 ms 1254.18 ms 9.32 ms
9ef729b 1228.79 ms 1245.36 ms 16.57 ms
e2abb0d 1235.08 ms 1257.00 ms 21.92 ms

App size

Revision Plain With Sentry Diff
2af280d 20.76 KiB 435.22 KiB 414.46 KiB
4977fbc 20.76 KiB 419.85 KiB 399.10 KiB
cf97209 21.58 KiB 632.16 KiB 610.58 KiB
9e6665f 22.84 KiB 403.52 KiB 380.67 KiB
ef5821b 21.58 KiB 414.96 KiB 393.37 KiB
3912b16 21.58 KiB 625.82 KiB 604.24 KiB
59ea421 22.85 KiB 413.45 KiB 390.60 KiB
1437c68 22.85 KiB 410.96 KiB 388.11 KiB
9ef729b 20.76 KiB 432.88 KiB 412.12 KiB
e2abb0d 20.76 KiB 434.72 KiB 413.96 KiB

Previous results on branch: fix/dont-symbolicate-locally-for-app-hangs

Startup times

Revision Plain With Sentry Diff
0c326fd 1207.29 ms 1221.37 ms 14.08 ms

App size

Revision Plain With Sentry Diff
0c326fd 21.58 KiB 669.85 KiB 648.26 KiB

Copy link
Contributor

@brustolin brustolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I just dont see how this could help in an AppHang deadlock.
AFAIK being async-safe is not a requirement to use in a background thread when the main thread is busy. We use objc during ANR detection which is also not async-safe

@philipphofmann
Copy link
Member Author

This looks good. I just dont see how this could help in an AppHang deadlock. AFAIK being async-safe is not a requirement to use in a background thread when the main thread is busy. We use objc during ANR detection which is also not async-safe

Looking at the stracktrace posted in #4056 you can see that _dyld_get_image_header requires a lock. When we're not calling symbolicate_internal, we prevent acquiring a lock, which you shouldn't do when suspending all threads. Acquiring a lock with symbolicate_internal doesn't always end up in a deadlock but in rare cases it can.

+ 7566 -[SentryThreadInspector getCurrentThreadsWithStackTrace]  (in MyApp) + 232  [0x1116f7d64]
+ 7566 getStackEntriesFromThread  (in MyApp) + 108  [0x1116f7718]
+ 7566 symbolicate_internal  (in MyApp) + 108  [0x11171a0d8]
+ 7566 sentrycrashdl_dladdr  (in MyApp) + 108  [0x11174f70c]
+ 7566 dyld4::APIs::_dyld_get_image_header(unsigned int)  (in dyld_sim) + 112  [0x10482f8ec]
+ 7566 dyld4::RuntimeLocks::withLoadersReadLock(void () block_pointer)  (in dyld_sim) + 56  [0x104816010]
+ 7566 _os_unfair_lock_lock_slow  (in libsystem_platform.dylib) + 204  [0x1051199c4]
+ 7566 __ulock_wait  (in libsystem_kernel.dylib) + 8  [0x10509e7dc]

Is it clearer now @brustolin or am I missing something?

@brustolin
Copy link
Contributor

which you shouldn't do when suspending all threads

Ow, this is the problem! Now I remember, thanks!!

@philipphofmann philipphofmann merged commit e145ca1 into main Jun 14, 2024
64 of 66 checks passed
@philipphofmann philipphofmann deleted the fix/dont-symbolicate-locally-for-app-hangs branch June 14, 2024 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hang detection causes app to hang
2 participants