Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent ANR during SDK initialization #709

Merged

Conversation

ansman
Copy link
Contributor

@ansman ansman commented Nov 22, 2024

Description

When initializing the OpenTelemetry Android SDK with disk buffering enabled, we discovered that synchronous disk space checks were causing ANRs in production. These checks occur during the creation of disk buffering exporters, specifically in DiskManager.getMaxFolderSize(), which makes blocking IPC calls through StorageManager.getAllocatableBytes() on the main thread. The issue manifests in the following ANR stacktrace:

android.os.BinderProxy.transact (BinderProxy.java:662)
android.os.storage.IStorageManager$Stub$Proxy.getAllocatableBytes (IStorageManager.java:2837)
android.os.storage.StorageManager.getAllocatableBytes (StorageManager.java:2414)
android.os.storage.StorageManager.getAllocatableBytes (StorageManager.java:2404)
io.opentelemetry.android.internal.services.CacheStorage.getAvailableSpace (CacheStorage.java:66)
io.opentelemetry.android.internal.services.CacheStorage.ensureCacheSpaceAvailable (CacheStorage.java:50)
io.opentelemetry.android.internal.features.persistence.DiskManager.getMaxFolderSize (DiskManager.kt:58)
io.opentelemetry.android.OpenTelemetryRumBuilder.createStorageConfiguration (OpenTelemetryRumBuilder.java:338)
io.opentelemetry.android.OpenTelemetryRumBuilder.build (OpenTelemetryRumBuilder.java:286)

Solution

To fix this we moved initialization to run on a background executor and buffer the data in memory until it completes.

The process works like this:

  1. Initialize the SDK with BufferDelegatingExporter instances that can immediately accept telemetry data.
  2. Move exporter initialization off the main thread.
  3. Once async initialization completes, flush buffered signals to initialized exporters and delegate all future signals.

The primary goal of this solution is to be unobtrusive and prevent ANRs caused by initialization of disk exporters, while preventing signals from being dropped.

Testing

We have added unit tests to cover the buffering, delevation, and RUM building. We've also verified this with both disk enabled and disk disabled.

@ansman ansman requested a review from a team as a code owner November 22, 2024 17:35
Copy link

linux-foundation-easycla bot commented Nov 22, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@ansman
Copy link
Contributor Author

ansman commented Nov 22, 2024

FYI we are working on getting the CLA signed.

@ansman ansman force-pushed the fix/improve-startup-performance branch from 6f1e90f to 52018e3 Compare December 6, 2024 16:25
@ansman
Copy link
Contributor Author

ansman commented Dec 6, 2024

@breedx-splk CLA signed

Copy link
Contributor

@breedx-splk breedx-splk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ansman for taking this on, I think this is going to be really helpful...and there are bound to be a few other areas that are problematic at startup, so I appreciate you both shining some light on this and for going the distance in submitting this PR.

Copy link
Contributor

@LikeTheSalad LikeTheSalad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to contribute 🙏 I like the overall idea. I left some questions and suggestions, so please have a look when you get a chance.

Description
When initializing the OpenTelemetry Android SDK with disk buffering enabled, we
discovered that synchronous disk space checks were causing ANRs in production.
These checks occur during the creation of disk buffering exporters, specifically
in `DiskManager.getMaxFolderSize()`, which makes blocking IPC calls through
`StorageManager.getAllocatableBytes()` on the main thread. The issue manifests
in the following ANR stacktrace:
```
android.os.BinderProxy.transact (BinderProxy.java:662)
android.os.storage.IStorageManager$Stub$Proxy.getAllocatableBytes (IStorageManager.java:2837)
android.os.storage.StorageManager.getAllocatableBytes (StorageManager.java:2414)
android.os.storage.StorageManager.getAllocatableBytes (StorageManager.java:2404)
io.opentelemetry.android.internal.services.CacheStorage.getAvailableSpace (CacheStorage.java:66)
io.opentelemetry.android.internal.services.CacheStorage.ensureCacheSpaceAvailable (CacheStorage.java:50)
io.opentelemetry.android.internal.features.persistence.DiskManager.getMaxFolderSize (DiskManager.kt:58)
io.opentelemetry.android.OpenTelemetryRumBuilder.createStorageConfiguration (OpenTelemetryRumBuilder.java:338)
io.opentelemetry.android.OpenTelemetryRumBuilder.build (OpenTelemetryRumBuilder.java:286)
```

Our Solution
To fix this we moved initialization to run on a background executor and
buffer the data in memory until it completes.

The process works like this:
1. Initialize the SDK with `BufferDelegatingExporter` instances that can
   immediately accept telemetry data.
2. Move exporter initialization off the main thread.
3. Once async initialization completes, flush buffered signals to initialized
   exporters and delegate all future signals.

The primary goal of this solution is to be unobtrusive and prevent ANRs caused
by initialization of disk exporters, while preventing signals from being
dropped.

Testing
We have added unit tests to cover the buffering, delevation, and RUM
building. We've also verified this with both disk enabled and disk
disabled.
@ansman ansman force-pushed the fix/improve-startup-performance branch from ff505fc to 5c4f0e5 Compare December 16, 2024 17:42
@ansman
Copy link
Contributor Author

ansman commented Dec 16, 2024

I've updated the PR with the feedback and everything except the logging has been addressed.

In addition the returned result will not be success any longer but rather resolve when the delegate is set.

Copy link
Contributor

@breedx-splk breedx-splk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ansman for sticking this out, I like how it's come together! Would still like to see the log warning emitted, and I'll wait to let @LikeTheSalad do the merge with his approval, but I think things are looking solid now. Thanks for the contribution.

Copy link
Contributor

@LikeTheSalad LikeTheSalad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution and patience, @ansman! 🙏 I'm looking forward to getting your PR merged, though it looks like there are some formatting issues that are making the build fail, if you could please update this PR after running ./gradlew spotlessApply locally, I think that should do. I'm happy to proceed with the merge once the GH build passes :)

@LikeTheSalad LikeTheSalad merged commit 5f38747 into open-telemetry:main Dec 17, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants