-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ANR detection to bugsnag-android #442
Conversation
Adds ANR detection to the Client by default, when a debugger is not attached. A handler posts a runnable to the main thread, which updates a timestamp periodically. If the timestamp has not been updated within the blocked threshold, an unhandled ANRError is reported.
… BlockedThreadDetector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of this looks good. I tested it on the example app and it behaves as expected. I would like to go through the logic with you line by line just to explore the potential edge cases there too though.
My only comment is that when the device is online, the error report from a detected ANR get saved to disk and then immediately sent. This seems like a whole bunch of unnecessary file operations when the app is in a potentially very busy state. I suspect it's because the error is considered "unhandled" and so it goes down the route "the app is crashing, serialise it asap". Do you think it would be possible and/or a good idea to check the network state and attempt to send first if the device is online?
The ANR dialog allows the user to kill the app process by clicking "Close". Therefore we write the report to disk immediately, which is the same processing used for when an uncaught exception occurs and the process is about to die. |
sdk/src/main/java/com/bugsnag/android/BlockedThreadDetector.java
Outdated
Show resolved
Hide resolved
…ower value passed
sdk/src/main/java/com/bugsnag/android/BlockedThreadDetector.java
Outdated
Show resolved
Hide resolved
sdk/src/main/java/com/bugsnag/android/BlockedThreadDetector.java
Outdated
Show resolved
Hide resolved
We've already implemented this in Square apps, and there are a few gotchas. ANRs aren't actually triggered by the main thread being blocked for 5 seconds. They are triggered when a touch event has been enqueued to an app and hasn't been consumed for 5 seconds. AFAIK there's no easy way to read the event queue and see how long events have been waiting in there (if someone knows, I'd love to learn more). So the closest proxy is usually "main thread has been blocked for 5 seconds AND the app is in foreground (foreground being a proxy for "can receive touch events). Even foreground is hard to capture as an activity might be stuck on onCreate or onStop so activity lifecycle callbacks don't have the right information in due time. A likely better way is to capture the running process importance, which indicates whether the system considers the app to be foreground or not. Also, posting to the main thread works fine in general but when the queue gets a lot of messages I believe the system will prioritize consuming touch events based on vsync, so you could have a really large message queue and yet no ANR. So you don't want to be too eager when detecting ANRs, and one way to do that is, if you're getting close to the ANR deadline, use Handler.sendMessageAtFrontOfQueue on the main thread. |
Add unhandled_events field to native payload
Thanks for the detailed review comments @pyricau - they've been really helpful. We plan on making the following changes to our implementation:
The intent of this would be to change our definition of an ANR as something that only occurs when the main thread is blocked for 5s and the app is in the foreground. |
…s, and only reports ANRs in the Using a HandlerThread with postDelayed and uptimeMillis will pause ANR detection when the system enters a deep sleep. As the ANR dialog is only shown when the user dispatches a touch event, we use whether the app is in the foreground or not as a proxy to determine whether we should count an ANR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the updates to this PR. It still works broadly as expected, but there is one sequence of events that produces behaviour that I wouldn't expect:
- load the example app
- hit the anr button
- close the app in before 5 seconds have elapsed
- wait until at least 5 seconds have elapsed since hitting the button
- reopen app
- observe ANR error is sent immediately
This seems surprising to me as the time that the app spent being unresponsive while being in the foreground was the time between steps 2 and 3, which was less than 5 seconds.
@bengourley I was able to reproduce that behaviour when I put the app in the background, then quickly brought it to the foreground again, but if I left it for >10 seconds an ANR was not detected. The example app's ANR blocks the main thread for 10s, so I believe what we're seeing is expected behaviour. |
An immediate ANR when the app has been in the foreground for less than the ANR threshold doesn't make sense though. This edge case could be avoided though by checked that the time in foreground is greater than the threshold instead of only whether in the foreground at all here though. |
sdk/src/main/java/com/bugsnag/android/BlockedThreadDetector.java
Outdated
Show resolved
Hide resolved
As a reminder, an ANR is not shown every time the main thread is blocked for 5 seconds. Blocking of the main thread can cause an ANR, but not always. ANRs are triggered when touch events in the input queue aren't consumed after 5 seconds. That's why a common trick is to ignore time the main thread spends being blocked when the app is in background, because you can't get touch events in background. The main thread is a serial queue but some things (like rendering & touch events) get prioritized ahead of the rest. Another idea: listen to touch events. If a touch event was received within the last 5 seconds, then that touch event was definitely consumed and therefore an ANR won't show. To listen to all touch events you need an activity lifecycle listener, then you get the window and you wrap the existing window callback with your own and replace. |
Thanks for all the helpful review comments @pyricau. I've made one change to the implementation so that the Process Info is collected on API <16. After internal discussion we've decided that we'd like to stick with the current implementation, as wrapping the existing window and listening to all touch events could be considered quite invasive for a 3rd party library. |
Makes sense. One last thing: on twitter Matthijs Mullender mentioned the OS sends a sigquit to the app. If you already have a native signal handler in place for bugsnag maybe there's something to look at here. |
Using a signal handler sounds like a very interesting approach as it would be a bit more definitive that the app actually crashed. We're going to investigate using that longer-term, and go with the current changes for now. |
The idea is like in /~https://github.com/SalomonBrys/ANR-WatchDog |
Goal
ANRs occur when an Android application has been blocking the main thread for a substantial amount of time, and typically means the user is about to close the app. Bugsnag should capture and report these events as unhandled errors.
Changeset
config.detectAnrs
, which is enabled by default if the app does not have a debugger attached (this would cause false positives)config.anrThresholdMs
which allows developers to control the threshold at which ANRs are detected (5000ms by default)Client
initialisation, which reports an unhandled error when the main thread is blocked for 5000ms+ANR detection implementation
The
BlockedThreadDetector
uses aHandler
to periodically post aRunnable
on the main thread. ThisRunnable
updates a timestamp obtained fromSystemClock.elapsedRealTime()
.A separate watcher thread checks whether the timestamp has been updated in the last 5000ms. If this is not the case, then the main thread has been unable to process messages for 5 seconds and this is interpreted as an ANR.
Discussion
ANR detection requires additional work that will be addressed in separate PRs:
unhandledCount
is serialised.Tests