Why I Use Crashlytics - Part 1
Why I Use Crashlytics - Part 2
In Part 1, I discussed how to set up Crashlytics to get intricate data about the crashes, devices, and how to easily view this data from within Android Studio. In this article I’m going to discuss how this data can help you fix the most critical crashes with minimal effort and create happy users and customers for your business. I’ll also talk about how I use Crashlytics in an application that is very large in scale.
The Beauty is in the Details
One of the great things about Crashlytics is the level of detail you’re given when a crash occurs. As a product manager, dev lead or engineer you’re able to quickly assess the risk the crash poses to your overall install base. With the device and operating system breakdown you can determine if a particular crash is only occurring on Android 2.x or everywhere (or another version). You can determine if the crash is affecting only a particular device manufacturer. This is a tremendous help.
One of the things I've done with my clients to help assess the risk of a particular issue is to juxtapose the issue statistics vs. the statistics from Answers by Crashlytics, Crashlytics' new mobile analytics service (as shown below). Using this information I can determine if the majority of my users are exposed to this issue simply by looking the users count from Crashlytics' crash reporting as well as the daily active users count for the OS version and the daily active users count for device from Answers. This data will provides me with the insight to determine if the issue is severe or just an edge case that is rarely hit (the issue count really helps here and is covered in the next section in more detail).
A sample screenshot of Answers by Crashlytics providing the Daily Active Users by Operating System (OS).
A sample screenshot of Daily Active Users by Device Type.
These charts provide tons of great info that you can compare to the issues in Crashlytics. If your app is generating an error occasionally it’s easy to determine what to fix, how to fix it, and when to fix the issue. However, when your app is at scale (millions of users on thousands of types of devices) then this process gets a bit more involved. I’ve learned to rely on Crashlytics in helping to determine what issues need to be resolved quickly. In doing so we have developed a set of steps that help us mitigate risk when new issues arise on the MyFitnessPal app.
App Scale and Implementing Post Release Risk Mitigation
Crashlytics is good for small and large apps but it really starts to shine when it's run in an app that is at scale (millions of users/installs) because you start seeing issues you’ve never experienced or even thought could occur in your app. You will get reports of crashes in 3rd party frameworks (like ad frameworks, etc.) as well as other libraries along the way.
At MyFitnessPal, Crashlytics has helped us catch an enormous amount of issues. Because Crashlytics exposed the issues quickly and accurately with a lot of detail we have been able to assess the issues sooner and with greater confidence. Using the tool has also cut costs dramatically as well since it is a free product. Our risk analysis process helps us determine if the issue needs to be immediately fixed via hotfix or if the issue is mild enough to wait for the next dot release (a dot release would be moving from 3.0.1 to 3.0.2 or from 3.0 to 3.1 - it varies from company to company).
The Dashboard and Risk Review
I feel it's important to see the Crashlytics dashboard for an app that has millions of installs because that level of detail helps determine our next course of action. The screenshot below is the Crashlytics Dashboard for the MyFitnessPal Android App as of the latter half of May 2014.
Crashlytics dashboard for the MyFitnessPal Android Application
The dashboard view provides a great number of details as outlined by table below.
|1. ||App / Package Name ||The application name and package name. |
|2. ||Version Selector ||Select a version to inspect or view all versions of your app in aggregate. I have selected version 3.1.1 (4685). This is the version name (3.1.1) and the version code (4685) that is in your AndroidManifest.xml. The selected version relates to #12 below. |
|3. ||Open / Closed issues ||Toggle between open and closed issues or display all. |
|4. ||Crash / Non-Fatals ||Toggle between Crashes and Non-Fatal issues. |
|5. ||Date/Time Selector ||Select a time or date range for the issues that you want to inspect. |
|6. ||Search ||Search for logs, keys, issues, etc. |
|7. ||Issue Total ||Total number of issues that have been created as a result of various crashes and non-fatals for the selected version / total of number of issues for crashes and non-fatals for the entire application if ‘All’ is selected. |
|8. ||Total Crashes and Total Non-Fatals with User Counts ||The total number of non-fatal crashes and the total number of users affected by the non-fatal crashes. Directly beneath this is the total number of crashes and total number of users affected by the crashes. |
|9. ||Time / Date Range Chart ||A visual chart representation of the number of crashes over a given time. |
|10. ||Issue List Count for Severity / Select All ||The total number of crashes for the given severity. The severity of the issue is ranked by Crashlytics on a 1-5 scale. In this screen shot you will notice 5/5 bars are blue. This means this issue is a severity level 5 (highest severity). You can also use the checkbox next to the issue count to select all issues in the list and perform mass editing. |
|11. ||Issue ||This is the issue line. It shows the issue ID (#26909) as well as other info about the issue. The root source code cause is shown with line number. Click on this to visit the issue detail page as shown earlier in this article. |
|12. ||Version Number ||The version number of the application when the crash occurred. If you have ‘All’ selected in the version selector, you will see various values in this column in the list. Since I have selected ‘3.1.1 (4685)’ in the version selector all of the issues in the list will have the same value here. |
|13. ||Crash Count ||Total number of crashes this issue has caused. |
|14. ||Affected Users Count ||The total number of users that this issue has affected. |
Using the data above along with the data in Crashlytics Answers you can gauge risk quite easily. An additional bonus that you get for free by using Crashlytics is that their severity ranking on Crashlytics is pretty accurate and is a good place to start when assessing risk. I’m not sure of the exact implementation of their severity algorithm, but it seems to be a combination of crash occurrence/issue persistence (does it continually keep happening) and breadth of the crash across Android OS and devices combined with affected users.
The risk analysis is pretty simple. Once you release a new version of your app. Check the Crashlytics dashboard daily and perform the following review:
Look for the highest occurring crash across the breadth of installs across the various OS’s.
Example: If a crash is only affecting Droid Razr devices running Android 2.3.9 and our install base says 0.03% of our users use this device and that accounts for .001% of our total users ... well ... it's probably not that big of an issue. However, if we see that we have a crash occurring on all Samsung S3 and Samsung S4 devices with Android 4.+ and our install numbers prove that this accounts for 60% of our install and/or user base, then that means we have a higher priority issue that probably needs a hotfix. We’ll need to ship the hotfix as soon as possible in order to mitigate the risk of loss of existing and new users.
Below is how I perform a risk analysis for a crash. Please note, the first time you do this it will take a few extra minutes because you are cross-referencing data, but after you do it a few times the process will become VERY quick and you probably won't need to reference Crashlytics Answers for the statistics as often. I do recommend that you still review your statistics in Crashlytics Answers every few weeks and at least once a month to make sure your heuristics are kept up to date (not to mention all the other great info in there - which should be a post in itself).
Crashlytics Issue Risk Analysis Steps
- Review the issue list and the following metrics:
- Severity (Crashlytics usually nails this pretty well). I always look at the 5/5’s first.
- Total number of crashes this issue has caused.
- Total number of users affected.
- Choose an issue you want to drill down into. Once the issue is opened in the issue details screen (as shown in the sections above) review the following metrics:
- Android Operating System Exposure - How many different OS’s are affected?
- Android Device Exposure - Percentage of various devices affected?
- Now cross reference these metrics with your current Crashlytics Answers statistics to find the most popular Android Devices and Android OS’s for your app Do not skip this step! This is different for every app and every market!
- Record/Remember the Top 80% of Android OS: Usually the first 3-5 are going to represent your 80%
- Record/Remember the Top 10 Android Devices - As your app grows in install base (especially when you trend into the millions of installs) you will see that your top devices are usually around 100K installs to 1M+ installs on a particular device. Really pay attention to these devices because you’ll want to be able to quickly identify your top devices that might be affected.
- Determine if it's a hotfix issue (an issue that needs to be fixed right now or can wait for the next release). Please note, each company's risk tolerance is different than the next. Adjust to your own risk tolerance as you see fit. MyFitnessPal has a different risk tolerance than my news reading app. The numbers below are generalizations.
- If the issue is affecting more than 1% of your users, it's definitely a hotfix issue.
- Again, this is highly subjective. If you have 10 users out of 1000 getting some really obscure error that rarely happens then you may want to hold off on fixing it. This is up to you. But at least you now have the data to come to that conclusion.
- If the issue is affecting one of your top devices by more than 10% of the issue device ratio (Issue per device type) it’s probably a candidate for a hotfix release.
- If hotfix is necessary, simply update the source and ship an update and start this process again.
Caveat: If you feel the issue at hand is something that needs to be fixed immediately (metrics aside) then by all means, fix it. This is only a rough process which I use to help determine if I should fix an issue immediately or later. This is a general guideline I use, feel free to adjust accordingly.
Crashes at Scale
Another thing you’ll notice when your app gets to scale is that you start getting crash reports for your app that happen in parts of the code that you do not own. Examples of this include ad frameworks, open source libraries or commercial third party libs (charting, etc). When you encounter issues like this you have a few options:
- Extend the class that is causing the issue (if possible) and fix the issue.
- If the code is open source, fork it and fix it, submit a pull request to the open source project and ship the app with your forked code until the fix is in the open source project.
- Contact the vendor and have them fix it.
I’ve used all three of these with mixed success. However, sometimes there is an error you simply cannot fix; for example, a bug in the Android framework (yes this does happen). Your best course of action is to try to catch the error in your source and gracefully handle it as Android OS updates do not happen that often on devices (unless the user is fortunate enough to be running a Nexus device - but even that gets limited and is slow).
Logging Exceptions with Crashlytics
Ok, the reports and data are great, but how do I log these exceptions!?!
Crash monitoring in Crashlytics is built in. You simply add it to your entry point of your application (as was done in the installation portion of the app via the plugin) and at that point crash reporting is handled for you. The reports are delivered over SSL so you don’t have to worry about the security of the information transmitted. The source for Crashlytics is quite small (~45kb) and minimally affects the startup time of your application. However, this is not the end of Crashlytics reach. It can do much more.
Logging Caught Exceptions
When writing an app you sometimes have to wrap your code in a try/catch. Maybe it’s a network failure you’re expecting, maybe it’s a file read/write issue or maybe it’s a null that sometimes happens when the app is in a certain state. These exceptions should not happen, but sometimes they do. You and your team do your best to handle these issues gracefully so that the app doesn't crash. Unfortunately at the end of the day you don’t really have any idea how many times a particular exception is being caught. Crashlytics allows you to log caught exceptions like this:
This will log the exception as a non-fatal exception in Crashlytics. You can review this data as you view a normal crash. All of the same data is displayed. Error line number, stack trace, OS, Device type, etc. This is shown below in a recent screenshot of a cross platform game I helped develop, QONQR.
Screenshot of the non-fatal exception details for the QONQR Android app. Here we have a problem with our onPause event inside of our QonqrMap activity in Android. We have since fixed it (as there is no data in the date chart) yet we leave it open for review as we’re making changes still.
At times in the debugging lifecycle you’ll find that particular crashes are only occurring for particular users. These issues can be especially hard to squash. To help track users and crashes Crashlytics allows you to log user information to help you identify a user. This is done with one of the following methods:
You can use the
setUserIdentifier method to set an id, number or hashed value that uniquely identifies the user of your app without disclosing or transmitting any of their personal information. You can also use the other methods listed to send up data about the user.
On top of user information it is also useful to keep track of context when a crash occurs. Context can help you pinpoint the root cause of an issue fairly quickly. Crashlytics allows you to log contextual data by using custom keys as such:
Crashlytics.setBool(String key, boolean value);
Crashlytics.setDouble(String key, double value);
Crashlytics.setFloat(String key, float value);
Crashlytics.setInt(String key, int value);
A couple example of how you could use this:
This data is sent with the crash to Crashlytics and you can view it in the issue details screen. Please note, Crashlytics limits you to 64 key/values pairs though might allow more if you contact support.
Tips and Tricks
Below are a few tips and tricks I usually share with folks when they first start using Crashlytics. I hope they help you out.
Gradle Build Variants for Debug and Release Builds
Crashlytics apps are added automatically to the Organization via the package name. You can add some logic around the Crashlytics initialization so that it only runs during release mode. If you don’t do this you’ll run into some false positives in your Crashlytics dashboard. To get around this, I use Gradle and build variants. Therefore, in debug mode my package name is com.donnfelker.myapp.debug and in release mode it is com.donnfelker.myapp. This helps me keep my development crashes out of my release crashes.
You can implement this with a simple build variant in Gradle like with this code (edit/remove the proguard config if you need to):
Now when you build ‘.debug’ will be appended to the package name for debug builds. Please note, you will notice I have ProGuard included above. Crashlytics plugin will automatically upload the ProGuard mapping to Crashlytics so that your stack traces are not obfuscated! Awesome!
Fragment getActivity() is at least 50% of the Errors at First
This is an easy one. Over 50% (I’d go as far to say 80%) of the apps that use Fragments do not check for null when the
getActivity() method is called. This results in a ton of
NullPointerExceptions being thrown (this causes the app to crash).
Your new rule of thumb should be:
getActivity() will ALWAYS be null. If you think that way, you will change the way you code when using
getActivity() and in doing so you will always be checking for null. Doing that will save you a ton of crash reports.
NullPointerException is your Best Friend
Speaking of null pointer exceptions ... As I’ve said in many of my presentations - expect everything to be be null especially
getActivity() in a Fragment. The
NullPointerException is one of the easiest exceptions to fix (when it’s in your code). These ones are quick hits - simply fix it, test it and ship it. These are real easy ones to do. But please note, at scale you will see more
NullPointerExceptions than you ever thought you would, and a lot of them will be in places you least expected them to happen. Happy hunting.
Since Android was introduced to the public in 2008 I’ve been developing apps for the Android platform. Many of my original apps were in the first 1000 published to the Android Market (now known as Google Play) and even during those times my major concern was around crash reporting. I’ve used a slew of tools over the years and while some have come close to the usability, cost, and feature set of Crashlytics none has yet to surpass it. Crashlytics is the first library I install on any app that is going to be released on Google Play or the Amazon App Store. It probably should be yours too.
Side Note: Crashlytics now offers beta distribution through Beta by Crashlytics (kind of like Testflight, but it's by Crashlytics) and mobile analytics with Answers by Crashlytics. Similar to the experience they deliver with crash reporting, these new services were a great tool to have in my toolbelt. If you use beta distribution or are looking for a strong mobile app analytics tool, you should definitely check them out.
Learn more about Crashlytics for Android.