ylliX - Online Advertising Network
Instagram Engineering

Building an Open Source, Carefree Android Disk Cache


Caching various files on disk has always been an integral part of many mobile apps. At Instagram, we use caching to store and recover images, videos, and text files. As a media-heavy application, the Instagram Android app requires a lightweight but stable disk cache system. When we first built the app, we started with the open source DiskLruCache library. It served us well until we found one major issue with the cache’s design: the cache code’s exception handling logic is cumbersome and prone to developer error.

For example, the following code snippet shows how to properly handle a simple write-to-disk operation using DiskLruCache:

// Writing to Cache before using IgDiskCache
if (mDiskLruCache != null) {
final String key = hashKeyForDisk(data);
DiskLruCache.Editor editor = null;
OutputStream out = null;
try {
editor = mDiskLruCache.edit(key);
if (editor != null) {
out = editor.newOutputStream(DISK_CACHE_INDEX);
writeFileToOutputStream(out);
out.close();
editor.commit();
}
} catch (IOException e) {
if (out != null) {
try {
out.close();
} catch (IOException e) {
Log.d(LOG_TAG, "can't close output stream", e);
}
}
if (editor != null) {
try {
// This is an Instagram modification to DiskLruCache.
// Making sure the cache will be in a good state even if an IOException is thrown.
editor.removeEntryAndAbort();
} catch (IOException e) {
Log.d(LOG_TAG, "can't abort editor", e);
}
}
}
} else {
// Handle the disk cache not available case.
}

As you can see, because the DiskLruCache doesn’t support stub instances, when the file storage is not available (either the cache directory not accessible, or there is not enough storage space left), we have no choice but to let the mDiskLruCache fallback to NULL. This seemingly harmless fallback requires all engineers to explicitly check that the cache is not equal to NULL before they ever want to use it. After confirming that the cache is available, the disk caching code also needs to go two extra steps to get to the OutputStream: retrieving the Editor object from the cache entry using the cache key, and then getting the OutputStream from the Editor. Both of these steps might throw IOExceptions, and the retrieving Editor from disk cache could also return NULL. If any of these failed cases ever happens, the engineers need to figure out on their own how to gracefully handle the crash, properly close all the streams/editors/snapshots, and make sure the partial files won’t mess up the cache.

If you think this is already complicated, just imagine how complicated it could get when handling two editors in the same code block, or implementing a read-process-write case inside a single method. Missing any one of those NULL checking or mishandling any of the IOExceptions will result in many crashes daily on client devices. Over time, as our app becomes more complex and more engineers joined and worked on the same code base, the disk caching code became extremely flaky and hard to maintain. For over a year, cache-related NPEs (Null Pointer Exception) and IOExceptions topped our crash list. After doing several small patches, we soon figured out these small fixes won’t solve the problem. The fix made the code look even worse, and new crashes kept coming.

To fix the issue completely, we knew we had to rethink what a disk cache is, and to redesign the disk cache to make the whole thing easier to use and maintain. A cache, by definition, can always tell the developer “I don’t have this item.” We use this principle to simplify the case that the cache can’t even be opened, or that there are disk errors. We simply report that we don’t have the item, and let writes fail silently. And for cases like IOExceptions, we ideally shouldn’t let the developers guess what’s happening inside the cache, and handle all the possible scenarios. The cache should be smart enough to handle most of the failed cases itself, and guarantee that no incomplete file will be cached and that all cache entries get closed properly.

When we decided to build IGDiskCache, we decided to focus on four main changes:

  • 1. Simplify cache initialization and null-checking: Support stub cache instance when the disk cache is not available or accessible, so that we don’t need to check the mDiskLruCache != null every time we want to use it.
  • Handle the IOExceptions smartly, as most of the exception handling logic (e.g. close cache entry, close input/output stream, discard the incomplete file) is reusable and there is no need to make the programmers handle all these edge cases themselves.
  • Flatten the cache, and remove the unnecessary level of Editors/Snapshots. This makes the cache entry’s commit/abort/close logic much cleaner and easier to read.
  • Prevent engineers from mis-using the cache. This includes requiring NULL checking for the cache entries after retrieving them from the cache; and ensuring all the time-consuming tasks (like cache initialization and close) to be executed only on non-UI threads.

From initial design to implementation, it took us about a month to build the initial version of the IgDiskCache, and a few more weeks to update all the call sites and test the module thoroughly. After we launched it in production, we were able to dramatically reduce the number of crashes in the app. Also, because of the built-in enhanced checking conditions, IgDiskCache was able to help us identify quite a few race conditions in our apps which were extremely hard to detect otherwise. The UI thread checking also prevents engineers from executing inefficient disk IO operations on the main thread. The code also looks much simpler, and easier to reason about.

// Writing to cache using IgDiskCache
OptionalStream <EditorOutputStream> output = mIgDiskCache.edit(key);
if (output.isPresent()) {
EditorOutputStream outputStream = output.get();
try {
writeFileToOutputStream(outputStream);
outputStream.commit();
} catch (IOException e) {
outputStream.abort();
}
}

Our story with IgDiskCache is a good example of how we tackle app reliability issues, and make our code cleaner and easier to maintain. We hope you’ll find it useful too!

We recently moved our mobile infrastructure engineering teams (iOS and Android) to New York City. If this blog post got you excited about what we’re doing, we’re hiring — check out our careers page.

Jimmy (He) Zhang is a software engineer at Instagram.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *