Demo application for caching library

Introduction

Today's applications frequently operate on remote data sources, while at the same time users are expecting greater interactivity than last century's web-based applications. Bandwidth and server capacities are always increasing, but not enough to keep up with our user's needs. To make a program seem responsive to the user, it's often impossible to eat the cost of sending a query to a server and waiting for processing on every little user action. Fortunately, many applications don't require such treatment; once a request has been processed once, the result can be saved and reapplied the next time it's needed.

This article presents an assembly that solves all of these problems, including support for multiple parameterized copies of the related data and versioning, in a multi-threaded, multi-user environment. The stored data can be either a DataSet, or anything that supports serializing to/from a stream.

Contents

The need for caching on the client side
Caching is more complex than it seems
Using the LocalCaching library
- Creating a CacheManager
- Storing and retrieving simple data
- Understanding parameters
- Understanding Date and Version
- Managing the cache
- Handling non-DataSet data
Designing the solution
Version History

The need for caching on the client side

Microsoft saw half of this problem, and provided us with the Caching Application Block. This is useful for taking computational or database retrieval load off of servers, but doesn't solve any bandwidth concerns. Using this, if your user wants to see, for example, an entire product catalog over and over again, it won't put undue strain on your servers. But if you've got a good-sized catalog it can still fill up your bandwidth quite easily.

Kirby Turner addressed this problem in his Code Project article Using Cache in Your WinForms Applications. He builds on Microsoft's solution to allow it to work on the client side. By storing the cached data on the client, no bandwidth at all needs to be used once the cache is populated. Unfortunately, the model he presents provides only a simplified concept of data in a cache.

Caching is more complex than it seems

When I first started working on this, I knew that I needed a way to put "a thing" into a cache, and later retrieve it. I soon discovered that my thinking of "a thing" was too fuzzy, and I needed to define exactly what would identify a cached item. I needed to find ways to distinguish between multiple instances of the same data, and also to track the version of the data "container" to handle upgrades to the server software.

Consider an e-commerce system in which a user wants to see product details. A naive implementation would simply store an item named "ProductDetail" when the user clicks the Details button for ProductX. That way, if he clicks the button again, we can pull the data straight out of the cache. But what happens when the user changes to ProductQ? Clearly we don't want to overwrite, when we save ProductQ in the cache, the cached details for ProductX. The LocalCaching solution develops a means of parameterization to address this.

Now suppose that the server-side code is changed, and that results in the "shape" of the data that's returned to client being different. Perhaps it's a DataSet and a new column has been added to one of its DataTables. Or maybe there was a bug in the database code, and some rows were being incorrectly omitted. Such a change ought to invalidate any cached data having the old version. The LocalCaching solution provides a way to clear out all obsolete versions of "ProductDetail" when a newer version is detected.

If you're really concerned about the responsiveness of your application, you're probably already using multithreading, at least to keep the user interface from freezing while data is being retrieved. It's necessary to prevent corruption in the cache, so the cache manager must be thread safe. Moreover, if there are multiple instances of the application running simultaneously, we need to prevent them from corrupting each other's cache.

It's also possible for multiple users to share the same machine. In our hypothetical e-commerce application, we want to be sure that each user gets his or her own private answer to "AccountDetails". Partitioning the data by users also allows one user to clear out the cache completely without disturbing the cache of another user.

In my environment, most of the stuff to be cached is in the form of DataSets. However, there are also instances when data in other forms, such as XML, need to be stored as well. The LocalCaching solution works most easily with DataSets, but it supports anything that can serialize itself to and from a stream.

Using the LocalCaching library

The library consists of a single assembly, LocalCaching.dll. You'll need to add this to your project's References. This assembly defines a single namespace, LocalCaching, for which you may want to add a using directive.

Creating a CacheManager

The first thing you've got to do is, get yourself a CacheManager object. There are two constructor methods defined; calling the default one will place the cache in the folder C:\Documents and Settings\username\Local Settings\Application Data\LocalCaching.

private LocalCaching.ICacheService mCache;
// ...
mCache = new LocalCaching.CacheManager();

Storing and retrieving simple data

Before we get into parameters and versions and stuff, let's take a look at how you can store and retrieve simple data. The following lines create a minimal date and version entry and an empty placeholder for data parameterization. Notice the first parameter of the storeDataset() method. This gives a name to the cache item, like "ProductDetail" and "AccountDetails" in the examples above. I usually use the type name of the DataSet I'm storing. The fourth parameter is the DataSet to be stored.

LocalCaching.DateAndVersion
dav = new LocalCaching.DateAndVersion( DateTime.UtcNow, "no version" );
mCache.storeDataset( "MyCacheItemName", "", dav, dsDatasetToCache );

Pulling the data back out of the cache is slightly more complicated because, in order to support typed DataSets we need to tell the library what type to load the data into.

DSDogOwners myds =
    mCache.retrieveDataset( "MyCacheItemName", "", typeof(DSDogOwners) )
    as DSDogOwners;

In the retrieveDataset() method we pass in the same name we gave it when we stored the item; the parameters (empty in this case), and the type of object that the library should instantiate to load the data into. Because the library can't know ahead of time what type to return, the method just returns a DataSet object, so you need to cast it into the correct type.

You may want to simply check for the presence of a record without actually retrieving it. This is useful because it does return the Date and Version of that record. You can send this off to your Web Service when making a request; this gives the server the option of saying that "your data is still current; keep on using what you've got in your cache".

LocalCaching.DateAndVersion dav = 
        mCache.getDateAndVersion( "MyCacheItemName", "" );
if (dav == null)
{
   System.Windows.Forms.MessageBox.Show("Not found in cache");
}

Understanding parameters

It's frequently not possible to simply save and retrieve the cached details. When the data is "ProductDetails", as discussed above, we need to distinguish which product's details are being stored. I call this distinguishing data "Parameters" because it generally corresponds to the parameters of the web method that retrieves the data.

Tagging a type of data with the parameter values quickly turns into a complicated problem because the library can't know ahead of time what the types of the parameters are, or even how many to expect. The library addresses this by providing a ParamHash class. You can put your parameters into an instance of this object, and then use its ToString() method to extract a string representing a hash of all the parameters, usable for passing into the storeDataset() and retrieveDataset() methods.

Here's an example of using ParamHash to store a parameterized DataSet.

LocalCaching.DateAndVersion dav =
     new LocalCaching.DateAndVersion(dtTimestamp.Value,txtVersion.Text);
LocalCaching.ParamHash ph = new LocalCaching.ParamHash();
ph.Append( txtMyParameter.Text );
mCache.storeDataset( "MyCacheItemName", ph.ToString(), dav, dsDogOwners1 );

Once you've created an empty ParamHash object, you Append() your actual parameter values into it. You can do this, one at a time by repeatedly calling Append(). You can also pass them in all at once by putting them into an object array and passing it.

One caveat is that, ParamHash internally relies on your parameter's ToString() method. For this to work, any object you pass in must implement a meaningful implementation of this method. By meaningful, I mean that it must give a value that represents the state of that object. Returning, say, the object's type name wouldn't allow ParamHash to tell one value apart from another.

Understanding Date and Version

The LocalCaching library includes a class DateAndVersion which is just a way to bundle these two pieces of information about a particular cached datum. Both of these help track the "freshness" of the data. To understand how they're useful, you need to step back and look at the big picture of both client and server.

Both the date and version are intended to be the values received from the server. This should be obvious for the version: since the server is the one building the data, it's the one that should have the authoritative versioning information for the code implementing that process. The date should be supplied by the timestamp because of the potential of (a) time zone differences, and (b) incorrect clock settings on the client.

The intended usage pattern for this follows:

Call getDateAndVersion() to check if we've already got a value; we discover we do have something cached.
Send this date and version to the server.
The server checks the version to see if it matches its current version; if not, get the requested data and send it to the client.
The server checks the date to see if the underlying data has changed; if newer data is available, send it to the client.
Since the version and date are still current, just send back to the client a message saying that their cached data is still acceptable.
If the client received new data, store it in the cache along with a new DateAndVersion; otherwise, just retrieve the existing data from the cache.

Note that we frequently cheat the server when checking to see if the data has changed. In some cases we don't track when the data has changed, and even if we do, sometimes it's just too expensive to check anyway. In these cases we just assign a time threshold considering how volatile the data is; if the client's cache is younger than that then we just assume that it's still fresh.

Similarly, we use this on the client. I've noticed a phenomena of users quickly clicking back-and-forth between two pages. When they're doing this, even the overhead of the simple call described above can be excessive. To handle this scenario I define a very small threshold, usually between 3 and 60 seconds. If the client-side data is younger than that small threshold (you can use the Age property to check), then we simply assume that the data is still good. Note that this can only work if you have confidence in the client clock.

Managing the cache

Knowing what to clear out of the cache and when, is just as important as knowing how to add and retrieve the data. There are at least four reasons that you'll want to clear portions of the cache:

The server tells you that the date for the current data is too old because a change has been made since. In this case you want to surgically remove just the one item that's old by using the method unStoreCache(string name,string paramhash).
The client has updated data. When you know an update is occurring, you should retrieve a fresh copy. This ensures that anything added/changed by the server is reflected; anything that didn't update due to errors can be discovered; and any optimistic locking tokens are updated. For this scenario use unStoreCache(string name,string paramhash)to clear out just the affected instance of the specific data. If the change was more significant, that is, it may affect more than just what's indicated in the parameters, you may want to call void clearCache(string name)instead, to clear out all of that type of data.
The cache is getting old. You may want to clear out any dead wood just because it's taking up space. For this purpose use the method clearCache(DateTime minbirth) to specify a oldest date that you want to keep. Everything, regardless of name and parameters, that's older than that will be cleared out. Or, if you want to really clear out everything, use clearCache() -- the nuclear option.
The server tells you that the version you've got stored is obsolete. In this case you want to get rid of everything with the tupple [name, version], so call clearCache( string name, string GoodVersion ) to clear out everything having that name but a version different from what the server says is current.

Handling non-DataSet data

While this library has been used most frequently for caching DataSets, it's equally happy handling anything that can serialize itself to and from a Stream. This works just like storing and retrieving DataSets, except you use

getStoreStream() when you want to store something; it gets for you a stream object that you can write to.
retrieveStream() to retrieve it again; it returns a stream object that you can read from.

Designing the solution

While I hope that the discussion above should be sufficient to use the library, maybe you'll have some insight toward solving the problems that inevitably arise if you understand some of what's going on inside.

What actually gets stored?

There are two things that need to get stored: the actual data to be cached, and an index to the cache so that, we can figure out what's been stored where.

I elected to store each record in its own file. This makes it easy to store any serializable object. It also makes it easy to see exactly what's in the cache with normal tools. The name of the file that the data will go into, starts with the name given for the record, but to ensure unique names for every entry I add a GUID onto that. You'll note that it's possible to pass "nasty" names to make this scheme fail. Don't do that.

The cache index goes into the same directory, with a special name "ContentIndex.xml". You can open this file up yourself to see what's being cached.

Maintaining the index

The index to the cache is itself a DataSet. In some ways a purpose-built collection of objects might have been preferable. However, this way let me build it fast, mostly by giving me the ability to filter with ease.

Thread safety

There's no finesse in the way I handle threading. I have one static lock object that I use any time something might affect the index file, so it's more like a simple semaphore. Only one thread at a time can do something that would change the cache; the others have to wait. In practice this doesn't seem to cause any performance issues.

Multi-process safety

In our environment it's quite possible for multiple instances of the application run simultaneously. When this happens I want all of them to benefit from the same cache, but at the same time they have to be careful not to step on each other's toes. This is achieved by not keeping the index in memory. Every time it's used, it's read in from disk, and every time it's changed, it's rewritten. In this way each process is always refreshing itself with changes from the others.

There is one potential hole in this. There is a very small window for lost updates, allowing one process to overwrite a change made by another. I've never seen this actually happen in the real world. Even if it did, it's not worth worrying about here. Since all we're implementing here is a performance enhancement, losing a cache entry will only result in losing the performance enhancement for that record.

Stuff I'm not satisfied with

What I don't like needs to be done to retrieve typed DataSets. The need to both pass in the type of the DataSet (so that the library knows what type to instantiate) and to cast it when it's returned (because the method must be defined as returning the base DataSet class) is ugly. Unfortunately I don't see any less-ugly alternatives.

Version History

1.0.0.0 - 2005-01-26
Initial release