Introduction

A set is a collection where each value appears at most once. It is a very useful data structure in many different types of applications, and yet unlike the Java libraries, the .NET Framework provides no built-in Set interface or implementations. I find that in almost every project I work on, I need a set, and I usually base it on the existing .NET Hashtable data structure.

In this article, I present a different type of set implementation, the "range set", which I have also seen called the "extent set". Instead of actually storing each element, it stores ranges, where every value between the low and high values of the range inclusive is considered a member of this set. Thus, it is only applicable to integral data types. The range set has the interesting property that adding more elements can actually -shrink- the amount of memory used by the set as longer consecutive runs of integers begin to appear.

Background

This article assumes a basic familiarity with C# programming as well as some knowledge of data structures. You might also find this article interesting if you'd like an example of a non-trivial IEnumerator implementation.

RangeSet Usage

The RangeSet class is quite easy to use. It supports the standard Add(int) and Remove(int) operations to add and remove elements from the set, as well as a Contains(int) method to check for set membership. There is also an Add() overload that accepts an array or param array of integers which you should use when adding multiple values to the set at the same time because it enables a special-case performance optimization discussed in more detail below. The Count property gets the number of elements in the set, and the Clear() method removes all elements. The TrimToSize() method removes all extra space allocated by the RangeSet's internal data structures; call it after you populate the set and intend to leave it unmodified for some period of time. The RangeSet is also serializable. It is not, however, thread-safe under modification.

RangeSet Implementation

The underlying data structure of the RangeSet is an ArrayList. This ArrayList stores one of two different types of structures, Range and SingleElementRange, that implement the IRange interface:

/// <summary>
/// Interface for a range in the rangeset.
/// </summary>
private interface IRange
{
    /// <summary>
    /// The low end of the range.
    /// </summary>
    int Low
    {
        get;
    }

    /// <summary>
    /// The high end of the range.
    /// </summary>
    int High
    {
        get;
    }

    /// <summary>
    /// Whether or not the range contains the given element.
    /// </summary>
    /// <param name="element"></param>
    /// <returns></returns>
    bool Contains( int element );
}

The Range class stores two integers, namely the low and high values (inclusive) of the range. When the range only contains one element (i.e. Low == High), we have an additional opportunity to save space by using the SingleElementRange class. SingleElementRange only stores one integer which is the low, high, and only value in the range.

The IRanges are always disjoint and maintained in ascending sorted order in the underlying ArrayList so that we can efficiently search on them. To check to see if a given element is a member of the set, we perform a binary search of the IRanges in the ArrayList. We cannot use the ArrayList.BinarySearch method but rather must do this manually since, by definition, the element we are looking for is not necessarily -in- the set but rather may be implied by an overlapping range. The relevant code is in FindIndexOfRangeFor:

/// <summary>
/// Find the index of the range containing
/// the given element currently in the list, if any.
/// </summary>
/// <param name="element">The element we're looking for.</param>
/// <returns>The index of the containing range,
///     or the appropriate insertion point (a negative number)
///     if there isn't any.</returns>
private int FindIndexOfRangeFor( int element )
{
    if ( rangeList.Count == 0 )
    {
        return -1;
    }

    int low = 0;
    int high = rangeList.Count - 1;
    while ( low <= high )
    {
        int middle = (low == high) ? low : ((low + high) / 2);
        IRange curRange = (IRange) rangeList[ middle ];

        if( curRange.Contains( element ) )
        {
            return middle;
        }
        else if( curRange.Low > element )
        {
            high = middle - 1;
        }
        else if( curRange.High < element )
        {
            low = middle + 1;
        }
        else
        {
            return -middle;
        }
    }
    
    // Standard data structures hack to indicate
    // the appropriate insertion point as a part of this call.
    return -(low+1);
}

When elements are added to and removed from the set, we must find, adjust the bounds, potentially split an existing range or convert between a Range and a SingleElementRange, or create a new range. I refer you to the relevant code in RangeSet.cs to see how each of these cases is handled.

RangeSet Performance Characteristics

My primary goal in the implementation of the RangeSet was to minimize the space it took to store and check for membership in large, dense sets of integers in memory. This benefit comes at the expense of certain other performance characteristics. In particular, the Add() and Remove() operations are not in general constant-time because they may require copying blocks of array elements up and down. An implementation backed by a linked list instead of an array list would not have this issue, but I did not want to incur the overhead of linked list pointers, which would be significant in less and moderately packed sets. The Contains() method is efficient as it uses a binary search across the ranges.

Thus, it is quite possible that if you need a set, the RangeSet may not be the right tool for the job. If you are frequently adding or removing elements, a hashtable-backed set, a tree-backed set, or a bitset (assuming you know the range of possible values and it is not prohibitively large) may be more appropriate. However, the RangeSet can excel in cases where you load up a large number of integers of arbitrary size and primarily call the Contains() method, with relatively infrequent Add()s and Remove()s.

One special case where adding elements can be optimized is when the set is first created or initialized. In that case, we sort the elements, which allows us to create the correct ranges in ascending order without the possibility of having to expand or shrink existing ranges. Thus, when you know in advance all or a large portion of the elements you want to add to the set (as is often the case when loading a set of IDs from a database, for example), you should use the RangeSet constructor that accepts an int param array or, alternatively, call the param-array-accepting Add() overload before adding any other element.

A naive implementation of the Count property would require a scan of all of the ranges in the set. To avoid this inefficiency, we keep a cached count of all of the elements in the set which is maintained by all of the methods that add and remove elements.

Note that an implementation using a standard .NET array of a single struct containing the low and high values as the underlying data structure rather than an ArrayList would cut down the memory requirements further due to elimination of struct boxing. I have prototyped that code and in the future may include it in this distribution.

Using The Code

The associated code download contains the RangeSet implementation (RangeSet.cs) in a library that includes a suite of NUnit-based unit tests. If you would like to compile this assembly and/or run the tests, you will need a recent version of NUnit, available here, and you will also have to delete and re-add the nunit.framework reference in the RangeSetLib project. If you don't want to build the library project or run the tests, just add RangeSet.cs to your own projects.

The test application is a console application that I used to evaluate the memory performance. It creates sets of random numbers in the range 0 to 199,999 with a known percentage density. The density is varied from 5% (10,000 elements) to 95% (190,000 elements). These numbers are added to both a hashtable-based set and a RangeSet, and the total memory consumption as reported by the garbage collector is collected for each set. The results of this experiment appear in the image at the top of the page. As expected, the RangeSet's memory usage decreases for fill percentages over about 50%, and it is consistently less than that of the hashtable-based set -- almost a full order of magnitude less for sets consisting of 95% of the possible elements.

Future Directions and Possible Enhancements

A RangeSet implementation that makes use of .NET 2.0 generics to enable integral data types other than Int32.
Explore further optimizations when adding multiple elements to the set at the same time.
A synchronized wrapper to permit thread-safety under modification.
Array-backed storage of ranges as discussed earlier.

History

Initial release: 04/20/2005.