## Introduction

A set is a collection where each value appears at most once. It is a very useful data structure in many different types of applications, and yet unlike the Java libraries, the .NET Framework provides no built-in Set interface or implementations. I find that in almost every project I work on, I need a set, and I usually base it on the existing .NET `Hashtable`

data structure.

In this article, I present a different type of set implementation, the "range set", which I have also seen called the "extent set". Instead of actually storing each element, it stores ranges, where every value between the low and high values of the range inclusive is considered a member of this set. Thus, it is only applicable to integral data types. The range set has the interesting property that adding more elements can actually -shrink- the amount of memory used by the set as longer consecutive runs of integers begin to appear.

## Background

This article assumes a basic familiarity with C# programming as well as some knowledge of data structures. You might also find this article interesting if you'd like an example of a non-trivial `IEnumerator`

implementation.

## RangeSet Usage

The `RangeSet`

class is quite easy to use. It supports the standard `Add(int)`

and `Remove(int)`

operations to add and remove elements from the set, as well as a `Contains(int)`

method to check for set membership. There is also an `Add()`

overload that accepts an array or param array of integers which you should use when adding multiple values to the set at the same time because it enables a special-case performance optimization discussed in more detail below. The `Count`

property gets the number of elements in the set, and the `Clear()`

method removes all elements. The `TrimToSize()`

method removes all extra space allocated by the `RangeSet`

's internal data structures; call it after you populate the set and intend to leave it unmodified for some period of time. The `RangeSet`

is also serializable. It is **not**, however, thread-safe under modification.

## RangeSet Implementation

The underlying data structure of the `RangeSet`

is an `ArrayList`

. This `ArrayList`

stores one of two different types of structures, `Range`

and `SingleElementRange`

, that implement the `IRange`

interface:

private interface IRange
{
int Low
{
get;
}
int High
{
get;
}
bool Contains( int element );
}

The `Range`

class stores two integers, namely the low and high values (inclusive) of the range. When the range only contains one element (i.e. `Low == High`

), we have an additional opportunity to save space by using the `SingleElementRange`

class. `SingleElementRange`

only stores one integer which is the low, high, and only value in the range.

The `IRange`

s are always disjoint and maintained in ascending sorted order in the underlying `ArrayList`

so that we can efficiently search on them. To check to see if a given element is a member of the set, we perform a binary search of the `IRange`

s in the `ArrayList`

. We cannot use the `ArrayList.BinarySearch`

method but rather must do this manually since, by definition, the element we are looking for is not necessarily -in- the set but rather may be implied by an overlapping range. The relevant code is in `FindIndexOfRangeFor`

:

private int FindIndexOfRangeFor( int element )
{
if ( rangeList.Count == 0 )
{
return -1;
}
int low = 0;
int high = rangeList.Count - 1;
while ( low <= high )
{
int middle = (low == high) ? low : ((low + high) / 2);
IRange curRange = (IRange) rangeList[ middle ];
if( curRange.Contains( element ) )
{
return middle;
}
else if( curRange.Low > element )
{
high = middle - 1;
}
else if( curRange.High < element )
{
low = middle + 1;
}
else
{
return -middle;
}
}
return -(low+1);
}

When elements are added to and removed from the set, we must find, adjust the bounds, potentially split an existing range or convert between a `Range`

and a `SingleElementRange`

, or create a new range. I refer you to the relevant code in *RangeSet.cs* to see how each of these cases is handled.

## RangeSet Performance Characteristics

My primary goal in the implementation of the `RangeSet`

was to minimize the space it took to store and check for membership in large, dense sets of integers in memory. This benefit comes at the expense of certain other performance characteristics. In particular, the `Add()`

and `Remove()`

operations are not in general constant-time because they may require copying blocks of array elements up and down. An implementation backed by a linked list instead of an array list would not have this issue, but I did not want to incur the overhead of linked list pointers, which would be significant in less and moderately packed sets. The `Contains()`

method is efficient as it uses a binary search across the ranges.

Thus, it is quite possible that if you need a set, the `RangeSet`

may not be the right tool for the job. If you are frequently adding or removing elements, a hashtable-backed set, a tree-backed set, or a bitset (assuming you know the range of possible values and it is not prohibitively large) may be more appropriate. However, the `RangeSet`

can excel in cases where you load up a large number of integers of arbitrary size and primarily call the `Contains()`

method, with relatively infrequent `Add()`

s and `Remove()`

s.

One special case where adding elements can be optimized is when the set is first created or initialized. In that case, we sort the elements, which allows us to create the correct ranges in ascending order without the possibility of having to expand or shrink existing ranges. Thus, when you know in advance all or a large portion of the elements you want to add to the set (as is often the case when loading a set of IDs from a database, for example), you should use the `RangeSet`

constructor that accepts an `int`

param array or, alternatively, call the param-array-accepting `Add()`

overload before adding any other element.

A naive implementation of the `Count`

property would require a scan of all of the ranges in the set. To avoid this inefficiency, we keep a cached count of all of the elements in the set which is maintained by all of the methods that add and remove elements.

Note that an implementation using a standard .NET array of a single `struct`

containing the low and high values as the underlying data structure rather than an `ArrayList`

would cut down the memory requirements further due to elimination of `struct`

boxing. I have prototyped that code and in the future may include it in this distribution.

## Using The Code

The associated code download contains the `RangeSet`

implementation (*RangeSet.cs*) in a library that includes a suite of NUnit-based unit tests. If you would like to compile this assembly and/or run the tests, you will need a recent version of NUnit, available here, and you will also have to delete and re-add the *nunit.framework* reference in the RangeSetLib project. If you don't want to build the library project or run the tests, just add *RangeSet.cs* to your own projects.

The test application is a console application that I used to evaluate the memory performance. It creates sets of random numbers in the range 0 to 199,999 with a known percentage density. The density is varied from 5% (10,000 elements) to 95% (190,000 elements). These numbers are added to both a hashtable-based set and a `RangeSet`

, and the total memory consumption as reported by the garbage collector is collected for each set. The results of this experiment appear in the image at the top of the page. As expected, the `RangeSet`

's memory usage decreases for fill percentages over about 50%, and it is consistently less than that of the hashtable-based set -- almost a full order of magnitude less for sets consisting of 95% of the possible elements.

## Future Directions and Possible Enhancements

- A
`RangeSet`

implementation that makes use of .NET 2.0 generics to enable integral data types other than `Int32`

.
- Explore further optimizations when adding multiple elements to the set at the same time.
- A synchronized wrapper to permit thread-safety under modification.
- Array-backed storage of ranges as discussed earlier.

## History

- Initial release: 04/20/2005.