A set is a collection where each value appears at most once. It is a very useful data structure in many different types of applications, and yet unlike the Java libraries, the .NET Framework provides no built-in Set interface or implementations. I find that in almost every project I work on, I need a set, and I usually base it on the existing .NET
Hashtable data structure.
In this article, I present a different type of set implementation, the "range set", which I have also seen called the "extent set". Instead of actually storing each element, it stores ranges, where every value between the low and high values of the range inclusive is considered a member of this set. Thus, it is only applicable to integral data types. The range set has the interesting property that adding more elements can actually -shrink- the amount of memory used by the set as longer consecutive runs of integers begin to appear.
This article assumes a basic familiarity with C# programming as well as some knowledge of data structures. You might also find this article interesting if you'd like an example of a non-trivial
RangeSet class is quite easy to use. It supports the standard
Remove(int) operations to add and remove elements from the set, as well as a
Contains(int) method to check for set membership. There is also an
Add() overload that accepts an array or param array of integers which you should use when adding multiple values to the set at the same time because it enables a special-case performance optimization discussed in more detail below. The
Count property gets the number of elements in the set, and the
Clear() method removes all elements. The
TrimToSize() method removes all extra space allocated by the
RangeSet's internal data structures; call it after you populate the set and intend to leave it unmodified for some period of time. The
RangeSet is also serializable. It is not, however, thread-safe under modification.
The underlying data structure of the
RangeSet is an
ArrayList stores one of two different types of structures,
SingleElementRange, that implement the
private interface IRange
bool Contains( int element );
Range class stores two integers, namely the low and high values (inclusive) of the range. When the range only contains one element (i.e.
Low == High), we have an additional opportunity to save space by using the
SingleElementRange only stores one integer which is the low, high, and only value in the range.
IRanges are always disjoint and maintained in ascending sorted order in the underlying
ArrayList so that we can efficiently search on them. To check to see if a given element is a member of the set, we perform a binary search of the
IRanges in the
ArrayList. We cannot use the
ArrayList.BinarySearch method but rather must do this manually since, by definition, the element we are looking for is not necessarily -in- the set but rather may be implied by an overlapping range. The relevant code is in
private int FindIndexOfRangeFor( int element )
if ( rangeList.Count == 0 )
int low = 0;
int high = rangeList.Count - 1;
while ( low <= high )
int middle = (low == high) ? low : ((low + high) / 2);
IRange curRange = (IRange) rangeList[ middle ];
if( curRange.Contains( element ) )
else if( curRange.Low > element )
high = middle - 1;
else if( curRange.High < element )
low = middle + 1;
When elements are added to and removed from the set, we must find, adjust the bounds, potentially split an existing range or convert between a
Range and a
SingleElementRange, or create a new range. I refer you to the relevant code in RangeSet.cs to see how each of these cases is handled.
RangeSet Performance Characteristics
My primary goal in the implementation of the
RangeSet was to minimize the space it took to store and check for membership in large, dense sets of integers in memory. This benefit comes at the expense of certain other performance characteristics. In particular, the
Remove() operations are not in general constant-time because they may require copying blocks of array elements up and down. An implementation backed by a linked list instead of an array list would not have this issue, but I did not want to incur the overhead of linked list pointers, which would be significant in less and moderately packed sets. The
Contains() method is efficient as it uses a binary search across the ranges.
Thus, it is quite possible that if you need a set, the
RangeSet may not be the right tool for the job. If you are frequently adding or removing elements, a hashtable-backed set, a tree-backed set, or a bitset (assuming you know the range of possible values and it is not prohibitively large) may be more appropriate. However, the
RangeSet can excel in cases where you load up a large number of integers of arbitrary size and primarily call the
Contains() method, with relatively infrequent
One special case where adding elements can be optimized is when the set is first created or initialized. In that case, we sort the elements, which allows us to create the correct ranges in ascending order without the possibility of having to expand or shrink existing ranges. Thus, when you know in advance all or a large portion of the elements you want to add to the set (as is often the case when loading a set of IDs from a database, for example), you should use the
RangeSet constructor that accepts an
int param array or, alternatively, call the param-array-accepting
Add() overload before adding any other element.
A naive implementation of the
Count property would require a scan of all of the ranges in the set. To avoid this inefficiency, we keep a cached count of all of the elements in the set which is maintained by all of the methods that add and remove elements.
Note that an implementation using a standard .NET array of a single
struct containing the low and high values as the underlying data structure rather than an
ArrayList would cut down the memory requirements further due to elimination of
struct boxing. I have prototyped that code and in the future may include it in this distribution.
Using The Code
The associated code download contains the
RangeSet implementation (RangeSet.cs) in a library that includes a suite of NUnit-based unit tests. If you would like to compile this assembly and/or run the tests, you will need a recent version of NUnit, available here, and you will also have to delete and re-add the nunit.framework reference in the RangeSetLib project. If you don't want to build the library project or run the tests, just add RangeSet.cs to your own projects.
The test application is a console application that I used to evaluate the memory performance. It creates sets of random numbers in the range 0 to 199,999 with a known percentage density. The density is varied from 5% (10,000 elements) to 95% (190,000 elements). These numbers are added to both a hashtable-based set and a
RangeSet, and the total memory consumption as reported by the garbage collector is collected for each set. The results of this experiment appear in the image at the top of the page. As expected, the
RangeSet's memory usage decreases for fill percentages over about 50%, and it is consistently less than that of the hashtable-based set -- almost a full order of magnitude less for sets consisting of 95% of the possible elements.
Future Directions and Possible Enhancements
RangeSet implementation that makes use of .NET 2.0 generics to enable integral data types other than
- Explore further optimizations when adding multiple elements to the set at the same time.
- A synchronized wrapper to permit thread-safety under modification.
- Array-backed storage of ranges as discussed earlier.
- Initial release: 04/20/2005.