
Introduction
A set is a collection where each value appears at most once. It is a very useful data structure in many different types of applications, and yet unlike the Java libraries, the .NET Framework provides no built-in Set interface or implementations. I find that in almost every project I work on, I need a set, and I usually base it on the existing .NET Hashtable data structure.
In this article, I present a different type of set implementation, the "range set", which I have also seen called the "extent set". Instead of actually storing each element, it stores ranges, where every value between the low and high values of the range inclusive is considered a member of this set. Thus, it is only applicable to integral data types. The range set has the interesting property that adding more elements can actually -shrink- the amount of memory used by the set as longer consecutive runs of integers begin to appear.
Background
This article assumes a basic familiarity with C# programming as well as some knowledge of data structures. You might also find this article interesting if you'd like an example of a non-trivial IEnumerator implementation.
RangeSet Usage
The RangeSet class is quite easy to use. It supports the standard Add(int) and Remove(int) operations to add and remove elements from the set, as well as a Contains(int) method to check for set membership. There is also an Add() overload that accepts an array or param array of integers which you should use when adding multiple values to the set at the same time because it enables a special-case performance optimization discussed in more detail below. The Count property gets the number of elements in the set, and the Clear() method removes all elements. The TrimToSize() method removes all extra space allocated by the RangeSet's internal data structures; call it after you populate the set and intend to leave it unmodified for some period of time. The RangeSet is also serializable. It is not, however, thread-safe under modification.
RangeSet Implementation
The underlying data structure of the RangeSet is an ArrayList. This ArrayList stores one of two different types of structures, Range and SingleElementRange, that implement the IRange interface:
private interface IRange
{
int Low
{
get;
}
int High
{
get;
}
bool Contains( int element );
}
The Range class stores two integers, namely the low and high values (inclusive) of the range. When the range only contains one element (i.e. Low == High), we have an additional opportunity to save space by using the SingleElementRange class. SingleElementRange only stores one integer which is the low, high, and only value in the range.
The IRanges are always disjoint and maintained in ascending sorted order in the underlying ArrayList so that we can efficiently search on them. To check to see if a given element is a member of the set, we perform a binary search of the IRanges in the ArrayList. We cannot use the ArrayList.BinarySearch method but rather must do this manually since, by definition, the element we are looking for is not necessarily -in- the set but rather may be implied by an overlapping range. The relevant code is in FindIndexOfRangeFor:
private int FindIndexOfRangeFor( int element )
{
if ( rangeList.Count == 0 )
{
return -1;
}
int low = 0;
int high = rangeList.Count - 1;
while ( low <= high )
{
int middle = (low == high) ? low : ((low + high) / 2);
IRange curRange = (IRange) rangeList[ middle ];
if( curRange.Contains( element ) )
{
return middle;
}
else if( curRange.Low > element )
{
high = middle - 1;
}
else if( curRange.High < element )
{
low = middle + 1;
}
else
{
return -middle;
}
}
return -(low+1);
}
When elements are added to and removed from the set, we must find, adjust the bounds, potentially split an existing range or convert between a Range and a SingleElementRange, or create a new range. I refer you to the relevant code in RangeSet.cs to see how each of these cases is handled.
RangeSet Performance Characteristics
My primary goal in the implementation of the RangeSet was to minimize the space it took to store and check for membership in large, dense sets of integers in memory. This benefit comes at the expense of certain other performance characteristics. In particular, the Add() and Remove() operations are not in general constant-time because they may require copying blocks of array elements up and down. An implementation backed by a linked list instead of an array list would not have this issue, but I did not want to incur the overhead of linked list pointers, which would be significant in less and moderately packed sets. The Contains() method is efficient as it uses a binary search across the ranges.
Thus, it is quite possible that if you need a set, the RangeSet may not be the right tool for the job. If you are frequently adding or removing elements, a hashtable-backed set, a tree-backed set, or a bitset (assuming you know the range of possible values and it is not prohibitively large) may be more appropriate. However, the RangeSet can excel in cases where you load up a large number of integers of arbitrary size and primarily call the Contains() method, with relatively infrequent Add()s and Remove()s.
One special case where adding elements can be optimized is when the set is first created or initialized. In that case, we sort the elements, which allows us to create the correct ranges in ascending order without the possibility of having to expand or shrink existing ranges. Thus, when you know in advance all or a large portion of the elements you want to add to the set (as is often the case when loading a set of IDs from a database, for example), you should use the RangeSet constructor that accepts an int param array or, alternatively, call the param-array-accepting Add() overload before adding any other element.
A naive implementation of the Count property would require a scan of all of the ranges in the set. To avoid this inefficiency, we keep a cached count of all of the elements in the set which is maintained by all of the methods that add and remove elements.
Note that an implementation using a standard .NET array of a single struct containing the low and high values as the underlying data structure rather than an ArrayList would cut down the memory requirements further due to elimination of struct boxing. I have prototyped that code and in the future may include it in this distribution.
Using The Code
The associated code download contains the RangeSet implementation (RangeSet.cs) in a library that includes a suite of NUnit-based unit tests. If you would like to compile this assembly and/or run the tests, you will need a recent version of NUnit, available here, and you will also have to delete and re-add the nunit.framework reference in the RangeSetLib project. If you don't want to build the library project or run the tests, just add RangeSet.cs to your own projects.
The test application is a console application that I used to evaluate the memory performance. It creates sets of random numbers in the range 0 to 199,999 with a known percentage density. The density is varied from 5% (10,000 elements) to 95% (190,000 elements). These numbers are added to both a hashtable-based set and a RangeSet, and the total memory consumption as reported by the garbage collector is collected for each set. The results of this experiment appear in the image at the top of the page. As expected, the RangeSet's memory usage decreases for fill percentages over about 50%, and it is consistently less than that of the hashtable-based set -- almost a full order of magnitude less for sets consisting of 95% of the possible elements.
Future Directions and Possible Enhancements
- A
RangeSet implementation that makes use of .NET 2.0 generics to enable integral data types other than Int32.
- Explore further optimizations when adding multiple elements to the set at the same time.
- A synchronized wrapper to permit thread-safety under modification.
- Array-backed storage of ranges as discussed earlier.
History
- Initial release: 04/20/2005.
| You must Sign In to use this message board. |
|
|
 |
|
 |
An interesting comparison would be with a simple Bool array of max size, with index representing the value. Range inserts would perform pretty well when things are continuous because they would reduce to a simple for loop.
for (i=min;i<max;i++) { bools[i] = true; }
This should go really fast(especially if SIMD gets a hold of it), though not as fast as setting 2 values as in your data structure. Add, removes, and searches will have the speed of direct array access. It should also perform fairly well on non continuous numbers.
Compression should be around 96%; instead of 32bit values we know have 1 bit ones. The catch is the array is of fixed size and so consumes the full amount of memory at all times. You could dynamically resize the array but if you know your data set is large why bother. There are many options to consider when dynamically resizing. You could do it like array list does internally for one.
Another downside is iteration; your data structure will be able to skip values during iteration where the bool one will need to iterate over every element.
The speed of Iteration could be improved at the cost of memory. You could use 2 bits per value. The first bit representing if the value exists and the second if the next N values can be skipped. Choosing increasing values of N will basically trade add and remove performance for iteration performance. Compression rate would not drop by too much either by adding another bit. Though actual size would double you would still be getting a bit over 93% compression.
This would place compression about halfway between the two methods you tested. The access times would likely be the best out of the three.
For float types the range set would be the only viable option.
Most people already know about this I just thought it would be good alternative for similiar use cases as this data structure.
-- modified at 9:07 Monday 10th April, 2006
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi Scott,
I like your submission for its clarity and useability. Thank you very much!
I'm having problems with the internal count property of the RangeSet class. It seems to me that it is not updated when adding ranges. Furthermore, AddRange(low,high) doesn't do any checking for overlaps at all.
Maintaining count is no trivial task if you intend to keep an accurate track of the elements (=int values) stored within the RangeSet. Imagine calling AddRange(1,12) followed by a call to AddRange(8,10). Or ,worse, AddRange(10,15).
Do you incidentially have a improved version of your code at hand that addesses these issues and that you would like to share with us?
Meanwhile, here's a roundtrip string representation of the RangeSet, parsing a RangeSet from "1,3,17..32,6" and converting an existing RangeSet to a similar string:
public static RangeSet Parse(string s) { RangeSet ResultSet = new RangeSet(); string[] rangeToken = s.Split(new char[1] {','}); foreach (string Token in rangeToken) { int sepPos = Token.IndexOf(".."); if (sepPos != -1) { int lowEnd = Int32.Parse(Token.Substring(0,sepPos)); int highEnd = Int32.Parse(Token.Substring(sepPos+2)); ResultSet.AddRange(lowEnd, highEnd); } else { ResultSet.Add(Int32.Parse(Token)); } } return ResultSet; }
override public String ToString() { System.Text.StringBuilder ResultBuilder = new System.Text.StringBuilder(); if (rangeList.Count>0) { foreach (IRange MyRange in rangeList) { if (ResultBuilder.Length>0) { ResultBuilder.Append(","); } if (MyRange is Range) { ResultBuilder.Append(MyRange.Low.ToString() + ".." + MyRange.High.ToString() ); } else if (MyRange is SingleElementRange ) { ResultBuilder.Append(MyRange.Low); } } return ResultBuilder.ToString(); } else { return "#EMPTY"; } }
Greets, /|/|artin
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi. Thanks for your interest.
AddRange() is definitely problematic in these situations. That's why it's private, and the only ways to get it called are:
1. From the constructor that takes a param array of integers, or: 2. Calling the public Add() method that takes a param array *only when the set is currently empty*.
I imagine since you've been adding some enhancements that you've been working inside the RangeSet class and calling AddRange() outside of these circumstances.
Thanks for the string marshaling code -- I'll look at adding something along those lines in a subsequent release!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
An interesting and useful data structure. It would be helpful to also have a run-time vs. percent packed graph, in addition to the memory-usage graph. It could give the time required to initially populate the RangeSet, which would be useful for evaluating the time/space tradeoff.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Thanks for your interest.
You seem to be referring to the runtime to populate the set. Per the discussion in the article, initially populating the rangeset from scratch with a set of integers known at creation time is algorithmically as efficient as sorting the integers (read: very good). Populating the rangeset by repeatedly calling the Add method is rather inefficient as it involves a lot of memory reallocation and subarray copying. But the latter is not a scenario I'm particularly concerned with for this data structure.
I'd argue that comparing either set population scenario to the memory usage of the set is apples-to-oranges. I envision rangeset instances as very large structures that stay resident in the application (and largely immutable) for relatively long periods of time. (Otherwise, one probably wouldn't be concerned enough with memory overhead to use it.) And therefore the population cost is amortized over many calls to Contains().
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi Scott, Very nice code.
Some time ago I prototyped a similar class based upon a linked list to hold the set of positive integers for which the hailstone function converge to 1 when called in a loop
long hailstone(long x) { if (x%2 =0) return x/2; else return 3*x +1; }
Just as Jeffrey Sax proposed I used only one rangetype with high and low and it worked quite well and kept the code simple. Due to specific behaviour of the 'hailstone-set' I was considering a BitmapRange object in your terms. Its memory layout looks like:
{ long low; long bitmap; // or a BitmapArray or bool[64] etc. }
It has a lower bound plus a bitmap of the following 64 longs. The higher bound is somewhere between low and low+64. Optionally one could cache the high value used at costs of some extra memory, a byte representing the offset wrt low would be sufficient. [this byte could also be embedded in the bitmap ==> 58 bits for the bitmap 6 bits for the high-offset; probably too complex]
The BitmapRange will use as much memory as a Range object but it is able to hold small clusters of integers that do not form a substantial range yet, e.g. all the odd numbers between 10001 and 10031. This would need 15 SingleValueRange or Range objects but only one BitmapRange object.
The bitmap part of the BMR object can be made as big as you like effectively transforming the RangeSet gradually into a Bitmap set. In essence this step would make the BitmapSet a specific case of a RangeSet.
Another kind of Range object could be the MultipleOffsetRange (awfull name) which holds a lower value + an array of 8 byte offsets. The effective range of such object would be max 8 values in the range low..low+256.
so far my 2 cents 
Rob Tillaart
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Rob --
That's a really slick approach for getting around the traditional range-limit restriction on bitsets. Will keep it in mind in the future.
Thanks!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
First of all, nice work, Scott!
Second, you don't give yourself enough credit. Your benchmark measurements are biased in favor of the Hashtable solution. A big chunk of memory (0.8MB) is taken up by the test data and is completely separate from the solution itself, yet you include it in the memory consumption. The relative improvement is therefore more like a factor of 6 to 12 than 2 to 8.
Even so, this code can still be improved significantly.
You try to save memory by using two different range structures: one for a singleton, and one for a 'real' range. Unfortunately, this prevents you from treating the ranges as value types. All the structures need to be boxed, which implies an overhead of 8 bytes per range.
It also slows things down by requiring casts and adding additional branching logic to distinguish between the two cases.
You can avoid these problems if you only use one type of range, which can have identical lower and upper bounds. You can then use an array to store the ranges, (or a List<Range> in C# 2.0), which eliminates the boxing overhead.
The result: peak memory consumption is only 1.2MB (including the test data, 0.4MB without) with the fill factor at 60%. The net improvement for a fill factor of 95% is a factor of 82 (8.4MB / 102KB). For a fill factor of 10%, the net memory consumption is 144KB.
Jeffrey
Everything should be as simple as possible, but not simpler. -- Albert Einstein Numerical components for C# and VB.NET
|
| Sign In·View Thread·PermaLink | 5.00/5 |
|
|
|
 |
|
 |
Hi Jeffrey. Thanks for the positive feedback!
As I alluded to in the article, I did prototype a version that eliminated boxing. Two, actually -- one with a single Range value type and another that stored the ranges in parallel int arrays (the thinking being that the latter might be easier to efficiently port to Java). I did see the impressive additional memory overhead improvement you mention.
The reason I didn't go with this (for the first version anyway) was because without implementing array capacity redoubling, it's not possible to get any kind of usable Add and Remove performance. (Even though I stated that I didn't care so much about Add and Remove performance for RangeSet, in this case we're talking slow-beyond-the-pale.) ArrayList, of course, handles redoubling for you internally, so Add and Remove, while not optimized, at least work decently when you need them. I may take up the project of converting RangeSet to use arrays with smart dynamic resizing at some point in the future, and then update this article accordingly.
Thanks again!
_scott
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|