Introduction
It is said that indexing a multidimensional array that is laid out linearly in memory (like C/C++-arrays are) with indexing schemes other than the naive
'row-by-row, slice-by-slice' can result in a significant performance gain for some applications.
Especially applications that often need the neighbors of a data element after they have accessed it should profit from an indexing scheme that preserves this
locality of data better. There are numerous such schemes out there. Two of the most popular techniques are dividing the continuous memory into small, linearly laid
out blocks and using space-filling curves like the Hilbert or the Morton curve (aka Z-order or Z-curve). Usually it is said that the raised chances for pulling in more
potentially usable data with a single cache line compensates for the slightly more complex mapping of n-dimensional indices to one-dimensional memory addresses. (By the way, for the remainder
of this article, n will be two or three – no need to go over the top.)
However, I have found surprisingly few applications that actually use the indexing schemes mentioned above. Even in computer graphics and simulation, where data
structures for which these ideas were proposed in the first place are frequent, often the good(?)-old 'row-by-row, slice-by-slice' is used. So I decided to implement
a simple (okay, a very simple) framework to test the benefits and costs of apparently clever indexing schemes on today's consumer hardware - namely mine.
Please note that this framework will not be usable for production code as it is presented here. For example, it lacks the possibility to put an array-interface
over an arbitrary memory area, a STL-compatible interface with iterators, and – most importantly – some serious error checking. It was merely written to implement
some indexing schemes and analyze their performance.
Outline
In order to test several indexing techniques, developing a framework that allows easy integration of new techniques and management of existing ones makes sense.
The idea is to have something like a generic ArrayND class that takes a certain indexing scheme, made available by an Indexer class,
and the type of its elements as template parameters. This is a version of the Strategy pattern with static polymorphism. It has a downside: it makes changing
indexing schemes at runtime impossible, but performance is important here, so I wanted to avoid virtual function calls.
After the description of this framework, I'm going to provide code for Indexers based on power-of-two blocks (2d and 3d), the Morton curve (2d only), and (for comparison)
a linear 'row-by-row, slice-by-slice' Indexer. Finally, I will conduct a simple performance test and discuss its results.
Implementation
A simple array wrapper
The first building block of the framework is a simple wrapper for a one-dimensional array that will be indexed by the indexer classes. It is parameterized by two
template parameters: first the type of elements it holds, and second an allocator that is used to allocate memory for the elements. It has a constructor that takes
the number of elements, a value to initialize the elements with, and the allocator. Upon construction, memory for the elements is allocated and the elements
are copy-constructed from the default value using placement-new. (The elements could also be default-constructed, which would be more like a C-array and save
some time for types with expensive copy-constructors, but I found it more convenient this way.) The destructor frees the memory after performing the necessary
explicit-destructor calls. An operator[] is provided to access the data similarly to an ordinary C-array. Here is the code:
template
<
typename TElement,
class TAllocator = std::allocator< TElement >
>
class Array
{
public:
Array( index_t numElements, const TElement& defaultElement =
TElement(), const TAllocator& allocator = TAllocator() )
: m_data( 0 )
, m_numElements( numElements )
, m_allocator( allocator )
{
m_data = m_allocator.allocate( numElements );
for( index_t i=0; i<numElements; ++i )
new ( &m_data[i] ) TElement( defaultElement );
}
~Array()
{
for( index_t i=0; i<m_numElements; ++i )
( &m_data[i] )->TElement::~TElement();
m_allocator.deallocate( m_data, m_numElements );
}
const TElement& operator [] ( index_t index ) const { return m_data[ index ]; }
TElement& operator [] ( index_t index ) { return m_data[ index ]; }
private:
TElement* m_data;
index_t m_numElements;
TAllocator m_allocator;
};
index_t is a typedef for __int32. I experimented with different types and to my surprise, int performed slightly better than unsigned,
although I couldn't find any obvious optimizations in the generated assembly code.
Some Indexers (especially the BlockIndexer described below) can gain some performance if the allocated memory is aligned on multiples of the system's
cache line size (64 bytes for my machine). So here is the stripped down code for such an allocator (it does not implement the full STL-allocator interface!):
template
<
typename T,
std::size_t TByteAlignment = 64
>
class aligned_allocator
{
public:
T* allocate( std::size_t count, void* hint = 0 )
{
hint;
return (T*)_aligned_malloc( count*sizeof(T), TByteAlignment );
}
void deallocate( void* ptr, std::size_t )
{
_aligned_free( ptr );
}
};
Next comes the Array2D class. The Indexers are highly dependent on the dimension of the index-space they work in, and with them the arrays. So for simplicity and speed,
I didn't include the dimension as a template parameter or something like this, although it would have made sense for the BlockIndexer.
The Array2D class has two data members: An array that serves as a linear buffer and the Indexer that is used to access the buffer. When an Array2D is constructed,
the indexer's static GetNeededBufferCapacity() method is called to query how many elements the buffer should allocate for the given width and height.
This is necessary because some Indexers need more than width*height elements to ease their address calculations. operator()
simply uses the Indexer to access the buffer and retrieve the element of interest. Here is the code:
template
<
typename TElement,
class TIndexer = BlockIndexer2D< 4 >,
class TAllocator = std::allocator< TElement >
>
class Array2D
{
public:
Array2D( index_t sizeU, index_t sizeV, const TElement& defaultElement =
TElement(), const TAllocator& allocator = TAllocator() )
: m_data( TIndexer::GetNeededBufferCapacity( sizeU, sizeV ), defaultElement, allocator )
, m_indexer( sizeU, sizeV )
{}
Array2D( index_t sizeU, index_t sizeV, const TElement& defaultElement,
const TAllocator& allocator, const TIndexer& indexer )
: m_data( TIndexer::GetNeededBufferCapacity( sizeU, sizeV ), defaultElement, allocator )
, m_indexer( indexer )
{}
~Array2D() {}
const TElement& operator () ( index_t u, index_t v )
const { return m_data[ m_indexer.GetIndex(u, v) ]; }
TElement& operator () ( index_t u, index_t v ) { return m_data[ m_indexer.GetIndex(u, v) ]; }
private:
Array< TElement, TAllocator > m_data;
TIndexer m_indexer;
};
The Array3D class is implemented analogously.
The Indexers
I will first discuss the LinearIndexer2D to introduce the interface. The interface extents in the canonical way to 3d. The guarantees an Indexer must make are:
- It must provide a static function that returns the number of elements the buffer must be able to hold for an index range [0,sizeU]x[0,sizeV].
- It must provide a member function that returns the one-dimensional index of an element in the buffer given a two-dimensional index.
- It must be copy-constructable and constructable from a (width, height) – pair. The latter condition only helps to instantiate
Array2D
more conveniently and can easily be removed if necessary.
As mentioned earlier, LinearIndexer2D simply indexes an array row-by-row. The code is:
class LinearIndexer2D
{
public:
LinearIndexer2D( index_t sizeU, index_t )
: m_sizeU( sizeU )
{}
~LinearIndexer2D() {}
index_t GetIndex( index_t u, index_t v ) const
{
return v*m_sizeU + u;
}
static index_t GetNeededBufferCapacity( index_t sizeU, index_t sizeV )
{
return sizeU * sizeV;
}
private:
index_t m_sizeU;
};
BlockIndexer2D divides the two-dimensional index space into blocks of a certain edge length. The elements inside a block are indexed row-by-row,
as are the blocks themselves at a higher level. Block sizes are restricted to be powers of two in order to speed up the calculations, so we take the log2 of the block size
as a template argument. There are only full blocks in the array, so the Indexer requires enough memory to hold an array with the extents rounded up to the next multiple
of block size. Hmm, it's probably best to let commented code speak:
template
<
index_t TLog2BlockSize = 4
>
class BlockIndexer2D
{
public:
BlockIndexer2D( index_t sizeU, index_t )
: m_numBlocksU( getNumBlocks( sizeU ) )
{}
~BlockIndexer2D()
{}
index_t GetIndex( index_t u, index_t v ) const
{
index_t blockU, blockV;
index_t relativeU, relativeV;
splitIndex( blockU, relativeU, u );
splitIndex( blockV, relativeV, v );
const index_t indexToBlock = ( m_numBlocksU*blockV + blockU ) * (ms_blockSize*ms_blockSize);
const index_t indexInBlock = ( relativeV * ms_blockSize ) + relativeU;
return indexToBlock + indexInBlock;
}
static index_t GetNeededBufferCapacity( index_t sizeU, index_t sizeV )
{
const index_t numBlocksU = getNumBlocks( sizeU );
const index_t numBlocksV = getNumBlocks( sizeV );
return (numBlocksU*numBlocksV) << (2*TLog2BlockSize);
}
private:
static index_t getIndexInBlock( index_t x ) { return x & (ms_blockSize-1); }
static void splitIndex( index_t& block, index_t& relative, index_t x )
{ block = x >> TLog2BlockSize; relative = x & (ms_blockSize-1); }
static index_t getNumBlocks( index_t size ) { return ( size + ms_blockSize-1 ) >> TLog2BlockSize; }
};
The code is based on the implementation Matt Pharr and Greg Humphreys gave in [1].
MortonIndexer2D uses Morton numbers to access the data. These are obtained by interleaving the bits of the u and v index. This results in a recursive
Z-shaped curve. See the Wikipedia article for details ([2]).
MortonIndexer2D requires an indexing area that is as high as it is wide. Width and height must be a power of two, which wastes serious amounts
of memory in some cases. The bit twiddling tricks are taken from Sean Eron Anderson's excellent website ([3]). Here is the code:
class MortonIndexer2D
{
public:
MortonIndexer2D( index_t , index_t )
{}
~MortonIndexer2D()
{}
index_t GetIndex( index_t u, index_t v ) const
{
return spreadBits( u ) | (spreadBits( v ) << 1);
}
static index_t GetNeededBufferCapacity( index_t sizeU, index_t sizeV )
{
index_t n = std::max( sizeU, sizeV ) - 1;
n |= n >> 1;
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
++n;
return n*n;
}
private:
The code for LinearIndexer3D and BlockIndexer3D can be found in the zip-file of this article.
Performance test
The tests conducted were not very exhaustive and are certainly no replacement for profiling in your concrete environment, but they should give an estimate
of how much performance can be gained or lost in certain situations.
Performance of the Indexers was tested with different access patterns in 2d and 3d. In each pattern, after the element at (u,v) (or (u,v,w)) is accessed,
also the 4 edge-neighbors (or 6 face-neighbors in 3d) that are 'radius' elements away in each direction are accessed. So if radius is 0, the same
element is accessed; if radius is 1, the direct neighbors are accessed, and so on. The element (u,v) is picked with two fundamentally different schemes: linear (random start
location, then row-by-row), and completely random (using stdlib's rand() function – it should be good enough for this test). The code for the 2d functions is:
template< class T >
static inline
void AccessArray2D( T& a, index_t u, index_t v, index_t radius )
{
a(u,v) =
+ a(u, v-radius)
+ a(u-radius, v)
+ a(u, v)
+ a(u+radius, v)
+ a(u, v+radius)
;
}
template< class T >
double TestLinearAccessArray2D( T& a, index_t size, index_t radius )
{
const index_t uStart = (rand() % (size - 2*radius)) + radius;
const index_t vStart = (rand() % (size - 2*radius)) + radius;
index_t t1 = clock();
const index_t numLoops = size * 100;
index_t loopCounter = 0;
for( index_t v=vStart; v<size-radius; ++v )
{
for( index_t u=uStart; u<size-radius; ++u )
{
AccessArray2D( a, u, v, radius );
if( ++loopCounter >= numLoops )
{
v = size; break;
}
}
}
index_t t2 = clock();
return (double)(t2-t1) / CLOCKS_PER_SEC;
}
template< class T >
double TestRandomAccessArray2D( T& a, index_t size, index_t radius )
{
index_t t1 = clock();
const index_t numLoops = size * 100;
for( index_t i=0; i<numLoops; ++i )
{
const index_t u = (rand() % (size - 2*radius)) + radius;
const index_t v = (rand() % (size - 2*radius)) + radius;
AccessArray2D( a, u, v, radius );
}
index_t t2 = clock();
return (double)(t2-t1) / CLOCKS_PER_SEC;
}
The array size was chosen to be 64MB for 2d and 3d, in the hope not to have the complete array in cache after a few runs and not to provoke paging on the other hand.
The tests were repeated several times to get a better estimate. Code was compiled with MSVC9.0's default release settings. The machine was a Core2 Duo E6700, 2.66GHz,
4GB Dual Channel DDR2 RAM running with 333MHz. The Operating System was Vista x64. L1 data cache was 2x32KB with 64 byte lines (8-way set associative),
L2 cache was 4MB with 64 byte lines (16-way set associative). Here are the results:
LINEAR ACCESS TEST 2D (radius = 0, 1000 runs)
linear: 2.171 (100%)
block: 5.207 (239.843%)
morton: 10.515 (484.339%)
---
LINEAR ACCESS TEST 2D (radius = 1, 1000 runs)
linear: 2.295 (100%)
block: 4.69 (204.357%)
morton: 11.064 (482.092%)
---
LINEAR ACCESS TEST 2D (radius = 2, 1000 runs)
linear: 1.907 (100%)
block: 5.3 (277.923%)
morton: 10.827 (567.75%)
---
LINEAR ACCESS TEST 2D (radius = 3, 1000 runs)
linear: 2.016 (100%)
block: 4.97 (246.528%)
morton: 11.142 (552.679%)
---
LINEAR ACCESS TEST 2D (radius = 4, 1000 runs)
linear: 1.954 (100%)
block: 5.002 (255.988%)
morton: 11.03 (564.483%)
---
LINEAR ACCESS TEST 2D (radius = 5, 1000 runs)
linear: 2.172 (100%)
block: 5.146 (236.924%)
morton: 10.856 (499.816%)
===
RANDOM ACCESS TEST 2D (radius = 0, 1000 runs)
linear: 61.911 (100%)
block: 63.559 (102.662%)
morton: 72.104 (116.464%)
---
RANDOM ACCESS TEST 2D (radius = 1, 1000 runs)
linear: 91.758 (100%)
block: 75.676 (82.4735%)
morton: 79.015 (86.1124%)
---
RANDOM ACCESS TEST 2D (radius = 2, 1000 runs)
linear: 93.053 (100%)
block: 85.483 (91.8649%)
morton: 86.119 (92.5483%)
---
RANDOM ACCESS TEST 2D (radius = 3, 1000 runs)
linear: 94.045 (100%)
block: 92.768 (98.6421%)
morton: 92.646 (98.5124%)
---
RANDOM ACCESS TEST 2D (radius = 4, 1000 runs)
linear: 94.44 (100%)
block: 97.971 (103.739%)
morton: 98.108 (103.884%)
---
RANDOM ACCESS TEST 2D (radius = 5, 1000 runs)
linear: 93.62 (100%)
block: 97.261 (103.889%)
morton: 98.702 (105.428%)
===
And the results for 3d were:
LINEAR ACCESS TEST 3D (radius = 0, 1000 runs)
linear: 0.422 (100%)
block: 0.795 (188.389%)
---
LINEAR ACCESS TEST 3D (radius = 1, 1000 runs)
linear: 0.557 (100%)
block: 0.706 (126.75%)
---
LINEAR ACCESS TEST 3D (radius = 2, 1000 runs)
linear: 0.452 (100%)
block: 0.859 (190.044%)
---
LINEAR ACCESS TEST 3D (radius = 3, 1000 runs)
linear: 0.42 (100%)
block: 0.921 (219.286%)
---
LINEAR ACCESS TEST 3D (radius = 4, 1000 runs)
linear: 0.389 (100%)
block: 1 (257.069%)
---
LINEAR ACCESS TEST 3D (radius = 5, 1000 runs)
linear: 0.531 (100%)
block: 0.842 (158.569%)
===
RANDOM ACCESS TEST 3D (radius = 0, 1000 runs)
linear: 4.786 (100%)
block: 5.042 (105.349%)
---
RANDOM ACCESS TEST 3D (radius = 1, 1000 runs)
linear: 7.128 (100%)
block: 6.771 (94.9916%)
---
RANDOM ACCESS TEST 3D (radius = 2, 1000 runs)
linear: 7.394 (100%)
block: 6.989 (94.5226%)
---
RANDOM ACCESS TEST 3D (radius = 3, 1000 runs)
linear: 7.582 (100%)
block: 7.613 (100.409%)
---
RANDOM ACCESS TEST 3D (radius = 4, 1000 runs)
linear: 8.015 (100%)
block: 8.536 (106.5%)
---
RANDOM ACCESS TEST 3D (radius = 5, 1000 runs)
linear: 7.875 (100%)
block: 8.614 (109.384%)
---
Conclusion
So what are the conclusions to draw? First, I have to say again that the test cases were constructed to a high degree and cannot speak for your specific scenario.
Also, there was a serious variance among different runs of the same program (I have no statistical figures, but I'd guess 2-3%).
What can be said is that there are benefits in some scenarios, but also costs in others (as if we didn't know that from the beginning...). Especially block indexing
can save 10-20% of time in some cases (radius = 1,2), and it isn't too bad in cases that don't suit it well (radius = 0,4,5). In 3d,
it performed worse than I expected, probably the computation overhead for indices is simply too high there. MortonIndexer
performed worse than BlockIndexer with small radii, although its performance didn't degrade that much with rising radii. So, as one would have
expected, it seems to be the choice if elements 'roughly near' an already fetched element are accessed frequently. However, it also can waste an enormous
amount of memory compared to BlockIndexer. A drawback that both BlockIndexer and MortonIndexer share is that they
are significantly slower than LinearIndexer when you just 'plow through' an array.
They are much slower in debug builds, too. This might not sound too bad, and in most cases the gains in release builds may make up for it, but I have worked with debug
builds that were so terribly slow that – well - you couldn't really work with them, even as a developer. So I thought I should mention it here.
References
[1]: Matt Pharr, Greg Humphreys: Physically based rendering - from theory to implementation. Morgan Kaufmann Publishers, 2004. http://www.pbrt.org.
[2]: Wikipedia article about Z-order: http://en.wikipedia.org/wiki/Z-order_%28curve%29
[3]: Sean Eron Anderson: Bit twiddling hacks. http://graphics.stanford.edu/%7Eseander/bithacks.html
History
- 2007-12-27: First upload.