## Introduction

It is said that indexing a multidimensional array that is laid out linearly in memory (like C/C++-arrays are) with indexing schemes other than the naive 'row-by-row, slice-by-slice' can result in a significant performance gain for some applications.

Especially applications that often need the neighbors of a data element after they have accessed it should profit from an indexing scheme that preserves this locality of data better. There are numerous such schemes out there. Two of the most popular techniques are dividing the continuous memory into small, linearly laid out blocks and using space-filling curves like the Hilbert or the Morton curve (aka Z-order or Z-curve). Usually it is said that the raised chances for pulling in more potentially usable data with a single cache line compensates for the slightly more complex mapping of n-dimensional indices to one-dimensional memory addresses. (By the way, for the remainder of this article, n will be two or three – no need to go over the top.)

However, I have found surprisingly few applications that actually use the indexing schemes mentioned above. Even in computer graphics and simulation, where data structures for which these ideas were proposed in the first place are frequent, often the good(?)-old 'row-by-row, slice-by-slice' is used. So I decided to implement a simple (okay, a very simple) framework to test the benefits and costs of apparently clever indexing schemes on today's consumer hardware - namely mine. Please note that this framework will not be usable for production code as it is presented here. For example, it lacks the possibility to put an array-interface over an arbitrary memory area, a STL-compatible interface with iterators, and – most importantly – some serious error checking. It was merely written to implement some indexing schemes and analyze their performance.

## Outline

In order to test several indexing techniques, developing a framework that allows easy integration of new techniques and management of existing ones makes sense.
The idea is to have something like a generic `ArrayND`

class that takes a certain indexing scheme, made available by an Indexer class,
and the type of its elements as template parameters. This is a version of the Strategy pattern with static polymorphism. It has a downside: it makes changing
indexing schemes at runtime impossible, but performance is important here, so I wanted to avoid virtual function calls.

After the description of this framework, I'm going to provide code for Indexers based on power-of-two blocks (2d and 3d), the Morton curve (2d only), and (for comparison) a linear 'row-by-row, slice-by-slice' Indexer. Finally, I will conduct a simple performance test and discuss its results.

## Implementation

### A simple array wrapper

The first building block of the framework is a simple wrapper for a one-dimensional array that will be indexed by the indexer classes. It is parameterized by two
template parameters: first the type of elements it holds, and second an allocator that is used to allocate memory for the elements. It has a constructor that takes
the number of elements, a value to initialize the elements with, and the allocator. Upon construction, memory for the elements is allocated and the elements
are copy-constructed from the default value using placement-new. (The elements could also be default-constructed, which would be more like a C-array and save
some time for types with expensive copy-constructors, but I found it more convenient this way.) The destructor frees the memory after performing the necessary
explicit-destructor calls. An `operator[]`

is provided to access the data similarly to an ordinary C-array. Here is the code:

```
/**
A simple wrapper for a 1d array.
*/
template
<
typename TElement,
class TAllocator = std::allocator< TElement >
>
class Array
{
public:
Array( index_t numElements, const TElement& defaultElement =
TElement(), const TAllocator& allocator = TAllocator() )
: m_data( 0 )
, m_numElements( numElements )
, m_allocator( allocator )
{
m_data = m_allocator.allocate( numElements );
for( index_t i=0; i<numElements; ++i )
new ( &m_data[i] ) TElement( defaultElement );
// THINKABOUTME: with copy-ctors I might as well use allocator::construct()...
}
~Array()
{
for( index_t i=0; i<m_numElements; ++i )
( &m_data[i] )->TElement::~TElement();
m_allocator.deallocate( m_data, m_numElements );
}
const TElement& operator [] ( index_t index ) const { return m_data[ index ]; }
TElement& operator [] ( index_t index ) { return m_data[ index ]; }
private:
TElement* m_data;
index_t m_numElements;
TAllocator m_allocator;
};
```

`index_t`

is a typedef for `__int32`

. I experimented with different types and to my surprise, int performed slightly better than unsigned,
although I couldn't find any obvious optimizations in the generated assembly code.

Some Indexers (especially the `BlockIndexer`

described below) can gain some performance if the allocated memory is aligned on multiples of the system's
cache line size (64 bytes for my machine). So here is the stripped down code for such an allocator (it does not implement the full STL-allocator interface!):

```
template
<
typename T,
std::size_t TByteAlignment = 64
>
class aligned_allocator
{
public:
T* allocate( std::size_t count, void* hint = 0 )
{
hint;
return (T*)_aligned_malloc( count*sizeof(T), TByteAlignment );
}
void deallocate( void* ptr, std::size_t /*count*/ )
{
_aligned_free( ptr );
}
};
```

Next comes the `Array2D`

class. The Indexers are highly dependent on the dimension of the index-space they work in, and with them the arrays. So for simplicity and speed,
I didn't include the dimension as a template parameter or something like this, although it would have made sense for the BlockIndexer.

The `Array2D`

class has two data members: An array that serves as a linear buffer and the Indexer that is used to access the buffer. When an `Array2D`

is constructed,
the indexer's static `GetNeededBufferCapacity()`

method is called to query how many elements the buffer should allocate for the given width and height.
This is necessary because some Indexers need more than `width*height`

elements to ease their address calculations. `operator()`

simply uses the Indexer to access the buffer and retrieve the element of interest. Here is the code:

```
/**
A 2d array that can be used with variable indexing schemes and allocators.
*/
template
<
typename TElement,
class TIndexer = BlockIndexer2D< 4 >,
class TAllocator = std::allocator< TElement >
>
class Array2D
{
public:
Array2D( index_t sizeU, index_t sizeV, const TElement& defaultElement =
TElement(), const TAllocator& allocator = TAllocator() )
: m_data( TIndexer::GetNeededBufferCapacity( sizeU, sizeV ), defaultElement, allocator )
, m_indexer( sizeU, sizeV )
{}
Array2D( index_t sizeU, index_t sizeV, const TElement& defaultElement,
const TAllocator& allocator, const TIndexer& indexer )
: m_data( TIndexer::GetNeededBufferCapacity( sizeU, sizeV ), defaultElement, allocator )
, m_indexer( indexer )
{}
~Array2D() {}
const TElement& operator () ( index_t u, index_t v )
const { return m_data[ m_indexer.GetIndex(u, v) ]; }
TElement& operator () ( index_t u, index_t v ) { return m_data[ m_indexer.GetIndex(u, v) ]; }
private:
Array< TElement, TAllocator > m_data;
TIndexer m_indexer;
};
```

The `Array3D`

class is implemented analogously.

### The Indexers

I will first discuss the `LinearIndexer2D`

to introduce the interface. The interface extents in the canonical way to 3d. The guarantees an Indexer must make are:

- It must provide a static function that returns the number of elements the buffer must be able to hold for an index range [0,sizeU]x[0,sizeV].
- It must provide a member function that returns the one-dimensional index of an element in the buffer given a two-dimensional index.
- It must be copy-constructable and constructable from a (width, height) – pair. The latter condition only helps to instantiate
`Array2D`

more conveniently and can easily be removed if necessary.

As mentioned earlier, `LinearIndexer2D`

simply indexes an array row-by-row. The code is:

```
/**
A straightforward row-by-row indexer that translates 2d indices to 1d.
*/
class LinearIndexer2D
{
public:
/// Constructs an indexer to index a 2d array of specified size.
LinearIndexer2D( index_t sizeU, index_t /*sizeV*/ )
: m_sizeU( sizeU )
{}
/// Dtor.
~LinearIndexer2D() {}
/// Returns the 1d index for the 2d (u,v) pair.
index_t GetIndex( index_t u, index_t v ) const
{
return v*m_sizeU + u;
}
/// Returns the number of elements a buffer must
/// able to hold to be used with an indexer
/// for the specified size.
static index_t GetNeededBufferCapacity( index_t sizeU, index_t sizeV )
{
return sizeU * sizeV;
}
private:
index_t m_sizeU;
};
```

`BlockIndexer2D`

divides the two-dimensional index space into blocks of a certain edge length. The elements inside a block are indexed row-by-row,
as are the blocks themselves at a higher level. Block sizes are restricted to be powers of two in order to speed up the calculations, so we take the log2 of the block size
as a template argument. There are only full blocks in the array, so the Indexer requires enough memory to hold an array with the extents rounded up to the next multiple
of block size. Hmm, it's probably best to let commented code speak:

```
/**
An Indexer that uses blocks of a specific size to exploit locality of data.
*/
template
<
index_t TLog2BlockSize = 4
>
class BlockIndexer2D
{
public:
/// Ctor to index specified size.
BlockIndexer2D( index_t sizeU, index_t /*sizeV*/ )
: m_numBlocksU( getNumBlocks( sizeU ) )
{}
/// Dtor.
~BlockIndexer2D()
{}
/// Returns the index into the underlying 1d buffer of the given (u,v) pair.
index_t GetIndex( index_t u, index_t v ) const
{
index_t blockU, blockV;
index_t relativeU, relativeV;
splitIndex( blockU, relativeU, u );
splitIndex( blockV, relativeV, v );
const index_t indexToBlock = ( m_numBlocksU*blockV + blockU ) * (ms_blockSize*ms_blockSize);
const index_t indexInBlock = ( relativeV * ms_blockSize ) + relativeU;
return indexToBlock + indexInBlock;
// The code above can also be expressed with bitwise operators.
// Intersetingly, the less tricky version above performed slightly better on my machine,
// although the compiler didn't optimize all arithmetic expressions to bit-twiddlings...
//const index_t indexToBlock = ( m_numBlocksU*blockV + blockU ) << (2*TLog2BlockSize);
//const index_t indexInBlock = ( relativeV << TLog2BlockSize ) | relativeU;
//return indexToBlock | indexInBlock;
}
/// Returns the number of elements a buffer must able to hold to be used with an indexer
/// for the specified size. The sizes are rounded up to the next multiple of blocksize.
static index_t GetNeededBufferCapacity( index_t sizeU, index_t sizeV )
{
const index_t numBlocksU = getNumBlocks( sizeU );
const index_t numBlocksV = getNumBlocks( sizeV );
return (numBlocksU*numBlocksV) << (2*TLog2BlockSize);
}
private:
///< Blocksize as static const to enable compiler optimizations
static const index_t ms_blockSize = (1 << TLog2BlockSize);
index_t m_numBlocksU; ///< number of blocks in U-direction, needed for faster indexing
/// Returns the index of the block in which a component of a (u,v) index lies (x/ms_blockSize).
static index_t getIndexOfBlock( index_t x ) { return x >> TLog2BlockSize; }
/// Returns the relative index inside a block (x % ms_blockSize).
static index_t getIndexInBlock( index_t x ) { return x & (ms_blockSize-1); }
/// Combines getIndexOf/InBlock to reduce function call overhead in debug builds.
static void splitIndex( index_t& block, index_t& relative, index_t x )
{ block = x >> TLog2BlockSize; relative = x & (ms_blockSize-1); }
/// Get the number of blocks needed to cover an extent of size.
static index_t getNumBlocks( index_t size ) { return ( size + ms_blockSize-1 ) >> TLog2BlockSize; }
};
```

The code is based on the implementation Matt Pharr and Greg Humphreys gave in [1].

`MortonIndexer2D`

uses Morton numbers to access the data. These are obtained by interleaving the bits of the u and v index. This results in a recursive
Z-shaped curve. See the Wikipedia article for details ([2]).

`MortonIndexer2D`

requires an indexing area that is as high as it is wide. Width and height must be a power of two, which wastes serious amounts
of memory in some cases. The bit twiddling tricks are taken from Sean Eron Anderson's excellent website ([3]). Here is the code:

```
/**
Indexes a 1d array with a Morton curve (Z-order).
*/
class MortonIndexer2D
{
public:
/// Ctor for indexing an array of specified size.
MortonIndexer2D( index_t /*sizeU*/, index_t /*sizeV*/ )
{}
/// Dtor.
~MortonIndexer2D()
{}
/// Translated (u,v) into 1d index.
/// Only works for 16 bit u and 15 bit v!
index_t GetIndex( index_t u, index_t v ) const
{
// calculate morton number for (u,v)
return spreadBits( u ) | (spreadBits( v ) << 1);
}
static index_t GetNeededBufferCapacity( index_t sizeU, index_t sizeV )
{
// THINKABOUTME: try bsr-instruction,
// although this is not performance critical
// round up the max of the two extents to the next power of two
index_t n = std::max( sizeU, sizeV ) - 1;
// set all bits lower than the highest set bit
// taken from <a href=""http://graphics.stanford.edu/~seander/bithacks.html"">http://graphics.stanford.edu/%7Eseander/bithacks.html</a>
n |= n >> 1;
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
// increment to get next power of two
++n;
return n*n;
}
private:
/// Spreads out the lowest 16 bits in x so that
/// they cover 32 bits (with 0-bits in between).
/// Taken from <a href=""http://graphics.stanford.edu/~seander/bithacks.html"">http://graphics.stanford.edu/%7Eseander/bithacks.html</a>
static index_t spreadBits( index_t x )
{
index_t r = x;
r = (r | (r << 8)) & 0x00FF00FF;
r = (r | (r << 4)) & 0x0F0F0F0F;
r = (r | (r << 2)) & 0x33333333;
r = (r | (r << 1)) & 0x55555555;
return r;
}
};
```

The code for `LinearIndexer3D`

and `BlockIndexer3D`

can be found in the zip-file of this article.

## Performance test

The tests conducted were not very exhaustive and are certainly no replacement for profiling in your concrete environment, but they should give an estimate of how much performance can be gained or lost in certain situations.

Performance of the Indexers was tested with different access patterns in 2d and 3d. In each pattern, after the element at (u,v) (or (u,v,w)) is accessed,
also the 4 edge-neighbors (or 6 face-neighbors in 3d) that are '`radius`

' elements away in each direction are accessed. So if radius is 0, the same
element is accessed; if `radius`

is 1, the direct neighbors are accessed, and so on. The element (u,v) is picked with two fundamentally different schemes: linear (random start
location, then row-by-row), and completely random (using stdlib's `rand()`

function – it should be good enough for this test). The code for the 2d functions is:

```
template< class T >
static inline
void AccessArray2D( T& a, index_t u, index_t v, index_t radius )
{
a(u,v) =
+ a(u, v-radius)
+ a(u-radius, v)
+ a(u, v)
+ a(u+radius, v)
+ a(u, v+radius)
;
}
template< class T >
double TestLinearAccessArray2D( T& a, index_t size, index_t radius )
{
const index_t uStart = (rand() % (size - 2*radius)) + radius;
const index_t vStart = (rand() % (size - 2*radius)) + radius;
index_t t1 = clock();
const index_t numLoops = size * 100;
index_t loopCounter = 0;
for( index_t v=vStart; v<size-radius; ++v )
{
for( index_t u=uStart; u<size-radius; ++u )
{
AccessArray2D( a, u, v, radius );
if( ++loopCounter >= numLoops )
{
v = size; // to break the outer loop
break;
}
}
}
index_t t2 = clock();
return (double)(t2-t1) / CLOCKS_PER_SEC;
}
template< class T >
double TestRandomAccessArray2D( T& a, index_t size, index_t radius )
{
index_t t1 = clock();
const index_t numLoops = size * 100;
for( index_t i=0; i<numLoops; ++i )
{
const index_t u = (rand() % (size - 2*radius)) + radius;
const index_t v = (rand() % (size - 2*radius)) + radius;
AccessArray2D( a, u, v, radius );
}
index_t t2 = clock();
return (double)(t2-t1) / CLOCKS_PER_SEC;
}
```

The array size was chosen to be 64MB for 2d and 3d, in the hope not to have the complete array in cache after a few runs and not to provoke paging on the other hand. The tests were repeated several times to get a better estimate. Code was compiled with MSVC9.0's default release settings. The machine was a Core2 Duo E6700, 2.66GHz, 4GB Dual Channel DDR2 RAM running with 333MHz. The Operating System was Vista x64. L1 data cache was 2x32KB with 64 byte lines (8-way set associative), L2 cache was 4MB with 64 byte lines (16-way set associative). Here are the results:

```
LINEAR ACCESS TEST 2D (radius = 0, 1000 runs)
linear: 2.171 (
```**100%**)
block: 5.207 (**239.843%**)
morton: 10.515 (**484.339%**)
---
LINEAR ACCESS TEST 2D (radius = 1, 1000 runs)
linear: 2.295 (100%)
block: 4.69 (204.357%)
morton: 11.064 (482.092%)
---
LINEAR ACCESS TEST 2D (radius = 2, 1000 runs)
linear: 1.907 (100%)
block: 5.3 (277.923%)
morton: 10.827 (567.75%)
---
LINEAR ACCESS TEST 2D (radius = 3, 1000 runs)
linear: 2.016 (100%)
block: 4.97 (246.528%)
morton: 11.142 (552.679%)
---
LINEAR ACCESS TEST 2D (radius = 4, 1000 runs)
linear: 1.954 (100%)
block: 5.002 (255.988%)
morton: 11.03 (564.483%)
---
LINEAR ACCESS TEST 2D (radius = 5, 1000 runs)
linear: 2.172 (100%)
block: 5.146 (236.924%)
morton: 10.856 (499.816%)
===
RANDOM ACCESS TEST 2D (radius = 0, 1000 runs)
linear: 61.911 (**100%**)
block: 63.559 (**102.662%**)
morton: 72.104 (**116.464%**)
---
RANDOM ACCESS TEST 2D (radius = 1, 1000 runs)
linear: 91.758 (**100%**)
block: 75.676 (**82.4735%**)
morton: 79.015 (**86.1124%**)
---
RANDOM ACCESS TEST 2D (radius = 2, 1000 runs)
linear: 93.053 (**100%**)
block: 85.483 (**91.8649%**)
morton: 86.119 (**92.5483%**)
---
RANDOM ACCESS TEST 2D (radius = 3, 1000 runs)
linear: 94.045 (100%)
block: 92.768 (98.6421%)
morton: 92.646 (98.5124%)
---
RANDOM ACCESS TEST 2D (radius = 4, 1000 runs)
linear: 94.44 (100%)
block: 97.971 (103.739%)
morton: 98.108 (103.884%)
---
RANDOM ACCESS TEST 2D (radius = 5, 1000 runs)
linear: 93.62 (100%)
block: 97.261 (103.889%)
morton: 98.702 (105.428%)
===

And the results for 3d were:

```
LINEAR ACCESS TEST 3D (radius = 0, 1000 runs)
linear: 0.422 (100%)
block: 0.795 (188.389%)
---
LINEAR ACCESS TEST 3D (radius = 1, 1000 runs)
linear: 0.557 (100%)
block: 0.706 (126.75%)
---
LINEAR ACCESS TEST 3D (radius = 2, 1000 runs)
linear: 0.452 (100%)
block: 0.859 (190.044%)
---
LINEAR ACCESS TEST 3D (radius = 3, 1000 runs)
linear: 0.42 (100%)
block: 0.921 (219.286%)
---
LINEAR ACCESS TEST 3D (radius = 4, 1000 runs)
linear: 0.389 (100%)
block: 1 (257.069%)
---
LINEAR ACCESS TEST 3D (radius = 5, 1000 runs)
linear: 0.531 (100%)
block: 0.842 (158.569%)
===
RANDOM ACCESS TEST 3D (radius = 0, 1000 runs)
linear: 4.786 (
```**100%**)
block: 5.042 (**105.349%**)
---
RANDOM ACCESS TEST 3D (radius = 1, 1000 runs)
linear: 7.128 (**100%**)
block: 6.771 (**94.9916%**)
---
RANDOM ACCESS TEST 3D (radius = 2, 1000 runs)
linear: 7.394 (**100%**)
block: 6.989 (**94.5226%**)
---
RANDOM ACCESS TEST 3D (radius = 3, 1000 runs)
linear: 7.582 (100%)
block: 7.613 (100.409%)
---
RANDOM ACCESS TEST 3D (radius = 4, 1000 runs)
linear: 8.015 (100%)
block: 8.536 (106.5%)
---
RANDOM ACCESS TEST 3D (radius = 5, 1000 runs)
linear: 7.875 (100%)
block: 8.614 (109.384%)
---

## Conclusion

So what are the conclusions to draw? First, I have to say again that the test cases were constructed to a high degree and cannot speak for your specific scenario. Also, there was a serious variance among different runs of the same program (I have no statistical figures, but I'd guess 2-3%).

What can be said is that there are benefits in some scenarios, but also costs in others (as if we didn't know that from the beginning...). Especially block indexing
can save 10-20% of time in some cases (radius = 1,2), and it isn't too bad in cases that don't suit it well (radius = 0,4,5). In 3d,
it performed worse than I expected, probably the computation overhead for indices is simply too high there. `MortonIndexer`

performed worse than `BlockIndexer`

with small radii, although its performance didn't degrade that much with rising radii. So, as one would have
expected, it seems to be the choice if elements 'roughly near' an already fetched element are accessed frequently. However, it also can waste an enormous
amount of memory compared to `BlockIndexer`

. A drawback that both `BlockIndexer`

and `MortonIndexer`

share is that they
are significantly slower than `LinearIndexer`

when you just 'plow through' an array.
They are much slower in debug builds, too. This might not sound too bad, and in most cases the gains in release builds may make up for it, but I have worked with debug
builds that were so terribly slow that – well - you couldn't really work with them, even as a developer. So I thought I should mention it here.

## References

[1]: Matt Pharr, Greg Humphreys: Physically based rendering - from theory to implementation. Morgan Kaufmann Publishers, 2004. http://www.pbrt.org.

[2]: Wikipedia article about Z-order: http://en.wikipedia.org/wiki/Z-order_%28curve%29

[3]: Sean Eron Anderson: Bit twiddling hacks. http://graphics.stanford.edu/%7Eseander/bithacks.html

## History

- 2007-12-27: First upload.