Click here to Skip to main content
Licence CPOL
First Posted 26 Dec 2007
Views 9,881
Downloads 49
Bookmarked 12 times

Implementation and Performance of Various Indexing Schemes for Multidimensional Arrays

By | 26 Dec 2007 | Article
This article presents implementations for a few cache-friendly indexing schemes for multidimensional arrays and a brief analysis of their performance in different scenarios.

Introduction

It is said that indexing a multidimensional array that is laid out linearly in memory (like C/C++-arrays are) with indexing schemes other than the naive 'row-by-row, slice-by-slice' can result in a significant performance gain for some applications.

Especially applications that often need the neighbors of a data element after they have accessed it should profit from an indexing scheme that preserves this locality of data better. There are numerous such schemes out there. Two of the most popular techniques are dividing the continuous memory into small, linearly laid out blocks and using space-filling curves like the Hilbert or the Morton curve (aka Z-order or Z-curve). Usually it is said that the raised chances for pulling in more potentially usable data with a single cache line compensates for the slightly more complex mapping of n-dimensional indices to one-dimensional memory addresses. (By the way, for the remainder of this article, n will be two or three – no need to go over the top.)

However, I have found surprisingly few applications that actually use the indexing schemes mentioned above. Even in computer graphics and simulation, where data structures for which these ideas were proposed in the first place are frequent, often the good(?)-old 'row-by-row, slice-by-slice' is used. So I decided to implement a simple (okay, a very simple) framework to test the benefits and costs of apparently clever indexing schemes on today's consumer hardware - namely mine. Please note that this framework will not be usable for production code as it is presented here. For example, it lacks the possibility to put an array-interface over an arbitrary memory area, a STL-compatible interface with iterators, and – most importantly – some serious error checking. It was merely written to implement some indexing schemes and analyze their performance.

Outline

In order to test several indexing techniques, developing a framework that allows easy integration of new techniques and management of existing ones makes sense. The idea is to have something like a generic ArrayND class that takes a certain indexing scheme, made available by an Indexer class, and the type of its elements as template parameters. This is a version of the Strategy pattern with static polymorphism. It has a downside: it makes changing indexing schemes at runtime impossible, but performance is important here, so I wanted to avoid virtual function calls.

After the description of this framework, I'm going to provide code for Indexers based on power-of-two blocks (2d and 3d), the Morton curve (2d only), and (for comparison) a linear 'row-by-row, slice-by-slice' Indexer. Finally, I will conduct a simple performance test and discuss its results.

Implementation

A simple array wrapper

The first building block of the framework is a simple wrapper for a one-dimensional array that will be indexed by the indexer classes. It is parameterized by two template parameters: first the type of elements it holds, and second an allocator that is used to allocate memory for the elements. It has a constructor that takes the number of elements, a value to initialize the elements with, and the allocator. Upon construction, memory for the elements is allocated and the elements are copy-constructed from the default value using placement-new. (The elements could also be default-constructed, which would be more like a C-array and save some time for types with expensive copy-constructors, but I found it more convenient this way.) The destructor frees the memory after performing the necessary explicit-destructor calls. An operator[] is provided to access the data similarly to an ordinary C-array. Here is the code:

/**
A simple wrapper for a 1d array.
*/
template
<
 typename TElement,
 class TAllocator = std::allocator< TElement >
>
class Array
{
public:
 Array( index_t numElements, const TElement& defaultElement = 
           TElement(), const TAllocator& allocator = TAllocator() )
  : m_data( 0 )
  , m_numElements( numElements )
  , m_allocator( allocator )
 {
  m_data = m_allocator.allocate( numElements );
  for( index_t i=0; i<numElements; ++i )
   new ( &m_data[i] ) TElement( defaultElement );
   // THINKABOUTME: with copy-ctors I might as well use allocator::construct()...
 }
 ~Array()
 {
  for( index_t i=0; i<m_numElements; ++i )
   ( &m_data[i] )->TElement::~TElement();
  m_allocator.deallocate( m_data, m_numElements );
 }
 const TElement& operator [] ( index_t index ) const { return m_data[ index ]; }
 TElement& operator [] ( index_t index ) { return m_data[ index ]; }
private:
 TElement* m_data;
 index_t m_numElements;
 TAllocator m_allocator;
};

index_t is a typedef for __int32. I experimented with different types and to my surprise, int performed slightly better than unsigned, although I couldn't find any obvious optimizations in the generated assembly code.

Some Indexers (especially the BlockIndexer described below) can gain some performance if the allocated memory is aligned on multiples of the system's cache line size (64 bytes for my machine). So here is the stripped down code for such an allocator (it does not implement the full STL-allocator interface!):

template
<
 typename T,
 std::size_t TByteAlignment = 64
>
class aligned_allocator
{
public:
 T* allocate( std::size_t count, void* hint = 0 )
 {
  hint;
  return (T*)_aligned_malloc( count*sizeof(T), TByteAlignment );
 }
 void deallocate( void* ptr, std::size_t /*count*/ )
 {
  _aligned_free( ptr );
 }
};

Next comes the Array2D class. The Indexers are highly dependent on the dimension of the index-space they work in, and with them the arrays. So for simplicity and speed, I didn't include the dimension as a template parameter or something like this, although it would have made sense for the BlockIndexer.

The Array2D class has two data members: An array that serves as a linear buffer and the Indexer that is used to access the buffer. When an Array2D is constructed, the indexer's static GetNeededBufferCapacity() method is called to query how many elements the buffer should allocate for the given width and height. This is necessary because some Indexers need more than width*height elements to ease their address calculations. operator() simply uses the Indexer to access the buffer and retrieve the element of interest. Here is the code:

/**
A 2d array that can be used with variable indexing schemes and allocators.
*/
template
<
 typename TElement,
 class TIndexer = BlockIndexer2D< 4 >,
 class TAllocator = std::allocator< TElement >
>
class Array2D
{
public:
 Array2D( index_t sizeU, index_t sizeV, const TElement& defaultElement = 
          TElement(), const TAllocator& allocator = TAllocator() )
  : m_data( TIndexer::GetNeededBufferCapacity( sizeU, sizeV ), defaultElement, allocator )
  , m_indexer( sizeU, sizeV )
 {}
 Array2D( index_t sizeU, index_t sizeV, const TElement& defaultElement, 
          const TAllocator& allocator, const TIndexer& indexer )
  : m_data( TIndexer::GetNeededBufferCapacity( sizeU, sizeV ), defaultElement, allocator )
  , m_indexer( indexer )
 {}
 ~Array2D() {}
 const TElement& operator () ( index_t u, index_t v ) 
       const { return m_data[ m_indexer.GetIndex(u, v) ]; }
 TElement& operator () ( index_t u, index_t v ) { return m_data[ m_indexer.GetIndex(u, v) ]; }
private:
 Array< TElement, TAllocator > m_data;
 TIndexer m_indexer;
};

The Array3D class is implemented analogously.

The Indexers

I will first discuss the LinearIndexer2D to introduce the interface. The interface extents in the canonical way to 3d. The guarantees an Indexer must make are:

  • It must provide a static function that returns the number of elements the buffer must be able to hold for an index range [0,sizeU]x[0,sizeV].
  • It must provide a member function that returns the one-dimensional index of an element in the buffer given a two-dimensional index.
  • It must be copy-constructable and constructable from a (width, height) – pair. The latter condition only helps to instantiate Array2D more conveniently and can easily be removed if necessary.

As mentioned earlier, LinearIndexer2D simply indexes an array row-by-row. The code is:

/**
A straightforward row-by-row indexer that translates 2d indices to 1d.
*/
class LinearIndexer2D
{
public:
 /// Constructs an indexer to index a 2d array of specified size.
 LinearIndexer2D( index_t sizeU, index_t /*sizeV*/ )
  : m_sizeU( sizeU )
 {}
 /// Dtor.
 ~LinearIndexer2D() {}
 /// Returns the 1d index for the 2d (u,v) pair.
 index_t GetIndex( index_t u, index_t v ) const
 {
  return v*m_sizeU + u;
 }
 /// Returns the number of elements a buffer must
 /// able to hold to be used with an indexer
 /// for the specified size.
 static index_t GetNeededBufferCapacity( index_t sizeU, index_t sizeV )
 {
  return sizeU * sizeV;
 }
private:
 index_t m_sizeU;
};

BlockIndexer2D divides the two-dimensional index space into blocks of a certain edge length. The elements inside a block are indexed row-by-row, as are the blocks themselves at a higher level. Block sizes are restricted to be powers of two in order to speed up the calculations, so we take the log2 of the block size as a template argument. There are only full blocks in the array, so the Indexer requires enough memory to hold an array with the extents rounded up to the next multiple of block size. Hmm, it's probably best to let commented code speak:

/**
An Indexer that uses blocks of a specific size to exploit locality of data.
*/
template
<
 index_t TLog2BlockSize = 4
>
class BlockIndexer2D
{
public:
 /// Ctor to index specified size.
 BlockIndexer2D( index_t sizeU, index_t /*sizeV*/ )
  : m_numBlocksU( getNumBlocks( sizeU ) )
 {}
 /// Dtor.
 ~BlockIndexer2D()
 {}
 /// Returns the index into the underlying 1d buffer of the given (u,v) pair.
 index_t GetIndex( index_t u, index_t v ) const
 {
  index_t blockU, blockV;
  index_t relativeU, relativeV;
  splitIndex( blockU, relativeU, u );
  splitIndex( blockV, relativeV, v );
  const index_t indexToBlock = ( m_numBlocksU*blockV + blockU ) * (ms_blockSize*ms_blockSize);
  const index_t indexInBlock = ( relativeV * ms_blockSize ) + relativeU;
  return indexToBlock + indexInBlock;
  // The code above can also be expressed with bitwise operators.
  // Intersetingly, the less tricky version above performed slightly better on my machine,
  // although the compiler didn't optimize all arithmetic expressions to bit-twiddlings...
  //const index_t indexToBlock = ( m_numBlocksU*blockV + blockU ) << (2*TLog2BlockSize);
  //const index_t indexInBlock = ( relativeV << TLog2BlockSize ) | relativeU;
  //return indexToBlock | indexInBlock;
 }
 /// Returns the number of elements a buffer must able to hold to be used with an indexer
 /// for the specified size. The sizes are rounded up to the next multiple of blocksize.
 static index_t GetNeededBufferCapacity( index_t sizeU, index_t sizeV )
 {
  const index_t numBlocksU = getNumBlocks( sizeU );
  const index_t numBlocksV = getNumBlocks( sizeV );
  return (numBlocksU*numBlocksV) << (2*TLog2BlockSize);
 }
private:
 ///< Blocksize as static const to enable compiler optimizations 
 static const index_t ms_blockSize = (1 << TLog2BlockSize);
 index_t m_numBlocksU; ///< number of blocks in U-direction, needed for faster indexing
 /// Returns the index of the block in which a component of a (u,v) index lies (x/ms_blockSize).
 static index_t getIndexOfBlock( index_t x ) { return x >> TLog2BlockSize; }
 /// Returns the relative index inside a block (x % ms_blockSize).
 static index_t getIndexInBlock( index_t x ) { return x & (ms_blockSize-1); }
 /// Combines getIndexOf/InBlock to reduce function call overhead in debug builds.
 static void splitIndex( index_t& block, index_t& relative, index_t x )
 { block = x >> TLog2BlockSize; relative = x & (ms_blockSize-1); }
 /// Get the number of blocks needed to cover an extent of size.
 static index_t getNumBlocks( index_t size ) { return ( size + ms_blockSize-1 ) >> TLog2BlockSize; }
};

The code is based on the implementation Matt Pharr and Greg Humphreys gave in [1].

MortonIndexer2D uses Morton numbers to access the data. These are obtained by interleaving the bits of the u and v index. This results in a recursive Z-shaped curve. See the Wikipedia article for details ([2]).

MortonIndexer2D requires an indexing area that is as high as it is wide. Width and height must be a power of two, which wastes serious amounts of memory in some cases. The bit twiddling tricks are taken from Sean Eron Anderson's excellent website ([3]). Here is the code:

/**
Indexes a 1d array with a Morton curve (Z-order).
*/
class MortonIndexer2D
{
public:
 /// Ctor for indexing an array of specified size.
 MortonIndexer2D( index_t /*sizeU*/, index_t /*sizeV*/ )
 {}
 /// Dtor.
 ~MortonIndexer2D()
 {}
 /// Translated (u,v) into 1d index.
 /// Only works for 16 bit u and 15 bit v!
 index_t GetIndex( index_t u, index_t v ) const
 {
  // calculate morton number for (u,v)
  return spreadBits( u ) | (spreadBits( v ) << 1);
 }
 static index_t GetNeededBufferCapacity( index_t sizeU, index_t sizeV )
 {
  // THINKABOUTME: try bsr-instruction,
  // although this is not performance critical
  // round up the max of the two extents to the next power of two
  index_t n = std::max( sizeU, sizeV ) - 1;
  // set all bits lower than the highest set bit
  // taken from <a href=""http://graphics.stanford.edu/~seander/bithacks.html"">http://graphics.stanford.edu/%7Eseander/bithacks.html</a>
  n |= n >> 1;
  n |= n >> 2;
  n |= n >> 4;
  n |= n >> 8;
  n |= n >> 16;
  // increment to get next power of two
  ++n;
  return n*n;
 }
private:
 /// Spreads out the lowest 16 bits in x so that
 /// they cover 32 bits (with 0-bits in between).
 /// Taken from <a href=""http://graphics.stanford.edu/~seander/bithacks.html"">http://graphics.stanford.edu/%7Eseander/bithacks.html</a>
 static index_t spreadBits( index_t x )
 {
  index_t r = x;
  r = (r | (r << 8)) & 0x00FF00FF;
  r = (r | (r << 4)) & 0x0F0F0F0F;
  r = (r | (r << 2)) & 0x33333333;
  r = (r | (r << 1)) & 0x55555555;
  return r;
 }
};

The code for LinearIndexer3D and BlockIndexer3D can be found in the zip-file of this article.

Performance test

The tests conducted were not very exhaustive and are certainly no replacement for profiling in your concrete environment, but they should give an estimate of how much performance can be gained or lost in certain situations.

Performance of the Indexers was tested with different access patterns in 2d and 3d. In each pattern, after the element at (u,v) (or (u,v,w)) is accessed, also the 4 edge-neighbors (or 6 face-neighbors in 3d) that are 'radius' elements away in each direction are accessed. So if radius is 0, the same element is accessed; if radius is 1, the direct neighbors are accessed, and so on. The element (u,v) is picked with two fundamentally different schemes: linear (random start location, then row-by-row), and completely random (using stdlib's rand() function – it should be good enough for this test). The code for the 2d functions is:

template< class T >
static inline
void AccessArray2D( T& a, index_t u, index_t v, index_t radius )
{
  a(u,v) =
   + a(u, v-radius)
   + a(u-radius, v)
   + a(u, v)
   + a(u+radius, v)
   + a(u, v+radius)
   ;
}
template< class T >
double TestLinearAccessArray2D( T& a, index_t size, index_t radius )
{
 const index_t uStart = (rand() % (size - 2*radius)) + radius;
 const index_t vStart = (rand() % (size - 2*radius)) + radius;
 index_t t1 = clock();
 const index_t numLoops = size * 100;
 index_t loopCounter = 0;
 for( index_t v=vStart; v<size-radius; ++v )
 {
  for( index_t u=uStart; u<size-radius; ++u )
  {
   AccessArray2D( a, u, v, radius );
   if( ++loopCounter >= numLoops )
   {
    v = size; // to break the outer loop
    break;
   }
  }
 }
 index_t t2 = clock();
 return (double)(t2-t1) / CLOCKS_PER_SEC;
}
template< class T >
double TestRandomAccessArray2D( T& a, index_t size, index_t radius )
{
 index_t t1 = clock();
 const index_t numLoops = size * 100;
 for( index_t i=0; i<numLoops; ++i )
 {
  const index_t u = (rand() % (size - 2*radius)) + radius;
  const index_t v = (rand() % (size - 2*radius)) + radius;
  AccessArray2D( a, u, v, radius );
 }
 index_t t2 = clock();
 return (double)(t2-t1) / CLOCKS_PER_SEC;
}

The array size was chosen to be 64MB for 2d and 3d, in the hope not to have the complete array in cache after a few runs and not to provoke paging on the other hand. The tests were repeated several times to get a better estimate. Code was compiled with MSVC9.0's default release settings. The machine was a Core2 Duo E6700, 2.66GHz, 4GB Dual Channel DDR2 RAM running with 333MHz. The Operating System was Vista x64. L1 data cache was 2x32KB with 64 byte lines (8-way set associative), L2 cache was 4MB with 64 byte lines (16-way set associative). Here are the results:

LINEAR ACCESS TEST 2D (radius = 0, 1000 runs)
linear: 2.171 (100%)
block: 5.207 (239.843%)
morton: 10.515 (484.339%)
---
LINEAR ACCESS TEST 2D (radius = 1, 1000 runs)
linear: 2.295 (100%)
block: 4.69 (204.357%)
morton: 11.064 (482.092%)
---
LINEAR ACCESS TEST 2D (radius = 2, 1000 runs)
linear: 1.907 (100%)
block: 5.3 (277.923%)
morton: 10.827 (567.75%)
---
LINEAR ACCESS TEST 2D (radius = 3, 1000 runs)
linear: 2.016 (100%)
block: 4.97 (246.528%)
morton: 11.142 (552.679%)
---
LINEAR ACCESS TEST 2D (radius = 4, 1000 runs)
linear: 1.954 (100%)
block: 5.002 (255.988%)
morton: 11.03 (564.483%)
---
LINEAR ACCESS TEST 2D (radius = 5, 1000 runs)
linear: 2.172 (100%)
block: 5.146 (236.924%)
morton: 10.856 (499.816%)
===
RANDOM ACCESS TEST 2D (radius = 0, 1000 runs)
linear: 61.911 (100%)
block: 63.559 (102.662%)
morton: 72.104 (116.464%)
---
RANDOM ACCESS TEST 2D (radius = 1, 1000 runs)
linear: 91.758 (100%)
block: 75.676 (82.4735%)
morton: 79.015 (86.1124%)
---
RANDOM ACCESS TEST 2D (radius = 2, 1000 runs)
linear: 93.053 (100%)
block: 85.483 (91.8649%)
morton: 86.119 (92.5483%)
---
RANDOM ACCESS TEST 2D (radius = 3, 1000 runs)
linear: 94.045 (100%)
block: 92.768 (98.6421%)
morton: 92.646 (98.5124%)
---
RANDOM ACCESS TEST 2D (radius = 4, 1000 runs)
linear: 94.44 (100%)
block: 97.971 (103.739%)
morton: 98.108 (103.884%)
---
RANDOM ACCESS TEST 2D (radius = 5, 1000 runs)
linear: 93.62 (100%)
block: 97.261 (103.889%)
morton: 98.702 (105.428%)
===

And the results for 3d were:

LINEAR ACCESS TEST 3D (radius = 0, 1000 runs)
linear: 0.422 (100%)
block: 0.795 (188.389%)
---
LINEAR ACCESS TEST 3D (radius = 1, 1000 runs)
linear: 0.557 (100%)
block: 0.706 (126.75%)
---
LINEAR ACCESS TEST 3D (radius = 2, 1000 runs)
linear: 0.452 (100%)
block: 0.859 (190.044%)
---
LINEAR ACCESS TEST 3D (radius = 3, 1000 runs)
linear: 0.42 (100%)
block: 0.921 (219.286%)
---
LINEAR ACCESS TEST 3D (radius = 4, 1000 runs)
linear: 0.389 (100%)
block: 1 (257.069%)
---
LINEAR ACCESS TEST 3D (radius = 5, 1000 runs)
linear: 0.531 (100%)
block: 0.842 (158.569%)
===
RANDOM ACCESS TEST 3D (radius = 0, 1000 runs)
linear: 4.786 (100%)
block: 5.042 (105.349%)
---
RANDOM ACCESS TEST 3D (radius = 1, 1000 runs)
linear: 7.128 (100%)
block: 6.771 (94.9916%)
---
RANDOM ACCESS TEST 3D (radius = 2, 1000 runs)
linear: 7.394 (100%)
block: 6.989 (94.5226%)
---
RANDOM ACCESS TEST 3D (radius = 3, 1000 runs)
linear: 7.582 (100%)
block: 7.613 (100.409%)
---
RANDOM ACCESS TEST 3D (radius = 4, 1000 runs)
linear: 8.015 (100%)
block: 8.536 (106.5%)
---
RANDOM ACCESS TEST 3D (radius = 5, 1000 runs)
linear: 7.875 (100%)
block: 8.614 (109.384%)
---

Conclusion

So what are the conclusions to draw? First, I have to say again that the test cases were constructed to a high degree and cannot speak for your specific scenario. Also, there was a serious variance among different runs of the same program (I have no statistical figures, but I'd guess 2-3%).

What can be said is that there are benefits in some scenarios, but also costs in others (as if we didn't know that from the beginning...). Especially block indexing can save 10-20% of time in some cases (radius = 1,2), and it isn't too bad in cases that don't suit it well (radius = 0,4,5). In 3d, it performed worse than I expected, probably the computation overhead for indices is simply too high there. MortonIndexer performed worse than BlockIndexer with small radii, although its performance didn't degrade that much with rising radii. So, as one would have expected, it seems to be the choice if elements 'roughly near' an already fetched element are accessed frequently. However, it also can waste an enormous amount of memory compared to BlockIndexer. A drawback that both BlockIndexer and MortonIndexer share is that they are significantly slower than LinearIndexer when you just 'plow through' an array. They are much slower in debug builds, too. This might not sound too bad, and in most cases the gains in release builds may make up for it, but I have worked with debug builds that were so terribly slow that – well - you couldn't really work with them, even as a developer. So I thought I should mention it here.

References

[1]: Matt Pharr, Greg Humphreys: Physically based rendering - from theory to implementation. Morgan Kaufmann Publishers, 2004. http://www.pbrt.org.

[2]: Wikipedia article about Z-order: http://en.wikipedia.org/wiki/Z-order_%28curve%29

[3]: Sean Eron Anderson: Bit twiddling hacks. http://graphics.stanford.edu/%7Eseander/bithacks.html

History

  • 2007-12-27: First upload.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Karsten Schwenk

Software Developer

Germany Germany

Member



Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Layout  Per page   
  Refresh
-- There are no messages in this forum --
Permalink | Advertise | Privacy | Mobile
Web01 | 2.5.120517.1 | Last Updated 27 Dec 2007
Article Copyright 2007 by Karsten Schwenk
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid