Introduction
It is late at night. Very late. In fact it is so late, that I should say it’s almost early in the morning… but I´m immersed in a task I have delayed for too much time: extend my own Math library, directly written in C#.
Firstly, I must say that this article does not pretend to be anything too seriuos, just some ramblings about maths, C# and .Net. Some good ideas, and some bad ideas too... ;)
Background
Why to write my own library?
Because I think it’s good to depend on the Graphics API as less as possible. Why?
 Because APIs are deprecated from time to time (as happened with Managed DirectX), or evolve, or change, or reappear in some form (as it seems that is going to happen with some parts of Managed DirectX, via the Windows 7 Code Pack), or you just decide to use another API, or you want your code to be compatible both with XNA and with DirectX.
 Because I want to be able to debug the inners of my Maths.
 Because writing your own library (even if you copy most parts of it), you will learn a lot.
 Because seeing the internals of things, you will realize how much work some tasks take, and you will think it twice before calling them…
Do you need any more reasons?
Wait a minute. Faster code, using no API?
For CPUrelated stuff, YES, it is possible.
XNA (or DirectX) makes no magic. They just calculate things as you will do. But of course you must choose the best algorithms for each task.If you do so, you will end up with a code as fast as the one written in the API, but the point is, as you will know how things work in the inners, you will have a much more detailed idea of the cost of things, and you will optimize your code. In addition to that, you will be able to tune things a bit for your specific needs, increasing the speed a bit more.
However, keep in mind that companies like Microsoft have hundreds or thousands of the best engineers working in their products, so if there is a better method of doing something, the API will have it. You need to be that good too!
Let’s see an example of the kind of optimization you need to achieve to make your own Maths library as fast (or faster) than the ones included in the APIs. We will include some matrix operations and make some speed tests with different algorithms.
A note about Affine Transformation Matrices
In 3D graphics applications, transformations are typically built from the combination of elementary matrices (being this elementary matrices: translations, rotations, scales and the identity matrix). Matrices built this way, always result in Affine transformations (matrices with the right column of the matrix being 0,0,0,1 –if using row representation of matrices).
There are some algorithm optimizations in this article that are only valid for Affine Transformation matrices. You should avoid using them for nonaffine matrices. It is the case of Perspective Projection matrices, which DO NOT BELONG to the affine transformations group.
A note about JIT compilation and .NET maths performance
This is a very interesting question: Are local variables faster than member variables?
At first, one could say that they aren’t, as they have to be recreated each time the method is called. BUT, as this interesting blog post points out, while the JIT compiler can enregister local variables, it cannot do the same with member fields. So we should say that YES, THEY ARE, when talking about math calculations. As you can see the in the test, it makes a remarkable difference. So:
If you are developing in .Net and you have a performance critical math. calculation method, you should use local copy of variables to speed up calculations

In fact, if you get a glimpse inside the XNA code, every nonstatic method inside the Matrix class, which is performance critical, makes a local copy of each member field.
Part #1 Matrix Determinant Calculation
The general Method (Based on Graphics Gems I)
The following is a pretty straightforward algorithm for calculating the determinant of a matrix. It is mostly designed for nonrealtime applications, as it quite slow and uses doubleprecission. It calculates a 4x4 determinant basing on the 3x3 method, then in the 2x2 and so on. It is a based in an algorithm used in Graphics Gems I for matrix inversion.
public double Determinant4x4()
{
double ans;
double a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4;
a1 = this.M11; b1 = this.M12;
c1 = this.M13; d1 = this.M14;
a2 = this.M21; b2 = this.M22;
c2 = this.M23; d2 = this.M24;
a3 = this.M31; b3 = this.M32;
c3 = this.M33; d3 = this.M34;
a4 = this.M41; b4 = this.M42;
c4 = this.M43; d4 = this.M44;
ans = a1 * Determinant3x3(b2, b3, b4, c2, c3, c4, d2, d3, d4)
 b1 * Determinant3x3(a2, a3, a4, c2, c3, c4, d2, d3, d4)
+ c1 * Determinant3x3(a2, a3, a4, b2, b3, b4, d2, d3, d4)
 d1 * Determinant3x3(a2, a3, a4, b2, b3, b4, c2, c3, c4);
return ans;
}
public double Determinant3x3(double a1, double a2, double a3, double b1, double b2, double b3, double c1, double c2, double c3)
{
double ans;
ans = a1 * Determinant2x2(b2, b3, c2, c3)
 b1 * Determinant2x2(a2, a3, c2, c3)
+ c1 * Determinant2x2(a2, a3, b2, b3);
return ans;
}
public double Determinant2x2(double a, double b, double c, double d)
{
double ans;
ans = a * d  b * c;
return ans;
}
The general Method (realtime applications optimized)
This second version is also a full, general method, but optimized for realtime applications. It is much faster, because of the calculation and because of the grouping of duplicated mults in numXX variables, saving multiplications and adds.
public float Determinant()
{
float num18 = (M33 * M44)  (M34 * M43);
float num17 = (M32 * M44)  (M34 * M42);
float num16 = (M32 * M43)  (M33 * M42);
float num15 = (M31 * M44)  (M34 * M41);
float num14 = (M31 * M43)  (M33 * M41);
float num13 = (M31 * M42)  (M32 * M41);
return ((((M11 * (((M22 * num18)  (M23 * num17)) + (M24 * num16)))  (M12 * (((M21 * num18) 
(M23 * num15)) + (M24 * num14)))) + (M13 * (((M21 * num17)  (M22 * num15)) +
(M24 * num13))))  (M14 * (((M21 * num16)  (M22 * num14)) + (M23 * num13))));
}
This is the one used in XNA or SlimDX. It is the standard defacto.
Optimization I (faster Memory Accessing)
As stated in the intro, about the performance of local variables vs. member fields, we´ll make an optimization of the above method using this concept:
public float Determinant()
{
float L11 = this.M11;
float L12 = this.M12;
float L13 = this.M13;
float L14 = this.M14;
float L21 = this.M21;
float L22 = this.M22;
float L23 = this.M23;
float L24 = this.M24;
float L31 = this.M31;
float L32 = this.M32;
float L33 = this.M33;
float L34 = this.M34;
float L41 = this.M41;
float L42 = this.M42;
float L43 = this.M43;
float L44 = this.M44;
float C1 = (L33 * L44)  (L34 * L43);
float C2 = (L32 * L44)  (L34 * L42);
float C3 = (L32 * L43)  (L33 * L42);
float C4 = (L31 * L44)  (L34 * L41);
float C5 = (L31 * L43)  (L33 * L41);
float C6 = (L31 * L42)  (L32 * L41);
return ((((L11 * (((L22 * C1)  (L23 * C2)) + (L24 * C3)))  (L12 * (((L21 * C1) 
(L23 * C4)) + (L24 * C5)))) + (L13 * (((L21 * C2)  (L22 * C4)) + (L24 * C6)))) 
(L14 * (((L21 * C3)  (L22 * C5)) + (L23 * C6))));
}
See the test for performance results.
Optimization II (Based on Graphics Gems II, for Affine transformations only)
The following algorithm is a direct translation of a piece of code found in a Affine Matrix Inversion algorithm, inside Graphics Gems II (see in the next post for the full algorithm). It makes only 12 mults and 7 adds. Keep in mind that this method is only valid for affine transformations (see intro for further information):
public float Determinant()
{
return (this.M11 * this.M22 * this.M33) +
(this.M12 * this.M23 * this.M31) +
(this.M13 * this.M21 * this.M32) 
(this.M13 * this.M22 * this.M31) 
(this.M12 * this.M21 * this.M33) 
(this.M11 * this.M23 * this.M32);
}
The test
So, it’s time to measure this little boys.
Performance
The test was made calculation a 4x4 affine matrix determinant 100 Million times, in a Intel Core 2 Quad Q9450 2.66Ghz with 3 Gb (non multithreading).
General method (Graphics Gems I, non realtime) 
1 minute: 0.9 secs

General method (realtime) 
2.6832 secs

Optimization I (memory access) 
2.496 secs

Optimization II (affine transformations) 
1.404 secs

Reliability
The reliability test was made with two different matrices.There is no surprises here, all the algorithms work well. The following is the matrices used for the tests (affine transformations) and the results (the determinant) calculated by each algorithm.
Matrix 1:
0.7071068  0.235702276  0.6666667  0.0 
0.0  0.9428091  0.33333334  0.0 
0.7071068  0.235702276  0.6666667  0.0 
0.0  0.0  15.0  1 

General method (Graphics Gems I) 
1.0000001641611829

General method (realtime) 
1.00000012

Optimization I (memory access) 
1.00000012

Optimization II (affine transformations) 
1.00000012

Matrix 2:
9.848019  0.10061159  0.0  0.0 
3.50578356  0.282625765  0.0  0.0 
0.0  0.0  94.22  0.0 
0.0  0.0  0.0  1 

General method (Graphics Gems I) 
295.47639778127973

General method (realtime) 
295.4764

Optimization I (memory access) 
295.4764

Optimization II (affine transformations) 
295.4764

Part #2: Matrix Inverse Calculation
One of the most expensive Matrix operations (among the usual ones) is calculating the inverse of a matrix. Will start with the general algorithm for this, and study a couple of optimizations later:
The general Method
This algorithm (explained here), is something like this:
public Matrix Inverse()
{
float determinant = this.Determinant();
if (determinant == 0)
throw new System.InvalidOperationException("Matrix has zero determinant (singular matrix). Inverse doesn´t exist");
Matrix b = new Matrix();
b.M11 = M22 * M33 * M44 + M23 * M34 * M42 + M24 * M32 * M43  M22 * M34 * M43  M23 * M32 * M44  M24 * M33 * M42;
b.M12 = M12 * M34 * M43 + M13 * M32 * M44 + M14 * M33 * M42  M12 * M33 * M44  M13 * M34 * M42  M14 * M32 * M43;
b.M13 = M12 * M23 * M44 + M13 * M24 * M42 + M14 * M22 * M43  M12 * M24 * M43  M13 * M22 * M44  M14 * M23 * M42;
b.M14 = M12 * M24 * M33 + M13 * M22 * M34 + M14 * M23 * M32  M12 * M23 * M34  M13 * M24 * M32  M14 * M22 * M33;
b.M21 = M21 * M34 * M43 + M23 * M31 * M44 + M24 * M33 * M41  M21 * M33 * M44  M23 * M34 * M41  M24 * M31 * M43;
b.M22 = M11 * M33 * M44 + M13 * M34 * M41 + M14 * M31 * M43  M11 * M34 * M43  M13 * M31 * M44  M14 * M33 * M41;
b.M23 = M11 * M24 * M43 + M13 * M21 * M44 + M14 * M23 * M41  M11 * M23 * M44  M13 * M24 * M41  M14 * M21 * M43;
b.M24 = M11 * M23 * M34 + M13 * M24 * M31 + M14 * M21 * M33  M11 * M24 * M33  M13 * M21 * M34  M14 * M23 * M31;
b.M31 = M21 * M32 * M44 + M22 * M34 * M41 + M24 * M31 * M42  M21 * M34 * M42  M22 * M31 * M44  M24 * M32 * M41;
b.M32 = M11 * M34 * M42 + M12 * M31 * M44 + M14 * M32 * M41  M11 * M32 * M44  M12 * M34 * M41  M14 * M31 * M42;
b.M33 = M11 * M22 * M44 + M12 * M24 * M41 + M14 * M21 * M42  M11 * M24 * M42  M12 * M21 * M44  M14 * M22 * M41;
b.M34 = M11 * M24 * M32 + M12 * M21 * M34 + M14 * M22 * M31  M11 * M22 * M34  M12 * M24 * M31  M14 * M21 * M32;
b.M41 = M21 * M33 * M42 + M22 * M31 * M43 + M23 * M32 * M41  M21 * M32 * M43  M22 * M33 * M41  M23 * M31 * M42;
b.M42 = M11 * M32 * M43 + M12 * M33 * M41 + M13 * M31 * M42  M11 * M33 * M42  M12 * M31 * M43  M13 * M32 * M41;
b.M43 = M11 * M23 * M42 + M12 * M21 * M43 + M13 * M22 * M41  M11 * M22 * M43  M12 * M23 * M41  M13 * M21 * M42;
b.M44 = M11 * M22 * M33 + M12 * M23 * M31 + M13 * M21 * M32  M11 * M23 * M32  M12 * M21 * M33  M13 * M22 * M31;
return (1f / determinant) * b;
}
You can use any of the methods presented in Part 1 of this article for calculating the Determinant.
Optimization I (memory access and some operation savings)
This method applies the same concept, but using local variables and packing duplicated calculations in local variables too:
public static Matrix InvertXNALocalValues(Matrix matrix)
{
Matrix outMatrix = new Matrix();
float L1 = matrix.M11;
float L2 = matrix.M12;
float L3 = matrix.M13;
float L4 = matrix.M14;
float L5 = matrix.M21;
float L6 = matrix.M22;
float L7 = matrix.M23;
float L8 = matrix.M24;
float L9 = matrix.M31;
float L10 = matrix.M32;
float L11 = matrix.M33;
float L12 = matrix.M34;
float L13 = matrix.M41;
float L14 = matrix.M42;
float L15 = matrix.M43;
float L16 = matrix.M44;
float L17 = (L11 * L16)  (L12 * L15);
float L18 = (L10 * L16)  (L12 * L14);
float L19 = (L10 * L15)  (L11 * L14);
float L20 = (L9 * L16)  (L12 * L13);
float L21 = (L9 * L15)  (L11 * L13);
float L22 = (L9 * L14)  (L10 * L13);
float L23 = ((L6 * L17)  (L7 * L18)) + (L8 * L19);
float L24 = (((L5 * L17)  (L7 * L20)) + (L8 * L21));
float L25 = ((L5 * L18)  (L6 * L20)) + (L8 * L22);
float L26 = (((L5 * L19)  (L6 * L21)) + (L7 * L22));
float L27 = 1f / ((((L1 * L23) + (L2 * L24)) + (L3 * L25)) + (L4 * L26));
float L28 = (L7 * L16)  (L8 * L15);
float L29 = (L6 * L16)  (L8 * L14);
float L30 = (L6 * L15)  (L7 * L14);
float L31 = (L5 * L16)  (L8 * L13);
float L32 = (L5 * L15)  (L7 * L13);
float L33 = (L5 * L14)  (L6 * L13);
float L34 = (L7 * L12)  (L8 * L11);
float L35 = (L6 * L12)  (L8 * L10);
float L36 = (L6 * L11)  (L7 * L10);
float L37 = (L5 * L12)  (L8 * L9);
float L38 = (L5 * L11)  (L7 * L9);
float L39 = (L5 * L10)  (L6 * L9);
outMatrix.M11 = L23 * L27;
outMatrix.M21 = L24 * L27;
outMatrix.M31 = L25 * L27;
outMatrix.M41 = L26 * L27;
outMatrix.M12 = (((L2 * L17)  (L3 * L18)) + (L4 * L19)) * L27;
outMatrix.M22 = (((L1 * L17)  (L3 * L20)) + (L4 * L21)) * L27;
outMatrix.M32 = (((L1 * L18)  (L2 * L20)) + (L4 * L22)) * L27;
outMatrix.M42 = (((L1 * L19)  (L2 * L21)) + (L3 * L22)) * L27;
outMatrix.M13 = (((L2 * L28)  (L3 * L29)) + (L4 * L30)) * L27;
outMatrix.M23 = (((L1 * L28)  (L3 * L31)) + (L4 * L32)) * L27;
outMatrix.M33 = (((L1 * L29)  (L2 * L31)) + (L4 * L33)) * L27;
outMatrix.M43 = (((L1 * L30)  (L2 * L32)) + (L3 * L33)) * L27;
outMatrix.M14 = (((L2 * L34)  (L3 * L35)) + (L4 * L36)) * L27;
outMatrix.M24 = (((L1 * L34)  (L3 * L37)) + (L4 * L38)) * L27;
outMatrix.M34 = (((L1 * L35)  (L2 * L37)) + (L4 * L39)) * L27;
outMatrix.M44 = (((L1 * L36)  (L2 * L38)) + (L3 * L39)) * L27;
return outMatrix;
}
Optimization II (for Affine Transformations only Based on the Kevin Wu article, Graphics Gems II)
The Graphics Gems books series are a must in any graphics or game developer’s bookshelves. You can by the the book I’m talking about here, or get more info about the Graphics Gems series here. Google have some excerpts of this book in the following link, but one page out of each two is missing: Graphics gems II  Página 342.
So, the algorithm purposed by Mr. Wu looks like the following (ported to C#, original in C++):
public static bool Inverse(Matrix min, Matrix mout)
{
float det_1;
float pos, neg, temp;
double PRECISION_LIMIT = 1.0e15;
pos = neg = 0.0f;
temp = min.M11 * min.M22 * min.M33;
if (temp >= 0.0)
pos += temp;
else
neg += temp;
temp = min.M12 * min.M23 * min.M31;
if (temp >= 0.0)
pos += temp;
else
neg += temp;
temp = min.M13 * min.M21 * min.M32;
if (temp >= 0.0)
pos += temp;
else
neg += temp;
temp = min.M13 * min.M22 * min.M31;
if (temp >= 0.0)
pos += temp;
else
neg += temp;
temp = min.M12 * min.M21 * min.M33;
if (temp >= 0.0)
pos += temp;
else
neg += temp;
temp = min.M11 * min.M23 * min.M32;
if (temp >= 0.0)
pos += temp;
else
neg += temp;
det_1 = pos + neg;
if ((det_1 == 0.0)  (Math.Abs(det_1 / (pos  neg)) < PRECISION_LIMIT))
throw new System.InvalidOperationException("Matrix has zero determinant (singular matrix). Inverse doesn´t exist");
det_1 = 1.0f / det_1;
mout.M11 = (min.M22 * min.M33  min.M23 * min.M32) * det_1;
mout.M21 = (min.M21 * min.M33  min.M23 * min.M31) * det_1;
mout.M31 = (min.M21 * min.M32  min.M22 * min.M31) * det_1;
mout.M12 = (min.M12 * min.M33  min.M13 * min.M32) * det_1;
mout.M22 = (min.M11 * min.M33  min.M13 * min.M31) * det_1;
mout.M32 = (min.M11 * min.M32  min.M12 * min.M31) * det_1;
mout.M13 = (min.M12 * min.M23  min.M13 * min.M22) * det_1;
mout.M23 = (min.M11 * min.M23  min.M13 * min.M21) * det_1;
mout.M33 = (min.M11 * min.M22  min.M12 * min.M21) * det_1;
mout.M41 = (min.M41 * mout.M11 + min.M42 * mout.M21 + min.M43 * mout.M31);
mout.M42 = (min.M41 * mout.M12 + min.M42 * mout.M22 + min.M43 * mout.M32);
mout.M43 = (min.M41 * mout.M13 + min.M42 * mout.M23 + min.M43 * mout.M33);
mout.M14 = mout.M24 = mout.M34 = 0.0f;
mout.M44 = 1.0f;
return true;
}
Please keep in mind that this algorithm is only valid for affine transformations. Of course, some memory accessing optimization could be done here, using local variables as explained in Part 1.
Optimization III (for matrices composed by Rotations and Translations only)
If you know that your matrix is composed by rotations and translations only, you can apply this (even faster) method, which avoids calculating the determinant, and takes advantage of some properties of this kind of matrices. You can read all the info here.
public Matrix Inverse()
{
Matrix r = new Matrix();
r.M11 = this.M11;
r.M12 = this.M21;
r.M13 = this.M31;
r.M14 = 0f;
r.M21 = this.M12;
r.M22 = this.M22;
r.M23 = this.M32;
r.M24 = 0f;
r.M31 = this.M13;
r.M32 = this.M23;
r.M33 = this.M33;
r.M34 = 0f;
r.M41 = (r.M11 * this.M41 + r.M21 * this.M42 + r.M31 * this.M43);
r.M42 = (r.M12 * this.M41 + r.M22 * this.M42 + r.M32 * this.M43);
r.M43 = (r.M13 * this.M41 + r.M23 * this.M42 + r.M33 * this.M43);
r.M44 = 1f;
return r;
}
Additional Optimizations
If you need it, other optimizations could be included in this algorithms, detecting special matrices whose inverse can be calculated even faster. It is the case of diagonal matrices (where only the diagonal should be inverted using 1/value entries), or Orthogonal matrices (whose inverse is equal to its transpose).
The Test
We made a very simple test, with 50.000.000 (50 Million) inverse calculations of affine transformations, in an Intel Core 2 Quad Q9450 2.66Ghz with 3 Gb (non multithreading).
General method 
11.523 secs

Optimization I (memory accessing and op. savings) 
5.4912 secs

Optimization II (affine transformations only) 
2.7144 secs

Optimization III (rotations and translations only) 
1.326 secs

Conclusions
Searching a bit, it is indeed possible to reach the performance of your graphics API and as mentioned above, the advantages of having (and building) your own library are countless.
References
For further info about the mentioned algorithms or books, you can check the following links:
Take care!