Click here to Skip to main content
12,887,921 members (45,828 online)
Click here to Skip to main content
Add your own
alternative version

Stats

6K views
334 downloads
14 bookmarked
Posted 18 Apr 2017

Need for Speed - C++ versus Assembly Language

, 18 Apr 2017 CPOL
Rate this:
Please Sign up or sign in to vote.
Impossible to beat the C++ compiler?

Introduction

At first sight, we tend to believe that assembled code, i.e. machine code obtained from ASM (Assembly Language) source through an Assembler, should run faster than compiled code, i.e. machine code obtained from a High-Level Language source through a Compiler. The reason is that ASM has an almost one-to-one (isomorphic) mapping to machine code (ASM can be called symbolic machine code) while the Compiler works at least on a 2-stage process, the last one being a translation to ASM.
It used to be like that, but Compilers, in particular some C++ Compilers, are becoming increasingly ingenious at optimization and, in most cases, only very talented ASM programmers can equal them or do better. As always, there are many exceptions to the rule, but we will not elaborate on them here. A single example, tasks that deal with vectorized SIMD instructions are, at this time, better handled using ASM, although intrinsics (which are actually ASM in disguise) do the job almost as well.
In spite of that, ASM continues to be an important programming language to learn, not only because exist tasks that need to be handled as close to the hardware as possible but mainly because it provides the sort of insight impossible to acquire through High-Level Language study and practice.

For illustration purposes, I will present a program that will make some lengthy computation, first by calling a C++ function then by calling a linked assembled ASM object file. By lengthy computation I mean a few seconds, although the program can be easily modified to take less or more time. I had not done any steps to improve the C++ code; the Compiler had to do its best with what was there. On the other hand, I undertook a few optimization steps with the ASM source code and was able to improve the assembled performance by more than 30%.

Background

Our program will calculate a determinant under the Cramer's Rule, which is an explicit formula for the solution of a system of linear equations, with the same number of equations as unknowns.
Just consider that the calculation of determinants is central to the Cramer's Rule - this shall be enough for our purpose and we will not spend more time with the Math behind.
The Cramer's Rule is great for calculating determinants of small matrixes; however it becomes computationally very expensive for larger matrix sizes, because its time complexity is O(n.n!) - Calculating the determinant of a 13x13 matrix takes seconds, a 14x14 matrix will take minutes and be ready to wait one hour to know the determinant of a 15x15 matrix.

Using the code

1) How it was compiled?

The C++ source code was compiled with Visual Studio 2015. The Release build of the compilation has the following settings:
Maximize Speed, Use Streaming SIMD Extensions 2 (/arch:SSE2), Fast (/fp:fast) floating point model, Disable Security Check (/GS-), not Enable C++ Exception, No (/SAFESEH:NO) for Win32.

2) How it works

The program will calculate the determinant of a 13x13 matrix by using the Cramer's Rule. Most of the time is spend doing recursion and looping. The recursion is required because the determinant of a matrix is calculated from the determinants of lower order matrixes (called "minors"), which are obtained by eliminating the top row and a column at a time. Lower order matrixes determinants, are calculated the same way, and so on. In our program, the recursion ends when we reach a 4x4 matrix, whose determinant is then calculated directly. Matrixes of order 4x4, 3x3 and 2x2 have their determinants calculate directly and don't do any recursion. The program loops through three levels to prepare the minor matrixes. For a 13x13 matrix the outer loop will execute 321,408,685 times, the middle loop 1,359,245,472 times and the innermost loop will execute 7,191,246,668 times.
The calculation of 4x4 minor determinants executes comparatively much fewer times, only 259,459,200.

3) Preparing the matrix

The matrix is populated with random float numbers in the range -50.0 to +50.0 rounded to 1 decimal place.

int i, j;
int numRows, numCols;
float *inMat;

srand(time(NULL));
numRows = 13; /*Changes as you like for other matrix orders */
numCols = numRows;
inMat = (float*)malloc(numRows*numCols * sizeof(float));
for (i = 0;i < numRows;i++)
    for (j = 0;j < numCols;j++)
        /* Pointer to matrix, filled as one dimensional array */
        inMat[i*numCols + j] = roundf((((float)rand() / (float)(RAND_MAX)) * 100 - 50) * 10) / 10;

printf("%dx%d Matrix\n\n", numRows, numCols);
printf("Input Matrix\n");
for (int i = 0; i < numRows; i++)
{
    for (int j = 0; j < numCols; j++)
        printf("%.1f\t", inMat[i*numCols + j]);
    printf("\n");
}
printf("\n");

4) Measuring the time for the C++ execution:

void C_example(float* inMat, int rows)
{
    LARGE_INTEGER frequency;
    LARGE_INTEGER t1, t2;
    double elapsedTime;

    QueryPerformanceFrequency(&frequency);
    QueryPerformanceCounter(&t1);
    float det = determinant(inMat, rows);
    QueryPerformanceCounter(&t2);
    elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;

    printf("\nDeterminant of Matrix using C++ is %.2f.\n", det);
    printf("Elapsed time: %f miliseconds\n", elapsedTime);
}

5) Calculating the determinant in C++:

float determinant(float*inMat, int rows)
{    
    HANDLE heapPtr;
    float* minorMat;
    int i, j, k, s;
    float det = 0.0f, sign = 1.0f;

    if (4 == rows)
    {
        /*
        float A11 = inMat[0];
        float A12 = inMat[1];
        float A13 = inMat[2];
        float A14 = inMat[3];
        float A21 = inMat[rows];
        float A22 = inMat[rows + 1];
        float A23 = inMat[rows + 2];
        float A24 = inMat[rows + 3];
        float A31 = inMat[rows * 2];
        float A32 = inMat[rows * 2 + 1];
        float A33 = inMat[rows * 2 + 2];
        float A34 = inMat[rows * 2 + 3];
        float A41 = inMat[rows * 3];
        float A42 = inMat[rows * 3 + 1];
        float A43 = inMat[rows * 3 + 2];
        float A44 = inMat[rows * 3 + 3];
        (A13*A24-A14*A23) * ((A31*A42)-(A32*A41)) +
        (A14*A22-A12*A24) * ((A31*A43)-(A33*A41)) +
        (A13*A22-A12*A23) * ((A34*A41)-(A31*A44)) +
        (A11*A24-A14*A21) * ((A32*A43)-(A33*A42)) +
        (A11*A23-A13*A21) * ((A34*A42)-(A32*A44)) +
        (A11*A22-A12*A21) * ((A33*A44)-(A34*A43))
        */
        det = (inMat[2] * inMat[rows + 3] - inMat[3] * inMat[rows + 2]) * ((inMat[rows * 2] * inMat[rows * 3 + 1]) - (inMat[rows * 2 + 1] * inMat[rows * 3])) +
            (inMat[3] * inMat[rows + 1] - inMat[1] * inMat[rows + 3]) * ((inMat[rows * 2] * inMat[rows * 3 + 2]) - (inMat[rows * 2 + 2] * inMat[rows * 3])) +
            (inMat[2] * inMat[rows + 1] - inMat[1] * inMat[rows + 2]) * ((inMat[rows * 2 + 3] * inMat[rows * 3]) - (inMat[rows * 2] *inMat[rows * 3 + 3])) +
            (inMat[0] * inMat[rows + 3] - inMat[3] * inMat[rows]) * ((inMat[rows * 2 + 1] * inMat[rows * 3 + 2]) - (inMat[rows * 2 + 2] * inMat[rows * 3 + 1])) +
            (inMat[0] * inMat[rows + 2] - inMat[2] * inMat[rows]) * ((inMat[rows * 2 + 3] * inMat[rows * 3 + 1]) - (inMat[rows * 2 + 1] * inMat[rows * 3 + 3])) +
            (inMat[0] * inMat[rows + 1] - inMat[1] * inMat[rows]) * ((inMat[rows * 2 + 2] * inMat[rows * 3 + 3]) - (inMat[rows * 2 + 3] * inMat[rows * 3 + 2]));
    }
    else if (3 == rows)
    {
        det = inMat[0] * (inMat[rows + 1] * inMat[rows * 2 + 2] - inMat[rows + 2] * inMat[rows * 2 + 1]) -
            inMat[1] * (inMat[rows] * inMat[rows * 2 + 2] - inMat[rows + 2] * inMat[rows * 2]) +
            inMat[2] * (inMat[rows] * inMat[rows * 2 + 1] - inMat[rows + 1] * inMat[rows * 2]);
    }
    else if (2 == rows)
    {
        det = inMat[0] * inMat[rows + 1] - inMat[1] * inMat[rows];
    }

    else
    {
        heapPtr = GetProcessHeap();
        minorMat = (float*)HeapAlloc(heapPtr, 0, (rows - 1)*(rows - 1) * sizeof(float));
        for (i = 0;i < rows;i++, sign *= -1)
        {
            for (j = 1;j < rows;j++)
            {
                for (s = k = 0;k < rows;k++)
                {
                    if (i != k)
                    {
                        minorMat[((j - 1)*(rows - 1) + s)] = inMat[(j*rows + k)];
                        s++;
                    }
                }
            }

            det += (sign * inMat[i] * determinant(minorMat, rows - 1));
        }
        HeapFree(heapPtr, 0, minorMat);
    }

    return det;
}

 

6) Measuring the assembled ASM execution time:

void Asm_Example(float* inMat, int rows)
{
    LARGE_INTEGER frequency;
    LARGE_INTEGER t1, t2;
    double elapsedTime;

    QueryPerformanceFrequency(&frequency);
    QueryPerformanceCounter(&t1);
    float det = Determinant(inMat, rows);
    QueryPerformanceCounter(&t2);
    elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;

    printf("\nDeterminant of Matrix using Asm is %.2f\n", det);
    printf("Elapsed time: %f miliseconds\n", elapsedTime);
}

Note that the Determinant (with capital D) function is an external function:

extern "C"
{
    float Determinant(void* intMat, size_t rows);
}

7) The ASM source code.

We have 3 variations of source code: 1 for 32-bit and 2 for 64-bit.
The 32-bit version can be assembled as is with MASM.
For the 64-bit, we have one version ready to be assembled with JWASM (or HJWASM, which is a recent evolution of the first). JWASM has at least the same features under 64-bit that MASM has under 32-bit. In order words, MASM 64-bit lacks some of its own 32-bit functionality, but it works well and has its fans who never stop developing macros to compensate the shortcomings (or not, actually MASM means "macro assembler", so it’s in the name what you are expected to do!). Either way, I made as well a version that can be assembled with MASM 64-bit.

Below is the 64-bit version for JWASM/HJWASM (the 32-bit and the MASM 64-bit versions can be downloaded from above):

option casemap:none
option frame:auto

_MM_SHUFFLE MACRO fp3, fp2, fp1, fp0
    exitm <( ( ( fp3 ) shl 6)  or  ( ( fp2 ) shl 4 )  or  ( ( fp1 ) shl 2 ) or  ( ( fp0 ) ) ) >
ENDM

_PERMUTE_PS MACRO v, c
    shufps v, v, c
ENDM

_VECTORSWIZZLE MACRO v, fp0, fp1, fp2, fp3
    _PERMUTE_PS v, _MM_SHUFFLE (fp3, fp2, fp1, fp0)
ENDM

_REALSTOXMM MACRO par1, par2, par3, par4
Local xmmValue
  .const
  align 16
  xmmValue dd par1, par2, par3, par4
  .code
  exitm <xmmValue>
ENDM    

T4x4MATRIX struct 16
    r0 XMMWORD ?
    r1 XMMWORD ?
    r2 XMMWORD ?
    r3 XMMWORD ?
T4x4MATRIX ends

TFLOAT3X3 struct 4
    r0 REAL4 3 dup (?)
    r1 REAL4 3 dup (?)
    r2 REAL4 3 dup (?)
TFLOAT3X3 ends

HANDLE TYPEDEF PTR
HeapAlloc proto :HANDLE, : DWORD, :QWORD
HeapFree proto :HANDLE, : DWORD, : PTR
GetProcessHeap proto

.const
_MINUSONE REAL4 -1.0

.code

RecursiveDetCalc proc private FRAME uses rsi rdi r12 r13 r14 inMat : ptr, rows : qword
    LOCAL heapPtr : ptr
    LOCAL det : REAL4
    LOCAL sign : REAL4

    mov rsi, rcx
    mov rows, rdx
    mov r13, rdx
    mov det, 0.0
    mov sign, 1.0; sign

    .if rdx==4
        ASSUME rcx : ptr T4x4MATRIX
        movups xmm6, [rcx].r2
        movups xmm1, [rcx].r3
        movaps xmm2, xmm6
        _VECTORSWIZZLE xmm2, 1, 0, 0, 0
        movaps xmm3, xmm1
        _VECTORSWIZZLE xmm3, 2, 2, 1, 1
        movaps xmm4, xmm6
        _VECTORSWIZZLE xmm4, 1, 0, 0, 0
        movaps xmm5, xmm1
        _VECTORSWIZZLE xmm5, 3, 3, 3, 2
        movaps xmm0, xmm6
        _VECTORSWIZZLE xmm0, 2, 2, 1, 1
        movaps xmm7, xmm1
        _VECTORSWIZZLE xmm7, 3, 3, 3, 2

        mulps xmm2, xmm3
        movaps xmm8, xmm2
        mulps xmm4, xmm5
        movaps xmm9, xmm4
        mulps xmm0, xmm7
        movaps xmm10, xmm0
    
        movaps xmm2, xmm6
        _VECTORSWIZZLE xmm2, 2, 2, 1, 1
        movaps xmm3, xmm1
        _VECTORSWIZZLE xmm3, 1, 0, 0, 0
        movaps xmm4, xmm6
        _VECTORSWIZZLE xmm4, 3, 3, 3, 2
        movaps xmm5, xmm1
        _VECTORSWIZZLE xmm5, 1, 0, 0, 0
        movaps xmm0, xmm6
        _VECTORSWIZZLE xmm0, 3, 3, 3, 2
        movaps xmm7, xmm1
        _VECTORSWIZZLE xmm7, 2, 2, 1, 1

        mulps xmm2, xmm3
        movaps xmm3, xmm8
        subps xmm3, xmm2
        movaps xmm8, xmm3
    
        mulps xmm4, xmm5
        subps xmm9, xmm4

        mulps xmm0, xmm7
        subps xmm10, xmm0

        movups xmm6, [rcx].r1
        movaps xmm2, xmm6
        _VECTORSWIZZLE xmm2, 3, 3, 3, 2
        movaps xmm3, xmm6
        _VECTORSWIZZLE xmm3, 2, 2, 1, 1
        movaps xmm4, xmm6
        _VECTORSWIZZLE xmm4, 1, 0, 0, 0
    
        movups xmm0, [rcx].r0
        mulps xmm0, XMMWORD ptr _REALSTOXMM(1.0, -1.0, 1.0, -1.0)

        mulps xmm2, xmm8
    
        mulps xmm3, xmm9
        subps xmm2, xmm3
    
        mulps xmm4, xmm10
        addps xmm4, xmm2
    
        mulps xmm0, xmm4
        shufps xmm4, xmm0, _MM_SHUFFLE(1,0,0,0)
        addps xmm4, xmm0
        shufps xmm0, xmm4, _MM_SHUFFLE(0,3,0,0)
        addps xmm0, xmm4
        _PERMUTE_PS xmm0, _MM_SHUFFLE(2,2,2,2)

        ASSUME rcx : nothing
    .elseif rdx==3 ; Only used for 3x3 matrixes
        ASSUME rcx : ptr TFLOAT3X3
        movups xmm3, XMMWORD ptr [rcx].r0
        mulps xmm3, XMMWORD ptr _REALSTOXMM(1.0,-1.0,1.0,1.0)

        movups xmm0, XMMWORD ptr [rcx].r1+4
        movq xmm2, qword ptr [rcx].r2+4
        shufps xmm0, xmm2, _MM_SHUFFLE(1,0,1,0)
        ;calculate first determinant
        shufps xmm1, xmm0, _MM_SHUFFLE(0,1,0,0)
        mulps xmm0, xmm1
        shufps xmm1, xmm0, _MM_SHUFFLE(2,0,0,0)
        shufps xmm0, xmm0, _MM_SHUFFLE(3,3,3,3)
        shufps xmm1, xmm1, _MM_SHUFFLE(3,3,3,3)
        subps xmm0, xmm1
        movaps xmm4, xmm0
    
        movups xmm0, XMMWORD ptr [rcx].r1
        movups xmm2, XMMWORD ptr [rcx].r1+8
        shufps xmm0, xmm2, _MM_SHUFFLE(3,1,2,0)
        ;calculate second determinant
        shufps xmm1, xmm0, _MM_SHUFFLE(0,1,0,0)
        mulps xmm0, xmm1
        shufps xmm1, xmm0, _MM_SHUFFLE(2,0,0,0)
        shufps xmm0, xmm0, _MM_SHUFFLE(3,3,3,3)
        shufps xmm1, xmm1, _MM_SHUFFLE(3,3,3,3)
        subps xmm0, xmm1
        movaps xmm5, xmm0
    
        movups xmm0, XMMWORD ptr [rcx].r1
        movq xmm2, qword ptr [rcx].r2
        shufps xmm0, xmm2, _MM_SHUFFLE(1,0,1,0)
        ;calculate third determinant
        shufps xmm1, xmm0, _MM_SHUFFLE(0,1,0,0)
        mulps xmm0, xmm1
        shufps xmm1, xmm0, _MM_SHUFFLE(2,0,0,0)
        shufps xmm0, xmm0, _MM_SHUFFLE(3,3,3,3)
        shufps xmm1, xmm1, _MM_SHUFFLE(3,3,3,3)
        subps xmm0, xmm1

        unpckhps xmm4, xmm5
        shufps xmm4, xmm0, _MM_SHUFFLE(0,0,1,0)
        mulps xmm3, xmm4
    
        movaps xmm0, xmm3
        movhlps xmm0, xmm0
        addps xmm0, xmm3
        shufps xmm3, xmm3, 1
        addss xmm0, xmm3
        shufps xmm0, xmm0,0
        ASSUME rcx : nothing        
    .elseif rdx==2 ; Only used for 2x2 matrixes
        movss xmm0, dword ptr [rcx+12]
        movss xmm1, dword ptr [rcx+8]
        mulss xmm0, dword ptr [rcx]
        mulss xmm1, dword ptr [rcx+4]
        subss xmm0, xmm1
    .else
        INVOKE GetProcessHeap
        mov heapPtr, rax  
        mov rax, rows
        dec rax
        mul rax
        shl rax, 2
        INVOKE HeapAlloc, heapPtr, 0, rax ; space for new (n-1)x(n-1) matrix
        mov rdi, rax        
        
        mov r12,0

        .while r12<r13
            mov r10, 1
            .while r10<r13
                mov r11,0
                mov r9,0
            
                lea rax, [4*r13]
                mul  r10
                mov r14, rax

                lea rax, [4*r13-4]
                lea rcx, [r10-1]
                mul rcx
                mov r8, rax

                .while r11<r13
                    .if r12 != r11
                        lea rcx, [r14 + 4*r11]
                        mov eax, REAL4 ptr [rsi+rcx]
                        lea rcx, [r8+4*r9]
                        mov REAL4 ptr [rdi+rcx], eax
                        add r9, 1
                    .endif
                    add r11, 1
                .endw
                add r10, 1
            .endw
            
            lea rdx, [r13-1]
            INVOKE RecursiveDetCalc, rdi, rdx

            movss xmm1, dword ptr [rsi+4*r12]
            mulss xmm1, sign
            mulss xmm0, xmm1    
            
            addss xmm0, det
            
            movss xmm1, sign
            movss det, xmm0
            
            mulss xmm1, _MINUSONE
            movss sign, xmm1
            add r12, 1
        .endw
        INVOKE HeapFree, heapPtr, 0, rdi
    .endif
    ret
RecursiveDetCalc endp

Determinant proc public FRAME uses xmm6 xmm7 xmm8 xmm9 xmm10 inMat : ptr, rows : qword
    ; Save xmm registers here and do the calculation in the recursive function.
    
    INVOKE RecursiveDetCalc, rcx, rdx
    ret
Determinant endp

end

8) After compiling the 64-bit Release Build and running you will obtain results similar to the image below:

 


Disregard the difference in the determinant value, which is due to rounding errors under different variations of the calculation.

 

Why is ASM slower?

 

As mentioned, I made a number of optimizations to the ASM code but was not enough to beat the C++ compilation. The SIMD parts, namely the 4x4 minor determinants calculations are faster on the ASM code, but they do not weight that much in the total time taken by the whole.
I have seen the ASM listing produced by the C++ compiler and some parts are just mind blowing - nobody would believe a human would code that way (if he does the code would be almost impossible to maintain). The compiler uses every trick under the table in an automated way - difficult to beat. It knows about everything about how the pipelines and predictive branching work, it reorders of instructions, does loop tiling and uses cache-oblivious algorithms.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Jose A Pascoa
AtelierWeb Software
Portugal Portugal
Jose Pascoa is the owner of AtelierWeb Software (http://www.atelierweb.com). We produce security and network software and mixed utilities since 1999. The first program I published (in a BBS) was a MS-DOS utility, had the size of 21 KB and was done in Assembly Language. Nowadays, my low level languages are more likely to be "C", "C++" and "Delphi" rather than Assembly Language but I still like it. I have nothing against more fashionable languages like C# and technologies like WPF, actually I have played with them and published software with them.

You may also be interested in...

Pro
Pro

Comments and Discussions

 
QuestionC++ optimization potential: heap allocation Pin
Stefan_Lang24-Apr-17 4:09
memberStefan_Lang24-Apr-17 4:09 
GeneralThe compiler is not magic Pin
Yks19-Apr-17 20:58
memberYks19-Apr-17 20:58 
Questiondigging in a bit (or 2) Pin
Jan Heckman19-Apr-17 2:24
memberJan Heckman19-Apr-17 2:24 
AnswerRe: digging in a bit (or 2) Pin
Jose A Pascoa19-Apr-17 3:31
memberJose A Pascoa19-Apr-17 3:31 
GeneralRe: digging in a bit (or 2) Pin
Jan Heckman19-Apr-17 8:58
memberJan Heckman19-Apr-17 8:58 
GeneralRe: digging in a bit (or 2) Pin
bling20-Apr-17 20:54
memberbling20-Apr-17 20:54 
GeneralRe: digging in a bit (or 2) Pin
Jan Heckman20-Apr-17 22:04
memberJan Heckman20-Apr-17 22:04 
GeneralRe: digging in a bit (or 2) Pin
Jan Heckman24-Apr-17 6:29
memberJan Heckman24-Apr-17 6:29 
GeneralRe: digging in a bit (or 2) Pin
Jose A Pascoa24-Apr-17 18:00
memberJose A Pascoa24-Apr-17 18:00 
QuestionAssembly language is <i>possibly</i> dead (for humans) Pin
Daniel Pfeffer18-Apr-17 21:49
professionalDaniel Pfeffer18-Apr-17 21:49 
AnswerRe: Assembly language is <i>possibly</i> dead (for humans) Pin
Jose A Pascoa18-Apr-17 22:52
memberJose A Pascoa18-Apr-17 22:52 
QuestionNo assembly anymore Pin
Michael Chourdakis18-Apr-17 8:25
memberMichael Chourdakis18-Apr-17 8:25 
AnswerRe: No assembly anymore Pin
Nelek18-Apr-17 10:21
protectorNelek18-Apr-17 10:21 
GeneralRe: No assembly anymore Pin
Shao Voon Wong18-Apr-17 18:45
professionalShao Voon Wong18-Apr-17 18:45 
GeneralRe: No assembly anymore Pin
Nelek18-Apr-17 19:27
protectorNelek18-Apr-17 19:27 
GeneralRe: No assembly anymore Pin
Michael Chourdakis20-Apr-17 7:19
memberMichael Chourdakis20-Apr-17 7:19 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170424.1 | Last Updated 18 Apr 2017
Article Copyright 2017 by Jose A Pascoa
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid