Click here to Skip to main content
Click here to Skip to main content

Using Inline Assembly in C/C++

By , 14 Oct 2006
 
Sample Image - edujini_inline_asm.jpg

Introduction

First of all, what does term "inline" mean?

Generally the inline term is used to instruct the compiler to insert the code of a function into the code of its caller at the point where the actual call is made. Such functions are called "inline functions". The benefit of inlining is that it reduces function-call overhead.

Now, it's easier to guess about inline assembly. It is just a set of assembly instructions written as inline functions. Inline assembly is used for speed, and you ought to believe me that it is frequently used in system programming.

We can mix the assembly statements within C/C++ programs using keyword asm. Inline assembly is important because of its ability to operate and make its output visible on C/C++ variables.

GCC Inline Assembly Syntax

Assembly language appears in two flavors: Intel Style & AT&T style. GNU C compiler i.e. GCC uses AT&T syntax and this is what we would use. Let us look at some of the major differences of this style as against the Intel Style.

If you are wondering how you can use GCC on Windows, you can just download Cygwin from www.cygwin.com.

  1. Register Naming: Register names are prefixed with %, so that registers are %eax, %cl etc, instead of just eax, cl.
  2. Ordering of operands: Unlike Intel convention (first operand is destination), the order of operands is source(s) first, and destination last. For example, Intel syntax "mov eax, edx" will look like "mov %edx, %eax" in AT&T assembly.
  3. Operand Size: In AT&T syntax, the size of memory operands is determined from the last character of the op-code name. The suffix is b for (8-bit) byte, w for (16-bit) word, and l for (32-bit) long. For example, the correct syntax for the above instruction would have been "movl %edx, %eax".
  4. Immediate Operand: Immediate operands are marked with a $ prefix, as in "addl $5, %eax", which means add immediate long value 5 to register %eax).
  5. Memory Operands: Missing operand prefix indicates it is a memory-address; hence "movl $bar, %ebx" puts the address of variable bar into register %ebx, but "movl bar, %ebx" puts the contents of variable bar into register %ebx.
  6. Indexing: Indexing or indirection is done by enclosing the index register or indirection memory cell address in parentheses. For example, "movl 8(%ebp), %eax" (moves the contents at offset 8 from the cell pointed to by %ebp into register %eax).

For all our code, we would be working on Intel x86 processors. This information is necessary since all instructions may or may not work with other processors.

Basic Inline Code

We can use either of the following formats for basic inline assembly.

asm("assembly code");

or

__asm__ ("assembly code");

Example:

asm("movl %ebx, %eax"); /* moves the contents of ebx register to eax */
__asm__("movb %ch, (%ebx)"); /* moves the byte from ch to the memory pointed by ebx */

Just in case we have more than one assembly instruction, use semicolon at the end of each instruction.

Please refer to the example below (available in basic_arithmetic.c in downloads).

#include <stdio.h>

int main() {
    /* Add 10 and 20 and store result into register %eax */
    __asm__ ( "movl $10, %eax;"
                "movl $20, %ebx;"
                "addl %ebx, %eax;"
    );

    /* Subtract 20 from 10 and store result into register %eax */
    __asm__ ( "movl $10, %eax;"
                    "movl $20, %ebx;"
                    "subl %ebx, %eax;"
    );

    /* Multiply 10 and 20 and store result into register %eax */
    __asm__ ( "movl $10, %eax;"
                    "movl $20, %ebx;"
                    "imull %ebx, %eax;"
    );

    return 0 ;
}

Compile it using "-g" option of GNU C compiler "gcc" to keep debugging information with the executable and then using GNU Debugger "gdb" to inspect the contents of CPU registers.

Extended Assembly

In extended assembly, we can also specify the operands. It allows us to specify the input registers, output registers and a list of clobbered registers.

asm ( "assembly code"
           : output operands                  /* optional */
           : input operands                   /* optional */
           : list of clobbered registers      /* optional */
);

If there are no output operands but there are input operands, we must place two consecutive colons surrounding the place where the output operands would go.

It is not mandatory to specify the list of clobbered registers to use, we can leave that to GCC and GCC’s optimization scheme do the needful.

Example (1)

asm ("movl %%eax, %0;" : "=r" ( val ));

In this example, the variable "val" is kept in a register, the value in register eax is copied onto that register, and the value of "val" is updated into the memory from this register.

When the "r" constraint is specified, gcc may keep the variable in any of the available General Purpose Registers. We can also specify the register names directly by using specific register constraints.

The register constraints are as follows :

+---+--------------------+
| r |    Register(s)     |
+---+--------------------+
| a |   %eax, %ax, %al   |
| b |   %ebx, %bx, %bl   |
| c |   %ecx, %cx, %cl   |
| d |   %edx, %dx, %dl   |
| S |   %esi, %si        |
| D |   %edi, %di        |
+---+--------------------+

Example (2)

    int no = 100, val ;
        asm ("movl %1, %%ebx;"
             "movl %%ebx, %0;"
             : "=r" ( val )        /* output */
             : "r" ( no )         /* input */
             : "%ebx"         /* clobbered register */
         );

In the above example, "val" is the output operand, referred to by %0 and "no" is the input operand, referred to by %1. "r" is a constraint on the operands, which says to GCC to use any register for storing the operands.

Output operand constraint should have a constraint modifier "=" to specify the output operand in write-only mode. There are two %’s prefixed to the register name, which helps GCC to distinguish between the operands and registers. operands have a single % as prefix.

The clobbered register %ebx after the third colon informs the GCC that the value of %ebx is to be modified inside "asm", so GCC won't use this register to store any other value.

Example (3)

int arg1, arg2, add ;
__asm__ ( "addl %%ebx, %%eax;"
        : "=a" (add)
        : "a" (arg1), "b" (arg2) );

Here "add" is the output operand referred to by register eax. And arg1 and arg2 are input operands referred to by registers eax and ebx respectively.

Let us see a complete example using extended inline assembly statements. It performs simple arithmetic operations on integer operands and displays the result (available as arithmetic.c in downloads).

#include <stdio.h>

int main() {

    int arg1, arg2, add, sub, mul, quo, rem ;

    printf( "Enter two integer numbers : " );
    scanf( "%d%d", &arg1, &arg2 );

    /* Perform Addition, Subtraction, Multiplication & Division */
    __asm__ ( "addl %%ebx, %%eax;" : "=a" (add) : "a" (arg1) , "b" (arg2) );
    __asm__ ( "subl %%ebx, %%eax;" : "=a" (sub) : "a" (arg1) , "b" (arg2) );
    __asm__ ( "imull %%ebx, %%eax;" : "=a" (mul) : "a" (arg1) , "b" (arg2) );

    __asm__ ( "movl $0x0, %%edx;"
              "movl %2, %%eax;"
              "movl %3, %%ebx;"
               "idivl %%ebx;" : "=a" (quo), "=d" (rem) : "g" (arg1), "g" (arg2) );

    printf( "%d + %d = %d\n", arg1, arg2, add );
    printf( "%d - %d = %d\n", arg1, arg2, sub );
    printf( "%d * %d = %d\n", arg1, arg2, mul );
    printf( "%d / %d = %d\n", arg1, arg2, quo );
    printf( "%d %% %d = %d\n", arg1, arg2, rem );

    return 0 ;
}

Volatile

If our assembly statement must execute where we put it, (i.e. must not be moved out of a loop as an optimization), put the keyword "volatile" or "__volatile__" after "asm" or "__asm__" and before the ()s.

asm volatile ( "...;"
        "...;" : ... );

or

__asm__ __volatile__ ( "...;"
            "...;" : ... );

Refer to the following example, which computes the Greatest Common Divisor using well known Euclid's Algorithm ( honoured as first algorithm).

#include <stdio.h>

int gcd( int a, int b ) {
    int result ;
    /* Compute Greatest Common Divisor using Euclid's Algorithm */
    __asm__ __volatile__ ( "movl %1, %%eax;"
                          "movl %2, %%ebx;"
                          "CONTD: cmpl $0, %%ebx;"
                          "je DONE;"
                          "xorl %%edx, %%edx;"
                          "idivl %%ebx;"
                          "movl %%ebx, %%eax;"
                          "movl %%edx, %%ebx;"
                          "jmp CONTD;"
                          "DONE: movl %%eax, %0;" : "=g" (result) : "g" (a), "g" (b)
    );

    return result ;
}

int main() {
    int first, second ;
    printf( "Enter two integers : " ) ;
    scanf( "%d%d", &first, &second );

    printf( "GCD of %d & %d is %d\n", first, second, gcd(first, second) ) ;

    return 0 ;
}

Here are some more examples which use FPU (Floating Point Unit) Instruction Set.

An example program to perform simple floating point arithmetic:

#include <stdio.h>

int main() {

    float arg1, arg2, add, sub, mul, div ;

    printf( "Enter two numbers : " );
    scanf( "%f%f", &arg1, &arg2 );

    /* Perform floating point Addition, Subtraction, Multiplication & Division */
    __asm__ ( "fld %1;"
              "fld %2;"
              "fadd;"
              "fstp %0;" : "=g" (add) : "g" (arg1), "g" (arg2) ) ;

    __asm__ ( "fld %2;"
              "fld %1;"
              "fsub;"
              "fstp %0;" : "=g" (sub) : "g" (arg1), "g" (arg2) ) ;

    __asm__ ( "fld %1;"
              "fld %2;"
              "fmul;"
              "fstp %0;" : "=g" (mul) : "g" (arg1), "g" (arg2) ) ;

    __asm__ ( "fld %2;"
              "fld %1;"
              "fdiv;"
              "fstp %0;" : "=g" (div) : "g" (arg1), "g" (arg2) ) ;

    printf( "%f + %f = %f\n", arg1, arg2, add );
    printf( "%f - %f = %f\n", arg1, arg2, sub );
    printf( "%f * %f = %f\n", arg1, arg2, mul );
    printf( "%f / %f = %f\n", arg1, arg2, div );

    return 0 ;
}

Example program to compute trigonometrical functions like sin and cos:

#include <stdio.h>

float sinx( float degree ) {
    float result, two_right_angles = 180.0f ;
    /* Convert angle from degrees to radians and then calculate sin value */
    __asm__ __volatile__ ( "fld %1;"
                            "fld %2;"
                            "fldpi;"
                            "fmul;"
                            "fdiv;"
                            "fsin;"
                            "fstp %0;" : "=g" (result) : 
				"g"(two_right_angles), "g" (degree)
    ) ;
    return result ;
}

float cosx( float degree ) {
    float result, two_right_angles = 180.0f, radians ;
    /* Convert angle from degrees to radians and then calculate cos value */
    __asm__ __volatile__ ( "fld %1;"
                            "fld %2;"
                            "fldpi;"
                            "fmul;"
                            "fdiv;"
                            "fstp %0;" : "=g" (radians) : 
				"g"(two_right_angles), "g" (degree)
    ) ;
    __asm__ __volatile__ ( "fld %1;"
                            "fcos;"
                            "fstp %0;" : "=g" (result) : "g" (radians)
    ) ;
    return result ;
}

float square_root( float val ) {
    float result ;
    __asm__ __volatile__ ( "fld %1;"
                            "fsqrt;"
                            "fstp %0;" : "=g" (result) : "g" (val)
    ) ;
    return result ;
}

int main() {
    float theta ;
    printf( "Enter theta in degrees : " ) ;
    scanf( "%f", &theta ) ;

    printf( "sinx(%f) = %f\n", theta, sinx( theta ) );
    printf( "cosx(%f) = %f\n", theta, cosx( theta ) );
    printf( "square_root(%f) = %f\n", theta, square_root( theta ) ) ;

    return 0 ;
}

Summary

GCC uses AT&T style assembly statements and we can use asm keyword to specify basic as well as extended assembly instructions. Using inline assembly can reduce the number of instructions required to be executed by the processor. In our example of GCD, if we implement using inline assembly, the number of instructions required for calculation would be much less as compared to normal C code using Euclid's Algorithm.

You can also visit Eduzine© - electronic technology magazine of EduJini, the company that I work with.

History

  • 15th October, 2006: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

jain.pk
India India
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionMissing clobbersmemberTimothy Baldwin12 May '12 - 6:56 
Many of the examples modify registers which are neither listed as outputs or in the clobber list, which results in undefined behaviour. For example variables or constants may mysteriously change, or even take 2 different values simultaneously.
QuestionHaving problem compilingmemberMizan Rahman9 Aug '11 - 0:48 
Hi,
 
Thank you for this informative article.
 
I created a MFC dialog based app in VS2008 in Windows7 64bit. I placed some asm code:
void CMfcAsmTestDlg::OnBnClickedButton1()
{
  char format[] = "%s %s\n";
  char hello[] = "Hello";
  char world[] = "world";
 
   __asm
   {
      mov  eax, offset world
      push eax
      mov  eax, offset hello
      push eax
      mov  eax, offset format
      push eax
      call printf
      //clean up the stack so that main can exit cleanly
      //use the unused register ebx to do the cleanup
      pop  ebx
      pop  ebx
      pop  ebx
   }
}
I get: error C2415: improper operand type. This happens on each line I used the word 'mov'.
 
Do I need to change any project settings or include any file?
AnswerRe: Having problem compilingmemberEmilio Canizalez17 Dec '11 - 6:27 
in nasm code
call printf ->platform linux
call _printf -> platform windows
underscore i think that is the problem
GeneralRunning the example in Visual Studio 2005.memberntt broke10 Aug '09 - 4:51 
how can I run or build it in visual studio? which project type I need to choose? which template I need to choose?
If I past the c file code (arithmetic for example) I'm getting errors like:
error C2143: syntax error : missing ')' before ':' arithmetic.c
regarding the asm commands.
So I guess I need to tell him or reference him to the assemble language compiler or something?
Please please be detailed and specific in your answers!
Generalindexed notationmembercycologist20 Jan '09 - 17:07 
How would an intel instruction
mov dl,[bx + si} be written in GCC? I assume GCC does do other registers besides extended ones?
 
the "whatever it takes" attitude

Questionusing __emit in unixmembervikramaditya23419 Nov '08 - 1:28 
I have to use '__emit' (porting it from Windows) in C in UNIX, but when i use it in the manner:
asm("__emit 0fh"); 
 
it gives me error:
/tmp/ccNpl4q8.s:14: Error: invalid character '_' in mnemonic
 
Please suggest me the equivalent for the same that will work in UNIX
 
think because thats what matters

Generalunsing inline assembly in VS 2005member@run10 Sep '07 - 20:31 
When I write in Visual Studio 2005
void main()
{
 
char msg[] = "Hello World!";
_asm
{
lea eax, msg
push eax
call printf
line 20: add esp,4
}
}
 
I get exception
"Unhandled exception at 0x004182bc in TEST.exe: 0xC000001E: An attempt was made to execute an invalid lock sequence."
on line 20
 
I think this works in VS 2003 but not in VS 2005
Please help me to solve this.
Confused | :confused:
 

 

By Arun

GeneralStill usefullmemberthe developer17 Oct '06 - 22:04 
There are some rare cases (64bit integer arithmetic on 32bit hardware, MMX, SSE, etc.) where inline asm comes handy.
 
If you are developing really time critical code inline asm can save the day, too.
 
The Flash Systems developer
www.flashsystems.de

GeneralRe: Still usefullmemberumeca7420 Oct '06 - 23:56 
just continuing this academic discussion... Smile | :)
 
some assembly is good for you, if nothing for looking into the disassembly window during debugging
 
as for using inline assembly to "speed-up" code, let me give you a recent example from my coding life. I was trying to find a fast fuzzy-search algorithm and found a few pieces of code for Ratcliff/Obershelp algorithm
 
one was in C, using recursion, and another in assembly, looking really mean and lean. I tried them both. The compiler optimized C version turned out to be 5 times faster than inline assembly!
 
morale: unless you really know what you're doing, better leave assembly to the professionals (ie compilers) Smile | :)
GeneralRe: Still usefull [modified]memberDuncan Mackay24 Oct '10 - 21:51 
Just before everyone rushes to learn assembly, don't forget that parts of the C library are already written in assembly, and some implementations (e.g. memcpy) even utilise SIMD instructions, when supported by your processor.
 
Quite recently, I came across a certain "need for speed", as follows.
 
I was trying to scan a file of several hundred megabytes - as quickly as possible - for the appearance and continuation of a synchronisation pattern. A given 3 byte pattern indicated the beginning of synchronisation, thereafter the file should contain similar sync patterns at fixed offsets. The code needed to cope with continued loss and gain of the sync pattern - some files might maintain sync perfectly throughout, others would contain several breaks and resynchronisations, and others might contain no sync at all. The code had to work fast in all cases.
 
My original implementation read large chunks of the file into a large circular buffer, and performed the search for the sync pattern on that circular buffer. The circular buffer conveniently provided an overloaded operator [], so that the user of the class dealt in the logical byte offsets rather than the physical locations of buffered bytes.
 
Under the particular file chunk size I was using at the time (perhaps 64K, I can't remember), each synchronisation check was taking 33 milliseconds - not exactly slow, but considerable given such huge files. I wanted to do much better.
 
After a brief study of the code, I noted the circular buffer's operator [], although user-friendly, was having to do arithmetic to translate the logical byte number into a physical byte in its internal array. The sync detection routine was working one byte at a time, as was the sync verification routine, and both were using the circular buffer's "user-friendly" operator [].
 
The first optimisation step was to add a method to the circular buffer class to get hold of a raw contiguous byte buffer. When the buffered data wraps around between the end and the beginning of the physical buffer, the circular buffer class uses memcpy to re-arrange its internal layout. The sync detection and verification code was then modified to make use of the raw byte buffer instead of relying on the more costly operator [].
 
The next step was to speed up the sync detection. Instead of checking each byte via circularbuffer::operator [], the code was changed to use wcschr - the wide-character version of strchr. This can be used to very quickly locate the first two bytes of our pattern. Ideally, we would have liked to use wcsstr for detecting the complete 3 byte pattern, but this routine is not implemented in assembly code, as stepping into the function will show.
 
The final step was to speed up the sync verification - a memcmp() was used in place of the 3 serial circular buffer operator [] calls.
 
After making these changes, the chunk processing time fell from 33 milliseconds to 1/3 of a millisecond!
 
I expect, with effort, dropping down to assembly could yield an even faster algorithm - but I doubt it would improve matters greatly. This optimisation using only C library calls and a small bit of refactoring is 100 times faster than the original code already.
 
The moral of the story here, then, is to be sure your C/C++ coding is as tight as possible in the first place, before deciding that a venture into assembly language is necessary. Don't forget that assembly, unlike C/C++, is not portable, and, that if you do make use of SSE instructions directly in your code, you'll be restricting your code to run only on processors supporting those instructions.
 
Happy coding!

modified on Monday, October 25, 2010 4:11 AM

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130523.1 | Last Updated 15 Oct 2006
Article Copyright 2006 by jain.pk
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid