Download source - 1.5 KB

Introduction

In this article, I shall explore the Endian conversion problem and give a set of assembly functions to solve it. They are for x86 and ARM, and can compile under VS and GCC.

Background

In a cross platform project, we faced an age old problem – Endian conversion. If the file is generated on a little Endian machine, an integer 255 may be stored like:

ff 00 00 00

But when it's read into the memory, the value will be different for different platforms. And this will cause a porting problem.

int a;
fread(&a, sizeof(int), 1, file);
// on little endian machine, a = 0xff;
// but on big endian machine, a = 0xff000000;

First Approach

A very simple and effective way to solve it is as follows. We can write a function called readInt().

void readInt(void* p, file)
{ 
    char buf[4];
    fread(buf, 4, 1, file); 
    *((uint32*)p) = buf[0] << 24 | buf[1] << 16 
                   | buf[2] << 8 | buf[3];}

The function has the advantage of working on both big and small Endian platforms. But it reaped the convenience of reading a structure from us.

fread(&header, sizeof(struct MyFileHeader), 1, file);

If MyFileHeader contains many integers, we will have to break it down to many read()s. It’s not only cumbersome to code, but also runs slowly due to increased IO operation. So I shall propose another method.
We can leave the code unchanged, and use several macros to post-process the data.

fread(&header, sizeof(struct MyFileHeader), 1, file);
CQ_NTOHL(header.version);
CQ_NTOHL_ARRAY(&header.box, 4); // box is a RECT structure

If the endianness of the machine doesn’t match with that of the data file, these macros are defined to execute certain functions, or else they are defined empty:

#if defined(ENDIAN_CONVERSION)
#    define CQ_NTOHL(a) {a = ((a) >> 24) | (((a) & 0xff0000) >> 8) | 
	(((a) & 0xff00) << 8) | ((a) << 24); }
#    define CQ_NTOHL_ARRAY(arr, num) {uint32 i; 
	for(i = 0; i < num; i++) {CQ_NTOHL(arr[i]); }}
#else
#    define CQ_NTOHL(a)
#    define CQ_NTOHL_ARRAY(arr, num)
#endif

This approach has the advantage of wasting no CPU cycle if ENDIAN_CONVERSION is not defined. And the code can be preserved in its natural form of reading a whole structure at a time.

The Fall of C

This is the best we can achieve with C language. But I recalled that 80x86 series of CPU has a dedicated instruction to do Endian conversion. A little Googling confirmed it as BSWAP. And for ARM series CPU, there is also an algorithm to accelerate it.

Sadly, there is a limitation with the C language. Because the underlying machine structures are widely different, how can one high level language harness all their power?

In the simplest scenario of division, 8086 has an instruction DIV which will store the quotient and the remainder in AH and AL respectively. But in C, we will have to write:

Quotient = a / b; 
Remainder = a % b;

Seemingly, the calculation is made twice. But a good compiler will be able to optimize it.

For another example, almost every architecture has a right rotate instruction. I wonder why C doesn’t even consider it. So in C, if we want to right rotate a for b bits, we will have to write:

uint32 mask = (1 << b) - 1;
a = (a >> b) | ((a & mask) << (32 - b));

Instructions like algorithm shift left/right, rotate left/right, none of them has a counterpart in C.

For RISC machines like ARM, with its interesting but brilliant design, there is a huge gap between the instruction set and C semantics, so a tremendous amount of effort must be put on the compiler. Can we still trust the compiler for generating the desired code? No. I tried the above CQ_NTOHL() with VS2008, it failed to use BSWAP. I debugged the function ntohl() from winsock.h, it doesn’t use BSWAP either.

How to Use the Code

I think there is a possible explanation for this. Because the time spent in Endian conversion is dwarfed by the preceding IO, it seems I’m a little carried away. But that’s what I am. ;) I'm having fun pushing it to the limit.

I shall write assembly code for 80486 and ARM CPU, and tell you how to compile it under VS and GCC alike in the following sections. But first, let’s declare the functions we are going to write:

uint32 cq_ntohl(uint32 value);
void cq_ntohl_array(uint32* arr, uint32 num);
uint16 cq_ntohs(uint16 value);
void cq_ntohs_array(uint16* arr, uint32 num);
#define CQ_NTOHL(a) a = cq_ntohl(a);
#define CQ_NTOHL_ARRAY(p, n) cq_ntohl_array(p, n);
#define CQ_NTOHS(a) a = cq_ntohs(a);
#define CQ_NTOHS_ARRAY(p, n) cq_ntohs_array(p, n);

As for the implementations, I shall give you 4 versions of them:

Visual Studio. X86
Visual Studio. Smart device
GCC arm-linux-gcc Inline assembly
GCC arm-linux-as Assembler

The source codes are packed into a zip file. Here I shall only explain how to use them in your project and some points that need your attention.

Visual Studio. X86

Visual Studio supports inlined x86 assembly. So simply adding compile ec_x86.c into your VS project will suffice. 32bit integer is converted with 80486 instruction BSWAP, and 16 bit integer with a simple right rotate of 8 bits.

uint32 cq_ntohl(uint32 a) {
    __asm{
        mov eax, a;
        bswap eax; 
    }
}

void cq_ntohl_array(uint32* p, uint32 num) {
    __asm{
        mov eax, dword ptr [p];
        mov ecx, num;
next:
        mov ebx, [eax];
        bswap ebx;
        mov dword ptr[eax], ebx;
        add eax, 4;
        loop next;
    }
} 
uint16 cq_ntohs(uint16 a) {
    __asm{
        mov ax, a;
        ror ax, 8;
    }
}

void cq_ntohs_array(uint16* arr, uint32 num) {
    __asm{
        mov eax, dword ptr [arr];
        mov ecx, num;
next: 
        mov bx, word ptr [eax];
        ror bx, 8;
        mov word ptr[eax], bx;
        add eax, 2;
        loop next;
    }
}

Visual Studio. Smart Device

Visual Studio doesn’t support inlined arm assembly. So you need to compile the ec_arm.s file using armasm.exe:

armasm.exe ec_arm.s ec_arm.obj

And then link your EXE against the .obj file just like an ordinary static link library. Or you can add .s into your project and setup Custom Build Steps. But I won’t explain the details here.

The 32bit Endian conversion algorithm involves exclusive or, right rotate and right shift.

    eor r1, r0, r0, ROR #16
    bic r1, r1, #0xFF, 16
    mov r0, r0, ror #8
    eor r0, r0, r1, lsr #8

It’s a little hard to understand. But we can verify it. Let:

r0 = 0xaabbccdd. 
r1 = r0 ^ (r0 rotate-right 16) 
   = 0xaabbccdd ^ 0xccddaabb 
   = 0x(aa^cc,bb^dd,cc^aa,dd^bb); 
r1 = r1 & not(0xff0000) 
   = r1 & 0xff00ffff 
   = 0x(aa^cc, 0, cc^aa, dd^bb);
r0 = r0 rotate-right 8 = 0xddaabbcc;
r0 = r0 ^ (r1 >> 8) 
   = 0xddaabbcc ^ 0x(0, aa^cc, 0, cc^aa) 
   = 0x(dd, aa^aa^cc, bb, cc^cc^aa)
   = 0x(dd,cc,bb,aa);

It’s amazing that these calculations can be packed into just four 32-bit arm instructions.

GCC Arm-Linux-gcc

arm-linux-gcc supports both inlined assembly and .s file. So you can choose to use either ec_arm_linux.c or ec_arm_linux.s.

I’m not experienced with assembly language or assembler. I only taught myself the things that are needed to do the conversion, nothing much else. So please excuse me if there is any error or omission.

Points of Interest

From this experience, I learned that there is a huge gap between machine instruction and C language. And so I feel more in dept to compiler writers. It’s indeed a very hard job. And I also learned that C language is platform independent, compiler independent. But sometimes, writing assemblies will do the job more quickly and directly.

References

Basic concepts on Endianness by Juan Carlos Cobas

History

October 1, 2008 - First version