Download demo project source - 1.01 MB

JIT Assembler

QOR - Architecture Module: ArchQOR

This article is No.3 in a series on the Querysoft Open Runtime (QOR), an open source, aspect oriented, framework for C++ software. The introductory article can be found here. The source code associated with this article is designed to work independently. You don't need to combine it with the code from previous articles.

Introduction: Some Assembly Required

The QOR Architecture module abstracts the different hardware platforms which the QOR targets or to put it another way it is the architecture aspect of the QOR. In principle this should be easy because we already require a working C++ compiler to be available targeting a platform in order for the QOR to support it. There shouldn't be much else to do as the Operating System should take care of the rest of any hardware idiosyncrasies for us. However it turns out as usual to be a little more complex than that.

The QOR is a C++ framework and we go out of our way in places to ensure that it remains purely a C++ framework. There are a few areas where this becomes difficult to do conventionally, especially where the target operating system or compiler practically requires bits of assembly language code in order to compile. For example in order to support Structured Exception Handling, C++ exception handling and the additional debugging and security support available from Microsoft Visual C++ on 32bit Windows without including any Microsoft code, some assembly language elements are essential.

Here's an example function from our interim Win32 exception handling and SEH implementation which can't be implemented without the use of assembly:

__QCMP_DECLARE_NAKED void JumpToFunction( void* target, void* targetStack, void* targetEBP )
{
    __asm
    {
        mov eax,[esp + 0x4] // target
        mov ebp,[esp + 0xC] // targetEBP
        mov esp,[esp + 0x8] // targetStack
        jmp eax
    }
}

How can this be you may ask if you've never needed any inline assembly in order to get your C++ code to work. The answer is that it's usually supplied by the C and C++ libraries that come with the OS or the build tools. On Windows using MSVC it even gets sneaked into your native application as tiny static libraries that get linked without you asking for them. For the QOR this is unacceptable as we want to be 100% open source, contain no proprietary code and our implementation is to be completely open and transparent.

However we can't avoid assembly language completely and in fact we'd be loosing out even if we could because there are a handful of things that still really ought to be done in assembly. For example the mathematical operations at the heart of almost all image processing can benefit from the use of SIMD (Single Instruction Multiple Data) extensions on newer processors making real world operations like scaling and converting images many times faster than they would otherwise be.

So why not the easy way?

Couldn't we just compromise and include a bit of inline assembly here and there or some prebuilt static libraries, after all even the Microsoft Platform SDK does that in places?

__inline PVOID GetCurrentFiber( void ) { __asm mov eax, fs:[0x10] }

The answer is no for 3 reasons.

Which assembly language do we use?

This is not just down to which kind of processor are we targeting but in the case of x86 the syntax of the assembly source has at least two very different varieties, AT&T syntax and Intel syntax. As the following simple example shows not only is the format different but the operand order is reversed between the two. The GetCurrentFiber code above won't work with the GCC assembler GAS in the default AT&T mode and this is one of the reasons why the MinGW environment trips over in a pile of bits if you try to use it with the Windows Platform SDK headers.

AT&T syntax:

movl %esp, %ebp

Intel syntax:

mov ebp, esp

Unlike differing C++ compilers we can't overcome the differences between AT&T and Intel syntax with a library like CompilerQOR. To support both we'd have to have 2 sets of source for x86 systems, one for each syntax. We'd still need a whole support library though to allow users to choose their assembler, NASM, FASM, TASM or HLASM and not lock them into MASM or GAS as some other frameworks do.

How do we support SSE4 without requiring it?

The second challenge is taking into account the variations in target hardware at the time the QOR is compiled. For example how can we take full advantage of SSE4 on x86 machines that support it if a QOR compiled to use SSE4 won't work on machines that don't support it?

On a recent x86 CPU that supports SSE4 we can do

pmovsxbd xmm0, m32

but if we only have SSE2 we have to do this instead to achieve the same result.

movd xmm0, m32
punpcklbw xmm0, xmm0
punpcklwd xmm0, xmm0
psrad, xmm, 24

What do we do about asm disabled 64bit compilers?

The third reason inline assembly specifically is ruled out is that it's not supported by Microsoft's 64bit compilers. If we want 64bit Windows support which we surely do then no inline assembler can be allowed in the source tree. We could get around this as Microsoft do by using a lot of conditional compilation and target specific intrinsics but this would lock us in to using either the Microsoft or Intel compilers where those intrinsics are supported and we've gone to some trouble already as seen in the CompilerQOR to break out of that sort of restriction.

To the rescue, Just-In-Time

The answer to these problems and also to supporting non x86 hardware targets is to create a Just-In-Time(JIT) assembler which doesn't use rely on any existing assembler, be it GAS or MASM, AT&T or Intel.

A JIT Assembler like a JIT compiler is one that only assembles the code once it's aready running on the target machine. It can therefore detect the presence of features on the target machine before assembly takes place. It can take advantage of the best technology available while still being able to run on older hardware. Microsoft's .NET languages have a JIT Compiler as does Java. We don't want to replace C++ with a JIT compiled language as they have but rather to JIT assemble just the small ammount of necessary assembly language code to make the C++ portable between different hardware, to integrate successfully with different operating systems and to be able to take advantage of advanced processor features where they're available.

Not only would JIT compiling everything be a big performance hit but it would mean the JIT compiler itself would be left out becoming a separate dependency that could not take advantage of what makes the rest of the system portable. You can't JIT a JITer to coin a phrase. This problem has been approached in the past by JVMs which require installation before Java can be used and by intermediate languages or PCode. I don't believe either of these is a complete solution but really just moving the problem around and introducing lots of new problems, even whole new languages for developers to learn in order to debug effectively.

The QOR will include a JIT Assembler forming the bulk of its architecture abstraction module allowing hardware dependency to be abstracted and enabling target specific assembler routines to be generated at runtime from C++ that takes full advantage of the QOR's portability.

Fortunately I don't have to write a JIT assembler from scratch because Petr Kobalicek has already created the awesome AsmJit project. I'll take this as a starting point and massage it gently to fit the principles and practices of the QOR.

I'll still walk through the design process as if I was doing this from scratch but that's not to say that any of this is the way Petr originally thought about it.

JIT Assembler Design

One appealingly simple way to do a JIT assembler would be to write a C++ function equivalent to each and every operation of the target processor. For x86 this would be a few hundred functions, not an unreasonable amount of code. In order to carry out an assembler program we would then just call each of these functions in turn. Each function would execute the single assembly level instruction as it was called.

Wrong way to do an assembler

Where lea might have an implementation similar to:

__declspec( __naked ) void lea( void )
{
    _asm{ lea }
}

Unfortunately this would be a bad solution for two reasons. Firstly the performance would be very, very poor in comparison to real assembly because of the overhead of calling a C++ function for every instruction, perhaps 20 times slower. Secondly it wouldn't work because that calling overhead would also destroy the state necessary for the overall assembly function to work. Registers would not be preserved by the C++ compiler across the multiple function calls forming an assembly function and neither would the stack.

Why it won't work that way

However what if we modify this idea slightly by batching up the assembly instructions? We still call a C++ function for each assembly instruction but instead of immediately executing that instruction on the hardware the instruction is just saved up in a buffer in the exact form that it would be executed. Once a whole batch of instructions is written to the buffer we could convert the buffer address to a function pointer with the correct signature and call the assembly function as if it was a regular C++ function compiled along with the rest of the application. We could even hold onto the buffer so that the function can be called repeatedly without having to reassemble it each time.

A better way to do an assembler

This is the essence of a JIT assembler. We call a series of C++ functions equivalent to the assembly instructions we want to execute. The instructions are written into a buffer. Once complete our buffer is copied or changed to executable memory and can then be treated as a function. All the compiled code is still written in C++ including the functions which generate architecture specific assembly functions.

This process remains the same whatever architecture we're targeting so a lot of the code and external interfaces can be made generic, shared by current and future architecture specific implementations. I'll start getting down to nuts and bolts then with these common base classes and shared code for an instruction level assembler.

A Generic Assembler Outline

I'm going to describe a series of simple classes which are initially unrelated to one another and then build these together from the bottom up to form the assembler outline.

We start with a simple buffer class capable of reading and writing bytes and various sizes of word into a growable contiguous buffer. This equates to the code buffer described in the design above during the incremental buffer writing stage.

CAsmBuffer collaboration

Note the take function which gives access to the buffer after it's been written.

Next we have a generic assembler class which is abstract and says only that assemblers should be able to report the size of their generated code and to relocate it to a given destination.

CAbstractAssembler collaboration

To improve the performance of our assembler we can avoid doing lots of small incremental memory allocs and reallocs from the operating system by using a Zone intermediate allocator which grabs larger chunks at a time from the OS and parcels them out as requested with minimal overhead.

CZone collaboration

Tracking the progress of any assembly or compilation process is important as mistakes will inevitably be made and errors will occur. For this purpose ArchQOR defines an abstract Logger class which can be implemented downstream to track the assembly process.

CLogger inheritance

The specialist job of finalising our assembled buffer into executable code is done by a CodeGenerator. For the moment we want to immediately execute the code but in future we could use a different CCodeGeneratorBase sub class to actually write an executable module to disk just like a regular assembler.

CJitCodeGenerator collaboration

Turning a buffer from data byes into executable code may seem trivial or more like black magic depending on how much you've thought about it. In practice in doesn't mean doing anything to the bytes in the buffer but it can mean special negotiation with the operating system to make that memory executable or to transfer the buffer contents into memory that the OS will be happy to execute. These manipulations are carried out on behalf of the code generator by a memory manager which due to interaction with the OS is split into OS generic and OS specific parts. The OS specific parts will end up in the SystemQOR library but we haven't got one of those yet so a little fudging is required for now.

CVirtualMemoryManager collaboration

With the groundwork laid now we're ready to create an abstract batch operation CPU tying together most of the classes we've described.

CCPUBase is still completely generic but as the collaboration diagram shows it has the capability to interact with the low level buffer, memory manager, an abstract code generator and whatever logging we decide to put in place. These things are now taken care of for any and all architecture specific implementations until and unless we ever find we need to override them.

To manage optional extensions to CPU instruction sets, including built in Floating Point Units we'll need a way to add such extensions. For now we'll just create an empty abstract CISetExtension class

CISetExtension collaboration

Assembler targets

Processors like the computing devices built around them exist in wondrous variety but at least they can be classified into some useful categories.

CPUs (Central Processing Units) are the main workhorse general purpose computing devices for which the majority of application code is targeted. These are the processors that the Operating Systems on which we want to use the QOR are targeted for.
FPUs (Floating Point Units) are specialist math processors, often built into or tightly coupled with CPUs. These processors speed up the execution of non integer arithmetic.
GPUs (Graphics Processing Units) are the latest addition to the armory. These specialist units are designed to process video data in real-time, often massively parallel but less adaptable than CPUs or even FPUs they have only recently become truly programmable with developments like CUDA and OpenCL.

For the QOR we'll focus on CPUs with a nod to FPUs and just the possibility to bring GPU programming in later. This will get us everything needed for a portable framework with the possibility of later expansion without making the ArchQOR too vast a project in its own right.

Assembler targets

This generic structure of supported hardware configurations is reflected in the abstract CLogicBase class and in those classes derived from it for specific architectures.

CLogicBase inheritance

The FPU and GPU classes are trivial for now as we concentrate on the x86 CPU.

ArchQOR Package

Lets place the generic classes developed so far in the context of library before we specialize them for the x86 architecture.

We'll create a master header for the ArchQOR library ArchQOR.h.

//ArchQOR.h
...
#ifndef _QARCH_H_
#define _QARCH_H_

#include "CompilerQOR.h"            //Source Compiler definition and framework config
#include "ArchQOR/Defs.h"            //Basic definitions for architecture configuration
#include "ArchQOR/ArchitectureSelection.h"    //Select and configure the architecture platform
#include "ArchQOR/Machine.h"        //Define a Machine representative class

#endif//_QARCH_H_

A basic definitions header to enumerate the architectures we'd like to support Defs.h.

//Defs.h
...
//Basic definitions for ArchQOR

#ifndef _QARCH_DEFS_H_
#define _QARCH_DEFS_H_

//Platforms
//Possible values for _QARCH_ARCHITECTURE
#define _QARCH_X86PC        1    //PC x86 based boxes, Intel, AMD etc
#define _QARCH_ARMMOBILE    2    //ARM based Smart phone and tablet type SOC platforms

#endif//_QARCH_DEFS_H_

A preprocessor inclusion control header to selectively include the headers for only the target architecture ArchitectureSelection.h.

#ifndef _QARCH_ARCHITECTURESELECTION_H_
#define _QARCH_ARCHITECTURESELECTION_H_

#ifndef    _QARCH_ARCHITECTURE
#    define _QARCH_ARCHITECTURE _QARCH_X86PC
        __QCMP_MESSAGE( "Target Architecture defaulted to x86 PC." )
#endif

#if    ( _QARCH_ARCHITECTURE == _QARCH_X86PC )
    __QCMP_MESSAGE( "Target Architecture x86 PC." )
#    include "ArchQOR/x86/Config/x86PCSelection.h"

#elif    ( _QARCH_ARCHITECTURE == _QARCH_ARMMOBILE )
    __QCMP_MESSAGE( "Target Architecture ARM mobile device." )
#    include "ArchQOR/ARM/Config/ARMSelection.h"
#endif

#endif//_QARCH_ARCHITECTURESELECTION_H_

A Machine object header to declare the CMachine type as the root public object exposed by ArchQOR Machine.h.

//Machine.h
...
#ifndef _QARCH_MACHINE_H_
#define _QARCH_MACHINE_H_

namespace nsArch
{
    //------------------------------------------------------------------------------
    class __QOR_INTERFACE( __ARCHQOR ) CMachine : public CArchVPackage
    {
    public:

        CMachine();
        virtual ~CMachine();
    };
}

__QCMP_LINKAGE_C __QOR_INTERFACE( __ARCHQOR ) nsArch::CMachine& TheMachine( void );

#endif//_QARCH_MACHINE_H_

That's it for the top level headers. Defining _QARCH_ARCHITECTURE in the project allows us to select an architecture or it defaults to x86. The headers that choice causes to be included are responsible for defining the CArchVPackage base class from which CMachine is derived.

That completes the generic infrastructure on which we can build support for specific architectures.

An x86 Assembler

The dominant CPU families of the last 2 decades have been Intel's x86 processors in PCs and larger devices and the ARM based processors of various manufacturers in handheld and embedded devices. For now ArchQOR will support x86 architectures which contain sufficient variation amongst themselves and are certainly complex enough to thoroughly test the JIT Assembler concept and keep me busy for a few months.

x86 Assembly language consists largely of instructions identified by opcodes combined together with operands which can be registers, memory locations or immediate values i.e. the value itself as part of the instruction.

Registers can be general purpose or specialized. x86 family processors have had numerous registers added over the years. On top of the original 16bit 8086 registers and their 32bit extended versions there are now x87 registers for the built in FPU, Multi Media registers and eXtended Multi Media registers for use with the massively enlarged instruction set that has grown and grown with each new generation of x86. We'll also need Labels in our assembler so that we can do jumps and loops.

One more thing we'll need in order to maintain the register state we couldn't maintain with the initial design is the concept of a Variable. This will also help to eventually integrate the assembled code with compiled C++ code. We'll need General Purpose, x87, MM and XMM Variables for x86 assembly to match up with the different kinds of registers.

These objects form the x86 assembler Operand class hierarchy:

To generate valid x86 instructions we need to be able to define what an instruction is and enumerate what valid opcode operand combinations exist. For this we define an InstructionDescription struct and a table of instruction descriptions.

//------------------------------------------------------------------------------
struct InstructionDescription
{

    //------------------------------------------------------------------------------
    // Instruction groups.
    // This should be only used by assembler, because it's Assembler
    // specific grouping. Each group represents one 'case' in the Assembler's 
    // main emit method.

    enum G
    {
        // Group categories.
        G_EMIT,
        G_ALU,
        G_BSWAP,
        ...
        G_MMU_RM_IMM8, 
        G_MMU_RM_3DNOW // Group for 3dNow instructions
    };


    //------------------------------------------------------------------------------
    // Instruction core flags.
    enum F
    {
        F_NONE = 0x00, // No flags.
        F_JUMP = 0x01, // Instruction is jump, conditional jump, call or ret.
        F_MOV = 0x02, // Instruction will overwrite first operand - o[0].
        F_FPU = 0x04, // Instruction is X87 FPU.
        // Instruction can be prepended using LOCK prefix (usable for multithreaded applications).
        F_LOCKABLE = 0x08,
        F_SPECIAL = 0x10, // Instruction is special case, this is for HLA. 
        F_SPECIAL_MEM = 0x20// Instruction always performs memory access. 
        //This flag is always combined with F_SPECIAL and signalizes
        //that there is implicit address which is accessed (usually EDI/RDI or ESI/EDI).
    };

    // --------------------------------------------------------------------------
    // Instruction operand flags.
    enum O
    {
        // X86, MM, XMM
        O_GB = 0x0001,
        O_GW = 0x0002,
        ...
        O_FM_4_8_10 = O_FM_4 | O_FM_8 | O_FM_10,
        // Don't emit REX prefix.
        O_NOREX = 0x2000
    };

    Cmp_unsigned__int16 code; // Instruction code.
    Cmp_unsigned__int16 nameIndex; // Instruction name index in instructionName[] array.
    Cmp_unsigned__int8 group; // Instruction group, used also by HLA
    Cmp_unsigned__int8 flags; // Instruction type flags. 
    // First and second operand flags (some groups depends to these settings, used also by HLA). 
    Cmp_unsigned__int16 oflags[ 2 ];
    Cmp_unsigned__int16 opCodeR; // If instruction has only memory operand, this is register opcode. 
    Cmp_unsigned__int32 opCode[ 2 ]; // Primary and secondary opcodes.

    //------------------------------------------------------------------------------
    // Get the instruction name (null terminated string).
    inline const char* getName() const 
    { 
        return instructionName + nameIndex; 
    }
    ...
};

//------------------------------------------------------------------------------

//Instruction description table

# define MAKE_INST(code, name, group, flags, oflags0, oflags1, opReg, opCode0, opCode1) \
{ code, code##_INDEX, group, flags, { oflags0, oflags1 }, opReg, { opCode0, opCode1 } }
# define G(g) InstructionDescription::G_##g
# define F(f) InstructionDescription::F_##f
# define O(o) InstructionDescription::O_##o

const InstructionDescription instructionDescription[] =
{
// Instruction code (enum) | instruction name | instruction group|
//     instruction flags| oflags[0] | oflags[1] | r| opCode[0] | opcode[1]
MAKE_INST(INST_ADC , "adc" , G(ALU) , F(LOCKABLE) , O(GQDWB_MEM) , O(GQDWB_MEM)|O(IMM) , 2, 0x00000010, 0x00000080),
MAKE_INST(INST_ADD , "add" , G(ALU) , F(LOCKABLE) , O(GQDWB_MEM) , O(GQDWB_MEM)|O(IMM) , 0, 0x00000000, 0x00000080),
MAKE_INST(INST_ADDPD , "addpd" , G(MMU_RMI) , F(NONE) , O(XMM) , O(XMM_MEM) , 0, 0x66000F58, 0),
MAKE_INST(INST_ADDPS , "addps" , G(MMU_RMI) , F(NONE) , O(XMM) , O(XMM_MEM) , 0, 0x00000F58, 0),
MAKE_INST(INST_ADDSD , "addsd" , G(MMU_RMI) , F(NONE) , O(XMM) , O(XMM_MEM) , 0, 0xF2000F58, 0),
MAKE_INST(INST_ADDSS , "addss" , G(MMU_RMI) , F(NONE) , O(XMM) , O(XMM_MEM) , 0, 0xF3000F58, 0),
MAKE_INST(INST_ADDSUBPD , "addsubpd" , G(MMU_RMI) , F(NONE) , O(XMM) , O(XMM_MEM) , 0, 0x66000FD0, 0),
...
};

Now we can derive an x86 specific batch CPU from the the abstract base and use the instruction description table to emit x86 instructions into the code buffer.

//------------------------------------------------------------------------------
class __QOR_INTERFACE( __ARCHQOR ) Cx86CPUCore : public CCPUBase
{
    friend class CInstEmitter;

public:

    Cx86CPUCore( nsArch::CCodeGeneratorBase* codeGenerator ) __QCMP_THROW;      
    virtual ~Cx86CPUCore() __QCMP_THROW;
...

    //------------------------------------------------------------------------------
    //Emit single opCode without operands.
    inline void _emitOpCode( Cmp_unsigned__int32 opCode) __QCMP_THROW
    {
        // instruction prefix
        if( opCode & 0xFF000000 )
        {
            _emitByte( (Cmp_unsigned__int8)( ( opCode & 0xFF000000 ) >> 24 ) );
        }

        // instruction opcodes
        if( opCode & 0x00FF0000 )
        {
            _emitByte((Cmp_unsigned__int8)( ( opCode & 0x00FF0000 ) >> 16 ) );
        }

        if( opCode & 0x0000FF00 )
        {
            _emitByte((Cmp_unsigned__int8)(( opCode & 0x0000FF00 ) >>  8 ) );
        }
        // last opcode is always emitted (can be also 0x00)
        _emitByte((Cmp_unsigned__int8)( opCode & 0x000000FF ) );
    }

...
    //------------------------------------------------------------------------------
    //Emit SIB byte.
    inline void _emitSib( Cmp_unsigned__int8 s, Cmp_unsigned__int8 i, Cmp_unsigned__int8 b ) __QCMP_THROW
    { 
        _emitByte( ( ( s & 0x03 ) << 6 ) | ( ( i & 0x07 ) << 3 ) | ( b & 0x07 ) ); 
    }

    //------------------------------------------------------------------------------
    //Emit REX prefix (64-bit mode only).
    inline void _emitRexR( Cmp_unsigned__int8 w, Cmp_unsigned__int8 opReg, 
      Cmp_unsigned__int8 regCode, bool forceRexPrefix ) __QCMP_THROW
    {
...
    }

...    
    // Emit instruction where register is inlined to opcode.
    void _emitX86Inl(Cmp_unsigned__int32 opCode, Cmp_unsigned__int8 i16bit, 
      Cmp_unsigned__int8 rexw, Cmp_unsigned__int8 reg, bool forceRexPrefix) __QCMP_THROW;
...    
    // Emit MMX/SSE instruction.
    void _emitMmu(Cmp_unsigned__int32 opCode, Cmp_unsigned__int8 rexw, 
      Cmp_unsigned__int8 opReg, const COperand& src, Cmp_int_ptr immSize) __QCMP_THROW;

    // Emit displacement.
    LabelLink* _emitDisplacement(LabelData& l_data, 
                   Cmp_int_ptr inlinedDisplacement, int size) __QCMP_THROW;

    // Emit relative relocation to absolute pointer @a target. It's needed
    // to add what instruction is emitting this, because in x64 mode the relative
    // displacement can be impossible to calculate and in this case the trampoline
    // is used.
    void _emitJmpOrCallReloc(Cmp_unsigned__int32 instruction, void* target) __QCMP_THROW;

    // Helpers to decrease binary code size. These four emit methods are just
    // helpers thats used by assembler. They call emitX86() adding NULLs
    // to first, second and third operand, if needed.

    // Emit X86/FPU or MM/XMM instruction.
    void _emitInstruction(Cmp_unsigned__int32 code) __QCMP_THROW;

    // Emit X86/FPU or MM/XMM instruction.
    void _emitInstruction(Cmp_unsigned__int32 code, const COperand* o0) __QCMP_THROW;

...    
    void EmitInstructionG_ENTER( Cmp_unsigned__int32 code, const COperand*& o0, 
         const COperand*& o1, const COperand*& o2, Cmp_unsigned__int32& bLoHiUsed, 
         bool& assertIllegal, const COperand** _loggerOperands, const CImm* immOperand, 
         Cmp_unsigned__int32 immSize, Cmp_uint_ptr beginOffset, const InstructionDescription* id, 
         Cmp_unsigned__int32 forceRexPrefix )  __QCMP_THROW;
    void EmitInstructionG_IMUL( Cmp_unsigned__int32 code, const COperand*& o0, 
         const COperand*& o1, const COperand*& o2, Cmp_unsigned__int32& bLoHiUsed, 
         bool& assertIllegal, const COperand** _loggerOperands, const CImm* immOperand, 
         Cmp_unsigned__int32 immSize, Cmp_uint_ptr beginOffset, const InstructionDescription* id, 
         Cmp_unsigned__int32 forceRexPrefix )  __QCMP_THROW;
    void EmitInstructionG_INC_DEC( Cmp_unsigned__int32 code, const COperand*& o0, 
         const COperand*& o1, const COperand*& o2, Cmp_unsigned__int32& bLoHiUsed, 
         bool& assertIllegal, const COperand** _loggerOperands, const CImm* immOperand, 
         Cmp_unsigned__int32 immSize, Cmp_uint_ptr beginOffset, const InstructionDescription* id, 
         Cmp_unsigned__int32 forceRexPrefix )  __QCMP_THROW;
    void EmitInstructionG_J( Cmp_unsigned__int32 code, const COperand*& o0, 
         const COperand*& o1, const COperand*& o2, 
         Cmp_unsigned__int32& bLoHiUsed, bool& assertIllegal, 
         const COperand** _loggerOperands, const CImm* immOperand, Cmp_unsigned__int32 immSize, 
         Cmp_uint_ptr beginOffset, const InstructionDescription* id, 
         Cmp_unsigned__int32 forceRexPrefix )  __QCMP_THROW;

...
    //------------------------------------------------------------------------------
    // Simplifed version of relocCode() method.
    inline Cmp_uint_ptr relocCode( void* dst ) const __QCMP_THROW
    {
        return relocCode( dst, (Cmp_uint_ptr)dst );
    }

    // Embed

    void embed( const void* data, Cmp_uint_ptr length ) __QCMP_THROW;//Embed data into instruction stream.
    void embedLabel( const CLabel& label ) __QCMP_THROW;//Embed absolute label pointer (4 or 8 bytes).

...
    //------------------------------------------------------------------------------
    inline void SetEmitOptions( Cmp_unsigned__int32 EmitOptions )
    {
        m_uiEmitOptions = EmitOptions;
    }

...
protected:

    Cmp_unsigned__int32 m_uiProperties;     //Properties.
    Cmp_unsigned__int32 m_uiEmitOptions;    //Emit flags for next instruction (cleared after emit).

    Cmp_int_ptr m_iTrampolineSize;   //Size of possible trampolines.
    LabelLink* m_pUnusedLinks;       //Linked list of unused links (LabelLink* structures)


public:

    nsCodeQOR::PodVector< LabelData > m_LabelData;  //Labels data.
    nsCodeQOR::PodVector< RelocData > m_RelocData;  //Relocations data.

};

With this batch processor design it would be possible to have a single function that generated any x86 instruction from its opcode and parameters rather than one function per instruction. In practice such a function would be a vast monstrosity if it did the entire task, like implementing printf in a single function. Instead we divide the instruction set into groups of instructions that require particular validation checks or particular logic in order to emit them and then have one function per group. All these functions work on the common set of data needed to generate a single instruction. This is packaged up in the instruction emitter class CInstEmitter.

//------------------------------------------------------------------------------
class CInstEmitter
{

public:

    CInstEmitter( Cx86CPUCore& CPUParam, Cmp_unsigned__int32 codeParam, 
      const COperand* o0Param, const COperand* o1Param, 
      const COperand* o2Param ) __QCMP_THROW;

    bool BeginInstruction( void ) __QCMP_THROW;
    bool PrepareInstruction( void ) __QCMP_THROW;
    void FinishImmediate( const COperand* pOperand, Cmp_unsigned__int32 immSize ) __QCMP_THROW;
    void EndInstruction( void ) __QCMP_THROW;
    void CleanupInstruction( void ) __QCMP_THROW;
    bool LockInstruction( void ) __QCMP_THROW;
    void InstructionImmediate( void ) __QCMP_THROW;
    void InstructionIllegal( void ) __QCMP_THROW;

    void InstructionG_EMIT( void ) __QCMP_THROW;
    void InstructionG_ALU( void ) __QCMP_THROW;
    void InstructionG_BSWAP( void ) __QCMP_THROW;
...    
    void InstructionG_MOV_PTR( void ) __QCMP_THROW;
    void InstructionG_MOVSX_MOVZX( void ) __QCMP_THROW;
#        if ( _QARCH_WORDSIZE == 64 )
    void InstructionG_MOVSXD( void ) __QCMP_THROW;
#        endif
    void InstructionG_PUSH( void ) __QCMP_THROW;
    void InstructionG_POP( void ) __QCMP_THROW;
    void InstructionG_R_RM( void ) __QCMP_THROW;
    void InstructionG_RM_B( void ) __QCMP_THROW;

...    
        void InstructionG_MMU_RM_3DNOW( void ) __QCMP_THROW;

protected:

    Cmp_unsigned__int32 m_uiCode;
    const COperand* m_pO0;
    const COperand* m_pO1;
    const COperand* m_pO2;
    Cmp_unsigned__int32 m_bLoHiUsed; 
    bool m_bAssertIllegal;
    const COperand* m_aLoggerOperands[ 3 ];
    const CImm* m_pImmOperand;
    Cmp_unsigned__int32 m_uiImmSize;
    Cmp_uint_ptr m_uiBeginOffset;
    const InstructionDescription* m_pId;
    Cmp_unsigned__int32 m_uiForceRexPrefix;
    Cx86CPUCore& m_CPU;

private:

    __QCS_DECLARE_NONASSIGNABLE( CInstEmitter );
};

The Cx86CPUCore function _emitInstruction is used to wrap all this up and conduct the process of correctly emitting a single instruction.

void Cx86CPUCore::_emitInstruction( Cmp_unsigned__int32 code, const COperand* o0, 
     const COperand* o1, const COperand* o2 ) __QCMP_THROW
{
    const InstructionDescription* id = &instructionDescription[ code ];
    CInstEmitter Emitter( *this, code, o0, o1, o2 );
    if( Emitter.BeginInstruction() && Emitter.PrepareInstruction() && 
                Emitter.LockInstruction() )
    {
        switch( id->group )
        {
        case InstructionDescription::G_EMIT:
            Emitter.InstructionG_EMIT();
            break;
        case InstructionDescription::G_ALU:
            Emitter.InstructionG_ALU();
            break;
...
    Emitter.CleanupInstruction();
}

Now we have the code to write x86 format instructions by their group we need the functions that can be to called to generate specific instructions. To do this Cx86CPUCore is extended in two ways. A hierarchy of derived classes adds the instructions associated with each generation of processor so that we can target a particular level of the instruction set. A parallel hierarchy of instruction set extension classes that attach to a particular Cx86CPUCore instance implement the generations of multimedia extensions that have been added to the x86 instruction set.

CPU inheritance

At compile time we pick levels out of these derivation chains to constitute the minimum spec x86 target which we will require at runtime. Then at runtime we can check that the host machine for the executable is capable enough to meet those requirements. For example for the assembler functions needed to integrate with 32bit Windows we only need i386 level instructions although for practical purposes we won't support anything less than i486 as a target due to the lack of a cpuid instruction on the i386.

The Selection.h header a section of which is below will define the names CMainInstructionSet and CFloatingPointUnit to be the correct classes for example Ci686CPU and Ci686FPU.

...
#ifndef _QARCH_ISET_I786
#    error ("_QARCH_ISET_I786 not defined")
#endif

#ifndef _QARCH_X86LEVEL
    __QCMP_MESSAGE( "Target instruction set level not set. Defaulting to i686." )
#    define _QARCH_X86LEVEL _QARCH_ISET_I686
#endif

#if ( _QARCH_X86LEVEL == _QARCH_ISET_I786 )
    __QCMP_MESSAGE( "Target i786 instruction set." )
#    include "ArchQOR/x86/Assembler/BatchCPU/i786CPU.h"
    namespace nsArch
    {
        typedef nsx86::Ci786CPU CMainInstructionSet;
        typedef nsx86::CP7FPU CFloatingPointUnit;
#    define _QARCH_X87_FPU_EXTENSION_CLASS ,public CFloatingPointUnit
#    define _QARCH_X87_FPU_EXTENSION_INIT ,CFloatingPointUnit( (Cx86CPUCore&)(*this) )
    }
#elif ( _QARCH_X86LEVEL == _QARCH_ISET_I686 )
    __QCMP_MESSAGE( "Target i686 instruction set." )
#    include "ArchQOR/x86/Assembler/BatchCPU/i686CPU.h"
    namespace nsArch
    {
        typedef nsx86::Ci686CPU CMainInstructionSet;
        typedef nsx86::CP6FPU CFloatingPointUnit;
#    define _QARCH_X87_FPU_EXTENSION_CLASS ,public CFloatingPointUnit
#    define _QARCH_X87_FPU_EXTENSION_INIT ,CFloatingPointUnit( (Cx86CPUCore&)(*this) )
    }
#elif ( _QARCH_X86LEVEL == _QARCH_ISET_I586 )
    __QCMP_MESSAGE( "Target i586 instruction set." )
#    include "ArchQOR/x86/Assembler/BatchCPU/i586CPU.h"
    namespace nsArch
    {
        typedef nsx86::Ci586CPU CMainInstructionSet;
        typedef nsx86::CPentiumFPU CFloatingPointUnit;
#    define _QARCH_X87_FPU_EXTENSION_CLASS ,public CFloatingPointUnit
#    define _QARCH_X87_FPU_EXTENSION_INIT ,CFloatingPointUnit( (Cx86CPUCore&)(*this) )
    }
#elif ( _QARCH_X86LEVEL == _QARCH_ISET_I486 )
...

The CMainInstructionSet and CFloatingPointUnit classes selected are then used to derive the CCPU class.

This gives us a low level assembler that can create an executable sequence of instructions valid for an x86 machine but not yet easily create a function valid from the viewpoint of a C++ compiler. To go from low level instruction sequences to fully fledged functions with prologs and C++ calling conventions we need the set of constructs that make up a high level assembler.

High Level Assembler (HLA) Design

The high level assembler is based around the concept of an Emittable, an object that can be emitted into a stream of instructions and directives. The stream is processed through a number of stages before being used to drive the low level assembler to emit actual instructions to the buffer. This extra level of abstraction allows the creation of complete valid functions with control over the passing of parameters, calling conventions, return values and everything else that is needed to integrate with existing compiled functions. It also allows for additional processing stages such as aggressive optimization to be included in future. Just like the low level assembler the HLA is divided between abstract classes that can be reused across different architectures and architecture specific classes.

class __QOR_INTERFACE( __ARCHQOR ) Emittable
{
public:

    // Create new emittable.
    // Never create Emittable by new operator or on the stack, use
    // Compiler::newObject template to do that.
    Emittable( nsArch::CHighLevelAssemblerBase* c, Cmp_unsigned__int32 type ) __QCMP_THROW;

    // Destroy emittable.
    // Note Never destroy emittable using delete keyword, High Level Assembler
    // manages all emittables in internal memory pool and it will destroy
    // all emittables after you destroy it.
    virtual ~Emittable() __QCMP_THROW;

    // [Emit and Helpers]

    virtual void prepare( CHLAssemblerContextBase& cc ) __QCMP_THROW;        // Step 1. Extract emittable variables, update statistics, ...
    virtual Emittable* translate( CHLAssemblerContextBase& cc ) __QCMP_THROW;             // Step 2. Translate instruction, alloc variables, ...
    virtual void emit( CHighLevelAssemblerBase& a ) __QCMP_THROW;            // Step 3. Emit to Assembler.
    virtual void post( CHighLevelAssemblerBase& a ) __QCMP_THROW;            // Step 4. Last post step (verify, add data, etc).

    // [Utilities]
    
    virtual int getMaxSize() const __QCMP_THROW;                    // Get maximum size in bytes of this emittable (in binary).
    virtual bool _tryUnuseVar( CommonVarData* v ) __QCMP_THROW;            // Try to unuse the variable. Returns true only if the variable will be unused by the instruction, otherwise false is returned.

    //------------------------------------------------------------------------------
    inline nsArch::CHighLevelAssemblerBase* getHLA() const __QCMP_THROW        // Get associated HLA instance.
    { 
        return m_pHLAssembler; 
    }

    //------------------------------------------------------------------------------
    // Get emittable type, see EMITTABLE_TYPE.
    inline Cmp_unsigned__int32 getType() const __QCMP_THROW 
    { 
        return m_ucType; 
    }

    //------------------------------------------------------------------------------
    // Get whether the emittable was translated.
    inline Cmp_unsigned__int8 isTranslated() const __QCMP_THROW 
    { 
        return m_ucTranslated; 
    }

    //------------------------------------------------------------------------------
    // Get emittable offset in the stream
    // Emittable offset is not byte offset, each emittable increments offset by 1
    // and this value is then used by register allocator. Emittable offset is
    // set by compiler by the register allocator, don't use it in your code.
    inline Cmp_unsigned__int32 getOffset() const __QCMP_THROW 
    { 
        return m_uiOffset; 
    }

    //------------------------------------------------------------------------------
    inline Cmp_unsigned__int32 setOffset( Cmp_unsigned__int32 uiOffset ) __QCMP_THROW
    {
        m_uiOffset = uiOffset;
        return m_uiOffset;
    }            

    // [Emittables List]

    //------------------------------------------------------------------------------
    // Get previous emittable in list.
    inline Emittable* getPrev() const __QCMP_THROW 
    { 
        return m_pPrev; 
    }

    //------------------------------------------------------------------------------
    inline void setPrev( Emittable* pPrev ) __QCMP_THROW
    {
        m_pPrev = pPrev;
    }

    //------------------------------------------------------------------------------
    // Get next emittable in list.
    inline Emittable* getNext() const __QCMP_THROW 
    { 
        return m_pNext; 
    }

    //------------------------------------------------------------------------------
    inline void setNext( Emittable* pNext ) __QCMP_THROW
    {
        m_pNext = pNext;
    }

    //------------------------------------------------------------------------------
    // Get comment string.
    inline const char* getComment() const __QCMP_THROW 
    { 
        return m_szComment; 
    }

    void setComment( const char* str ) __QCMP_THROW;// Set comment string to str.

    void setCommentF( const char* fmt, ... ) __QCMP_THROW;
    // Format comment string using fmt string and variable argument list.

protected:

    //------------------------------------------------------------------------------
    // Mark emittable as translated and return next.
    inline Emittable* translated() __QCMP_THROW
    {
        //assert(_translated == false);

        m_ucTranslated = true;
        return m_pNext;
    }

    // High Level Assembler where this emittable is connected to.
    nsArch::CHighLevelAssemblerBase* m_pHLAssembler;
    Cmp_unsigned__int8 m_ucType;        // Type of emittable, see EMITTABLE_TYPE.
    Cmp_unsigned__int8 m_ucTranslated;    // Whether the emittable was translated, see translate().
    Cmp_unsigned__int8 m_ucReserved0;    // Reserved flags for future use.
    Cmp_unsigned__int8 m_ucReserved1;    // Reserved flags for future use.
    Cmp_unsigned__int32 m_uiOffset;        // Emittable offset.
    Emittable* m_pPrev;                    // Previous emittable.
    Emittable* m_pNext;                    // Next emittable.
    const char* m_szComment;            // Embedded comment string (also used by a Comment emittable).

private:

    __QCS_DECLARE_NONCOPYABLE( Emittable );
};

First we wrap low level instructions in a high level Instruction emittable, CEInstruction, then Prolog, Epilog and Function Prototype emittables, Jump targets, Returns and Calls and finally a Function emittable CEFunction to master all these and comply with the target ABI.

At the generic level Alignment, Comment, Data, Dummy, and FunctionEnd emittables and the base CEmittable class detailed above are independent of which underlying assembly language we're using so they can be reused across platforms.

The CEFunction class maintains the parameters and invariants for generating a single function but the developing state data that tracks the function generation process is managed by a High Level Assembler Context class specialized for x86 as Cx86HLAContext. This keeps track of variables, register allocation, stream write point and scope as the function is created.

//------------------------------------------------------------------------------
// HLA context is used during assembly and normally developer doesn't
// need access to it. The context is used per function (it's reset after each
// function is generated).
class __QOR_INTERFACE( __ARCHQOR ) Cx86HLAContext : public CHLAssemblerContextBase
{
public:

    Cx86HLAContext( nsArch::CHighLevelAssemblerBase* compiler ) __QCMP_THROW;
    ~Cx86HLAContext() __QCMP_THROW;                        

    void _clear() __QCMP_THROW;                                        // Clear context, preparing it for next function generation.
    
    void allocVar( VarData* vdata, Cmp_unsigned__int32 regMask, Cmp_unsigned__int32 vflags ) __QCMP_THROW;
...

    CMem _getVarMem( VarData* vdata ) __QCMP_THROW;

    VarData* _getSpillCandidateGP() __QCMP_THROW;
    VarData* _getSpillCandidateMM() __QCMP_THROW;
    VarData* _getSpillCandidateXMM() __QCMP_THROW;
    VarData* _getSpillCandidateGeneric(VarData** varArray, Cmp_unsigned__int32 count) __QCMP_THROW;

    //------------------------------------------------------------------------------
    inline bool _isActive( VarData* vdata) __QCMP_THROW 
    { 
        return vdata->nextActive != 0; 
    }

    void _addActive( VarData* vdata ) __QCMP_THROW;
    void _freeActive( VarData* vdata ) __QCMP_THROW;
    void _freeAllActive() __QCMP_THROW;
    void _allocatedVariable( VarData* vdata ) __QCMP_THROW;

    //------------------------------------------------------------------------------
    inline void _allocatedGPRegister(Cmp_unsigned__int32 index) __QCMP_THROW 
    { 
        _state.usedGP |= nsCodeQOR::maskFromIndex(index); _modifiedGPRegisters |= nsCodeQOR::maskFromIndex(index); 
    }

...


    // [Operand Patcher]
    void translateOperands( COperand* operands, Cmp_unsigned__int32 count ) __QCMP_THROW;
...

    // [Backward Code]
    void addBackwardCode( EJmp* from ) __QCMP_THROW;

    void addForwardJump( EJmp* inst ) __QCMP_THROW;

    // [State]

    StateData* _saveState() __QCMP_THROW;
    void _assignState(StateData* state) __QCMP_THROW;
    void _restoreState(StateData* state, Cmp_unsigned__int32 targetOffset = INVALID_VALUE) __QCMP_THROW;

    // [Memory Allocator]

    VarMemBlock* _allocMemBlock(Cmp_unsigned__int32 size) __QCMP_THROW;
    void _freeMemBlock(VarMemBlock* mem) __QCMP_THROW;
    void _allocMemoryOperands() __QCMP_THROW;
    void _patchMemoryOperands( nsArch::CEmittable* start, nsArch::CEmittable* stop ) __QCMP_THROW;
    
    nsCodeQOR::Zone _zone;                    // Zone memory manager.
    nsArch::CHighLevelAssemblerBase* _compiler;                    // Compiler instance.
    EFunction* _function;                    // Function emittable.
    nsArch::CEmittable* _start;                // Current active scope start emittable.
    nsArch::CEmittable* _stop;                // Current active scope end emittable.
    nsArch::CEmittable* _extraBlock;                    // Emittable that is used to insert some code after the function body.
    StateData _state;                        // Current state (register allocator).
    VarData* _active;                        // Link to circullar double-linked list containing all active variables (for current state).
    ForwardJumpData* _forwardJumps;            // Forward jumps (single linked list).        
    Cmp_unsigned__int32 _unrecheable;        // Whether current code is unrecheable.
    Cmp_unsigned__int32 _modifiedGPRegisters;        // Global modified GP registers mask (per function).
    Cmp_unsigned__int32 _modifiedMMRegisters;        // Global modified MM registers mask (per function).
    Cmp_unsigned__int32 _modifiedXMMRegisters;        // Global modified XMM registers mask (per function).
    Cmp_unsigned__int32 _allocableEBP;                // Whether the EBP/RBP register can be used by register allocator.

    int _adjustESP;                                // ESP adjust constant (changed during PUSH/POP or when using stack ).
    Cmp_unsigned__int32 _argumentsBaseReg;        // Function arguments base pointer (register).
    Cmp__int32 _argumentsBaseOffset;            // Function arguments base offset.
    Cmp__int32 _argumentsActualDisp;            // Function arguments displacement.
    Cmp_unsigned__int32 _variablesBaseReg;        // Function variables base pointer (register).
    Cmp__int32 _variablesBaseOffset;            // Function variables base offset.
    Cmp__int32 _variablesActualDisp;            // Function variables displacement.
    VarMemBlock* _memUsed;                        // Used memory blocks (for variables, here is each created mem block that can be also in _memFree list).
    VarMemBlock* _memFree;                        // Free memory blocks (freed, prepared for another allocation).
    
    Cmp_unsigned__int32 _mem4BlocksCount;        // Count of 4-byte memory blocks used by the function.
    Cmp_unsigned__int32 _mem8BlocksCount;        // Count of 8-byte memory blocks used by the function.
    Cmp_unsigned__int32 _mem16BlocksCount;        // Count of 16-byte memory blocks used by the function.
    Cmp_unsigned__int32 _memBytesTotal;            // Count of total bytes of stack memory used by the function.
    bool _emitComments;                            // Whether to emit comments.
    nsCodeQOR::PodVector< EJmp* > _backCode;    // List of emittables which need to be translated. These emittables are filled by addBackwardCode().
    Cmp_uint_ptr _backPos;                        // Backward code position (starts at 0).
};

In order to integrate JIT assembled functions into existing compiled C++ we need to be able to make calls in both directions. CFunctionPrototype assists JIT assembled functions to call exisiting compiled functions and other JIT assembled functions.

CFunctionPrototype collaboration Cx86HLAIntrinsics inheritance

The High Level Assembler class itself provides an interface for the creation of new variables and functions and the insertion of emittables into the stream.

The High Level Assembler Intrinsics class builds on the High Level Assembler to provide a virtual assembly language interface giving us the original one function per instruction design but within the current context of the emittable function under construction and in terms of variables rather than registers.

Now we have a high level x86 assembler capable of generating functions that integrate with our C++ codebase and a set of base classes to extend support to non x86 architectures in the future.

Just to put the icing on the cake and to make the High Level Assembler even easier to use ArchQOR defines a set of functor templates. These enable JIT functions to be treated as objects for the purpose of generating them and as functions for the purpose of calling them. Below is the code for the 3 parameter template.

template< typename RET, typename P1, typename P2, typename P3 >
class CJITFunctor3 : public CJITFunctorBase
{
public:
    typedef RET( *FP )( P1, P2, P3 );

    //------------------------------------------------------------------------------
    CJITFunctor3( CHighLevelAssemblerBase* pHLA ) : CJITFunctorBase( pHLA )
    , m_pFunc( 0 )
    {
    }

    //------------------------------------------------------------------------------
    ~CJITFunctor3()
    {
    }

    //------------------------------------------------------------------------------
    RET operator()( P1 p1, P2 p2, P3 p3 )
    {
        if( !m_bGenerated )
        {
            m_pFunc = Generate();
        }
    
        if( !m_pFunc )
        {
            throw "Null function pointer exception";
        }
        return (m_pFunc)( p1, p2, p3 );
    }

protected:

    virtual FP Generate( void ) = 0;

    FP m_pFunc;

};

Usage: The proof of the pudding

To demonstrate the usage of the high level JIT assembler we'll create a fast memcpy function that works in units of 4 bytes instead of the 1 byte at a time of the standard memcpy.

The signature for the function will be: void MemCpy32( Cmp_unsigned__int32* destination, const Cmp_unsigned__int32* source, Cmp_uint_ptr count );

Step 1 is to derive a new functor type from CJITFunctor3 matching the signature we want.

//A function object for the MemCpy32 JIT function
class CJITmemcpy32 : public nsArch::CJITFunctor3< void, Cmp_unsigned__int32*, 
      const Cmp_unsigned__int32*, Cmp_uint_ptr >
{
public:

    //Construction is minimal
    CJITmemcpy32( CHighLevelAssemblerBase* pHLA ) : CJITFunctor3( pHLA ){}

protected:

    //Overrides Generate to make the JIT function on demand
    virtual FP Generate( void );
};

Step 2 is to implement the Generate function to construct MemCpy32 using the High Level Assembler.

CJITmemcpy32::FP CJITmemcpy32::Generate( void )
{
    //Get the x86 specific High Level Assembler
    Cx86HLAIntrinsics& HLA( *( dynamic_cast< Cx86HLAIntrinsics* >( m_pHLA ) ) );

    //Create a new function
    HLA.newFunction( CALL_CONV_DEFAULT, FunctionBuilder3< Void, Cmp_unsigned__int32*, 
        const Cmp_unsigned__int32*, Cmp_unsigned__int32 >() );

    //Labels for the exit point and loop
    CLabel LoopLabel = HLA.newLabel();
    CLabel ExitLabel = HLA.newLabel();

    //Variables for the function parameters
    CGPVar dst( HLA.argGP( 0 ) );
    CGPVar src( HLA.argGP( 1 ) );
    CGPVar cnt( HLA.argGP( 2 ) );

    // Allocate variables to registers (if they are not allocated already).
    HLA.alloc( dst );
    HLA.alloc( src );
    HLA.alloc( cnt );

    // Exit if length is zero.
    HLA.test( cnt, cnt );
    HLA.jz( ExitLabel );

    // Loop begin.
    HLA.bind( LoopLabel );

    // Copy DWORD (4 bytes).
    CGPVar tmp( HLA.newGP( VARIABLE_TYPE_GPD ) );
    HLA.mov( tmp, dword_ptr( src ) );
    HLA.mov( dword_ptr( dst ), tmp );

    // Increment dst/src pointers.
    HLA.add( src, 4 );
    HLA.add( dst, 4 );

    // Loop until --cnt is zero.
    HLA.dec( cnt );
    HLA.jnz( LoopLabel );

    // Exit.
    HLA.bind( ExitLabel );

    // Finish.
    HLA.endFunction();

    // Part 2:

    // Make JIT function.
    FP fn = reinterpret_cast< FP >( HLA.make() );

    // Ensure that everything is ok.

    if( fn )
    {
        m_bGenerated = true;
    }

    return fn;
}

Step 3 is to create the CJITmemcpy32 instance and setup some data for it to work on.

// Create the JIT MemCopy32 function object. At this stage no assembly is generated
CJITmemcpy32 MemCopy32( &TheMachine().HLAssembler() );

// Create some data.
Cmp_unsigned__int32 dstBuffer[128];
Cmp_unsigned__int32 srcBuffer[128] = {1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89};

Step 4 is the fun part. Now we can 'magically' call a function that doesn't actually exist yet except as the 'source' from step 2 and yet it works.

MemCopy32( dstBuffer, srcBuffer, 128 );

Stepping into this call in the debugger will reveal that the functor checks if the function has been generated yet, finds it hasn't and so generates it just in time before calling it. Calling MemCopy32 a second time will skip the function generation phase and therefore give much faster results.

Here's a disassembler view on the actual code of the final JIT generated function:-

00030000  push        ebp  
00030001  mov         ebp,esp  
00030003  push        ebx  
00030004  push        esi  
00030005  sub         esp,10h  
00030008  mov         ecx,dword ptr [ebp+8]  
0003000B  mov         edx,dword ptr [ebp+0Ch]  
0003000E  mov         ebx,dword ptr [ebp+10h]  
00030011  test        ebx,ebx  
00030013  je          00030022  
00030015  mov         esi,dword ptr [edx]  
00030017  mov         dword ptr [ecx],esi  
00030019  add         edx,4  
0003001C  add         ecx,4  
0003001F  dec         ebx  
00030020  jne         00030015  
00030022  add         esp,10h  
00030025  pop         esi  
00030026  pop         ebx  
00030027  mov         esp,ebp  
00030029  pop         ebp  
0003002A  ret

The code associated with this article implements the described test of the high level assembler and also makes use of the low level assembler to do a small but vital task which cannot be done except with the use of assembly language: the iterrogation of the host CPU for its features, versioning and branding information. Here is a sample of the output this gives when running on my development laptop.

Test output

A learning experience

I am not and have never claimed to be any kind of expert in assembly language. Many years ago I could get by in Z80 assembly but x86 is a different game and I have always found it particularly difficult to read let alone write. Porting and polishing Petr Kobalicek's JIT assembler has taught me a great deal and I hope improved my x86 assembly to an almost competent level.

I think the biggest thing I had been missing was an appreciation of the importance of conventions in x86 assembly programming. With such a vast instruction set and so many registers there seem to be a thousand ways to do any given job. While the assembler itself won't complain whichever you choose and the machine will execute whatever you give it this is not enough. On top of the machines requirements for correct assembly language are a host of rules about how to use the stack, how to pass parameters, how to use and preserve registers, which 'general purpose' registers to actually use for return values, temporaries and parameters. Which segment register is used for which purpose by the operating system and a dozen other things I haven't discovered yet. Once you know these things, reading disassembly listings or other peoples assembly code becomes much easier because you know what to expect. Similarly writing assembly becomes much easier if you don't have to work out how to do the common things like passing parameters because you just follow the convention.

I've never come across any good documentation on these conventions and I think that's why I struggled so much to get to grips with x86 assembly. It's not really surprising if there isn't much in the way of coherent documentation given that these conventions are hardware dependent, partially OS dependent, partially compiler dependent and partially just arbitrary tradition.

Future Directions

ArchQOR is up and running but there's a lot still to do. 64bit code generation is untested and will have issues. I also need to ensure that ArchQOR works reliably on the full range of compilers supported by CompilerQOR but I'm not going to burden this article's source code with half a dozen extra project files and associated debris.

There are many possible improvements and extensions to ArchQOR and its JIT assembler as well as no doubt a number of bugs to be run down and squished.

Support for the more recent x86 instruction set extensions, AVX and AVX2 for Haswell. These will require extensive code changes as the maximum number of parameters goes from 3 to 5 and new instruction formats are introduced.

Support for ARM CPUs and NEON SIMD extensions is a must so that the QOR will work on ARM based mobile devices. The path to doing this is now very clear with the layout of the code in ArchQOR but the knowledge required especially to implement a high level assembler is extensive and I don't have most of it.

This version of ArchQOR is also tied to the Windows platform and that is not consistent with the concept of the QOR. 99% of the ArchQOR code is equally applicable to x86 Linux and other x86 operating systems. We'll see in future articles how to abstract ArchQOR and ultimately the entire framework from the operating system.

Now that we can target a specific level of x86 hardware the CompilerQOR can support compiler intrinsics which are target hardware dependent so I need to go back to CompilerQOR and update the MSVC support at least to enable the use of intrinsics that require an i586 or i686 target.

If you carefully examine the source code associated with this article you'll notice a CodeQOR folder with a couple of small classes in it. These are dependencies that ArchQOR needs but which would normally live in the CodeQOR module. The next article in this series will be about the CodeQOR and will, I'm sure, be even more interesting than this one.

Acknowledgements

The majority of the ArchQOR code is based completely on AsmJit so credit for the code should go to Petr Kobalicek without who's work none of this would ever have got here. Bob doesn't like offsite links but AsmJit is easy enough to Google.
Thanks are due to my expert proof reader and best sister. Any remaining errors were put in by me after she'd finished.
Thanks to Harold Aptroot for pointing out the lack of AVX/AVX2 support and the issues involved in adding it.
Microsoft, Windows, Visual C++ and Visual Studio are trademarks of Microsoft.
All other trademarks and trade names mentioned are acknowledged as the property of their respective owners who are in no way responsible for any of the content of this article or the associated source code.

History

Initial version - 09/07/2013.