QOR - Architecture Module: ArchQOR
This article is No.3 in a series on the Querysoft Open Runtime (QOR), an open source, aspect oriented, framework for C++ software. The introductory article can be found here. The source code associated with this article is designed to work independently. You don't need to combine it with the code from previous articles.
Introduction: Some Assembly Required
The QOR Architecture module abstracts the different hardware platforms which the QOR targets or to put it another way it is the architecture aspect of the QOR. In principle this should be easy because we already require a working C++ compiler to be available
targeting a platform in order for the QOR to support it. There shouldn't be much else to do as the Operating System should take care of the rest of any hardware
idiosyncrasies for us. However it turns out as usual to be a little more complex than that.
The QOR is a C++ framework and we go out of our way in places to ensure that it remains purely a C++ framework. There are a few areas where this becomes difficult to do conventionally, especially where the target operating system or compiler practically requires bits of assembly language code in order to compile. For example in order to support Structured Exception Handling, C++ exception handling and the additional debugging and security support available from Microsoft Visual C++ on 32bit Windows without including any Microsoft code, some assembly language elements are essential.
Here's an example function from our interim Win32 exception handling and SEH implementation which can't be implemented without the use of assembly:
__QCMP_DECLARE_NAKED void JumpToFunction( void* target, void* targetStack, void* targetEBP )
{
__asm
{
mov eax,[esp + 0x4]
mov ebp,[esp + 0xC]
mov esp,[esp + 0x8]
jmp eax
}
}
How can this be you may ask if you've never needed any inline assembly in order to get your C++ code to work. The answer is that it's usually supplied by the C and C++ libraries that come with the OS or the build tools. On Windows using MSVC it even gets sneaked into your native application as tiny static libraries that get linked without you asking for them. For the QOR this is unacceptable as we want to be 100% open source, contain no
proprietary code and our implementation is to be completely open and transparent.
However we can't avoid assembly language completely and in fact we'd be loosing out even if we could because there are a handful of things that still really ought to be done in assembly. For example the mathematical operations at the heart of almost all image processing can benefit from the use of SIMD (Single Instruction Multiple Data) extensions on newer processors making real world operations like scaling and converting images many times faster than they would otherwise be.
So why not the easy way?
Couldn't we just compromise and include a bit of inline assembly here and there or some prebuilt static libraries, after all even the Microsoft Platform SDK does that in places?
__inline PVOID GetCurrentFiber( void ) { __asm mov eax, fs:[0x10] }
The answer is no for 3 reasons.
Which assembly language do we use?
This is not just down to which kind of processor are we targeting but in the case of x86 the syntax of the assembly source has at least two very different varieties, AT&T syntax and Intel syntax. As the following simple example shows not only is the format different but the operand order is reversed between the two.
The GetCurrentFiber
code above won't work with the GCC assembler GAS in the default AT&T mode and this is one of the reasons why the MinGW environment trips over in a pile of bits if you try to use it with the Windows Platform SDK headers.
AT&T syntax:
movl %esp, %ebp
Intel syntax:
mov ebp, esp
Unlike differing C++ compilers we can't overcome the differences between AT&T and Intel syntax with a library like CompilerQOR. To support both we'd have to have 2 sets of source for x86 systems, one for each syntax. We'd still need a whole support library though to allow users to choose their assembler, NASM, FASM, TASM or HLASM and not lock them into MASM or GAS as some other frameworks do.
How do we support SSE4 without requiring it?
The second challenge is taking into account the variations in target hardware at the time the QOR is compiled. For example how can we take full advantage of SSE4 on x86 machines that support it if a QOR compiled to use SSE4 won't work on machines that don't support it?
On a recent x86 CPU that supports SSE4 we can do
pmovsxbd xmm0, m32
but if we only have SSE2 we have to do this instead to achieve the same result.
movd xmm0, m32
punpcklbw xmm0, xmm0
punpcklwd xmm0, xmm0
psrad, xmm, 24
What do we do about asm disabled 64bit compilers?
The third reason inline assembly specifically is ruled out is that it's not supported by Microsoft's 64bit compilers. If we want 64bit Windows support which we surely do then no inline assembler can be allowed in the source tree. We could get around this as Microsoft do by using a lot of conditional compilation and target specific
intrinsics but this would lock us in to using either the Microsoft or Intel compilers where those intrinsics are supported and we've gone to some trouble already as seen in the CompilerQOR to break out of that sort of restriction.
To the rescue, Just-In-Time
The answer to these problems and also to supporting non x86 hardware targets is to create a Just-In-Time(JIT) assembler which doesn't use rely on any existing assembler, be it GAS or MASM, AT&T or Intel.
A JIT Assembler like a JIT compiler is one that only assembles the code once it's aready running on the target machine. It can therefore detect the presence of features on the target machine before assembly takes place. It can take advantage of the best technology available while still being able to run on older hardware. Microsoft's .NET languages have a JIT Compiler as does Java. We don't want to replace C++ with a JIT compiled language as they have but rather to JIT assemble just the small ammount of necessary assembly language code to make the C++ portable between different hardware, to integrate successfully with different operating systems and to be able to take advantage of advanced processor features where they're available.
Not only would JIT compiling everything be a big performance hit but it would mean the JIT compiler itself would be left out becoming a separate dependency that could not take advantage of what makes the rest of the system portable. You can't JIT a JITer to coin a phrase. This problem has been approached in the past by JVMs which require installation before Java can be used and by intermediate languages or PCode. I don't believe either of these is a complete solution but really just moving the problem around and introducing lots of new problems, even whole new languages for developers to learn in order to debug effectively.
The QOR will include a JIT Assembler forming the bulk of its architecture abstraction module allowing hardware dependency to be abstracted and enabling target specific assembler routines to be generated at runtime from C++ that takes full advantage of the QOR's portability.
Fortunately I don't have to write a JIT assembler from scratch because Petr Kobalicek has already created the awesome AsmJit project. I'll take this as a starting point and massage it gently to fit the principles and practices of the QOR.
I'll still walk through the design process as if I was doing this from scratch but that's not to say that any of this is the way Petr originally thought about it.
JIT Assembler Design
One appealingly simple way to do a JIT assembler would be to write a C++ function equivalent to each and every operation of the target processor. For x86 this would be a few hundred functions, not an unreasonable
amount of code. In order to carry out an assembler program we would then just call each of these functions in turn. Each function would execute the single assembly level instruction as it was called.
Where lea might have an implementation similar to:
__declspec( __naked ) void lea( void )
{
_asm{ lea }
}
Unfortunately this would be a bad solution for two reasons. Firstly the performance would be very, very poor in comparison to real assembly because of the overhead of calling a C++ function for every instruction, perhaps 20 times slower. Secondly it wouldn't work because that calling overhead would also destroy the state necessary for the overall assembly function to work. Registers would not be preserved by the C++ compiler across the multiple function calls forming an assembly function and neither would the stack.
However what if we modify this idea slightly by batching up the assembly instructions? We still call a C++ function for each assembly instruction but instead of immediately executing that instruction on the hardware the instruction is just saved up in a buffer in the exact form that it would be executed. Once a whole batch of instructions is written to the buffer we could convert the buffer address to a function pointer with the correct signature and call the assembly function as if it was a regular C++ function compiled along with the rest of the application. We could even hold onto the buffer so that the function can be called repeatedly without having to reassemble it each time.
This is the essence of a JIT assembler. We call a series of C++ functions equivalent to the assembly instructions we want to execute. The instructions are written into a buffer. Once complete our buffer is copied or changed to executable memory and can then be treated as a function. All the compiled code is still written in C++ including the functions which generate architecture specific assembly functions.
This process remains the same whatever architecture we're targeting so a lot of the code and external interfaces can be made generic, shared by current and future architecture specific implementations. I'll start
getting down to nuts and bolts then with these common base classes and shared code for an instruction level assembler.
A Generic Assembler Outline
I'm going to describe a series of simple classes which are initially unrelated to one another and then build these together from the bottom up to form the assembler outline.
We start with a simple buffer class capable of reading and writing bytes and various sizes of word into a growable contiguous buffer. This equates to the code buffer described in the design above during the incremental buffer writing stage.
Note the take
function which gives access to the buffer after it's been written.
Next we have a generic assembler class which is abstract and says only that assemblers should be able to report the size of their generated code and to relocate it to a given
destination.
To improve the performance of our assembler we can avoid doing lots of small incremental memory allocs and reallocs from the operating system by using a Zone intermediate allocator which grabs larger chunks at a time from the OS and parcels them out as requested with minimal overhead.
Tracking the progress of any assembly or compilation process is important as mistakes will inevitably be made and errors will occur. For this purpose ArchQOR defines an abstract Logger class which can be implemented downstream to track the assembly process.
The specialist job of finalising our assembled buffer into executable code is done by a CodeGenerator. For the moment we want to immediately execute the code but in future we could use a different CCodeGeneratorBase
sub class to actually write an executable module to disk just like a regular assembler.
Turning a buffer from data byes into executable code may seem trivial or more like black magic depending on how much you've thought about it. In practice in doesn't mean doing anything to the bytes in the buffer but it can mean special negotiation with the operating system to make that memory executable or to transfer the buffer contents into memory that the OS will be happy to execute. These manipulations are carried out on behalf of the code generator by a memory manager which due to interaction with the OS is split into OS generic and OS specific parts. The OS specific parts will end up in the SystemQOR library but we haven't got one of those yet so a little fudging is required for now.
With the groundwork laid now we're ready to create an abstract batch operation CPU tying together most of the classes we've described.
CCPUBase
is still completely generic but as the collaboration diagram shows it has the capability to interact with the low level buffer, memory manager, an abstract code generator and whatever logging we decide to put in place. These things are now taken care of for any and all architecture specific implementations until and unless we ever find we need to override them.
To manage optional extensions to CPU instruction sets, including built in Floating Point Units we'll need a way to add such extensions. For now we'll just create an empty abstract CISetExtension
class
Assembler targets
Processors like the computing devices built around them exist in wondrous variety but at least they can be
classified into some useful categories.
- CPUs (Central Processing Units) are the main workhorse general purpose computing devices for which the majority of application code
is targeted. These are the processors that the Operating Systems on which we want to use the QOR are targeted for.
- FPUs (Floating Point Units) are specialist math processors, often built into or tightly coupled with CPUs.
These processors speed up the execution of non integer arithmetic.
- GPUs (Graphics Processing Units) are the latest addition to the armory. These specialist units are designed to process video data in real-time,
often massively parallel but less adaptable than CPUs or even FPUs they have only recently become truly programmable with developments like CUDA and OpenCL.
For the QOR we'll focus on CPUs with a nod to FPUs and just the possibility to bring GPU programming in later. This will get us everything needed for a portable framework with the possibility of later expansion without making the ArchQOR too vast a project in its own right.
This generic structure of supported hardware configurations is reflected in the abstract CLogicBase
class and in those classes derived from it for specific architectures.
The FPU and GPU classes are trivial for now as we concentrate on the x86 CPU.
ArchQOR Package
Lets place the generic classes developed so far in the context of library before we specialize them for the x86 architecture.
We'll create a master header for the ArchQOR library ArchQOR.h.
...
#ifndef _QARCH_H_
#define _QARCH_H_
#include "CompilerQOR.h" //Source Compiler definition and framework config
#include "ArchQOR/Defs.h" //Basic definitions for architecture configuration
#include "ArchQOR/ArchitectureSelection.h" //Select and configure the architecture platform
#include "ArchQOR/Machine.h" //Define a Machine representative class
#endif//_QARCH_H_
A basic definitions header to enumerate the architectures we'd like to support Defs.h.
...
#ifndef _QARCH_DEFS_H_
#define _QARCH_DEFS_H_
#define _QARCH_X86PC 1 //PC x86 based boxes, Intel, AMD etc
#define _QARCH_ARMMOBILE 2 //ARM based Smart phone and tablet type SOC platforms
#endif//_QARCH_DEFS_H_
A preprocessor inclusion control header to selectively include the headers for only the target architecture ArchitectureSelection.h.
#ifndef _QARCH_ARCHITECTURESELECTION_H_
#define _QARCH_ARCHITECTURESELECTION_H_
#ifndef _QARCH_ARCHITECTURE
# define _QARCH_ARCHITECTURE _QARCH_X86PC
__QCMP_MESSAGE( "Target Architecture defaulted to x86 PC." )
#endif
#if ( _QARCH_ARCHITECTURE == _QARCH_X86PC )
__QCMP_MESSAGE( "Target Architecture x86 PC." )
# include "ArchQOR/x86/Config/x86PCSelection.h"
#elif ( _QARCH_ARCHITECTURE == _QARCH_ARMMOBILE )
__QCMP_MESSAGE( "Target Architecture ARM mobile device." )
# include "ArchQOR/ARM/Config/ARMSelection.h"
#endif
#endif//_QARCH_ARCHITECTURESELECTION_H_
A Machine
object header to declare the CMachine
type as the root public object exposed by ArchQOR Machine.h.
...
#ifndef _QARCH_MACHINE_H_
#define _QARCH_MACHINE_H_
namespace nsArch
{
class __QOR_INTERFACE( __ARCHQOR ) CMachine : public CArchVPackage
{
public:
CMachine();
virtual ~CMachine();
};
}
__QCMP_LINKAGE_C __QOR_INTERFACE( __ARCHQOR ) nsArch::CMachine& TheMachine( void );
#endif//_QARCH_MACHINE_H_
That's it for the top level headers. Defining _QARCH_ARCHITECTURE
in the project allows us to select an architecture or it defaults to x86. The headers that choice causes to be included are responsible for defining the CArchVPackage
base class from which CMachine
is derived.
That completes the generic infrastructure on which we can build support for specific architectures.
An x86 Assembler
The dominant CPU families of the last 2 decades have been Intel's x86 processors in PCs and larger devices and the ARM based processors of various manufacturers in handheld and embedded devices. For now ArchQOR will support x86 architectures which contain sufficient variation amongst themselves and are certainly complex enough to thoroughly test the JIT Assembler concept and keep me busy for a few months.
x86 Assembly language consists largely of instructions identified by opcodes combined together with operands which can be registers, memory locations or immediate values i.e. the value itself as part of the instruction.
Registers can be general purpose or specialized. x86 family processors have had numerous registers added over the years. On top of the original 16bit 8086 registers and their 32bit extended versions there are now x87 registers for the built in FPU, Multi Media registers and eXtended Multi Media registers for use with the massively enlarged instruction set that has grown and grown with each new generation of x86. We'll also need Labels in our assembler so that we can do jumps and loops.
One more thing we'll need in order to maintain the register state we couldn't maintain with the initial design is the concept of a Variable. This will also help to eventually integrate the assembled code with compiled C++ code. We'll need General Purpose, x87, MM and XMM Variables for x86 assembly to match up with the different kinds of registers.
These objects form the x86 assembler Operand class hierarchy:
To generate valid x86 instructions we need to be able to define what an instruction is and enumerate what valid opcode operand combinations exist. For this we define an InstructionDescription struct
and a table of instruction descriptions.
struct InstructionDescription
{
enum G
{
G_EMIT,
G_ALU,
G_BSWAP,
...
G_MMU_RM_IMM8,
G_MMU_RM_3DNOW };
enum F
{
F_NONE = 0x00, F_JUMP = 0x01, F_MOV = 0x02, F_FPU = 0x04, F_LOCKABLE = 0x08,
F_SPECIAL = 0x10, F_SPECIAL_MEM = 0x20 };
enum O
{
O_GB = 0x0001,
O_GW = 0x0002,
...
O_FM_4_8_10 = O_FM_4 | O_FM_8 | O_FM_10,
O_NOREX = 0x2000
};
Cmp_unsigned__int16 code; Cmp_unsigned__int16 nameIndex; Cmp_unsigned__int8 group; Cmp_unsigned__int8 flags; Cmp_unsigned__int16 oflags[ 2 ];
Cmp_unsigned__int16 opCodeR; Cmp_unsigned__int32 opCode[ 2 ];
inline const char* getName() const
{
return instructionName + nameIndex;
}
...
};
# define MAKE_INST(code, name, group, flags, oflags0, oflags1, opReg, opCode0, opCode1) \
{ code, code##_INDEX, group, flags, { oflags0, oflags1 }, opReg, { opCode0, opCode1 } }
# define G(g) InstructionDescription::G_##g
# define F(f) InstructionDescription::F_##f
# define O(o) InstructionDescription::O_##o
const InstructionDescription instructionDescription[] =
{
MAKE_INST(INST_ADC , "adc" , G(ALU) , F(LOCKABLE) , O(GQDWB_MEM) , O(GQDWB_MEM)|O(IMM) , 2, 0x00000010, 0x00000080),
MAKE_INST(INST_ADD , "add" , G(ALU) , F(LOCKABLE) , O(GQDWB_MEM) , O(GQDWB_MEM)|O(IMM) , 0, 0x00000000, 0x00000080),
MAKE_INST(INST_ADDPD , "addpd" , G(MMU_RMI) , F(NONE) , O(XMM) , O(XMM_MEM) , 0, 0x66000F58, 0),
MAKE_INST(INST_ADDPS , "addps" , G(MMU_RMI) , F(NONE) , O(XMM) , O(XMM_MEM) , 0, 0x00000F58, 0),
MAKE_INST(INST_ADDSD , "addsd" , G(MMU_RMI) , F(NONE) , O(XMM) , O(XMM_MEM) , 0, 0xF2000F58, 0),
MAKE_INST(INST_ADDSS , "addss" , G(MMU_RMI) , F(NONE) , O(XMM) , O(XMM_MEM) , 0, 0xF3000F58, 0),
MAKE_INST(INST_ADDSUBPD , "addsubpd" , G(MMU_RMI) , F(NONE) , O(XMM) , O(XMM_MEM) , 0, 0x66000FD0, 0),
...
};
Now we can derive an x86 specific batch CPU from the the abstract base and use the instruction description table to emit x86 instructions into the code buffer.
class __QOR_INTERFACE( __ARCHQOR ) Cx86CPUCore : public CCPUBase
{
friend class CInstEmitter;
public:
Cx86CPUCore( nsArch::CCodeGeneratorBase* codeGenerator ) __QCMP_THROW;
virtual ~Cx86CPUCore() __QCMP_THROW;
...
inline void _emitOpCode( Cmp_unsigned__int32 opCode) __QCMP_THROW
{
if( opCode & 0xFF000000 )
{
_emitByte( (Cmp_unsigned__int8)( ( opCode & 0xFF000000 ) >> 24 ) );
}
if( opCode & 0x00FF0000 )
{
_emitByte((Cmp_unsigned__int8)( ( opCode & 0x00FF0000 ) >> 16 ) );
}
if( opCode & 0x0000FF00 )
{
_emitByte((Cmp_unsigned__int8)(( opCode & 0x0000FF00 ) >> 8 ) );
}
_emitByte((Cmp_unsigned__int8)( opCode & 0x000000FF ) );
}
...
inline void _emitSib( Cmp_unsigned__int8 s, Cmp_unsigned__int8 i, Cmp_unsigned__int8 b ) __QCMP_THROW
{
_emitByte( ( ( s & 0x03 ) << 6 ) | ( ( i & 0x07 ) << 3 ) | ( b & 0x07 ) );
}
inline void _emitRexR( Cmp_unsigned__int8 w, Cmp_unsigned__int8 opReg,
Cmp_unsigned__int8 regCode, bool forceRexPrefix ) __QCMP_THROW
{
...
}
...
void _emitX86Inl(Cmp_unsigned__int32 opCode, Cmp_unsigned__int8 i16bit,
Cmp_unsigned__int8 rexw, Cmp_unsigned__int8 reg, bool forceRexPrefix) __QCMP_THROW;
...
void _emitMmu(Cmp_unsigned__int32 opCode, Cmp_unsigned__int8 rexw,
Cmp_unsigned__int8 opReg, const COperand& src, Cmp_int_ptr immSize) __QCMP_THROW;
LabelLink* _emitDisplacement(LabelData& l_data,
Cmp_int_ptr inlinedDisplacement, int size) __QCMP_THROW;
void _emitJmpOrCallReloc(Cmp_unsigned__int32 instruction, void* target) __QCMP_THROW;
void _emitInstruction(Cmp_unsigned__int32 code) __QCMP_THROW;
void _emitInstruction(Cmp_unsigned__int32 code, const COperand* o0) __QCMP_THROW;
...
void EmitInstructionG_ENTER( Cmp_unsigned__int32 code, const COperand*& o0,
const COperand*& o1, const COperand*& o2, Cmp_unsigned__int32& bLoHiUsed,
bool& assertIllegal, const COperand** _loggerOperands, const CImm* immOperand,
Cmp_unsigned__int32 immSize, Cmp_uint_ptr beginOffset, const InstructionDescription* id,
Cmp_unsigned__int32 forceRexPrefix ) __QCMP_THROW;
void EmitInstructionG_IMUL( Cmp_unsigned__int32 code, const COperand*& o0,
const COperand*& o1, const COperand*& o2, Cmp_unsigned__int32& bLoHiUsed,
bool& assertIllegal, const COperand** _loggerOperands, const CImm* immOperand,
Cmp_unsigned__int32 immSize, Cmp_uint_ptr beginOffset, const InstructionDescription* id,
Cmp_unsigned__int32 forceRexPrefix ) __QCMP_THROW;
void EmitInstructionG_INC_DEC( Cmp_unsigned__int32 code, const COperand*& o0,
const COperand*& o1, const COperand*& o2, Cmp_unsigned__int32& bLoHiUsed,
bool& assertIllegal, const COperand** _loggerOperands, const CImm* immOperand,
Cmp_unsigned__int32 immSize, Cmp_uint_ptr beginOffset, const InstructionDescription* id,
Cmp_unsigned__int32 forceRexPrefix ) __QCMP_THROW;
void EmitInstructionG_J( Cmp_unsigned__int32 code, const COperand*& o0,
const COperand*& o1, const COperand*& o2,
Cmp_unsigned__int32& bLoHiUsed, bool& assertIllegal,
const COperand** _loggerOperands, const CImm* immOperand, Cmp_unsigned__int32 immSize,
Cmp_uint_ptr beginOffset, const InstructionDescription* id,
Cmp_unsigned__int32 forceRexPrefix ) __QCMP_THROW;
...
inline Cmp_uint_ptr relocCode( void* dst ) const __QCMP_THROW
{
return relocCode( dst, (Cmp_uint_ptr)dst );
}
void embed( const void* data, Cmp_uint_ptr length ) __QCMP_THROW; void embedLabel( const CLabel& label ) __QCMP_THROW;
...
inline void SetEmitOptions( Cmp_unsigned__int32 EmitOptions )
{
m_uiEmitOptions = EmitOptions;
}
...
protected:
Cmp_unsigned__int32 m_uiProperties; Cmp_unsigned__int32 m_uiEmitOptions;
Cmp_int_ptr m_iTrampolineSize; LabelLink* m_pUnusedLinks;
public:
nsCodeQOR::PodVector< LabelData > m_LabelData; nsCodeQOR::PodVector< RelocData > m_RelocData;
};
With this batch processor design it would be possible to have a single function that generated any x86 instruction from its opcode and parameters rather than one function per instruction. In practice such a function would be a vast monstrosity if it did the entire task, like implementing
printf
in a single function. Instead we divide the instruction set into groups of instructions that require particular validation checks or particular logic in order to emit them and then have one function per group. All these functions work on the common set of data needed to generate a single instruction. This is packaged up in the instruction emitter class CInstEmitter
.
class CInstEmitter
{
public:
CInstEmitter( Cx86CPUCore& CPUParam, Cmp_unsigned__int32 codeParam,
const COperand* o0Param, const COperand* o1Param,
const COperand* o2Param ) __QCMP_THROW;
bool BeginInstruction( void ) __QCMP_THROW;
bool PrepareInstruction( void ) __QCMP_THROW;
void FinishImmediate( const COperand* pOperand, Cmp_unsigned__int32 immSize ) __QCMP_THROW;
void EndInstruction( void ) __QCMP_THROW;
void CleanupInstruction( void ) __QCMP_THROW;
bool LockInstruction( void ) __QCMP_THROW;
void InstructionImmediate( void ) __QCMP_THROW;
void InstructionIllegal( void ) __QCMP_THROW;
void InstructionG_EMIT( void ) __QCMP_THROW;
void InstructionG_ALU( void ) __QCMP_THROW;
void InstructionG_BSWAP( void ) __QCMP_THROW;
...
void InstructionG_MOV_PTR( void ) __QCMP_THROW;
void InstructionG_MOVSX_MOVZX( void ) __QCMP_THROW;
# if ( _QARCH_WORDSIZE == 64 )
void InstructionG_MOVSXD( void ) __QCMP_THROW;
# endif
void InstructionG_PUSH( void ) __QCMP_THROW;
void InstructionG_POP( void ) __QCMP_THROW;
void InstructionG_R_RM( void ) __QCMP_THROW;
void InstructionG_RM_B( void ) __QCMP_THROW;
...
void InstructionG_MMU_RM_3DNOW( void ) __QCMP_THROW;
protected:
Cmp_unsigned__int32 m_uiCode;
const COperand* m_pO0;
const COperand* m_pO1;
const COperand* m_pO2;
Cmp_unsigned__int32 m_bLoHiUsed;
bool m_bAssertIllegal;
const COperand* m_aLoggerOperands[ 3 ];
const CImm* m_pImmOperand;
Cmp_unsigned__int32 m_uiImmSize;
Cmp_uint_ptr m_uiBeginOffset;
const InstructionDescription* m_pId;
Cmp_unsigned__int32 m_uiForceRexPrefix;
Cx86CPUCore& m_CPU;
private:
__QCS_DECLARE_NONASSIGNABLE( CInstEmitter );
};
The Cx86CPUCore
function _emitInstruction
is used to wrap all this up and conduct the process of correctly emitting a single instruction.
void Cx86CPUCore::_emitInstruction( Cmp_unsigned__int32 code, const COperand* o0,
const COperand* o1, const COperand* o2 ) __QCMP_THROW
{
const InstructionDescription* id = &instructionDescription[ code ];
CInstEmitter Emitter( *this, code, o0, o1, o2 );
if( Emitter.BeginInstruction() && Emitter.PrepareInstruction() &&
Emitter.LockInstruction() )
{
switch( id->group )
{
case InstructionDescription::G_EMIT:
Emitter.InstructionG_EMIT();
break;
case InstructionDescription::G_ALU:
Emitter.InstructionG_ALU();
break;
...
Emitter.CleanupInstruction();
}
Now we have the code to write x86 format instructions by their group we need the functions that can be to called to generate specific instructions. To do this Cx86CPUCore
is extended in
two ways. A hierarchy of derived classes adds the instructions associated with each generation of processor so that we can target a particular level of the instruction set. A parallel hierarchy of instruction set extension classes that attach to a particular Cx86CPUCore
instance implement the generations of multimedia extensions that have been added to the x86 instruction set.
At compile time we pick levels out of these derivation chains to constitute the minimum spec x86 target which we will require at runtime. Then at runtime we can check that the host machine for the executable is capable enough to meet those requirements. For example for the assembler functions needed to integrate with 32bit Windows we only need i386 level instructions although for practical purposes we won't support anything less than i486 as a target due to the lack of a cpuid
instruction on the i386.
The Selection.h header a section of which is below will define the names CMainInstructionSet
and CFloatingPointUnit
to be the correct classes for example Ci686CPU
and Ci686FPU
.
...
#ifndef _QARCH_ISET_I786
# error ("_QARCH_ISET_I786 not defined")
#endif
#ifndef _QARCH_X86LEVEL
__QCMP_MESSAGE( "Target instruction set level not set. Defaulting to i686." )
# define _QARCH_X86LEVEL _QARCH_ISET_I686
#endif
#if ( _QARCH_X86LEVEL == _QARCH_ISET_I786 )
__QCMP_MESSAGE( "Target i786 instruction set." )
# include "ArchQOR/x86/Assembler/BatchCPU/i786CPU.h"
namespace nsArch
{
typedef nsx86::Ci786CPU CMainInstructionSet;
typedef nsx86::CP7FPU CFloatingPointUnit;
# define _QARCH_X87_FPU_EXTENSION_CLASS ,public CFloatingPointUnit
# define _QARCH_X87_FPU_EXTENSION_INIT ,CFloatingPointUnit( (Cx86CPUCore&)(*this) )
}
#elif ( _QARCH_X86LEVEL == _QARCH_ISET_I686 )
__QCMP_MESSAGE( "Target i686 instruction set." )
# include "ArchQOR/x86/Assembler/BatchCPU/i686CPU.h"
namespace nsArch
{
typedef nsx86::Ci686CPU CMainInstructionSet;
typedef nsx86::CP6FPU CFloatingPointUnit;
# define _QARCH_X87_FPU_EXTENSION_CLASS ,public CFloatingPointUnit
# define _QARCH_X87_FPU_EXTENSION_INIT ,CFloatingPointUnit( (Cx86CPUCore&)(*this) )
}
#elif ( _QARCH_X86LEVEL == _QARCH_ISET_I586 )
__QCMP_MESSAGE( "Target i586 instruction set." )
# include "ArchQOR/x86/Assembler/BatchCPU/i586CPU.h"
namespace nsArch
{
typedef nsx86::Ci586CPU CMainInstructionSet;
typedef nsx86::CPentiumFPU CFloatingPointUnit;
# define _QARCH_X87_FPU_EXTENSION_CLASS ,public CFloatingPointUnit
# define _QARCH_X87_FPU_EXTENSION_INIT ,CFloatingPointUnit( (Cx86CPUCore&)(*this) )
}
#elif ( _QARCH_X86LEVEL == _QARCH_ISET_I486 )
...
The CMainInstructionSet
and CFloatingPointUnit
classes selected are then used to derive the CCPU
class.
This gives us a low level assembler that can create an executable sequence of instructions valid for an x86 machine but not yet easily create a function valid from the viewpoint of a C++ compiler. To go from low level instruction sequences to fully fledged functions with prologs and C++ calling conventions we need the set of constructs that make up a high level assembler.
High Level Assembler (HLA) Design
The high level assembler is based around the concept of an Emittable, an object that can be emitted into a stream of instructions and directives. The stream is processed through a number of stages before being used to drive the low level assembler to emit actual instructions to the buffer. This extra level of abstraction allows the creation of complete valid functions with control over the passing of parameters, calling conventions, return values and everything else that is needed to integrate with
existing compiled functions. It also allows for additional processing stages such as
aggressive optimization to be included in future. Just like the low level assembler the HLA is divided between abstract classes that can be reused across different architectures and architecture specific classes.
class __QOR_INTERFACE( __ARCHQOR ) Emittable
{
public:
Emittable( nsArch::CHighLevelAssemblerBase* c, Cmp_unsigned__int32 type ) __QCMP_THROW;
virtual ~Emittable() __QCMP_THROW;
virtual void prepare( CHLAssemblerContextBase& cc ) __QCMP_THROW; virtual Emittable* translate( CHLAssemblerContextBase& cc ) __QCMP_THROW; virtual void emit( CHighLevelAssemblerBase& a ) __QCMP_THROW; virtual void post( CHighLevelAssemblerBase& a ) __QCMP_THROW;
virtual int getMaxSize() const __QCMP_THROW; virtual bool _tryUnuseVar( CommonVarData* v ) __QCMP_THROW;
inline nsArch::CHighLevelAssemblerBase* getHLA() const __QCMP_THROW {
return m_pHLAssembler;
}
inline Cmp_unsigned__int32 getType() const __QCMP_THROW
{
return m_ucType;
}
inline Cmp_unsigned__int8 isTranslated() const __QCMP_THROW
{
return m_ucTranslated;
}
inline Cmp_unsigned__int32 getOffset() const __QCMP_THROW
{
return m_uiOffset;
}
inline Cmp_unsigned__int32 setOffset( Cmp_unsigned__int32 uiOffset ) __QCMP_THROW
{
m_uiOffset = uiOffset;
return m_uiOffset;
}
inline Emittable* getPrev() const __QCMP_THROW
{
return m_pPrev;
}
inline void setPrev( Emittable* pPrev ) __QCMP_THROW
{
m_pPrev = pPrev;
}
inline Emittable* getNext() const __QCMP_THROW
{
return m_pNext;
}
inline void setNext( Emittable* pNext ) __QCMP_THROW
{
m_pNext = pNext;
}
inline const char* getComment() const __QCMP_THROW
{
return m_szComment;
}
void setComment( const char* str ) __QCMP_THROW;
void setCommentF( const char* fmt, ... ) __QCMP_THROW;
protected:
inline Emittable* translated() __QCMP_THROW
{
m_ucTranslated = true;
return m_pNext;
}
nsArch::CHighLevelAssemblerBase* m_pHLAssembler;
Cmp_unsigned__int8 m_ucType; Cmp_unsigned__int8 m_ucTranslated; Cmp_unsigned__int8 m_ucReserved0; Cmp_unsigned__int8 m_ucReserved1; Cmp_unsigned__int32 m_uiOffset; Emittable* m_pPrev; Emittable* m_pNext; const char* m_szComment;
private:
__QCS_DECLARE_NONCOPYABLE( Emittable );
};
First we wrap low level instructions in a high level Instruction emittable, CEInstruction
,
then Prolog, Epilog and Function Prototype emittables, Jump targets, Returns and Calls and finally a Function emittable CEFunction
to master all these and comply with the target ABI.
At the generic level Alignment, Comment, Data, Dummy, and FunctionEnd emittables and the base CEmittable
class detailed above are independent of which underlying assembly language we're using so they can be reused across platforms.
The CEFunction
class maintains the parameters and invariants for generating a single function but the developing state data that tracks the function generation process is managed by a High Level Assembler Context class specialized for x86 as Cx86HLAContext
. This keeps track of variables, register allocation, stream write point and scope as the function is created.
class __QOR_INTERFACE( __ARCHQOR ) Cx86HLAContext : public CHLAssemblerContextBase
{
public:
Cx86HLAContext( nsArch::CHighLevelAssemblerBase* compiler ) __QCMP_THROW;
~Cx86HLAContext() __QCMP_THROW;
void _clear() __QCMP_THROW;
void allocVar( VarData* vdata, Cmp_unsigned__int32 regMask, Cmp_unsigned__int32 vflags ) __QCMP_THROW;
...
CMem _getVarMem( VarData* vdata ) __QCMP_THROW;
VarData* _getSpillCandidateGP() __QCMP_THROW;
VarData* _getSpillCandidateMM() __QCMP_THROW;
VarData* _getSpillCandidateXMM() __QCMP_THROW;
VarData* _getSpillCandidateGeneric(VarData** varArray, Cmp_unsigned__int32 count) __QCMP_THROW;
inline bool _isActive( VarData* vdata) __QCMP_THROW
{
return vdata->nextActive != 0;
}
void _addActive( VarData* vdata ) __QCMP_THROW;
void _freeActive( VarData* vdata ) __QCMP_THROW;
void _freeAllActive() __QCMP_THROW;
void _allocatedVariable( VarData* vdata ) __QCMP_THROW;
inline void _allocatedGPRegister(Cmp_unsigned__int32 index) __QCMP_THROW
{
_state.usedGP |= nsCodeQOR::maskFromIndex(index); _modifiedGPRegisters |= nsCodeQOR::maskFromIndex(index);
}
...
void translateOperands( COperand* operands, Cmp_unsigned__int32 count ) __QCMP_THROW;
...
void addBackwardCode( EJmp* from ) __QCMP_THROW;
void addForwardJump( EJmp* inst ) __QCMP_THROW;
StateData* _saveState() __QCMP_THROW;
void _assignState(StateData* state) __QCMP_THROW;
void _restoreState(StateData* state, Cmp_unsigned__int32 targetOffset = INVALID_VALUE) __QCMP_THROW;
VarMemBlock* _allocMemBlock(Cmp_unsigned__int32 size) __QCMP_THROW;
void _freeMemBlock(VarMemBlock* mem) __QCMP_THROW;
void _allocMemoryOperands() __QCMP_THROW;
void _patchMemoryOperands( nsArch::CEmittable* start, nsArch::CEmittable* stop ) __QCMP_THROW;
nsCodeQOR::Zone _zone; nsArch::CHighLevelAssemblerBase* _compiler; EFunction* _function; nsArch::CEmittable* _start; nsArch::CEmittable* _stop; nsArch::CEmittable* _extraBlock; StateData _state; VarData* _active; ForwardJumpData* _forwardJumps; Cmp_unsigned__int32 _unrecheable; Cmp_unsigned__int32 _modifiedGPRegisters; Cmp_unsigned__int32 _modifiedMMRegisters; Cmp_unsigned__int32 _modifiedXMMRegisters; Cmp_unsigned__int32 _allocableEBP;
int _adjustESP; Cmp_unsigned__int32 _argumentsBaseReg; Cmp__int32 _argumentsBaseOffset; Cmp__int32 _argumentsActualDisp; Cmp_unsigned__int32 _variablesBaseReg; Cmp__int32 _variablesBaseOffset; Cmp__int32 _variablesActualDisp; VarMemBlock* _memUsed; VarMemBlock* _memFree;
Cmp_unsigned__int32 _mem4BlocksCount; Cmp_unsigned__int32 _mem8BlocksCount; Cmp_unsigned__int32 _mem16BlocksCount; Cmp_unsigned__int32 _memBytesTotal; bool _emitComments; nsCodeQOR::PodVector< EJmp* > _backCode; Cmp_uint_ptr _backPos; };
In order to integrate JIT assembled functions into existing compiled C++ we need to be able to make calls in both directions. CFunctionPrototype
assists JIT assembled functions to call exisiting compiled functions and other JIT assembled functions.
The High Level Assembler class itself provides an interface for the creation of new variables and functions and the insertion of emittables into the stream.
The High Level Assembler Intrinsics class builds on the High Level Assembler to provide a virtual assembly language interface giving us the original one function per instruction design but within the current context of the emittable function under construction and in terms of variables rather than registers.
Now we have a high level x86 assembler capable of generating functions that integrate with our C++ codebase and a set of base classes to extend support to non x86 architectures in the future.
Just to put the icing on the cake and to make the High Level Assembler even easier to use ArchQOR defines a set of functor templates. These enable JIT functions to be treated as objects for the purpose of generating them and as functions for the purpose of calling them. Below is the code for the 3 parameter template.
template< typename RET, typename P1, typename P2, typename P3 >
class CJITFunctor3 : public CJITFunctorBase
{
public:
typedef RET( *FP )( P1, P2, P3 );
CJITFunctor3( CHighLevelAssemblerBase* pHLA ) : CJITFunctorBase( pHLA )
, m_pFunc( 0 )
{
}
~CJITFunctor3()
{
}
RET operator()( P1 p1, P2 p2, P3 p3 )
{
if( !m_bGenerated )
{
m_pFunc = Generate();
}
if( !m_pFunc )
{
throw "Null function pointer exception";
}
return (m_pFunc)( p1, p2, p3 );
}
protected:
virtual FP Generate( void ) = 0;
FP m_pFunc;
};
Usage: The proof of the pudding
To demonstrate the usage of the high level JIT assembler we'll create a fast memcpy
function that works in units of 4 bytes instead of the 1 byte at a time of the standard memcpy
.
The signature for the function will be: void MemCpy32( Cmp_unsigned__int32* destination, const Cmp_unsigned__int32* source, Cmp_uint_ptr count );
Step 1 is to derive a new functor type from CJITFunctor3
matching the signature we want.
class CJITmemcpy32 : public nsArch::CJITFunctor3< void, Cmp_unsigned__int32*,
const Cmp_unsigned__int32*, Cmp_uint_ptr >
{
public:
CJITmemcpy32( CHighLevelAssemblerBase* pHLA ) : CJITFunctor3( pHLA ){}
protected:
virtual FP Generate( void );
};
Step 2 is to implement the Generate
function to construct MemCpy32
using the High Level Assembler.
CJITmemcpy32::FP CJITmemcpy32::Generate( void )
{
Cx86HLAIntrinsics& HLA( *( dynamic_cast< Cx86HLAIntrinsics* >( m_pHLA ) ) );
HLA.newFunction( CALL_CONV_DEFAULT, FunctionBuilder3< Void, Cmp_unsigned__int32*,
const Cmp_unsigned__int32*, Cmp_unsigned__int32 >() );
CLabel LoopLabel = HLA.newLabel();
CLabel ExitLabel = HLA.newLabel();
CGPVar dst( HLA.argGP( 0 ) );
CGPVar src( HLA.argGP( 1 ) );
CGPVar cnt( HLA.argGP( 2 ) );
HLA.alloc( dst );
HLA.alloc( src );
HLA.alloc( cnt );
HLA.test( cnt, cnt );
HLA.jz( ExitLabel );
HLA.bind( LoopLabel );
CGPVar tmp( HLA.newGP( VARIABLE_TYPE_GPD ) );
HLA.mov( tmp, dword_ptr( src ) );
HLA.mov( dword_ptr( dst ), tmp );
HLA.add( src, 4 );
HLA.add( dst, 4 );
HLA.dec( cnt );
HLA.jnz( LoopLabel );
HLA.bind( ExitLabel );
HLA.endFunction();
FP fn = reinterpret_cast< FP >( HLA.make() );
if( fn )
{
m_bGenerated = true;
}
return fn;
}
Step 3 is to create the CJITmemcpy32
instance and setup some data for it to work on.
CJITmemcpy32 MemCopy32( &TheMachine().HLAssembler() );
Cmp_unsigned__int32 dstBuffer[128];
Cmp_unsigned__int32 srcBuffer[128] = {1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89};
Step 4 is the fun part. Now we can 'magically' call a function that doesn't actually exist yet except as the 'source' from step 2 and yet it works.
MemCopy32( dstBuffer, srcBuffer, 128 );
Stepping into this call in the debugger will reveal that the functor checks if the function has been generated yet, finds it hasn't and so generates it just in time before calling it. Calling
MemCopy32
a second time will skip the function generation phase and therefore give much faster results.
Here's a disassembler view on the actual code of the final JIT generated function:-
00030000 push ebp
00030001 mov ebp,esp
00030003 push ebx
00030004 push esi
00030005 sub esp,10h
00030008 mov ecx,dword ptr [ebp+8]
0003000B mov edx,dword ptr [ebp+0Ch]
0003000E mov ebx,dword ptr [ebp+10h]
00030011 test ebx,ebx
00030013 je 00030022
00030015 mov esi,dword ptr [edx]
00030017 mov dword ptr [ecx],esi
00030019 add edx,4
0003001C add ecx,4
0003001F dec ebx
00030020 jne 00030015
00030022 add esp,10h
00030025 pop esi
00030026 pop ebx
00030027 mov esp,ebp
00030029 pop ebp
0003002A ret
The code associated with this article implements the described test of the high level assembler and also makes use of the low level assembler to do a small but vital task which cannot be done except with the use of assembly language: the iterrogation of the host CPU for its features, versioning and branding information. Here is a sample of the output this gives when running on my development laptop.
A learning experience
I am not and have never claimed to be any kind of expert in assembly language. Many years ago I could get by in Z80 assembly but x86 is a different game and I have always found it particularly difficult to read let alone write. Porting and polishing Petr Kobalicek's JIT assembler has taught me a great deal and I hope improved my x86 assembly to an almost competent level.
I think the biggest thing I had been missing was an appreciation of the importance of conventions in x86 assembly programming. With such a vast instruction set and so many registers there seem to be a thousand ways to do any given job. While the assembler itself won't complain whichever you choose and the machine will execute whatever you give it this is not enough. On top of the machines requirements for correct assembly language are a host of rules about how to use the stack, how to pass parameters, how to use and preserve registers, which 'general purpose' registers to actually use for return values, temporaries and parameters. Which segment register is used for which purpose by the operating system and a dozen other things I haven't discovered yet. Once you know these things, reading disassembly listings or other peoples assembly code becomes much easier because you know what to expect. Similarly writing assembly becomes much easier if you don't have to work out how to do the common things like passing parameters because you just follow the convention.
I've never come across any good documentation on these conventions and I think that's why I struggled so much to get to grips with x86 assembly. It's not really surprising if there isn't much in the way of coherent documentation given that these conventions are hardware dependent, partially OS dependent, partially compiler dependent and partially just arbitrary tradition.
Future Directions
ArchQOR is up and running but there's a lot still to do. 64bit code generation is untested and will have issues. I also need to ensure that ArchQOR works reliably on the full range of compilers supported by CompilerQOR but I'm not going to burden this article's source code with half a dozen extra project files and associated debris.
There are many possible improvements and extensions to ArchQOR and its JIT assembler as well as no doubt a number of bugs to be run down and squished.
Support for the more recent x86 instruction set extensions, AVX and AVX2 for Haswell. These will require extensive code changes as the maximum number of parameters goes from 3 to 5 and new instruction formats are introduced.
Support for ARM CPUs and NEON SIMD extensions is a must so that the QOR will work on ARM based mobile devices. The path to doing this is now very clear with the layout of the code in ArchQOR but the knowledge required especially to implement a high level assembler is extensive and I don't have most of it.
This version of ArchQOR is also tied to the Windows platform and that is not consistent with the concept of the QOR. 99% of the ArchQOR code is equally applicable to x86 Linux and other x86 operating systems. We'll see in future articles how to abstract ArchQOR and ultimately the entire framework from the operating system.
Now that we can target a specific level of x86 hardware the CompilerQOR can support compiler intrinsics which are target hardware dependent so I need to go back to CompilerQOR and update the MSVC support at least to enable the use of intrinsics that require an i586 or i686 target.
If you carefully examine the source code associated with this article you'll notice a CodeQOR folder with a couple of small classes in it. These are dependencies that ArchQOR needs but which would normally live in the CodeQOR module. The next article in this series will be about the CodeQOR and will, I'm sure, be even more interesting than this one.
Acknowledgements
- The majority of the ArchQOR code is based completely on AsmJit so credit for the code should go to Petr Kobalicek without who's work none of this would ever have got here. Bob doesn't like offsite links but AsmJit is easy enough to Google.
- Thanks are due to my expert proof reader and best sister. Any remaining errors were put in by me after she'd finished.
- Thanks to Harold Aptroot for pointing out the lack of AVX/AVX2 support and the issues involved in adding it.
- Microsoft, Windows, Visual C++ and Visual Studio are trademarks of Microsoft.
- All other trademarks and trade names mentioned are acknowledged as the property of their respective owners who are in no way responsible for any of the content of this article or the associated source code.
History
- Initial version - 09/07/2013.