Click here to Skip to main content
11,412,834 members (73,374 online)
Click here to Skip to main content

GCN Assembler for AMD GPUs

, 17 Feb 2015 CPOL
Rate this:
Please Sign up or sign in to vote.
an assembler/compiler for AMD’s GCN (Generation Core Next Architecture) Assembly Language

Download Note: This project requires a specific hardware and driver. (see requirements for details)


This GCN GPU assembly language compiler converts human-readable assembly to binary machine code for AMD GCN equipped GPU processors. Assembly language is a friendly abstract way to view machine code. With some exceptions, assembly statements are directly mapped to individual machine instructions. Assembly is often used for applications that are time critical or when special hardware features might need to be accessed. An example use of GPU assembly might be for a bit-coin mining application. Bit-coin mining is compute intensive and benefits from small performance gains. Bit-coin mining using assembly can also take advantage of special hardware instructions or instruction features that couldn’t otherwise be accessed. Furthermore, writing in assembly can often be used to combine instructions in creative ways. Of course, writing assembly does have its drawbacks in the areas of maintainability and compatibility – see the disadvantages section for details.

Besides the core assembler, this project includes two additional Visual Studio projects. Each project wraps another in a shell like way. The inner core is the Asm4GCN assembler project. Wrapping that is the OpenCLwithGCN project that injects assembly binaries into dummy OpenCL kernels. And finally there is the Asm4GCNGUI project, the outer most layer. This is a windows application that pulls everything together in a friendly syntax-highlighting editor. This editor also doubles as an example of using OpenCLwithGCN.

Here is a quick introduction of each project starting with the assembler core and working out to the user interface:

Asm4GCN Assembler – The core is the Asm4GCN assembler/compiler itself. It basically converts assembly statement blocks into their binary representation. It handles variables, predefinitions, labels, and register management. It returns warnings and errors to the user.

OpenCLwithGCN – This basically takes an OpenCL program that has a mix of OpenCL and Asm4GCN kernels and then inserts (or patches-in) the binaries from the gncAsm kernels. It also runs the program.

Asm4GCNGUI – This is a simple windows app that implements the Asm4GCN for writing code in a small editor. The editor’s syntax highlighting and code completion make it handy for writing GCN assembly. The editor allows for quick testing of Asm4GCN kernels and maybe useful for testing or educational purposes. Just write some assembly and OpenCL kernels with c# based host code and see to what happens.


In times past, I wanted to take advantage of the extra performance benefits of writing in assembly as well as take advantage of special GPU hardware features that were only available through assembly. To satisfy this for NVidia GPUs I can inline some PTX using Cuda’s asm function. PTX is not true assembly but it does provide a very near experience. For AMD on the other hand, there is not really a straight forward way to "inline" assembly in OpenCL.

The optimal solution would be to have an asm function in OpenCL however this is not currently supported… at least not for the most part. There have been a couple posts online where users have partially got inline asm to work however there is one major problem– variables cannot be passed in/out. If variables cannot be passed in then an inline assembly is not so useful. Furthermore, any registers that are used might stomp on some in-use OpenCL variables.

Since there is nothing that I knew of that supports inline GCN assembly in OpenCL, I decided to try and attempt it myself. My original intentions were to have inline assembly however it did not pan out as I liked – at least not yet. The complexity of creating a dynamically sized dummy kernel and then capturing the used registers is complicated and my attempts were not reliable. After much effort, the original goal of inline assembly has been put on the back burner, and it was built so a cl::program can have a mix of OpenCL __kernel and GCN __asm4GCN kernels. The framework for doing inline assembly is half way in-place but I am still struggling with capturing the registers used for the dynamically sized dummy kernels. I am not sure if inline assembly will be added in the future. (see Inline assembly for more)

One note I should add here is there is currently a windows app, called HetPas, which generates elf-image files that can be loaded into an OpenCL program. This great app, created by Realhet, is a closed source windows project that takes GCN assembly kernels and Pascal-like host code and then runs them – similar to this project. I have been using HetPas for a while and it works well. The only part I struggled with is the Pascal side of things.

Asm4GCN Assembler Sub-Project

This project comprises of three Visual Studio projects. The Asm4GCN assembler, OpenCLwithGCN, and the Asm4GcnGUI. The Asm4GCN project is the core assembler for this project and converts a block of assembly statements into raw binary code.

For Example: [s_mov_b32 v1 s0] is converted into a hardware readable [7E020200]

Besides converting each statement to its hardware code counterpart, the Assembler also handles other items like labels and jump statements. I think the best way to show what the Assembler does through flowcharts. For simplicity, some steps have been left out.

  1. A block of GCN assembly is passed into the Asm4GCN assembler to be converted into binary.
  2. The first (or next) line is started.
  3. Remove any whitespace on this line. Comments, empty lines, and extra whitespace are removed.
  4. If the row starts with a label header (i.e. myLabel: ) it is recorded in the labels list here.
  5. #s/v_pool and #define
  6. Each line is checked for any #defines. For each match the keyword is replaced.
  7. Spit up line into multiple statements using “;”.
  8. Are there any (more) statements on this line? If Yes then lets parse the statement, if No then continue to next line.
  9. The statement is split up into a string array. V_Mov v2,v4 is translated into {“V_Mov”, “v2”, “v4”}
  10. Process any variable initialization or destruction. Here registers are reserved or returned back to the pool using free.
  11. If the current instruction uses any variables, they are replaced with their current register number.
  12. Convert hex, binary, octal, and scientific notation strings into constants.
  13. If the current instruction references any labels then we will hold off on converting the instruction to binary since we don’t know the constant value.
  14. The current instruction is converted into its binary version.
  15. The current instruction is added to a list for later processing. 


The next step is to fill the stmt.opSize for any statements that don’t have it filled in yet. After this is done, some final cleanup and then a binary is created.

16. Creates a list of Instructions with an unresolved size (opSize). These are instructions with labels where the distance is not known.
17. Processes each instruction one by one until we get to the end of the list.
18. If the min and max distance to the label are the same then the instruction size must be same.
19. Set the byte size to whatever the equal min/max was and then continue to the next instruction.
20. Is the byte size still 4 bytes even with the farthest jump distance? If so then it will always be 4 bytes.
21. Set OpSize to 4 bytes
22. We have finished enumerating the list. Are there any unresolved OpSizes? If so then we will need to enumerate again.
23. Convert any final instructions that had labels into binary form.
24. Join all the 4 and 8 byte instructions into one single byte array.
25. Return the finished binary. Other information is also returned like register usage as well.


One of the assembler’s main tasks is calculating the distance of jump labels and converting them into actual numbers. I found out doing this is one of the more complicated parts of GCN assembly - at least to do it correctly. It turns out that processing label distances, is much like the “chicken or the egg” dilemma. Instructions that have labels in them could be either 32-bit or 64-bit depending on the jump length but we don’t know the jump length until we know the sizes of the instructions in-between the jump and label. When there are no labels between a jump and the destination life is easy, however as soon as a jump passes over other jumps we start to have problems because we don’t know the exact jump distance any longer. 

To resolve this, each jump statement is tried with the shortest and longest possible distances. The longest distance is the sum of the sizes using a 64-bit size for the unresolved instructions. The shortest distance would be the same but using 32-bit for the unresolved. At any time, if the shortest and longest are both 32-bit or 64-bit, then we can lock in the size of that statement. We do this with each unresolved statement over and over until all the jump statements have a fixed size. After we know the size of each instruction, then the labels can be replaced with the appropriate constant jump length. There are a couple more steps but that’s the main idea.

As an alternative to the above, one easy, but not as good way to resolve this is to just use 64-bit version of the instruction when we are not sure about the label distance. A 64-bit instruction with a constant can hold any value a 32-bit instruction can hold.

Some features of the label system:

  • Labels can be used in place of most constants. A label is just the distance to another instruction so we can use a label wherever a constant would be in most cases. For example, a statement like “S_Mov S2, MyLabel” would simply move the distance from MyLabel to S2. Just remember “S_SUB S4, MyLabelA, MyLabelB” would fail in this case because “S_SUB” cannot take two constants.
  • Multiple labels with the same name are supported. For myself, I often want to re-use a label name or I just run out of my favorites. Basically the compiler will issue a warning but jumps will reference the nearest matching label. This can be handy for when you want to copy and paste a block also as the label names will not need to be changed in most cases.


Writing an assembly block with more than 10 to 20 statements gets complicated quickly. Some of the reasons for the complexity is trying to remember what each register does and whether it is free or not. The ability to checkout a register with a friendly name is extremely useful. For example, a variable named LoopCount is much more descriptive then S23:24. Also, the ability to checkout a register keeps a programmer from accidently over-writing an existing register. When the number free registers become low it can be time consuming to try and hunt down a free register. Variables abstract the programmer from having to remember what register numbers do and what registers are available.


v4i myVar1, myVar2, mySum;
v_mov_b32 myVar1, 10;
v_mov_b32 myVar2, 20;
v_add_i32 mySum, myVar1, myVar2
free myVar1, myVar2

The variable system for Asm4GCN basically has three parts – the declaration, using the variable, and freeing the variable.

Declaring a Variable

initializes a variable name and reserves a register. When the compiler runs across a declaration it will immediately reserve the first free register(s) from the allowed register pool. Instead of using a long declaration like scalar unsigned int or vector long long, I tried to simplify it using a short 3 digit declaration. The first digit determines vector or scalar. The second digit is the size in bytes. And the last digit is the intended data-type of the variable. Following the type information is the variable name itself. This first character must be a letter or underscore and the remaining characters can have numbers as well.


 v4i _my_int_vector;
 v8u myUnsignedLongVector;
 s8f myDoubleScalar1, myDoubleScalar2; //multiple items can be declared at once
 s4u myAddrForcedToS10 s10;     // forced to 10 (used for incoming params)
 v8f myDoubleForcedTo2 v[2:3];  // either v[2:3] or just the beginning, v2, can be specified

The declaration starts out with a type followed by one or more variable names. I wanted to keep the variable types simple so I came up three character variable type that holds type/size/datatype.

First Digit  - this specifies what memory space this variable will to belong to. This is either S for scalar or V for vector. This tells the register reservation system to reserve either a Scalar or Vector register.

Second Digit - specifies the size in bytes of the variable. The size tells the compiler to reserve 1 to 4 consecutive registers to hold the variable. A value of 2 would use 2 bytes but would take up a single 4-byte register. A value of 4 would also use one register. A value of 8 would be two full 32-bit registers. Currently there is no difference in what the compiler does for 1, 2, or 4 – this is currently just informational.  However, if 8 is used, this does affect the compiler in that it will reserve two consecutive registers - an even numbered register followed by an odd register.

Third Digit - this is the intended datatype. This is either F for float, I for integer, U for an unsigned integer, or B for bits. The “bits” format is really a ketch-all format. Bits can really represent any other formats including bool. Unfortunately this last format specifier is not used yet. My original intentions were to simplify GCN assembly by having the ability to do simple commands such as “R2 = R3 + R4” where type information would be needed. Also types could be used to throw warnings. If a float type is specified but used with a “v_add_i32” then this could throw a warning. These extra features may or may not be added in the future.

After the type and variable name, an optional register number can be specified to force a variable to use a given register number.   The reason this was added is because when the kernel starts some registers are pre-loaded with values like parameter addresses, block id/size, and lane id. These specific registers must be captured and reserved.  Variables with the force-register option should be used near the top of the block. If the register is used from a previous automatic assignment then the forced option will fail because the register would already be in use.

Using a variable

Just use a variable name as needed in your code instead of using register numbers(i.e. v17 or s[20:21]). Under the covers, whenever the assembler runs across a variable in the assembly it will do a dictionary lookup to retrieve its register number and type. It will then replace the variable name text with the register type and number. i.e. v_add_i32 v3, v4, myInt; to v_add_i32 v3, v4, v7;

Renaming a variable

A variable in Asm4GCN is a identifier given to an assigned register.  Sometimes the purpose of the register changes mid-flight and the identifier name should be updated accordingly.  For example, v0 might start out as a lane-id but is updated and becomes global-id instead.  Here would be an example of its use:  

v4u lane_ID v0
[use lane_ID here]
[lane_ID is changed and now becomes global_ID]
ren laneID global_ID
[use globalID here]
free global_ID

Free a variable

Use the free keyword followed to close out a variable and free the register(s) it had assigned to it. In the background, this will delete the variable name from a dictionary and mark the register number(s) as free. 

The free can be followed by several variable names separated by commas. i.e. free myVar1, myVar2

Flexible Constants

Inline constant, also known as literals, can be used with instructions that support them. Asm4GCN supports a large variety of constant types: Decimal (250), Hex (0xFA), Octal (0x 372), Binary (0b11111010), exponent (25E1, 25E+1), and Label (myLabel). You’re probably thinking why is label listed? Well in AsmGCN a label is just the distance to that label in bytes so it can be used in place of any constant.

Valid Examples:

s_add_i32    s3, s4, 12             // pos int  
s_add_i32    s3, s4, -12            // Small neg.  
s_min_u32    s5, s6, 0xabcd         // Hex  
s_min_u32    s5, s6, 10e2           // Exp  
s_min_u32    s5, s6, -10e2          // Neg Exp  
s_mov_b32    s4, 2.                 // float           
s_mov_b32    s4, -20.0              // neg float 
s_mov_b32    s4, .5                 // float 
s_mov_b32    s4, -.5                // float 
s_mov_b32    s4, 343.432            // float 
s_mov_b32    s4, 3.4e4              // exp float 
s_mov_b32    s4, -34.4e-4           // exp float 
s_mov_b32    s4, 0o7654             // Octal  (4000/FA0)
s_mov_b32    s4, 0b0011111111       // Binary (255/FF)

#define support

The assembler supports c-style #define statements. This is done by a simple find-and-replace for each #define in the code block. For example, #define _fe_ 54321 would replace all the _fe_ in the code with 54321. The reason for the surrounding underscores is to prevent accidental usage. If we just used fe we could accidently replace any fe with a 12345 so v_bfe_u32 would become v_b12345_u32. #defines without underscores can still be used however they will generate a compiler warning.

#defines with parameters are also supported, so something like #define _world_(AAA) 1AAA1; would search for items like int myNum = _world_(00); and would translate them into int myNum = 1001;.

Valid Examples:

#define _hw3_ Hello World!
<span style="font-size: 9pt;">#define _hw1_(opt0) Hello opt0 World!
</span><span style="font-size: 9pt;">#define _hw2_(opt0,opt1) Hello opt0 World from opt1!</span>

#s_pool / #v_pool command

Most of the time you might want to use registers 0 through 255 however sometimes there might be a need to compile using a particular set (or pool) of registers. #s_pool and #v_pool can be placed in the top of the assembler text to do this. When this compiler command is hit the register reservation system will use these registers for variables. Normally this would be put in the top of your code but you can actually use it anywhere – a warning will appear if it’s not near the top.

Valid Examples:

#S_POOL s22, s23, s24, s27, s29, s30, s31, s33, s34, s35, s36, s37
#V_POOL v11, v12, v13, v14, v15, v17, v19, v20, v21, v23, v24, v25

Semicolons and Multiple Statements per Line

In most cases, multiple statements can be used on a single line. If this is the case then a semicolon(;) can be used to separate the statements.  All instructions and most commands with the exception of #v_pool, #s_pool, and #define support multiple statements per line.  For lines with a single command or instruction a semicolon can optionally be used.


s8u myScalar; v8u myVector; v_add_i32 myVector, myVector, myScalar // this is okay
<span style="font-size: 9pt;">v_add_i32 v0, v1, v2  // this is okay
</span><span style="font-size: 9pt;">v_add_i32 v0, v1, v2; // this is okay
</span><span style="font-size: 9pt;">v_add_i32 v0, v1, v2; #define _myDef_ 12345678  // fail - #define must be on its own line</span>

Asm4GCN Project Files

The following is a list of the source files along with their description. The number of lines appears in parenthesis after the name of the file. The project has a total of about 10,000 lines (includes comments and whitespace).

  • GcnISA.cs (1631) - This file contains all the raw data about GCN. It has dictionaries, arrays, and enums of instruction information, register aliases information, etc.
  • Encoder.cs (1076) - This file contains the static GcnParcer class. This class contains methods for each GCN encoding format whose job is to convert a single statement line into its OpCode binary format. This is the core of the GCN assembler.
  • GcnBlock.cs (34) - GcnBlock.cs contains the GcnBlock class. This class is responsible for converting a block of instructions into a byte[] binary form.
  • DataStructs.cs (59) - Contains miscellaneous random structures like GcnStmt, Define, and AsmVar.
  • Labels.cs (75) - Holds the Label and Labels class. This class is for responsible for keeping track of labels and jump distances.
  • ParseOperand.cs (344) - This static tool-like class is responsible for parsing operands. It converts hex, octal, and binary strings to constants as well as verifies that the datatype is allowed.
  • Program.cs (220) - Contains functionality for when Asm4GCN is used in command line. Most projects however will link functions like in a dll. 
  • RegPool.cs (331) - Keeps track of Regs usage using the RegPool class . RegPool keeps an array of currently available registers. This class is initialized by a supplied a list of allowed registers or a range starting from zero.
  • RegUsage.cs (109) - The RegUsageCalc class keeps a usage count of each register size. It also remembers the maximums with what line they occurred on. This is for information usage or for knowing how many registers an inline statement might need. (inline not implemented)
  • Log.cs (121) - Contains a logging class for handing output. It can output to a StringBuilder or directly to a console.
  • Tools.cs (77) - Contains a static extension class that has some useful extensions like IsBetween().
  • TestInput.txt (167) - This file contains examples of GCN assembly. It also doubles for testing.

Point of Interest

Inline assembly – The inline assembly did not really pan out for this project but I wanted to include some notes about in case anyone wanted to give it a go. I made it a little more complicated originally (not shown here) and got burned out. The “inline” code is still there however it needs to be finished – either by someone else or myself. Here is an outline of the plan:

  1. Location and extract the inline Asm4GCN Blocks (into strings)
  2. Next, assemble the pulled Asm4GCN blocks to binary and note the byteSize, sReg and vReg counts. This is done by assembling with the “OutputUsageMode” option. In this mode, GCN assembly are compiled with temporary registers and the count of each size and type(S or V) of register is recorded. The maximum using points is important here because.
  3. In the OpenCL kernel we need to replace inline assembly with some generated dummy OpenCL code. The dummy code byte size would need to be the same size or slightly larger to fit the inline assembly bin. The dummy code would also need to use the same number of correctly sized sReg and vRegs as well as the same parameters. It also must be surrounded by barrier and mem_fence to prevent OpenCL from reordering the code or else the dummy code will be hard to identify later for replacement.
  4. Now compile the OpenCL dummy code to binary and note what sRegs and vRegs were used. Knowing what registers that were used is important because it tells us what registers we can use in the assembly. When coding the inline assembly, we will need to use variables instead of fixed register numbers since we do not know what pools of registers will be used later on.
  5. Next, re-assemble the inline assembly but this time we will prepend an allowed register pool to the top with the commands #S_POOL and #V_POOL. For example, #S_POOL s22,s23,s24,s27 would tell the assembler to use these registers for variables.
  6. The final step is to locate the dummy binary in the entire program binary and then replace it with the finished binary. RegEx can be used for this.

What was not finished? Steps (3) – This is kind of working but not as well as I hoped. It does create a kernel of close to the right size and the register counts are close also but not exact. The code for my attempts are in DummyFillerCode(). Step (6) – this was not finished. The start of the dynamically created DummyFillerCode binary seemed to be difficult to find.

Having the programmer use free vs out-of-scoping variable freeing - When making the variable reservation system I debated on whether I needed to use “free” to finish a variable or if I wanted it to automatically fall way when it is out of scope or used on the last line. Scopes in higher level languages are built-in and implicitly provide an end-of-scope for variables however most assemblers don’t have built-in scopes, they are usually handled by the programmer. So some kind of mechanism would need to detect the last time the register is used and then free it. After some back and forth, I found out that automatic freeing would not be strait forward without scopes and it all has to do with “goto”. There are instance when we might jump back into some code and expect a variable to still be active. Ultimately, I decided to allow the programmer to free the variable because it would be simpler to understand and lets the programmer decide when to free a variable.

OpenCLwithGCN Sub-Project

This project makes it possible to use Asm4GCN in C# projects. Its main function is to replace Asm4GCN kernels with dummy OpenCL kernels, and then after assembling, replace the dummy OpenCL binary with the Asm4GCN kernel binary. The OpenClwithGCN project allows the mixing of Asm4GCN and OpenCL kernels in the same cl::program. I think the best way to describe this is through a flowchart:

Features of OpenCLwithGCN

Built-in Text template engine

When programming in assembly it is almost mandatory to have some kind of mechanism to repeat text dynamically. One popular need for assembly programmers is to unroll loops. In higher level languages, compilers do this automatically but at the assembly level it is the programmer’s responsibility. 

This project uses a, C# based, text-template-transformation engine, that can be used to manipulate the assembly. It is comprised of C# code tags that are run on the assembly text before it being compiled – much like predefinitions. For the GCN text template mechanism, I chose to wrap the C# code in [[..]] style tags. To use it, just put in any C# code inside double brackets to control the flow of what gets displayed. Variables can be printed anywhere using [[=myVar]]. There are lots of features to this so please reference for details.

Text Template Engine Examples
Original Source s Expanded
[[for(int i=3; i<7; i++) {]]
  v_mov_b32 v[[=i]], v[[=i+4]] [[ } ]]
v_mov_b32 v3, v7
<span style="font-size: 9pt;">v_mov_b32 v4, v8
</span><span style="font-size: 9pt;">v_mov_b32 v5, v9</span>
// Created [[=DateTime.Now]]
// Created 1/24/2015 8:12 PM

Mixed OpenCL and Asm4GCN kernels in one cl::program

___kernel and __asm4GCN kernels can be mixed in the same openCL cl::program. Originally the goal was to have OpenCL kernels with inline functionality but this did not pan out – at least not yet. So I desided to go with separate ___asm4GCN and __kernel kernels. This works out well because both types of kernels can be combined in a cl::program with steams. Note: currently OpenCLwithGCN version only supports a single __asm4GCN. (see limitiations)

Using OpenCLwithGCN

To show how to use OpenCLwithGCN, I will walk through Example1.cs from the project.

The first step will be to add OpenCLwithGCN.exe and OpenCL.Net.dll as project references. Even though OpenCLwithGCN is an executable it still can be added as a reference. After the references are added then we can start writing some code.

We start out with creating some GCN assembly in one of a couple of ways. The first way we could do this is to just start writing assembly directly. Unless the programmer knows exactly what they are doing, this is very difficult because it is hard to determine what registers the parameters will use. Another way we can do this is to write a simple OpenCL kernel and then decompile it. After it is decompiled into assembly, we can copy and paste the code into our program and modify as needed.

For this example, I will use decompiled assembly from example1.cs (see below). The myAsmFunc(...) was originally created from the myOpenClFunc(...) below it and is not really used here. I just wanted to show where the GCN assembly came from. Also, while we are here, I would like to point out that we can have both normal OpenCL kernels and asm4GCN kernels in the same cl::program.

string source = @"
    __asm4GCN myAsmFunc ( float*, float* )
      #define _32Float_    0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]
      s_buffer_load_dword  s0, s[4:7], 0x04
      s_buffer_load_dword  s1, s[4:7], 0x18
      s_waitcnt            lgkmcnt(0)
      s_min_u32            s0, s0, 0x0000ffff
      s_buffer_load_dword  s4, s[8:11], 0x00
      v_mov_b32            v1, s0
      v_mul_i32_i24        v1, s12, v1 
      v_add_i32            v0, vcc, v0, v1
      v_add_i32            v0, vcc, s1, v0
      v_lshlrev_b32        v0, 2, v0
      s_load_dwordx4       s[12:15], s[2:3], 0x60 
      s_waitcnt            lgkmcnt(0)  
      v_add_i32            v1, vcc, s4, v0
      tbuffer_load_format_x v1, v1, s[12:15], _32Float_
      s_buffer_load_dword  s0, s[8:11], 0x04
      s_load_dwordx4       s[4:7], s[2:3], 0x68
      s_waitcnt            lgkmcnt(0) 
      v_add_i32            v0, vcc, s0, v0
      s_waitcnt            vmcnt(0) 
      v_add_f32            v1, v1, v1
      tbuffer_store_format_x v1, v0, s[4:7], _32Float_ 

     __kernel void myOpenClFunc ( __global float* cl_input, __global float* cl_output )
       size_t i = get_global_id(0);
       cl_output[i] = cl_input[i] + cl_input[i];
     }; ";

In the next step of our example, we initialize OpenClWithGCN and then compile the GCN. Also note that we grab the default environment created by OpenClWithGCN for later use.

OpenClWithGCN gprog = new OpenClWithGCN();
OpenClEnvironment env = gprog.env;
bool success = gprog.GcnCompile(source, out log);

At this point we are pretty much done with OpenClWithGCN. The rest of this example will use OpenCL.NET to setup OpenCL variables and then execute the kernels. I prefer OpenCL.Net because the wrappers are almost a line for line match for the c based OpenCL. Also, the rest of this example uses a modified OpenCL example written by Derek Gerstmann (UWA) in 2010. I tried to convert the example but I don't think I did it justice.

Next we create a Kernel from our modified Program. Most of the following functions have an ErrorCode output that is useful for debugging.

Kernel kernel = Cl.CreateKernel(env.program, "myAsmFunc", out err);
if (GetError(err, "CreateKernel()")) return;

Allocate the cl_input and cl_output device buffers, and fill with some random data…

IMem cl_input = Cl.CreateBuffer(env.context, MemFlags.ReadOnly, dataSz, out err);
IMem cl_output = (Mem)Cl.CreateBuffer(env.context, MemFlags.WriteOnly, dataSz, out err);

// create some random data for testing
var random = new Random();
const int count = 1024 * 1024;
const int dataSz = count * sizeof(float);
float[] data = (from i in Enumerable.Range(0, count)
               select (float)random.NextDouble()).ToArray();

// Copy our host buffer of random values to the cl_input device buffer...
err = Cl.EnqueueWriteBuffer(env.cmdQueue, cl_input, Bool.True, IntPtr.Zero,
    new IntPtr(dataSz), data, 0, null, out lastEvent);
InfoBuffer local = new InfoBuffer(new IntPtr(4));
err = Cl.GetKernelWorkGroupInfo(kernel, env.devices[0], KernelWorkGroupInfo.WorkGroupSize,
    new IntPtr(sizeof(int)), local, out nullPtr);

Set the arguments to our kernel, and enqueue it for execution

err = Cl.SetKernelArg(kernel, 0, new IntPtr(4), cl_input);
err = Cl.SetKernelArg(kernel, 1, new IntPtr(4), cl_output);
IntPtr[] workGroupSizePtr = new IntPtr[] { new IntPtr(count) };
IntPtr[] localGroupSizePtr = null;

Enqueue and run the kernel

err = Cl.EnqueueNDRangeKernel(env.cmdQueue, kernel, 1, null, workGroupSizePtr,
                              localGroupSizePtr, 0, null, out lastEvent);

Force command queue to get processed, wait until all commands finish

env.err = Cl.Finish(env.cmdQueue);

Read back the results

float[] results = new float[count];
env.err = Cl.EnqueueReadBuffer(env.cmdQueue, cl_output, Bool.True, IntPtr.Zero,
                              new IntPtr(dataSz), results, 0, null, out lastEvent);

That should be it. Now just print the results and verify them.

OpenCLwithGCN Project Files

The following is a list of the source files for this sub-project along with their description. 

  • OpenClEnvironment.cs (97) - This file contains classes related to holding theOpenCL environment.
  • OpenClWithGCN.cs (615) - This file contains the core functionality.
  • TextTemplate.cs (105) - This file contains a single static Expand() function the is responsible for all the Text Template Transformations.

Asm4GcnGUI Windows Interface Sub-Project

Asm4GcnGUI provides a graphical user interface (GUI) for working with GCN assembly. It is a better than a simple notepad in that it has syntax highlighting, code completion, and a few other extras. This app can be used for some quick kernels testing, playing around, or for educational purposes. The GUI also serves as an example on how to implement OpenCLwithGCN. When launched, it defaults to a simple example that can be used as a starting point for new projects. The interface has three windows: C# host code, kernel code, and a compiler output window at the bottom. 

In the host window, you will find the C# code that is run on the CPU.  The host window has everything that would make up a normal C# application so it can be dropped into a C# project.  To use OpenCL in C# a wrapper, OpenCL.Net is used. The wrapper wraps C based OpenCL functions in C# compatible functions.

The next window is the GCN Assembly code window. Normally __asm4GCN kernels would be with the rest of the host code, however the GUI splits this out so GCN specific syntax highlighting can be done. Behind the scenes, the text in the C# host and GCN Assembly windows are merged and then compiled together. The text in the GCN Assembly window is simply placed into a .CS file with namespace GCN_NS { static class Code{ public const string DevCode = [GCN assembly Text here]}}”. This allows the GCN Assembly text to be combined with the main host program code. 

The final window is the compiler output. This shows any errors or warnings that were generated while compiling the program. There are two sets of messages shown in here. First to show up are the Asm4GCN assembler messages followed by any C# host code errors. The output of the program does not show in this window – it shows up in a separate windows console window.

Features of the GUI Interface

  • Syntax Highlighting – For this project I used Pavel Torgashov’s Fast Colored Text box. It is a nice control that that can be customized to work with custom languages such as GCN Assembly. It provides syntax highlighting, cold folding, hotkey support, bracket highlighting, undo, printing support, and the list goes on and on. The syntax highlighting helps in readability and also with errors.  If a keyword is entered incorrectly it will not highlight.
  • Code completion – There is code completion for the GCN Assembly window. This helps with trying to remember what command you might be looking for. Again, thanks to Pavel Torgashov for this add-in.
  • Separate C# and GCN Assembly Code Windows – As mentioned already, this allows for language specific syntax highlighting. It also spits out the code in a logical way: CPU host code and a GPU code.

GPU Assembly Advantages/Disadvantages

Programming directly in GCN assembly has its advantages and disadvantages. These are my personal opinions based on my GPU programming experience.


  • Potential for efficient and fast code – if a programmer knows what they are doing they can often create much faster code with a smaller memory footprint. I would estimate that a skilled hand written kernel is between 2x-4x better performance.
  • Assembly can take advantage of more hardware instructions and features. Some hardware features and instructions are only accessible using assembly. Also, some features of instructions can only be utilized by using assembly.
  • Toying with GPU assembly helps programmers understanding the inner workings of a GPU. This will help when writing Cuda or OpenCL kernels.
  • When programming in assembly, a human programmer can often stuff items in memory more efficiently than a high-level compiler. Humans are more inventive than a compiler. =)


  • Low-level languages, like GCN Assembly or nVidia’s PTX, take much longer to program in then higher-level languages such as OpenCL and Cuda.
  • Difficult to maintain – following an assembly function is not as easy as higher level languages. Understanding what to change without breaking something can be time consuming. Usually smaller 5-20 liners don’t have this issue as much.
  • Low readability – Assembly by nature is not meant to help with high-level readability. The only readability it provides is allowing binary code to be human readable. Assembly has little refactoring and abstraction and this makes it really difficult to read and follow.
  • GPU Assembly can break with new generations of GPUs. Since it is designed for a particular chip it can break with future generations of GPUs.
  • Depending on how GPU assembly binaries are loaded, they can break on simple driver updates. The OpenCLwithGCN sub-project is very sensitive to driver versions.
  • OpenCL kernels work across different architectures including AMD GCN, nVidia, x86 and x64. Assembly kernels are locked into the AMD GCN 1.x architecture. Luckily, AMD GCN architectures have deep development cycles so GCN is not constantly changing.
  • Assembly is more bug prone - Higher level languages are designed to minimize common programming mistakes.
  • When programing in Assembly for performance, you’re competing against high-level compilers. Just because it’s written in Assembly does not mean it’s going to faster.

In general, based on personal experience, assembly is optimal for smaller time-critical sections of code. Large pieces of assembly can get complicated quickly if they are not well laid out. 1-50 lines of assembly is fine for critical sections but as a kernel gets more complicated the compiler starts to have the upper hand because it can keep track of things better.

GCN Assembly code writing tips:

  • Plan ahead – Maybe write your code in a higher-level language first (it’s almost like pseudo-code), Writing a function in a higher-level language accomplishes a few items:
    • It helps work out the details and helps you understand what you're writing fully.  For me, sometimes I think I know what is needed only to discover that I did not fully understand the problem. Or I come to discover that the function will not even work. Writing a function in assembly first is not a good place to be figuring out your code.  Coding your GPU function in OpenCL/Cuda first provides an outline to work off of. 
    • It gets the mind mentally ready. I often find having to writing a function a second time leads to a cleaner, more precise, and accurate function.
  • Know the GCN ISA manual - Go through the AMD GCN ISA programming manual to see what instructions are available that might aid in how things are done for your kernel.
  • Unroll your code to prevent jumps – jumps take extra compute cycles that can often be avoided.
  • Limit kernel size to fit in cache - Try to keep your kernels small enough to fit in the compute units share instruction cache, currently 32kb. (4000-8000 instructions) I learned this one from Realhet.
  • Limit register usage – The less registers that are used the more latency-hiding threads can run. One trick I have found to save on register usage is to move instructions up or down in your code to lower the overall register usage. Example: say on line 10, A and B are assigned values, then on line 20 they are first used with a “C=A+B“, and finally C is used on line 30. The register usage would be 2 variables from lines 10-20, and then 1 variable from lines 20-30. The total register usage can be written as (2 var * (20-10)) + (1 var * (30-20)) = 30. To save on registers we could move up “C=A+B” to line 11. This minimizes the time two registers are used. Then our usage would become (2 var * (11-10)) + (1 var * (30-11)) = 21. Basically we just freed up a single register from lines 11-20.  This trick only applies in assembly however because OpenCL and Cuda are smart enough to do this for you.
  • Just play and have fun with GCN assembly. The more you practice the better you will be.
  • Read online blogs, posts, and articles(like this one) =)


  • Does not support the VINTRP encoding format.
  • Does not work with many driver versions (see system requirements)
  • Only compatible with AMD display adapters with GCN technology (see system requirements)

Future wish list

  • Fix limitations
  • Setup unit testing / Additional testing
  • Add compatibility for additional GPU driver versions.
  • Inline Assembly – Inline assembly was an original goal but generating a dummy kernel for it was problematic.
  • Alterative friendly, easy to read, assembly statements. Instead of something like v_mul_i32_i24 varA, varB, varC maybe have something like varA = VarB * VarC. To do this we would need to know the types of the variable to select the correct instruction. This part is already in place.
  • Use variable type information (I,U,F,B) to throw warnings when used with non-matching instructions. i.e. using v4f with v_mul_i32_i24
  • Add and ability to access a particular register in a multi-register variable. For example, with s16b myVar there is not a way to access the second, third, or fourth register directly using the variable name.  In the future, index access might be used to access this register. i.e. myVar[2] to access second register.

System Requirements

  • An AMD Graphics card with GCN 1.x technology
  • AMD display driver versions 13.251.9001 or 14.501.1003 – most other versions will not work.

Feb 16 2015 - Initial Public Release
Mar 1 2015 - General Fixes and Changes

  • FixVariables would always use register 0
  • FixRemoved single __asm4GCN block limitation - There can now be many __asm4GCN kernels in one program.
  • ChangeParameter names removed - Since the Parameter names are not used having them there could be confusing.  Function headers are now in the form: __asm4GCN myAsmAddFunc (float*,float*){...}
  • Changemerged #ref command into normal variable declarations. Since the #ref command is almost identical to normal variable declarations except that it specifies a register it is best to combine these.  It is cleaner and less confusing. Instead of the format for a ref being #ref s8u myVar s[2:3]  is now just s8u myVar s[2:3].
  • Improvedenlarged the autocomplete box - it now fits the code a little better.
  • ImprovedCleaned up example code.
  • Improved: Sytax Highlighting - it now highlights, labels, registers, and defines.  It also highlights matching words.
  • RemovedAuto compile skip function.  This function would skip a re-compile if there were no changes in the code windows. It was removed because it added complexity in the code and there was hardly any performance benefit since the compile process is so fast anyway. 
  • Added: ren command - A rename command has been added.  This allows a variable to be renamed as its use changes.

Other Windows GUI GCN assemblers

HetPas Assembler - Written by Realhet, HetPas has been the only directly runnable GCN assembler for windows up to this point. I have used HetPas’s assembler for a couple of years now. It’s good for creating binary kernels or just playing around with GCN Assembly. The program implements a full and feature rich GCN Assembler using a Pascal like language for host code. It does require "Data Execution Prevention" to be turned off in windows and there are instructions on how to do this. HetPas has been updated several times and is getting more and more feature rich with each release. Recently variables were added as well. 

Other Open-Source GCN assemblers

GCN Assembler build by Daniel Bali. This is a C based CCN open-source assembler. Links:  GitHub,  OpenWall

CMINGCNASM by Sylvain Bertrand. C language minimal GCN assembler. Links:  GitHub,  GoogleCode

A special thanks too....

Daniel Bali, who built an excellent open source GCN assembler. This was my first compiler so I was looking for ideas on how to start. Daniel’s project gave me ideas on how I could tackle this. I took away some importent concepts on how to build an assembler from his project.

Realhet, who built a fully functional and feature rich assembler for windows called HetPas. Much of my GCN assembly skills came from toying with Realhet’s assembler and from reading his posts. I have been a Realhet fan for a while. He has made many insightful posts on AMD forums and on his WordPress site.

Ananth for providing OpenCL.Net, an excellent OpenCL wrapper for .Net. One reason I like this library is because the wrappers are very near plain OpenCL and also it is just OpenCL. 

Derek Gerstmann for his easy to follow and complete OpenCL example. Derek's OpenCL example is the default example that opens in Asm4GcnGUI. It has been coverted to C# OpenCL.NET.

Pavel Torgashov for the FastColoredTextBox editor and the Autocomplete menu. These controls add syntax highlighting and code completion richness to the GUI interface. 

AMD for making their GCN ISA manuals available. I spend many hours combing through the ISA documentation and referencing their data tables. 



This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Ryan S White
Help desk / Support
United States United States
Ryan White is an IT Coordinator, currently living in Pleasanton, California.

He earned his B.S. in Computer Science at California State University East Bay in 2012. Ryan has been writing lines of code since the age of 7 and continues to enjoy programming in his free time.

You can contact Ryan at s u n s e t q u e s t -A-T- h o t m a i l DOT com if you have any questions he can help out with.
Follow on   LinkedIn

Comments and Discussions

-- There are no messages in this forum --
| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150427.1 | Last Updated 17 Feb 2015
Article Copyright 2015 by Ryan S White
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid