Requirements: Visual Studio 2010-2015 and Cuda 7.0/7.5/8.0.
CudaPAD aids in the optimizing and understanding of nVidia’s Cuda kernels by displaying an on-the-fly view of the PTX/SASS that make up the GPU kernel. CudaPAD simply shows the PTX/SASS output, however it has several visual aids to help understand how minor code tweaks or compiler options can affect the PTX/SASS.
What is PTX or SASS anyway? NVidia’s PTX is an intermediate language for NVidia GPU’s. It is more closely tied to pure GPU assembly(SASS) but slightly abstracted. PTX is less tied to the specific hardware or a hardware generation which makes it more useful in most cases when compared to assembly. One item it abstracts is physical register numbers which makes it easier to use then assembly. PTX instructions are usually translated into one or more actual SASS hardware instructions. SASS is hardcore assembly. It is what the GPU actually runs and is directly translated into machine code. Viewing SASS code is more difficult but it does show exactly what the GPU will do. As mentioned, SASS code also works with the registers directly so there is more control where registers are stored but it’s another item that the programmer needs to keep track of and makes SASS more difficult to work with.
Often when programming in Cuda, there is a need to view what a kernel’s PTX/SASS might look like and CudaPAD helps with this. There might be a need to view PTX/SASS for debugging, understanding what’s happening, to squeezing a little more performance out of a kernel, or just for curiosity. To use the application, simply type or paste a kernel in the left panel and then the right panel will display the corresponding disassembly information. Visual informational aids like visual Cuda-to-PTX code matching lines, PTX cleanup, WinDiff, and quick register highlighting are built-in to help make the PTX easily to follow. Other on-the-fly information is also displayed like register counts, memory usage, and error information.
With any piece of code, there are often several ways to perform the same thing. Sometimes, just modifying a line or two will lead to different machine instructions with better registers and memory usage. Have fun and make some changes to a kernel in the left window and watch how the PTX/SASS changes on the right.
Just as a quick note. CudaPAD does not run any code. CudaPAD is only for viewing PTX, SASS, and register/memory usage.
Like most of my projects, this one was grown out of a personals need. For some algorithms I develop, GPU efficiency is important. One way to help with this is by understanding the low-level mechanics and making any necessary adjustments. Before creating this app, I would often get in this loop where I would write a performance critical kernel then view the PTX/SASS over and over using command line tools. Doing this repetitively was time consuming so I decided to build a quick C# app that would automate the process.
It started out as a simple app that would take a kernel in the left window and then output the PTX to the right side window. This was accomplished by basically running the same command line tools as before, mainly nvcc.exe, but now in an automated fashion in the background. I got carried away however and within a short period of time I started adding several features including automatic re-compiling, WinDiff, visual code lines markers, compile errors, and register/memory usage.
AMD used to have a similar tool for Brooke++ and this gave me the idea of having the two window app back in 2009 when I first built it. Basically the tool had a left window where a Brook+ kernel could be added and a right window where the assembly would output to. A button could be clicked to update the output window. AMD has had a couple of these over the years but it has since been replaced with AMD’s CodeXL.
AMD’s CodeXL and NVidia’s NSight have since replaced many tools like these however CudaPAD still has its place for quick, on the fly viewing of low-level assembly and experimentation. Both CodeXL and NSight are professional grade free tools and are a must have for GPU developers.
Requirements (updated 1/2017)
CudaPAD is simple to use. But before running it, make sure these system requirements are met:
A dedicated GPU is not required since we are only compiling code and not running anything.
If the requirements are met, then simply launch executable. When CudaPAD loads, it will have a sample kernel. The sample provides a quick place to start playing around or even a starting framework for a new kernel. Whenever the kernel on the left is edited, it will update the PTX or SASS on the right. If there is a compile error, it will show that near the bottom.
There are several features that can be enabled/disabled. All are on by default (also see Features section).
PTX/SASS View Modes
Change the drop down textbox between PTX, SASS or SOURCE views.
PTX view – shows the PTX intermediate language output of the kernel. PTX is close to SASS hardware instructions but is slightly higher level and is less tied to a particular GPU generation. Usually PTX instructions translate directly to SASS however sometimes there are multiple SASS instructions per PTX instruction.
SASS view – These are true assembly instructions. These types of instructions execute directly on the GPU. The amount of visual information supplied when viewing SASS is less then PTX – like the visual code lines do not show.
Raw code view – This view is mostly for debugging CudaPAD itself. Behind the covers, this app does not re-compile after every change. It only re-compiles when the code is modified and not comments or whitespace. The raw code is a stripped down version of the real code. The reason this was added was because I did not want it to keep compiling when I was adding/editing comments or adding/removing whitespace. This would not be resource friendly and would also throw off the WinDiff feature.
In the background, CudaPAD simply compiles the kernels with Cuda tools. The Cuda compiler then in turn calls a C++ compiler like Visual Studio. So to run this CudaPAD, Cuda needs to be installed and most likely a C++ compiler like Visual Studio.
Disabling the auto-compile is useful for making multiple changes before a compile. This can help show the changes in the diff (differencing) output over several changes. To do a manual compile, just click the green ‘start’ in the top right corner.
Under the Hood
Let's take a look at how this application works. I will present what happens when the left window is edited. This triggers a recompile and then updates the right PTX/SASS window. Here it is in steps:
- User enters in some Cuda in the left window.
- The textbox change kicks off a short term timer. If the user should type in any more text before that timer finishes, then the timer is reset. This system prevents the compile process from firing on every keystroke and lets the user finish typing before it automatically starts.
- When the timer completes an event is raised. In this event, we check to see if there were any changes that would require a re-compile. Obviously, if a user is just editing some comments or adding/removing whitespace, then we don't need to recompile. If there are no "code" changes, then we stop here. In the dropdown box, CODE can be selected to see what this cleaned up code looks like.
- We save the Cuda textbox to a file. This will be needed later when the Cuda compiler compiles it.
- We then clear any lines on the screen as we are going to draw new ones soon.
- We then call a batch file that does most of the compiling. This batch file is generated based on the options selected in CudaPAD. If the user has the sm_35 architecture selected, then this option is appended to the nvcc line. If the user selects an optimization level of three, then -O3 is appended. If SASS output is requested, then the CuObjDump command is appended. Here is the batch file:
- Perform some cleanup in the temp folder from the last time a compile was done.
- Calls NVidia's cuda compiler with some options:
nvcc.exe -keep -cubin --generate-line-info ...
This command compiles the cuda file into a cubin file. (device code) We also use the
-keep option and keep the ptx files as well as the
--generate-line-info so we know the line numbers of the source file so we can draw the lines.
- If SASS is selected from the dropdown, then we run CuObjDump.exe to disassemble the cubin device file into SASS code.
- Lastly, we capture any output messages from these commands to info.txt.
- Next, we fill the info textbox that has the registers and memory utilization information.
- We extract this out from the output log info.txt file we created from the batch file.
- We then grab the global, constant, stack, and shared memory, byte counts, register spill information, register usage and general log information using RegEx.
- This info is then formatted and displayed in the informational window.
- Next, any errors/warnings are captured from the rtcof.dat file and are then formatted and then placed in the error window.
- We then take grab the text from the outputted data.ptx (from nvcc.exe) and compare it to the PTX already in the window using a
diff algorithm. The final results of the
diff function is the new PTX with what changed in the form of comments. I chose to put the change information in comments so that if the text is copied to another program, it will still run.
- Next, we store the position of the scrollbars and caret location for the PTX/SASS window. This is needed because after we re-fill the output window with text, we are going to want to restore these.
- Next we grab the line information from the PTX and store that. The line numbers will be needed later to draw the connecting lines. The line information is in the form of "
.loc # ## #" statements. Any line information is then deleted from the PTX so that it is not displayed.
- Do some cleanup on the PTX to make it look all nice and dandy.
- Next, we draw the visual code lines.
- Previously, we saved the line number information for each location specified in the output PTX file. Example: On line 45 of the PTX we might have had a
.loc 1 20 1. The
20 here would be the source line so a line would be drawn from line 20 in the source to line 45 in the PTX window.
- Next, we get the indentation for each line. This is done by counting the whitespace (spaces/tabs) before each word. This is needed so the lines start or end where the code starts instead of just at the beginning of the line.
- Using the textbox height/width plus the current scroll positions for each window plus the indentation and line number of each line, we then draw the lines.
- Finally, we restore the scroll positions and caret location.
Visual Code Lines
These lines match up the Cuda source code to the PTX output. They help the programmer quickly identify what Cuda code matches up with what PTX. This function can be enabled or disabled by clicking the lines icon in the top of the PTX window.
Auto PTX Refresh
When needed, the application will automatically re-generate the PTX code. It does not do this on each text change in the source window but rather when the stuff that matters changes. Many items are stripped from the source text that do not impact the output such as comments or spaces. The Auto Update function can be enabled or disabled by clicking the auto update icon in the top of the PTX window.
Built-in Diff utility
Each time the output window updates, this will automatically run a differencing algorithm each time the PTX output changes. The notes are added in such a way that it does not impact runnability of the code. I decided to add the
diff information inside of a comments in the event the user wants to copy and paste the code. I came up with a system of using
// style comments on deleted lines and a
/*new*/ comment for new comments. The
// comments disable the entire line while the
/*new*/ does not.
Single-Click Multiple Highlighting (new in 2016)
Just click on any register or word in the PTX window and it will highlight all instances of that item. Click on another and it will highlight those as well with a different color. Click on any highlighted item and it will un-highlight all instances of that item. With just three click the following can be achieved:
Syntax Highlighting and Output Formatting
The ScintillaNET textbox control by Jacob Slusser has some convenient text highlighting abilities that visually helps when viewing code. Originally, this started out as a plain textbox, then moved to another 3rd party control and then finally to the ScintillaNET control. This results in more colorful and cleaner looking code.
Besides the text highlighting, the text in the output window is formatted so it’s a little cleaner. Things like compiler information and header information are removed:
- remove unneeded comment
- remove unneeded id: comment
- remove empty "//" comments
- shorten __cudaparam_
- shorten labels
- remove .loc 15 lines (i.e. “.loc 3 3431 3”)
- remove "%" in front of registers (New as of Jan. 2016)
- remove "// Inline" lines (New as of Jan. 2016)
- remove .file 1 "C:\\....." (New as of Jan. 2016)
Example of highlighted and cleaned up output formatting is as follows:
Online Error/Warning Search
Often when running across an error, it is helpful to do a quick online search. I found I was often opening a browser and then copying and pasting the error in to a search box. This was not efficient so I added a search online function. At the time, I think this was one of the first of its kind but since it was released in 2009, I have seen other IDEs have this.
Points of Interest
I had a little fun creating this. This is probably why so much time was put into this.
Getting the code lines to work was exciting for me. I believe the visual code lines might have been one of the first of their kind when I built this in 2009 but I am not sure. This was a wild idea I had and I was not sure if I could get it working. Drawing moving lines on the screen is not that easy as I found out as there always seemed to be some side effects. Drawing the spline was the easy part but all the miscellaneous stuff like cleaning it up was more difficult. Another difficult part was calculating the location in the text box. The textbox line height and line number must be known for each spline drawn. I’m not a graphics developer so I am just happy to get it to work! The visual lines turned out better than expected and are fun to play with.
At the time, I dreamed up many different “line” ideas to help break down the assembly but none of the others have been implemented yet:
Note: These other features have NOT been added to CudaPAD. (at least not at this time)
- Draw curved lines that show jumps. Upward jumps are in a lighter color and downward jumps are in a darker color.
- Click on a register and it would display lines where a register impacts. Dark lines for the actual places the register is used. Gray for registers it impacts. And light gray for registers it impacts after two instructions. This would have been similar to Excel’s Trace Precedents / Trace Dependents function.
- One other feature that I wanted to create but never got a chance to would have been a registers used function. This really helps understand where a kernel is maxing out on the register usage and often limits a kernel. When a register is used for the last time, it is freed after that instruction.
Advantages of Viewing PTX/SASS
Here are some advantages of viewing PTX...
- Curiosity - This is what I use it most for. Sometimes I just want to see what is going on at the lower levels and how small changes impact the code. This can be a very useful tool for trying to learn PTX/SASS and the Cuda compiler.
- Software bug- Trying to figure out that annoying bug. Is it a compiler bug or is it something with my code? Sometimes viewing the machine instructions can aid in understanding an unexpected result.
- Changing up a line or two often produces different results. When there exists a kernel that might need some performance optimization, toying with different ways of doing the same thing can produce more efficient code. One example that comes to mind was I found that using a union the PTX would always result in local memory. This was a while ago so it might not be true anymore but here is the example:
local .align 4 .b8 someLocMem;
st.local.s32 [someLocMem], someIntReg; <--very expensive
ld.local.f32 someFloatReg, [someLocMem]; <--very expensive
However, when using something like:
"int strangeInt = *(int*) &somefloat;"
the output looks like this:
mov.b32 someFloatReg, someIntReg;
This is easily spotted in CudaPAD because of the quick feedback and visual markers.
- Does the code do nothing? Several times in the past, I realized that my kernel had a bug because when I changed or deleted some code nothing changed in the PTX output. I thought to myself, how could this be? The reason why PTX might not show up is because the compiler often simplifies out useless code that does not do anything. As I found out, this is more common then I expected because I ran into this a couple times. This is usually caused by a bug but it could also just be pointless code also. In most cases, code that is optimized out should either be removed or fixed. Noticing this can help find some hidden errors in a program.
Just as a word of caution, try not to go optimization crazy. Optimization does have its place for particular functions that get run often however optimization can make code less readable, awkward, and more difficult to maintain. Also, time should only be spent on code where a performance increase would have a large impact. There is much more on this subject that I will not get into.
Videos (updated in 2016)
Below is a quick tutorial video. The sub-menu options did not show properly in the video but I explain what I am clinking on so hopefully you can still follow along.
CudaPAD won a poster spot at the 2016 GPU Technology Conference. Even better than that it was also selected as one of the top 20! At the conference, I gave a short presentation to about 100-150 people on April 4th 2016.
Here are some wish list items I have that may or may not be added in the future:
- Isolate the implementation code from interface code using the bridge pattern. While the GUI and code are somewhat split in different files right now, they are not really separable. It’s often good practice to split this up.
- Add the ability to execute the code for timing purpose. Right now PTX can be visually looked at but not benchmarked.
- Add a per-line register usage counter. Basically what this would require is to keep track of how many variables are being used on each PTX line. A GPU has a fixed number of registers and knowing where the register pressure is highest can help programmers balance their code. This is something I added in to my AMD GPU compiler, ASM4GCN, but have not added it here.
- Add jump lines to the PTX so one can easily see where a jump statements lands.
A Special Thanks to....
- Diff functionality - This is a nice drop-in C# file that provides quality diff functionality. Originally created by Eugene Myers in 1986; Converted in to C# by Matthias Herte. The mostly un-edited source is in the file Diff.cs.
- ScintillaNET - This nice tool provides the text highlighting for this project. It is a Windows Forms control, wrapper, and bindings for the versatile Scintilla source code editing component. It really adds a lot of life to this project.
- nVidia - In 2016, CudaPAD won a spot on a CudaPAD poster at the 2016 GPU Technology Conference. Moreover, it was selected as honorable mention (top 20). I presented it to an audance of around 100-150 people on a super large projector screen. It was a wonderful experance - one of the best I ever had.
- Dec 2009 – Initially built
- Jan 2013 – Changed the code textbox to use ScintillaNET for better syntax highlighting
- Nov 2014 – Updated for NVidia Cuda 6.0/6.5
- June 2015 – Code released to the public; changed to MIT License; updated for Cuda 6.5/7.0
- Jan 2016 – Added a single-click multiple highlighting search feature; Updated for Cuda 7.0/7.5.
- Jan 2017 – Verified okay with Cuda 8.0