Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

Introduction

Scott Gray edited this page Jan 24, 2015 · 4 revisions

So I wrote a fairly full featured assembler for the Nvidia Maxwell architecture. This tool lets you code directly in the same "sass" language output by cuobjdump.

This all started when I was studying different sgemm implementations and trying to incorporate those techniques into some deep learning code I've been working on. I basically came to the conclusion that it was not possible to fully utilize the hardware I bought with the tools Nvidia provides. Nvidia, unfortunately, doesn't believe in eating their own dog food and they hand assemble their library routines, rather than use ptxas like the rest of us have to. Ptxas badly manages register usage (especially when you're using vector memory ops), does a poor job of hiding memory latency with interleaved computation (particularly if you're trying to double buffer your memory loads) and handles certain predicated memory operations badly (even when warp uniform), among other things.

Anyway, the more I looked at the sass output of my code the more I began to realize that it should be possible to figure out the op and control codes to all the instructions I was using and just assemble my own code. After a month or so of toil I have achieved that goal, and then some. I now find it far less frustrating to code in assembler and talk directly to the hardware than it is to code in cuda c or ptx.

Here are the major features I put together (with more planned on the way):

  1. Register Allocation: You do this at the top of the file with a map of variable names to register numbers. This way you can write code that's easy to understand and not obscured by all the register numbers. But mainly this gives you absolute control over what registers are allocated and zero register spilling. For performance code this is important because at the hardware level registers are banked and some combinations give you higher throughput than others (and I'm talking 100s of GFlops here). To aid with this the tool can automatically allocate registers to avoid conflicts and notifies you of conflicts that aren't averted so you can manually adjust your mapping. The tool also optimally manages .reuse flags for you during assembly (a new feature with Maxwell and cuda 6.5). These flags further reduce bank conflicts as well as register bank bandwidth and overall chip power draw.

  2. Scheduled Blocks: For a lot of your code you don't want to spend too much time optimizing the ordering and stalling of instructions. So I wrote a basic scheduler to do this for you. This way you can focus on just writing clear code that's easy to maintain. But for performance blocks of code you don't have to auto schedule them and can very carefully place your instructions to maximize throughput.

  3. Meta Programming and Macros: I implemented this assembler in Perl and embedded the interpreter as a meta programming tool. This allows you to keep your code nicely rolled up without having a gazillion instructions to maintain. This makes it feel more like developing in a scripted language rather than assembly. I've also added assembler macros for things like XMAD that need to be expanded out into multiple instructions.

  4. Control Codes: Any instruction placed in a scheduled block has any required stall counts managed automatically to satisfy the pipeline depth of the particular instructions involved. But the other aspects of the control notation I deliberately don't manage for you. These are mainly the dependency barriers that memory operations make use of to signal when data is ready to use. Managing these automatically is a hard problem and is one I feel is better left up to the developer to work out. Setting these codes actually adds an interesting dimension to gpu programming that cuda c or ptx doesn't expose.

  5. Disassembly: Sometimes you just want to slightly tweak a compiled program. This tool makes that really easy to do. It can dump cubin code in an easy to edit format and you can just insert it back in. In fact, the program isn't designed to work from scratch. I did not spend the time to completely dissect the cubin format, only enough to be able to edit a kernel in place. You need to at least to start out with the shell of a kernel that defines the globals, shared memory, and parameters. The tool dumps that compiled cubin and you take it from there. There are lots of other little features to talk about and you can explore them in the wiki. I wrote it in Perl but I'll probably convert it to Python at some point (this seems like the perfect project to finally learn that language.) As it is, I find it pretty easy to now write code that performs within 2% of the theoretical throughput, which for GM204 is 4.9 TFlops (default clocks). The best I was getting from bashing my head against ptxas was a VERY tenuous 75%.

The op code coverage is around 80% at this point. I can dis and re-assemble all of cublas_device.lib with zero errors. But there's still more to do: more op codes (mainly surface, texture and video instructions) and more micro benchmarks to fine tune the scheduler. Anyone interested in contributing would be welcome.

Included is a sample sgemm implementation that runs at 98% of the theoretical throughput of Maxwell hardware, and as much as 4.8% faster than Nvidia's hand assembled cublas implementation. Also included is a simple framework to write microbenchmarks for the purpose of tuning the scheduler. Here is a simple page to get you started. There's more documentation and features to come...

--Scott Gray

Clone this wiki locally