Skip to content

Latest commit

 

History

History
180 lines (142 loc) · 8.3 KB

NVBIT_README.md

File metadata and controls

180 lines (142 loc) · 8.3 KB

NVBit (NVidia Binary Instrumentation Tool)

NVIDIA Corporation

NVBit is covered by the same End User License Agreement as that of the NVIDIA CUDA Toolkit. By using NVBit you agree to End User License Agreement described in the EULA.txt file.

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing

Introduction

NVBit (NVidia Binary Instrumentation Tool) is a research prototype of a dynamic binary instrumentation library for NVIDIA GPUs.

NVBit provides a set of simple APIs that enable writing a variety of instrumentation tools. Example of instrumentation tools are: dynamic instruction counters, instruction tracers, memory reference tracers, profiling tools, etc.

NVBit allows writing instrumentation tools (which we call NVBit tools) that can inspect and modify the assembly code (SASS) of a GPU application without requiring recompilation, thus dynamic. NVBit allows instrumentation tools to inspect the SASS instructions of each function (__global__ or __device__) as it is loaded for the first time in the GPU. During this phase is possible to inject one or more instrumentation calls to arbitrary device functions before (or after) a SASS instruction. It is also possible to remove SASS instructions, although in this case NVBit does not guarantee that the application will continue to work correctly.

NVBit tries to be as low overhead as possible, although any injection of instrumentation function has an associated cost due to saving and restoring application state before and after jumping to/from the instrumentation function.

Because NVBit does not require application source code, any pre-compiled GPU application should work regardless of which compiler (or version) has been used (i.e. nvcc, pgicc, etc).

Requirements

  • SM compute capability: >= 3.5 && <= 8.6
  • Host CPU: x86_64, ppc64le, aarch64
  • OS: Linux
  • GCC version : >= 5.3.0 for x86_64; >= 7.4.0 for ppc64le and aarch64
  • CUDA version: >= 10.1
  • CUDA driver version: <= 510.xx

Currently no Embedded GPUs or ARMs host are supported.

Getting Started with NVBit

NVBit is provided in a .tgz file containing this README file and three folders:

  1. A core folder, which contains the main static library libnvbit.a and various headers files (among which the nvbit.h file which contains all the main NVBit APIs declarations).
  2. A tools folder, which contains various source code examples of NVBit tools. A new user of NVBit, after familiarizing with these pre-existing tools will typically make a copy of one of them and modify appropriately.
  3. A test-apps folder, which contains a simple application that can be used to test NVBit tools. There is nothing special about this application, it is a simple vector addition program.

To compile the NVBit tools simply type make from inside the tools folder (make sure nvcc is in your PATH). Compile the test application by typing make inside the test-apps folder. Note: if you are making your own tool, make sure you link it to c++ standard library, which is required by NVBit, otherwise, you might see missing symbol errors. nvcc does it by default, but if you specify your own host compiler using nvcc -ccbin=<compiler>, you need to point to a c++ compiler or add -lstdc++.

Using an NVBit tool

Before running an NVBit tool, make sure nvdisasm is in your PATH. In Ubuntu distributions this is typically done by adding /usr/local/cuda/bin or /usr/local/cuda-"version"/bin to the PATH environment variable.

To use an NVBit tool we simply LD_PRELOAD the tool before the application execution command. Alternatively, you can use CUDA_INJECTION64_PATH instead if LD_PRELOAD does not work for you. Because some workloads, such as pytorch would overwrite LD_PRELOAD internally, making the NVBit tool not loaded.

NOTE: NVBit uses the same mechanism as nvprof, nsight system, and nsight compute, thus they cannot be used together.

For instance if the application vector add runs natively as:

./test-apps/vectoradd/vectoradd

and produces the following output:

Final sum = 100000.000000; sum/n = 1.000000 (should be ~1)

we would use the NVBit tool which performs instruction count as follow:

LD_PRELOAD=./tools/instr_count/instr_count.so ./test-apps/vectoradd/vectoradd

or

CUDA_INJECTION64_PATH=./tools/instr_count/instr_count.so ./test-apps/vectoradd/vectoradd

The output for this command should be the following:

------------- NVBit (NVidia Binary Instrumentation Tool) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
-----------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
        KERNEL_BEGIN = 0 - Beginning of the kernel launch interval where to apply instrumentation
          KERNEL_END = 4294967295 - End of the kernel launch interval where to apply instrumentation
    COUNT_WARP_LEVEL = 1 - Count warp level or thread level instructions
    EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
----------------------------------------------------------------------------------------------------
kernel 0 - vecAdd(double*, double*, double*, int) - #thread-blocks 98,  kernel instructions 50077, total instructions 50077
Final sum = 100000.000000; sum/n = 1.000000 (should be ~1)

As we can see, before the original output, there is a print showing the kernel call index "0", the kernel function prototype "vecAdd(double*, double*, double*, int)", total number of thread blocks launched in this kernel "98", the number of executed instructions in the kernel "50077", and for the all application "50077".

When the application starts, also two banners are printed showing the environment variables (and their current values) that can be used to control the NVBit core or the specific NVBit Tool. Mostly of the NVBit core environment variable are used for core debugging/development purposes. Set the environment value NOBANNER=1 to disable the core banner if that information is not wanted.

Examples of NVBit Tools

As explained above, inside the tools folder there are few example of NVBit tools. Rather than describing all of them in this README file we refer to comment in the source code of each one them.

The natural order (in terms of complexity) to learn these tools is:

  1. instr_count: Perform thread level instruction count. Specifically, a function is injected before each SASS instruction. Inside this function the total number of active threads in a warp is computed and a global counter is incremented.

  2. opcode_hist: Generate an histogram of all executed instructions.

  3. mov_replace: Replace each SASS instruction of type MOV with an equivalent function. This tool make use of the read/write register functionality within the instrumentation function.

  4. instr_countbb: Perform thread level instruction count by instrumenting basic blocks. The final result is the same as instr_count, but mush faster since less instructions are instrumented (only the first instruction in each basic block is instrumented and the counter).

  5. mem_printf: Print memory reference addresses for each global LOAD/STORE using the GPU side printf. This is accomplished by injecting an instrumentation function before each SASS instruction performing global LOAD/STORE, passing the register values and immediate used by that instruction (used to compute the resulting memory address) and performing the printf.

  6. mem_trace: Trace memory reference addresses. This NVBit tool works similarly to the above example but instead of using a GPU side printf it uses a communication channel (provided in utils/channel.hpp) to transfer data from GPU-to-CPU and it performs the printf on the CPU side.

We also suggest to take a look to nvbit.h (and comments in it) to get familiar with the NVBit APIs.