You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/47-cuda-events/README.md
+3-89Lines changed: 3 additions & 89 deletions
Original file line number
Diff line number
Diff line change
@@ -481,16 +481,6 @@ The `cuda_events` tool supports these options:
481
481
-`-p PATH`: Specify the path to the CUDA runtime library or application
482
482
-`-d PID`: Trace only the specified process ID
483
483
484
-
## Learning Objectives
485
-
486
-
Through this tutorial, you'll learn:
487
-
488
-
1. How CUDA applications interact with GPUs through the CUDA runtime API
489
-
2. How to use eBPF uprobes to trace user-space libraries
490
-
3. How to design efficient data structures for kernel-to-user communication
491
-
4. How to process and display traced events in a user-friendly format
492
-
5. How to filter events by process ID for focused debugging
493
-
494
484
## Next Steps
495
485
496
486
Once you're comfortable with this basic CUDA tracing tool, you could extend it to:
@@ -501,90 +491,14 @@ Once you're comfortable with this basic CUDA tracing tool, you could extend it t
501
491
4. Create visualizations of CUDA operations for easier analysis
502
492
5. Add support for other GPU frameworks like OpenCL or ROCm
503
493
494
+
For more detail about the cuda example and tutorial, you can checkout out repo and the code in <https://github.com/eunomia-bpf/basic-cuda-tutorial>
495
+
496
+
504
497
## References
505
498
506
499
- CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
507
500
- NVIDIA CUDA Runtime API: https://docs.nvidia.com/cuda/cuda-runtime-api/
- Linux uprobes Documentation: https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt
510
503
511
-
## Benchmarking Tracing Overhead
512
-
513
-
While tracing is an invaluable tool for debugging and understanding CUDA applications, it does introduce some overhead. We've included a benchmarking tool to help you measure this overhead.
514
-
515
-
### The Benchmark Tool
516
-
517
-
The `bench.cu` program performs several CUDA operations repeatedly and measures their execution time:
518
-
519
-
1. Memory allocation (`cudaMalloc`)
520
-
2. Memory transfers (host to device and device to host)
521
-
3. Kernel launches
522
-
4. Memory deallocation (`cudaFree`)
523
-
5. Full operations (the complete sequence)
524
-
525
-
Each operation is executed many times to get statistically significant results, and the average time per operation is reported in microseconds.
526
-
527
-
### Running the Benchmark
528
-
529
-
To build the benchmark tool:
530
-
531
-
```bash
532
-
make bench
533
-
```
534
-
535
-
To run a complete benchmark that compares performance with and without tracing:
536
-
537
-
```bash
538
-
make benchmark
539
-
```
540
-
541
-
This will run the benchmark twice:
542
-
1. First without any tracing
543
-
2. Then with the CUDA events tracer attached
544
-
545
-
You can also run individual benchmarks:
546
-
547
-
```bash
548
-
# Without tracing
549
-
make benchmark-no-trace
550
-
551
-
# With tracing
552
-
make benchmark-with-trace
553
-
```
554
-
555
-
### Interpreting the Results
556
-
557
-
The benchmark output shows the average time for each CUDA operation in microseconds. Compare the times with and without tracing to understand the overhead.
558
-
559
-
For example:
560
-
561
-
```
562
-
# Without tracing
563
-
cudaMalloc : 23.45 µs per operation
564
-
cudaMemcpyH2D : 42.67 µs per operation
565
-
cudaLaunchKernel : 15.89 µs per operation
566
-
cudaMemcpyD2H : 38.12 µs per operation
567
-
cudaFree : 10.34 µs per operation
568
-
Full Operation : 130.47 µs per operation
569
-
570
-
# With tracing
571
-
cudaMalloc : 25.12 µs per operation
572
-
cudaMemcpyH2D : 45.89 µs per operation
573
-
cudaLaunchKernel : 17.23 µs per operation
574
-
cudaMemcpyD2H : 41.56 µs per operation
575
-
cudaFree : 11.78 µs per operation
576
-
Full Operation : 141.58 µs per operation
577
-
```
578
-
579
-
In this example, tracing adds about 7-10% overhead to CUDA operations. This is typically acceptable for debugging and profiling purposes, but it's important to be aware of this impact when interpreting the results.
580
-
581
-
### Optimization Opportunities
582
-
583
-
If you find the tracing overhead too high for your use case, there are several ways to reduce it:
584
-
585
-
1. Trace only specific CUDA functions that are relevant to your investigation
586
-
2. Filter by specific process IDs to minimize the number of events captured
587
-
3. Disable return probes using the `-r` flag if you don't need return values
588
-
4. Consider running eBPF in user-space with tools like [bpftime](https://github.com/eunomia-bpf/bpftime) to reduce context-switching overhead
589
-
590
504
If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.
0 commit comments