[ThunderFX report] Provides modular Timer interface #1757

kiya00 · 2025-02-10T10:16:22Z

We'll want to focus here in the next PR. Fyi @kshitij12345. A couple notes:

We should probably improve our modularity so that we can just pass a callable to a timer class to generate timings, instead of timing both torch.compile and nvfuser simultaneously and explicitly
We should use @kshitij12345's learnings from his analysis of kernel times to implement a kernel timer class or function that's a subclass of either our own timer class or some generic timer class so it can accept arguments like warmups and # iterations
We should also have a wall timer class or function
We should probably report statistics of the non-warmup runs performed, like mean, median, and stdev of the distribution (which I suppose requires assuming the distribution is a normal distribution)

Originally posted by @mruberry in #1747 (comment)

kiya00 · 2025-02-10T10:27:55Z

Hi @kshitij12345 @mruberry , I have a few questions regarding the requirements.
Now we use torch.utils.benchmark.Timer(stmt='pass', setup='pass', global_setup='', timer=<built-in function perf_counter>, globals=None, label=None, sub_label=None, description=None, env=None, num_threads=1, language=Language.PYTHON) for wall time, triton.testing.do_bench(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, return_mode='mean') for kernel time, @kshitij12345 according to your experience of analysis of kernel times, is this sufficient for our needs?

We should probably improve our modularity .

Regarding the modular timing interfaces, this means we should provide a wrapper class (or function) that internally utilizes the above tools and offers a unified, consistent interface?

kshitij12345 · 2025-02-10T12:16:17Z

@kshitij12345 according to your experience of analysis of kernel times, is this sufficient for our needs?

Regarding kernel times, it was observed that the triton.testing.do_bench times did not match timings from nsys profile or nsys nvprof for fusions with a small runtime (less than 100 microseconds) and also didn't seem to be consistent across different runs for a few examples. (Have attached a couple of examples, can share more examples offline)

Potential Alternative - @kevinstephano has pitched the idea to use torch.profiler.profile to get the measurements for kernel times (an example code - https://github.com/kevinstephano/qwen2_lora_blocks/blob/5fcb1b13ce74efbf876c99f5cc4f2b21cb4f8684/utils/runner.py#L146-L159). I have tried it on a few examples and this seems to work well even for low runtime fusions though needs to be investigated more.

from torch.profiler import profile, ProfilerActivity

def kernel_time_with_torch_profiler(callable, inputs):
    # TODO - Add an argument `warmup` and `rep` to configure this if required.
    # Simple warm-up
    callable(*inputs)

    time = 0.0
    with profile(activities=[ProfilerActivity.CUDA]) as prof: 
        callable(*inputs)

    for evt in prof.events():
        time += evt.device_time
    # Also returning profile object
    # if anyone wants to investigate, (not required)
    return time, prof

Also, pinging @ptrblck if you have other ideas (as we were planning to sync for the same).

small_runtime_fusion_1.py.txt
small_runtime_fusion_2.py.txt

mruberry · 2025-02-10T16:38:49Z

Regarding the modular timing interfaces, this means we should provide a wrapper class (or function) that internally utilizes the above tools and offers a unified, consistent interface?

It would be nice to create timer functions for each of the three objects we're considering — fx graphs, thunder fx graph segments, fusions — where there's a base method that can accept some kind of timer object, and then the derived classes can call that method.

Ideally, all of these objects create an FX graph, and the generic base method allows for the specification of

a mechanism to compile the FX graph into a callable
a timer to time the execution of the callable

First, specifying a "mechanism to compile the FX graph" would generalize the current requests for torch.compile or thunder or eager to compile the graph. We could create an object, like "CompilerSpecification", that could describe how to call torch.compile or thunder or whatever to produce a callable. This should allow for compiling with torch.compile or thunder, and also for creating a thunder program and extracting an nvFuser fusion as a callable, too, which could be useful when testing nvFuser fusions. This object might require defining two methods:

a method in Python that produces the callable programmatically
a method that accepts some string arguments (like the string names for the inputs) and has to produce the Python source string that will create the callable

Either method could be optional. If one method is not defined then the compiler specification can't be used programmatically or can't be used to create a reproducer script.

Does that make sense?

Second, maybe the timer should be defined similarly. We don't need to make the perfect timer for the next PR, either. It's completely OK to start with one timer for wall clock times that uses the torch.utils timer as specified above, and another timer for kernel times that uses triton.testing.do_bench. We might want to try using them in the future, and we can swap the defaults later.

In summary, it'd be nice when we have an object representing an FX (sub)graph if we could write code like:

# obj is our FX (sub)graph object


# benchmarks.default_torch_compile just being an example name that points to the default "CompilerSpecification" for torch.compile
tc_times = obj.benchmark(benchmarks.default_torch_compile, benchmarks.default_wall_timer)

print(f"{tc_times.mean=}, {tc_times.median=} ...")

# Another example
tc_times = obj.benchmark(benchmarks.default_eager, benchmarks.triton_do_bench_timer)

# Another example
tc_times = obj.benchmark(benchmarks.default_thunder, benchmarks.default_kernel_timer)

# An example of a custom specification (which should use the same pattern we use to define compilers and timers)

class MyCompilerSpecification(benchmarks.CompilerSpecification):
  ...

  @staticmethod
  def compile(fx_graph) -> collections.abc.Callable:
    ...

  @staticmethod
  def to_source(fx_graph_name: str, ...) -> str:
    ...

class MyTimer(benchmarks.Timer):
  ...

  @staticmethod
  def time(callable, *args, **kwargs):
    ...

  @staticmethod
  def to_source(callable_name: str, ...) -> str:
    ...

my_times = obj.benchmark(MyCompilerSpecification, MyTimer)

I think if we did this then it would be relatively straightforward for us and developers to extend the set of compile and timer options. My example is oversimplifying how we write reproductions, however. For example, when writing the to_source method for a timer that calls triton.testing.do_bench, it probably wants to import do_bench from triton.testing, so instead of just returning a string maybe to_source should return a string for its additional imports and a string for how it's called. These strings could be specified using a tuple or a new type of object.

How does that sound, @kiya00? Also happy to talk through this over a VC

mruberry · 2025-02-10T16:40:09Z

Just to be super clear, @kshitij12345's work is great and we'll have more information on what timers we'd like to use in the near future, but we can start implementing modular support for compilation and timers without knowing all the timers we want to support in the future.

kevinstephano · 2025-02-10T17:44:56Z

Just as a heads up, we have had some issue in capturing the correct kernels under torch.profiler when running multiple runs. This means, over successive runs, the number of kernels captured and the aggregate time goes down. I have not figured out why this is happening. The only workaround I found was to add a 5 second delay prior to collecting the kernel stats. This seems specific to torch.profiler and does not include other CUPTI based measurement tools. I tried reducing the time from 5 seconds but empirically this is where things stabilized.

Something like this:

from torch.profiler import profile, ProfilerActivity
from time import sleep

def kernel_time_with_torch_profiler(callable, inputs):
    # TODO - Add an argument `warmup` and `rep` to configure this if required.
    # Simple warm-up
    callable(*inputs)

    time = 0.0
    with profile(activities=[ProfilerActivity.CUDA]) as prof: 
        callable(*inputs)
        sleep(5)

    for evt in prof.events():
        time += evt.device_time
    # Also returning profile object
    # if anyone wants to investigate, (not required)
    return time, prof

kiya00 · 2025-02-11T14:38:09Z

Hi @kevinstephano , thank you for the heads up! I'm trying to use the timer used in nvfuser benchmark, I noticed there're some comments about https://github.com/NVIDIA/Fuser/blob/39bc83a3121561e8b61f81520fc036c040b87355/benchmarks/python/core.py#L159-L161, is that the right way to measure the kernel time, does it also has the issue you mentioned?

kiya00 linked a pull request Feb 10, 2025 that will close this issue

Adds the timer based on torch.profiler.profile #1759

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ThunderFX report] Provides modular Timer interface #1757

[ThunderFX report] Provides modular Timer interface #1757

kiya00 commented Feb 10, 2025 •

edited

Loading

kiya00 commented Feb 10, 2025

kshitij12345 commented Feb 10, 2025

mruberry commented Feb 10, 2025

mruberry commented Feb 10, 2025

kevinstephano commented Feb 10, 2025 •

edited

Loading

kiya00 commented Feb 11, 2025

[ThunderFX report] Provides modular Timer interface #1757

[ThunderFX report] Provides modular Timer interface #1757

Comments

kiya00 commented Feb 10, 2025 • edited Loading

kiya00 commented Feb 10, 2025

kshitij12345 commented Feb 10, 2025

mruberry commented Feb 10, 2025

mruberry commented Feb 10, 2025

kevinstephano commented Feb 10, 2025 • edited Loading

kiya00 commented Feb 11, 2025

kiya00 commented Feb 10, 2025 •

edited

Loading

kevinstephano commented Feb 10, 2025 •

edited

Loading