-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ThunderFX report] Provides modular Timer interface #1757
Comments
Hi @kshitij12345 @mruberry , I have a few questions regarding the requirements.
Regarding the modular timing interfaces, this means we should provide a wrapper class (or function) that internally utilizes the above tools and offers a unified, consistent interface? |
Regarding kernel times, it was observed that the Potential Alternative - @kevinstephano has pitched the idea to use from torch.profiler import profile, ProfilerActivity
def kernel_time_with_torch_profiler(callable, inputs):
# TODO - Add an argument `warmup` and `rep` to configure this if required.
# Simple warm-up
callable(*inputs)
time = 0.0
with profile(activities=[ProfilerActivity.CUDA]) as prof:
callable(*inputs)
for evt in prof.events():
time += evt.device_time
# Also returning profile object
# if anyone wants to investigate, (not required)
return time, prof Also, pinging @ptrblck if you have other ideas (as we were planning to sync for the same). |
It would be nice to create timer functions for each of the three objects we're considering — fx graphs, thunder fx graph segments, fusions — where there's a base method that can accept some kind of timer object, and then the derived classes can call that method. Ideally, all of these objects create an FX graph, and the generic base method allows for the specification of
First, specifying a "mechanism to compile the FX graph" would generalize the current requests for torch.compile or thunder or eager to compile the graph. We could create an object, like "CompilerSpecification", that could describe how to call torch.compile or thunder or whatever to produce a callable. This should allow for compiling with torch.compile or thunder, and also for creating a thunder program and extracting an nvFuser fusion as a callable, too, which could be useful when testing nvFuser fusions. This object might require defining two methods:
Either method could be optional. If one method is not defined then the compiler specification can't be used programmatically or can't be used to create a reproducer script. Does that make sense? Second, maybe the timer should be defined similarly. We don't need to make the perfect timer for the next PR, either. It's completely OK to start with one timer for wall clock times that uses the In summary, it'd be nice when we have an object representing an FX (sub)graph if we could write code like:
I think if we did this then it would be relatively straightforward for us and developers to extend the set of compile and timer options. My example is oversimplifying how we write reproductions, however. For example, when writing the How does that sound, @kiya00? Also happy to talk through this over a VC |
Just to be super clear, @kshitij12345's work is great and we'll have more information on what timers we'd like to use in the near future, but we can start implementing modular support for compilation and timers without knowing all the timers we want to support in the future. |
Just as a heads up, we have had some issue in capturing the correct kernels under Something like this: from torch.profiler import profile, ProfilerActivity
from time import sleep
def kernel_time_with_torch_profiler(callable, inputs):
# TODO - Add an argument `warmup` and `rep` to configure this if required.
# Simple warm-up
callable(*inputs)
time = 0.0
with profile(activities=[ProfilerActivity.CUDA]) as prof:
callable(*inputs)
sleep(5)
for evt in prof.events():
time += evt.device_time
# Also returning profile object
# if anyone wants to investigate, (not required)
return time, prof |
Hi @kevinstephano , thank you for the heads up! I'm trying to use the timer used in nvfuser benchmark, I noticed there're some comments about https://github.com/NVIDIA/Fuser/blob/39bc83a3121561e8b61f81520fc036c040b87355/benchmarks/python/core.py#L159-L161, is that the right way to measure the kernel time, does it also has the issue you mentioned? |
We'll want to focus here in the next PR. Fyi @kshitij12345. A couple notes:
Originally posted by @mruberry in #1747 (comment)
The text was updated successfully, but these errors were encountered: