-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RUNTIME] Add device specific timers #7472
Conversation
f4351a2
to
8de820d
Compare
8de820d
to
3ff15d6
Compare
I think in the future i'd like to see us move to a log-based approach where each event of interest and timer used gets a unique id assigned to it at compile time. but I think this is a good change in the meantime, as that system is way more complicated and would need an RFC or two. |
python/tvm/runtime/profiling.py
Outdated
|
||
def start_timer(ctx): | ||
""" | ||
Start a low-overhead device specific timer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some documentation explaining the cases why you want to do this vs use time evaluator?
src/runtime/vm/profiler/vm.h
Outdated
@@ -51,7 +51,7 @@ class VirtualMachineDebug : public VirtualMachine { | |||
const std::vector<ObjectRef>& args) final; | |||
|
|||
std::unordered_map<Index, std::string> packed_index_map_; | |||
std::unordered_map<Index, std::vector<double>> op_durations_; | |||
std::unordered_map<Index, std::vector<TypedPackedFunc<int64_t()>>> op_durations_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't really durations anymore, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it is. You just have to call the function to get the duration.
python/tvm/runtime/profiling.py
Outdated
timer_start: function | ||
A function that stops the device specific timer. Calling this function | ||
stops the timer and returns another function that returns the elapsed | ||
time in nanoseconds. This third function should be called as late as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain the data type of the return value. if it is floating point, why not seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, what will happen if a wraparound occurs? it would be great to describe it in this api doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt wrap around will occur. You need to time something for longer than 584.6 years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if underflow occurs because the system time changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You get a negative duration (int64_t is signed). Right now I'm using high_resolution_clock which maybe not be monotonic. We could switch to steady clock, but it maybe not be as high resolution.
timer_stop = tvm.runtime.start_timer(ctx) | ||
time.sleep(1e-3) | ||
nanosecs = timer_stop() | ||
assert nanosecs < 2e6 and nanosecs > 5e5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do this, it's inviting a flaky test. do we have a mock DLContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switch to a test that just checks that the time is > 0.
src/runtime/rocm/rocm_device_api.cc
Outdated
@@ -200,5 +200,28 @@ TVM_REGISTER_GLOBAL("device_api.rocm").set_body([](TVMArgs args, TVMRetValue* rv | |||
DeviceAPI* ptr = ROCMDeviceAPI::Global(); | |||
*rv = static_cast<void*>(ptr); | |||
}); | |||
|
|||
TVM_REGISTER_GLOBAL("profiling.timer.rocm").set_body_typed([](TVMContext ctx) { | |||
hipEvent_t start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it safe to copy these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption here is that each function is called only once. If that constraint is violated, then its possible to double free.
src/runtime/cuda/cuda_device_api.cc
Outdated
return TypedPackedFunc<int64_t()>( | ||
[=]() -> int64_t { | ||
cudaEventSynchronize(stop); | ||
float milliseconds = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does it work accessing stop
from here when its copy was written to above? how might this work as a general mechanism? this does imply that starting and stopping a timer may load the dynamic memory allocator, which is bad for general timing API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop
is an opaque type copied from above. I am unsure if the allocator will be called. If so, it will be for allocating the closure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it matters a lot when the allocator is called relative to the timing calls, and being able to understand that just from reading the code is quite valuable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess chatting further--I think we already generally are not timing right inside the kernel functions. at a later time if we start doing that, i'd like us to consider things at this level, but we don't need to alter the whole approach. we may be able to help though by moving the std::function instantiation to the Start() call
include/tvm/runtime/profiling.h
Outdated
* Note that this timer performs synchronization between the device and CPU, | ||
* which can lead to overhead in the reported results. | ||
*/ | ||
TypedPackedFunc<TypedPackedFunc<int64_t()>()> DefaultTimer(TVMContext ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to have a different name as the return value is a stop function.
e.g. DefaultTimeStart?
Document the return value.
Alternatively, make a longer function, assuming we only calls into the StartTimer most of the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly minor stuff now
* Note: this function should only be called once per object. | ||
*/ | ||
virtual void Stop() = 0; | ||
/*! \brief Synchronize timer state and return elapsed time between `Start` and `Stop`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe include a comment that this calls TVMSynchronize under the hood (I think?)? and update the declaration below too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not call TVMSynchronize under the hood. Only the DefaultTimer
does that.
include/tvm/runtime/profiling.h
Outdated
*/ | ||
inline TypedPackedFunc<TypedPackedFunc<int64_t()>()> StartTimer(TVMContext ctx) { | ||
inline Timer StartTimer(TVMContext ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make static on Timer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its ok to leave it as a separate function.
tests/cpp/profiling.cc
Outdated
std::this_thread::sleep_for(std::chrono::milliseconds(10)); | ||
t.Stop(); | ||
int64_t elapsed = t.SyncAndGetTime(); | ||
CHECK_GT(elapsed, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
theoretically could be >10ms, no? I guess there is the possibility of timer adjusting..thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this test to check if the time is greater than 9 milliseconds (just to leave some leeway).
src/runtime/vm/profiler/vm.cc
Outdated
for (auto kv : op_durations_) { | ||
std::vector<double> durations; | ||
for (auto t : kv.second) { | ||
durations.push_back(t.SyncAndGetTime() / 1e3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to cast to double before dividing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1e3
is double so this will always be a double.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @tkonolige !
@tqchen to review/merge
cc @junrushao1994@comaniac @hzfan @d-smirnov please help to review this PR |
* int64_t nanosecs = t.SyncAndGetElapsedNanos() // elapsed time in nanoseconds | ||
* \endcode | ||
* | ||
* To add a new device-specific timer, register a new function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do users register the new function in python or cpp? Could you provide code example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C++, I've added an example.
@@ -0,0 +1,47 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add tests in python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't use it from python right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks @tkonolige
include/tvm/runtime/profiling.h
Outdated
* | ||
* Note: this function should only be called once per object. | ||
*/ | ||
void Stop() { operator->()->Stop(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider remove these member functions.
Given we already have timer->Stop() and timer->SyncAndGetElapsedNanos() to avoid one level of indirection.
This is also the approaches we use in latest added objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like including these methods as they document the timer interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these methods are also presented in the node and documentations are there as well. Removing the methods indirection would make the code more consistent with the reference usage of the rest of the codebase.
It also reduces the confusion of two possible ways to do the same thing.
Adding a link \sa link, as well as code examples in the Timer class would resolved the usage problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I based all of this off of how Pass
does things. And it takes this same approach.
The API here is Timer
not TimerNode
. These methods are the interface. Having these convenience methods also provides autocompletion for people who rely on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tkonolige , the overall changes looks good. I just made one final comment about removing the indirection wrapping of Ref object as we can directly do timer->Stop()
Thanks @tkonolige @areusch @hzfan ! |
This PR adds device specific timers for use in profiling. I've included CPU (std::steady_clock), CUDA (cudaEvent), and ROCM timers (rocmEvent). On devices that do no have a timer supplied, we fall back to using the CPU timer with TVMSychronize.
@areusch @hlu1 @jwfromm @icemelon9