Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RUNTIME] Add device specific timers #7472

Merged
merged 26 commits into from
Mar 5, 2021
Merged

Conversation

tkonolige
Copy link
Contributor

This PR adds device specific timers for use in profiling. I've included CPU (std::steady_clock), CUDA (cudaEvent), and ROCM timers (rocmEvent). On devices that do no have a timer supplied, we fall back to using the CPU timer with TVMSychronize.

@areusch @hlu1 @jwfromm @icemelon9

@areusch
Copy link
Contributor

areusch commented Feb 19, 2021

I think in the future i'd like to see us move to a log-based approach where each event of interest and timer used gets a unique id assigned to it at compile time. but I think this is a good change in the meantime, as that system is way more complicated and would need an RFC or two.


def start_timer(ctx):
"""
Start a low-overhead device specific timer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some documentation explaining the cases why you want to do this vs use time evaluator?

@@ -51,7 +51,7 @@ class VirtualMachineDebug : public VirtualMachine {
const std::vector<ObjectRef>& args) final;

std::unordered_map<Index, std::string> packed_index_map_;
std::unordered_map<Index, std::vector<double>> op_durations_;
std::unordered_map<Index, std::vector<TypedPackedFunc<int64_t()>>> op_durations_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't really durations anymore, is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it is. You just have to call the function to get the duration.

timer_start: function
A function that stops the device specific timer. Calling this function
stops the timer and returns another function that returns the elapsed
time in nanoseconds. This third function should be called as late as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain the data type of the return value. if it is floating point, why not seconds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, what will happen if a wraparound occurs? it would be great to describe it in this api doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt wrap around will occur. You need to time something for longer than 584.6 years.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if underflow occurs because the system time changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get a negative duration (int64_t is signed). Right now I'm using high_resolution_clock which maybe not be monotonic. We could switch to steady clock, but it maybe not be as high resolution.

timer_stop = tvm.runtime.start_timer(ctx)
time.sleep(1e-3)
nanosecs = timer_stop()
assert nanosecs < 2e6 and nanosecs > 5e5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do this, it's inviting a flaky test. do we have a mock DLContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switch to a test that just checks that the time is > 0.

@@ -200,5 +200,28 @@ TVM_REGISTER_GLOBAL("device_api.rocm").set_body([](TVMArgs args, TVMRetValue* rv
DeviceAPI* ptr = ROCMDeviceAPI::Global();
*rv = static_cast<void*>(ptr);
});

TVM_REGISTER_GLOBAL("profiling.timer.rocm").set_body_typed([](TVMContext ctx) {
hipEvent_t start;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to copy these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption here is that each function is called only once. If that constraint is violated, then its possible to double free.

return TypedPackedFunc<int64_t()>(
[=]() -> int64_t {
cudaEventSynchronize(stop);
float milliseconds = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it work accessing stop from here when its copy was written to above? how might this work as a general mechanism? this does imply that starting and stopping a timer may load the dynamic memory allocator, which is bad for general timing API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop is an opaque type copied from above. I am unsure if the allocator will be called. If so, it will be for allocating the closure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it matters a lot when the allocator is called relative to the timing calls, and being able to understand that just from reading the code is quite valuable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess chatting further--I think we already generally are not timing right inside the kernel functions. at a later time if we start doing that, i'd like us to consider things at this level, but we don't need to alter the whole approach. we may be able to help though by moving the std::function instantiation to the Start() call

* Note that this timer performs synchronization between the device and CPU,
* which can lead to overhead in the reported results.
*/
TypedPackedFunc<TypedPackedFunc<int64_t()>()> DefaultTimer(TVMContext ctx);
Copy link
Member

@tqchen tqchen Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to have a different name as the return value is a stop function.

e.g. DefaultTimeStart?

Document the return value.

Alternatively, make a longer function, assuming we only calls into the StartTimer most of the time

@tkonolige
Copy link
Contributor Author

I've switch from a function based timing approach to an object based one. This means that the api is substantially different. @areusch and @tqchen you will probably have to re-review. Sorry about that.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly minor stuff now

* Note: this function should only be called once per object.
*/
virtual void Stop() = 0;
/*! \brief Synchronize timer state and return elapsed time between `Start` and `Stop`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe include a comment that this calls TVMSynchronize under the hood (I think?)? and update the declaration below too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not call TVMSynchronize under the hood. Only the DefaultTimer does that.

*/
inline TypedPackedFunc<TypedPackedFunc<int64_t()>()> StartTimer(TVMContext ctx) {
inline Timer StartTimer(TVMContext ctx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make static on Timer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its ok to leave it as a separate function.

std::this_thread::sleep_for(std::chrono::milliseconds(10));
t.Stop();
int64_t elapsed = t.SyncAndGetTime();
CHECK_GT(elapsed, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theoretically could be >10ms, no? I guess there is the possibility of timer adjusting..thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this test to check if the time is greater than 9 milliseconds (just to leave some leeway).

for (auto kv : op_durations_) {
std::vector<double> durations;
for (auto t : kv.second) {
durations.push_back(t.SyncAndGetTime() / 1e3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to cast to double before dividing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1e3 is double so this will always be a double.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @tkonolige !

@tqchen to review/merge

@tqchen
Copy link
Member

tqchen commented Mar 1, 2021

cc @junrushao1994@comaniac @hzfan @d-smirnov please help to review this PR

@tqchen tqchen added the status: need update need update based on feedbacks label Mar 1, 2021
* int64_t nanosecs = t.SyncAndGetElapsedNanos() // elapsed time in nanoseconds
* \endcode
*
* To add a new device-specific timer, register a new function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do users register the new function in python or cpp? Could you provide code example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C++, I've added an example.

@@ -0,0 +1,47 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add tests in python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't use it from python right now.

Copy link
Contributor

@hzfan hzfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @tkonolige

*
* Note: this function should only be called once per object.
*/
void Stop() { operator->()->Stop(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider remove these member functions.
Given we already have timer->Stop() and timer->SyncAndGetElapsedNanos() to avoid one level of indirection.

This is also the approaches we use in latest added objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like including these methods as they document the timer interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these methods are also presented in the node and documentations are there as well. Removing the methods indirection would make the code more consistent with the reference usage of the rest of the codebase.

It also reduces the confusion of two possible ways to do the same thing.

Adding a link \sa link, as well as code examples in the Timer class would resolved the usage problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I based all of this off of how Pass does things. And it takes this same approach.

The API here is Timer not TimerNode. These methods are the interface. Having these convenience methods also provides autocompletion for people who rely on it.

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tkonolige , the overall changes looks good. I just made one final comment about removing the indirection wrapping of Ref object as we can directly do timer->Stop()

@tqchen tqchen merged commit d6c0cea into apache:main Mar 5, 2021
@tqchen
Copy link
Member

tqchen commented Mar 5, 2021

Thanks @tkonolige @areusch @hzfan !

@tqchen tqchen added status: accepted and removed status: need review status: need update need update based on feedbacks labels Mar 5, 2021
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants