[RUNTIME] Add device specific timers #7472

tkonolige · 2021-02-18T17:27:48Z

This PR adds device specific timers for use in profiling. I've included CPU (std::steady_clock), CUDA (cudaEvent), and ROCM timers (rocmEvent). On devices that do no have a timer supplied, we fall back to using the CPU timer with TVMSychronize.

@areusch @hlu1 @jwfromm @icemelon9

include/tvm/runtime/profiling.h

areusch · 2021-02-19T18:18:05Z

I think in the future i'd like to see us move to a log-based approach where each event of interest and timer used gets a unique id assigned to it at compile time. but I think this is a good change in the meantime, as that system is way more complicated and would need an RFC or two.

areusch · 2021-02-22T19:00:08Z

python/tvm/runtime/profiling.py

+
+def start_timer(ctx):
+    """
+    Start a low-overhead device specific timer.


Could you add some documentation explaining the cases why you want to do this vs use time evaluator?

areusch · 2021-02-22T19:01:11Z

src/runtime/vm/profiler/vm.h

@@ -51,7 +51,7 @@ class VirtualMachineDebug : public VirtualMachine {
                    const std::vector<ObjectRef>& args) final;

  std::unordered_map<Index, std::string> packed_index_map_;
-  std::unordered_map<Index, std::vector<double>> op_durations_;
+  std::unordered_map<Index, std::vector<TypedPackedFunc<int64_t()>>> op_durations_;


this isn't really durations anymore, is it?

I mean, it is. You just have to call the function to get the duration.

areusch · 2021-02-22T19:01:48Z

python/tvm/runtime/profiling.py

+    timer_start: function
+        A function that stops the device specific timer. Calling this function
+        stops the timer and returns another function that returns the elapsed
+        time in nanoseconds. This third function should be called as late as


explain the data type of the return value. if it is floating point, why not seconds?

also, what will happen if a wraparound occurs? it would be great to describe it in this api doc

I doubt wrap around will occur. You need to time something for longer than 584.6 years.

what happens if underflow occurs because the system time changed?

You get a negative duration (int64_t is signed). Right now I'm using high_resolution_clock which maybe not be monotonic. We could switch to steady clock, but it maybe not be as high resolution.

areusch · 2021-02-22T19:04:52Z

tests/python/unittest/test_runtime_profiling.py

+    timer_stop = tvm.runtime.start_timer(ctx)
+    time.sleep(1e-3)
+    nanosecs = timer_stop()
+    assert nanosecs < 2e6 and nanosecs > 5e5


I don't think we should do this, it's inviting a flaky test. do we have a mock DLContext?

I've switch to a test that just checks that the time is > 0.

areusch · 2021-02-22T19:05:42Z

src/runtime/rocm/rocm_device_api.cc

@@ -200,5 +200,28 @@ TVM_REGISTER_GLOBAL("device_api.rocm").set_body([](TVMArgs args, TVMRetValue* rv
  DeviceAPI* ptr = ROCMDeviceAPI::Global();
  *rv = static_cast<void*>(ptr);
 });
+
+TVM_REGISTER_GLOBAL("profiling.timer.rocm").set_body_typed([](TVMContext ctx) {
+  hipEvent_t start;


is it safe to copy these?

The assumption here is that each function is called only once. If that constraint is violated, then its possible to double free.

areusch · 2021-02-22T19:07:23Z

src/runtime/cuda/cuda_device_api.cc

+        return TypedPackedFunc<int64_t()>(
+            [=]() -> int64_t {
+              cudaEventSynchronize(stop);
+              float milliseconds = 0;


how does it work accessing stop from here when its copy was written to above? how might this work as a general mechanism? this does imply that starting and stopping a timer may load the dynamic memory allocator, which is bad for general timing API.

stop is an opaque type copied from above. I am unsure if the allocator will be called. If so, it will be for allocating the closure.

it matters a lot when the allocator is called relative to the timing calls, and being able to understand that just from reading the code is quite valuable

I guess chatting further--I think we already generally are not timing right inside the kernel functions. at a later time if we start doing that, i'd like us to consider things at this level, but we don't need to alter the whole approach. we may be able to help though by moving the std::function instantiation to the Start() call

tqchen · 2021-02-22T21:28:44Z

include/tvm/runtime/profiling.h

+ * Note that this timer performs synchronization between the device and CPU,
+ * which can lead to overhead in the reported results.
+ */
+TypedPackedFunc<TypedPackedFunc<int64_t()>()> DefaultTimer(TVMContext ctx);


We might want to have a different name as the return value is a stop function.

e.g. DefaultTimeStart?

Document the return value.

Alternatively, make a longer function, assuming we only calls into the StartTimer most of the time

tkonolige · 2021-02-23T23:30:52Z

I've switch from a function based timing approach to an object based one. This means that the api is substantially different. @areusch and @tqchen you will probably have to re-review. Sorry about that.

areusch

mostly minor stuff now

areusch · 2021-02-24T23:50:06Z

include/tvm/runtime/profiling.h

+   * Note: this function should only be called once per object.
+   */
+  virtual void Stop() = 0;
+  /*! \brief Synchronize timer state and return elapsed time between `Start` and `Stop`.


maybe include a comment that this calls TVMSynchronize under the hood (I think?)? and update the declaration below too

It does not call TVMSynchronize under the hood. Only the DefaultTimer does that.

include/tvm/runtime/profiling.h

areusch · 2021-02-25T00:10:37Z

include/tvm/runtime/profiling.h

 */
-inline TypedPackedFunc<TypedPackedFunc<int64_t()>()> StartTimer(TVMContext ctx) {
+inline Timer StartTimer(TVMContext ctx) {


make static on Timer?

I think its ok to leave it as a separate function.

include/tvm/runtime/profiling.h

src/runtime/vm/profiler/vm.h

src/runtime/vm/profiler/vm.cc

areusch · 2021-02-25T00:45:06Z

tests/cpp/profiling.cc

+  std::this_thread::sleep_for(std::chrono::milliseconds(10));
+  t.Stop();
+  int64_t elapsed = t.SyncAndGetTime();
+  CHECK_GT(elapsed, 0);


theoretically could be >10ms, no? I guess there is the possibility of timer adjusting..thoughts?

I've changed this test to check if the time is greater than 9 milliseconds (just to leave some leeway).

include/tvm/runtime/profiling.h

areusch · 2021-02-25T00:47:01Z

src/runtime/vm/profiler/vm.cc

      for (auto kv : op_durations_) {
+        std::vector<double> durations;
+        for (auto t : kv.second) {
+          durations.push_back(t.SyncAndGetTime() / 1e3);


I think you want to cast to double before dividing

1e3 is double so this will always be a double.

areusch

thanks @tkonolige !

@tqchen to review/merge

include/tvm/runtime/profiling.h

tqchen · 2021-03-01T23:38:41Z

cc @junrushao1994@comaniac @hzfan @d-smirnov please help to review this PR

hzfan · 2021-03-02T10:05:41Z

include/tvm/runtime/profiling.h

+   * int64_t nanosecs = t.SyncAndGetElapsedNanos() // elapsed time in nanoseconds
+   * \endcode
+   *
+   * To add a new device-specific timer, register a new function


Do users register the new function in python or cpp? Could you provide code example?

In C++, I've added an example.

hzfan · 2021-03-02T10:14:53Z

tests/cpp/profiling.cc

@@ -0,0 +1,47 @@
+/*


Could you also add tests in python?

You can't use it from python right now.

src/runtime/cuda/cuda_device_api.cc

hzfan

Looks good to me. Thanks @tkonolige

tqchen · 2021-03-04T17:32:26Z

include/tvm/runtime/profiling.h

+   *
+   * Note: this function should only be called once per object.
+   */
+  void Stop() { operator->()->Stop(); }


consider remove these member functions.
Given we already have timer->Stop() and timer->SyncAndGetElapsedNanos() to avoid one level of indirection.

This is also the approaches we use in latest added objects

I like including these methods as they document the timer interface.

these methods are also presented in the node and documentations are there as well. Removing the methods indirection would make the code more consistent with the reference usage of the rest of the codebase.

It also reduces the confusion of two possible ways to do the same thing.

Adding a link \sa link, as well as code examples in the Timer class would resolved the usage problem.

I based all of this off of how Pass does things. And it takes this same approach.

The API here is Timer not TimerNode. These methods are the interface. Having these convenience methods also provides autocompletion for people who rely on it.

include/tvm/runtime/profiling.h

tqchen

Thanks @tkonolige , the overall changes looks good. I just made one final comment about removing the indirection wrapping of Ref object as we can directly do timer->Stop()

tqchen · 2021-03-05T18:56:14Z

Thanks @tkonolige @areusch @hzfan !

tkonolige mentioned this pull request Feb 18, 2021

[FIX,Debug Graph Runtime] Insert TVMSynchronize for accurate profiling on GPUs #7444

Closed

tkonolige force-pushed the device_timers branch from f4351a2 to 8de820d Compare February 18, 2021 17:53

[RUNTIME] Add device specific timers

3ff15d6

tkonolige force-pushed the device_timers branch from 8de820d to 3ff15d6 Compare February 18, 2021 17:55

tqchen reviewed Feb 18, 2021

View reviewed changes

include/tvm/runtime/profiling.h Outdated Show resolved Hide resolved

tkonolige added 2 commits February 18, 2021 13:30

improve docs around proper usage

7fa2c75

formatting

32f020f

tkonolige added 4 commits February 19, 2021 10:47

add seperate function to synchronize and read timing value

132e6fa

missed one timer

a78bd68

fix timer initialization in graph runtime debug

86c3786

switch to high resolution clock

0f04bf6

tqchen added the status: need review label Feb 22, 2021

areusch reviewed Feb 22, 2021

View reviewed changes

areusch requested changes Feb 22, 2021

View reviewed changes

tqchen reviewed Feb 22, 2021

View reviewed changes

tkonolige added 3 commits February 23, 2021 10:54

switch to object api

e339606

switch to object based approach

e2b3617

formatting

3a3ec38

tkonolige added 2 commits February 23, 2021 16:46

whoops

b764f53

forgot rocm

5f69c8d

altanh mentioned this pull request Feb 24, 2021

[Profiling] Unify profiling into standardized formats/APIs #7523

Closed

fixes

a37bf4a

areusch reviewed Feb 25, 2021

View reviewed changes

tkonolige added 4 commits February 25, 2021 09:24

missed include

b34847c

rename SyncAndGetTime to SyncAndGetElapsedNanos

15f8a7d

rewording

580d671

doc vs time_evaluator

4559634

areusch approved these changes Feb 25, 2021

View reviewed changes

missed a rename

88577cc

tqchen reviewed Mar 1, 2021

View reviewed changes

include/tvm/runtime/profiling.h Outdated Show resolved Hide resolved

tqchen reviewed Mar 1, 2021

View reviewed changes

include/tvm/runtime/profiling.h Outdated Show resolved Hide resolved

tqchen reviewed Mar 1, 2021

View reviewed changes

include/tvm/runtime/profiling.h Outdated Show resolved Hide resolved

tqchen requested changes Mar 1, 2021

View reviewed changes

include/tvm/runtime/profiling.h Outdated Show resolved Hide resolved

tqchen added the status: need update need update based on feedbacks label Mar 1, 2021

tkonolige added 2 commits March 1, 2021 15:45

move timer start method into Timer class

718556e

missed some StartTimers

0e2a4f4

hzfan reviewed Mar 2, 2021

View reviewed changes

tkonolige added 2 commits March 2, 2021 08:33

use CUDA_CALL and ROCM_CALL

78c8d14

example of registering a cpu timer

2702ed3

hzfan approved these changes Mar 4, 2021

View reviewed changes

tqchen reviewed Mar 4, 2021

View reviewed changes

include/tvm/runtime/profiling.h Show resolved Hide resolved

tqchen reviewed Mar 4, 2021

View reviewed changes

tkonolige added 4 commits March 4, 2021 10:25

docs

71df7c9

more docs

83d711f

forgot to update timer usage

a12bd7d

update test

9a401fc

tqchen approved these changes Mar 5, 2021

View reviewed changes

tqchen merged commit d6c0cea into apache:main Mar 5, 2021

tqchen added status: accepted and removed status: need review status: need update need update based on feedbacks labels Mar 5, 2021

tkonolige mentioned this pull request Mar 10, 2021

[RUNTIME] Switch time evaluator to use device specific timing. #7631

Merged

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021

[RUNTIME] Add device specific timers (apache#7472)

2c25a05

trevor-m pushed a commit to neo-ai/tvm that referenced this pull request May 11, 2021

[RUNTIME] Add device specific timers (apache#7472)

dc86537

[RUNTIME] Add device specific timers #7472

[RUNTIME] Add device specific timers #7472

Conversation

tkonolige commented Feb 18, 2021

areusch commented Feb 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen Feb 22, 2021 • edited Loading

Choose a reason for hiding this comment

tkonolige commented Feb 23, 2021

areusch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areusch left a comment

Choose a reason for hiding this comment

tqchen commented Mar 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen left a comment

Choose a reason for hiding this comment

tqchen commented Mar 5, 2021

tqchen Feb 22, 2021 •

edited

Loading