Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine profiler and expose to Python. #7576

Merged
merged 8 commits into from
Jan 24, 2018

Conversation

qingqing01
Copy link
Contributor

@qingqing01 qingqing01 commented Jan 16, 2018

Fix #7577

  • Add profiler in the executor to record time for each op.
  • Expose profiler to Python.
  • The output is as follows in the unit test
------------------------->     Profiling Report <-------------------------

Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                            Calls       Total       Min.        Max.        Ave.
thread0::mul_grad                24          3.9817      0.030622    2.19622     0.165904
thread0::mul                     24          2.43597     0.018009    0.257821    0.101499
thread0::momentum                48          1.65413     0.012138    0.128806    0.0344611
thread0::softmax                 8           1.04434     0.129704    0.131114    0.130543
thread0::elementwise_add_grad    24          0.427291    0.01508     0.023408    0.0178038
thread0::elementwise_add         24          0.35214     0.009682    0.020287    0.0146725
thread0::cast                    32          0.286981    0.006007    0.0149      0.00896816
thread0::relu_grad               16          0.268582    0.013828    0.02076     0.0167864
thread0::top_k                   8           0.251921    0.02915     0.036       0.0314901
thread0::softmax_grad            8           0.196288    0.024375    0.024798    0.024536
thread0::sum                     16          0.172279    0.009126    0.012375    0.0107674
thread0::feed                    16          0.1705      0.00402     0.017667    0.0106562
thread0::relu                    16          0.132929    0.007018    0.009637    0.00830806
thread0::accuracy                8           0.116111    0.01412     0.015031    0.0145139
thread0::cross_entropy_grad      8           0.106584    0.012811    0.014423    0.013323
thread0::cross_entropy           8           0.096511    0.011705    0.012282    0.0120639
thread0::elementwise_div         8           0.089527    0.010935    0.01219     0.0111909
thread0::mean_grad               8           0.079708    0.008923    0.015726    0.0099635
thread0::fetch                   24          0.067049    0.001737    0.003392    0.00279371
thread0::mean                    8           0.05091     0.006111    0.006513    0.00636375
thread0::fill_constant           8           0.038575    0.004709    0.004976    0.00482187

@qingqing01
Copy link
Contributor Author

qingqing01 commented Jan 17, 2018

  1. Use with profiler.profiler('CPU', 'total') as prof to package the profiling code.
  2. profiler.reset_profiler() can be used to clear the previous records.

A simple usage is as follows:

image = fluid.layers.data(name='x', shape=[784], dtype='float32')
# ...
avg_cost = fluid.layers.mean(x=cost)
optimizer = fluid.optimizer.Momentum(learning_rate=0.001, momentum=0.9)
opts = optimizer.minimize(avg_cost)
accuracy = fluid.evaluator.Accuracy(input=predict, label=label)

place = fluid.CPUPlace() # or fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
accuracy.reset(exe)

with profiler.profiler('CPU', 'total') as prof:
     for iter in range(10):
         if iter == 2:
             profiler.reset_profiler()
         x = np.random.random((32, 784)).astype("float32")
         y = np.random.randint(0, 10, (32, 1)).astype("int64")

         outs = exe.run(fluid.default_main_program(),
                        feed={'x': x,
                              'y': y},
                        fetch_list=[avg_cost] + accuracy.metrics)
         acc = np.array(outs[1])
         pass_acc = accuracy.eval(exe)

Copy link
Contributor

@kuke kuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Thx.

and GPU program.

Args:
state (string) : The profiler state, It should be 'CPU' or 'GPU'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel that it is necessary to add more explanation about param state. Users may be confused why a state is needed after the place is given.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more explanation here.


Args:
state (string) : The profiler state, It should be 'CPU' or 'GPU'.
sorted_key (string) : If None, the profiler results will be printed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If None, the profiler results will be printed without sorting. -> If None, the profiling results will be printed in the order of first end time of events.

profiler results -> profiling results, the same below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

without sorting. Otherwise, the profiler results will be sorted
by the this flag. This flag should be one of 'calls', 'total',
'max', 'min' or 'ave'.
The `calls` means sorting by the calling counter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the calling counter -> number of calls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@contextmanager
def profiler(state, sorted_key=None):
"""The profiler interface.
Different from cuda_profiler, this fuction can be used to profile both CPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function -> profiler
Maybe CPU and GPU program -> CPU and GPU operator kernels is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the comments.


platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
auto dev_ctx = const_cast<platform::DeviceContext*>(pool.Get(place_));
platform::RecordEvent record_event(op->Type(), dev_ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecordEvent should not be always called, only when ProfilerState is not kDisabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengduoZH Thanks for your review. Whether to record timeline is judged in the constructor of RecordEvent: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/profiler.cc#L130

ParseEvents(all_events, sorted_key);
ResetProfiler();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should have a ProfilerState GetProfilerState(), and this function is called in executor.cc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the above reply. The less the operation, the better.

Copy link
Contributor Author

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuke @chengduoZH Thanks!

ParseEvents(all_events, sorted_key);
ResetProfiler();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the above reply. The less the operation, the better.

@contextmanager
def profiler(state, sorted_key=None):
"""The profiler interface.
Different from cuda_profiler, this fuction can be used to profile both CPU
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the comments.

and GPU program.

Args:
state (string) : The profiler state, It should be 'CPU' or 'GPU'.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more explanation here.


Args:
state (string) : The profiler state, It should be 'CPU' or 'GPU'.
sorted_key (string) : If None, the profiler results will be printed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

without sorting. Otherwise, the profiler results will be sorted
by the this flag. This flag should be one of 'calls', 'total',
'max', 'min' or 'ave'.
The `calls` means sorting by the calling counter.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.profiler('CPU')

def not_test_cuda_profiler(self):
self.profiler('GPU')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profiler does not be executed in the stage of unit tests, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

Copy link
Contributor

@kuke kuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe need a bit improvement in comments. Thx

to add more records.

Args:
state (string) : The profiling state, It should be 'CPU' or 'GPU'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> The profiling state, which should be CPU or GPU, telling the profiler to use CPU timer or GPU timer for profiling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Args:
state (string) : The profiling state, It should be 'CPU' or 'GPU'.
Although users may define CPUPlace or CUDAPlace when using Fluid,
Copy link
Contributor

@kuke kuke Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although users may define CPUPlace or CUDAPlace when using Fluid, the profiler doesn't get the state based on this Place. Since the implementation is an independent part from the Fluid.
->
Although users may have already specified the execution place(CPUPlace/CUDAPlace) in the begining, for flexibility the profiler doesn't inherit this place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@kuke
Copy link
Contributor

kuke commented Jan 23, 2018

LGTM

chengduoZH
chengduoZH previously approved these changes Jan 23, 2018
Copy link
Contributor

@chengduoZH chengduoZH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@chengduoZH chengduoZH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qingqing01 qingqing01 merged commit 750299f into PaddlePaddle:develop Jan 24, 2018
@qingqing01 qingqing01 deleted the profiling_py branch November 14, 2019 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants