-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenCLML] CLML Profiling fixes corresponding to OpenCL Timer recent … #12711
Conversation
cc @echuraev |
d034839
to
efe2361
Compare
} | ||
ICHECK(queue_found != false) << "Device queue not found in OpenCL Workspace"; | ||
|
||
this->queue = GetCommadQueue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, probably I missed something. I have never looked at OpenCLML part of TVM before. Could you please explain why you didn't reuse code from OpenCL runtime?
Won't this pseudocode work in the same way?
this->queue = GetCommadQueue(); | |
void InitCLML() { | |
// Setup CLML Context | |
cl_int result = 0; | |
// Initialize Context and Command Queue | |
OpenCLWorkspace* workspace = OpenCLWorkspace::Global(); | |
workspace->Init(); | |
OpenCLThreadEntry* t = workspace->GetThreadEntry(); | |
cl_device_id did = workspace->devices[t->device.device_id]; | |
if (!ExtensionStringPresent(did)) { | |
LOG(WARNING) << "CLML Runtime Init: Qualcomm extn not present.\n"; | |
return; | |
} | |
this->queue = workspace->GetQueue(t->device); |
If it works, it won't be necessary to implement function GetCommadQueue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both approaches leads to the same issue.
this->queue = workspace->GetQueue(t->device);
The queue referenced here is getting recreated at https://github.com/apache/tvm/pull/11180/files#diff-783a8419915dfdfe242ef786bce4252f60f9d0decebea66d72af63d9cf9d0058R472
and OpenCLML runtime is unaware of this recreation and continues to use a released command queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with you. One thing that I wanted to say that probably we could unify and reuse existing code instead of writing new similar functions.
Speaking about recreation function. So, in function recreateCommandQueue
a new queue is creating and storing in the vector with queues. And when we call GetQueue
then a new (recreated) queue will be returned. I don't see any reasons why we cannot call cl::OpenCLWorkspace::Global()->GetQueue(device)
in all places where we'd like to use command queue and get a new queue. What do you think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let me cleanup a bit by holding ref. for workspace instead of context and queues. This makes CLML unaffected by any changes to workspace internals.
efe2361
to
03643db
Compare
03643db
to
8deb4be
Compare
cl_command_queue queue = NULL; | ||
std::vector<cl_event>* evts; | ||
cl::OpenCLWorkspace* workspace = NULL; | ||
cl::OpenCLThreadEntry* tentry = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do you really need a separate variable for ThreadEntry? You can get it anytime from OpenCLWorkspace
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for convenience (also it wont change anyway) to minimize long path of accessing queue.
cl_platform_id platform; | ||
cl_device_id device_id; | ||
result = clGetPlatformIDs(1, &platform, NULL); | ||
ICHECK(result == CL_SUCCESS) << "clGetPlatformIDs:" << result; | ||
uint32_t num_devices = 0; | ||
result = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &num_devices); | ||
ICHECK(result == CL_SUCCESS && num_devices == 1) << "clGetDeviceIDs:" << result; | ||
result = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); | ||
ICHECK(device_id && result == CL_SUCCESS) << "clGetDeviceIDs:" << result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this code? You can get device_id
from the workspace
:
cl_device_id did = workspace->devices[workspace->GetThreadEntry()->device.device_id];
I think that the best solution is to extend class OpenCLWorkspace
with method cl_device_id GetClDeviceId(Device dev)
. This method also will do all necessary checks. I think, implementation of this method will be similar with implementation of GetQueue.
And after that you'll be able to get cl_device_id
by calling this new method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device_id from workspace is a sequence number in our case it's always 0
. What we need here is the real id from driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device_id from workspace is a sequence number in our case it's always 0.
I'm not agreed with you. You can also get real id from driver. In your code, you get the number of available GPU devices and then get id of one GPU device.
Please, take a look into the method Init. We do absolutely the same things and here we create a vector of real device ids from driver. And next we store these values in a member devices
. So you can extract the real device id from this vector.
When you call workspace->Init();
it will also create an OpenCL workspace for GPU device. In case if no GPU devices found, it will print a warning message. If you need only a GPU device, then you can add a check/assert for it in the CLML specific code.
Only one thing why I suggested adding a new method GetClDeviceId
because in this method we can do all necessary boundary and other checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, workspace has real id. Earlier I looked into ThreadEntry->Device, which is DLDevice and it's always zero.
We always use the first OpenCL device available and a new API GetClDeviceId
may not be required at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. Added new comment below. Let's discuss this moment there.
@@ -253,6 +230,8 @@ class CLMLRuntime : public JSONRuntimeBase { | |||
*/ | |||
void Run() override { | |||
cl_int result = 0; | |||
cl_command_queue queue = workspace->GetQueue(tentry->device); | |||
std::vector<cl_event>* evts = &(workspace->GetEventQueue(tentry->device)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need pointer on the vector with events instead of reference?
cl_event* evt = &(this->evts->back()); | ||
result = h_ClmlIntf->clEnqueueMLOpQCOM(queue, this->layer_.function[i], | ||
this->layer_.descriptorSet, 0, NULL, evt); | ||
if (getenv("CLML_PROFILING")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we will use method IsProfiling instead of this variable? Or it is a possible situation that we have an OpenCL queue created with profiling enable option, but we don't want to profile CLML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLML profiling is about profiling the ML op's within CLML sub graph (within BYOC). isProfiling is controlled by OpenCLTimer when ever someone want to profile OpenCL kernels (Generated by OpenCL Codegen). CLML doesn't have any kernels (no clEnqueueNDRangeKernel calls here) instead it has extension API.
More details,
Ideally, CLML can have it's own workspace (context & queue) and operate. The only dependency on TVM's OpenCL workspace is to have the buffers allocated on same queue so that we can do hardware level copy while context switching from TVM's OpenCL sub graph to CLML subgraph. Too tight integration here may lead to unexpected functionality break as those who enhance OpenCL runtime may not pay attention to CLML component dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for clarification. In this case, I agree with you. It makes no sense to use IsProfiling
here.
6a978fb
to
110ff51
Compare
110ff51
to
8143a84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM. One point to discuss below.
size_t reqd_size = 0; | ||
result = clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, 0, NULL, &reqd_size); | ||
result = clGetDeviceInfo(workspace->devices[0], CL_DEVICE_EXTENSIONS, 0, NULL, &reqd_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that in current implementation, device_id
is always zero. So, we can just get the first device in the vector.
But I have one point for a discussion. I'm afraid that if we change something in the OpenCLWorkspace
related to the list of the devices in the future, then we can break this piece of code. So, this is why I prefer to get the device in this way: workspace->devices[workspace->GetThreadEntry()->device.device_id]
. To make it more safe and readable, I proposed to introduce a new method. What do you think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its very unlikely for Snapdragon to have multi GPU (at least near future). Anyway, I modified it to be generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
51ee37d
to
3f8131f
Compare
@masahi can you take a look and merge it ? |
apache#12711) * [OpenCLML] CLML Profiling fixes corresponding to OpenCL Timer recent changes. * [OpenCLML] Review comments. * * review comment
Dynamic creation of command queue (#11180) breaks OpenCLML profiling as OpenCLML stores ref. for command queue and reuse later.
Now, we access the command queue from work space always rather storing internally.
Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.