Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenCLML] CLML Profiling fixes corresponding to OpenCL Timer recent … #12711

Merged
merged 3 commits into from
Sep 8, 2022

Conversation

srkreddy1238
Copy link
Contributor

Dynamic creation of command queue (#11180) breaks OpenCLML profiling as OpenCLML stores ref. for command queue and reuse later.

Now, we access the command queue from work space always rather storing internally.

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

@masahi
Copy link
Member

masahi commented Sep 6, 2022

cc @echuraev

}
ICHECK(queue_found != false) << "Device queue not found in OpenCL Workspace";

this->queue = GetCommadQueue();
Copy link
Contributor

@echuraev echuraev Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, probably I missed something. I have never looked at OpenCLML part of TVM before. Could you please explain why you didn't reuse code from OpenCL runtime?
Won't this pseudocode work in the same way?

Suggested change
this->queue = GetCommadQueue();
void InitCLML() {
// Setup CLML Context
cl_int result = 0;
// Initialize Context and Command Queue
OpenCLWorkspace* workspace = OpenCLWorkspace::Global();
workspace->Init();
OpenCLThreadEntry* t = workspace->GetThreadEntry();
cl_device_id did = workspace->devices[t->device.device_id];
if (!ExtensionStringPresent(did)) {
LOG(WARNING) << "CLML Runtime Init: Qualcomm extn not present.\n";
return;
}
this->queue = workspace->GetQueue(t->device);

If it works, it won't be necessary to implement function GetCommadQueue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both approaches leads to the same issue.

this->queue = workspace->GetQueue(t->device);
The queue referenced here is getting recreated at https://github.com/apache/tvm/pull/11180/files#diff-783a8419915dfdfe242ef786bce4252f60f9d0decebea66d72af63d9cf9d0058R472

and OpenCLML runtime is unaware of this recreation and continues to use a released command queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with you. One thing that I wanted to say that probably we could unify and reuse existing code instead of writing new similar functions.
Speaking about recreation function. So, in function recreateCommandQueue a new queue is creating and storing in the vector with queues. And when we call GetQueue then a new (recreated) queue will be returned. I don't see any reasons why we cannot call cl::OpenCLWorkspace::Global()->GetQueue(device) in all places where we'd like to use command queue and get a new queue. What do you think about that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let me cleanup a bit by holding ref. for workspace instead of context and queues. This makes CLML unaffected by any changes to workspace internals.

cl_command_queue queue = NULL;
std::vector<cl_event>* evts;
cl::OpenCLWorkspace* workspace = NULL;
cl::OpenCLThreadEntry* tentry = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do you really need a separate variable for ThreadEntry? You can get it anytime from OpenCLWorkspace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for convenience (also it wont change anyway) to minimize long path of accessing queue.

Comment on lines 489 to 497
cl_platform_id platform;
cl_device_id device_id;
result = clGetPlatformIDs(1, &platform, NULL);
ICHECK(result == CL_SUCCESS) << "clGetPlatformIDs:" << result;
uint32_t num_devices = 0;
result = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &num_devices);
ICHECK(result == CL_SUCCESS && num_devices == 1) << "clGetDeviceIDs:" << result;
result = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);
ICHECK(device_id && result == CL_SUCCESS) << "clGetDeviceIDs:" << result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this code? You can get device_id from the workspace:

cl_device_id did = workspace->devices[workspace->GetThreadEntry()->device.device_id];

I think that the best solution is to extend class OpenCLWorkspace with method cl_device_id GetClDeviceId(Device dev). This method also will do all necessary checks. I think, implementation of this method will be similar with implementation of GetQueue.

And after that you'll be able to get cl_device_id by calling this new method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device_id from workspace is a sequence number in our case it's always 0. What we need here is the real id from driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device_id from workspace is a sequence number in our case it's always 0.

I'm not agreed with you. You can also get real id from driver. In your code, you get the number of available GPU devices and then get id of one GPU device.

Please, take a look into the method Init. We do absolutely the same things and here we create a vector of real device ids from driver. And next we store these values in a member devices. So you can extract the real device id from this vector.

When you call workspace->Init(); it will also create an OpenCL workspace for GPU device. In case if no GPU devices found, it will print a warning message. If you need only a GPU device, then you can add a check/assert for it in the CLML specific code.

Only one thing why I suggested adding a new method GetClDeviceId because in this method we can do all necessary boundary and other checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, workspace has real id. Earlier I looked into ThreadEntry->Device, which is DLDevice and it's always zero.
We always use the first OpenCL device available and a new API GetClDeviceId may not be required at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. Added new comment below. Let's discuss this moment there.

@@ -253,6 +230,8 @@ class CLMLRuntime : public JSONRuntimeBase {
*/
void Run() override {
cl_int result = 0;
cl_command_queue queue = workspace->GetQueue(tentry->device);
std::vector<cl_event>* evts = &(workspace->GetEventQueue(tentry->device));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need pointer on the vector with events instead of reference?

cl_event* evt = &(this->evts->back());
result = h_ClmlIntf->clEnqueueMLOpQCOM(queue, this->layer_.function[i],
this->layer_.descriptorSet, 0, NULL, evt);
if (getenv("CLML_PROFILING")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we will use method IsProfiling instead of this variable? Or it is a possible situation that we have an OpenCL queue created with profiling enable option, but we don't want to profile CLML?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLML profiling is about profiling the ML op's within CLML sub graph (within BYOC). isProfiling is controlled by OpenCLTimer when ever someone want to profile OpenCL kernels (Generated by OpenCL Codegen). CLML doesn't have any kernels (no clEnqueueNDRangeKernel calls here) instead it has extension API.

More details,

Ideally, CLML can have it's own workspace (context & queue) and operate. The only dependency on TVM's OpenCL workspace is to have the buffers allocated on same queue so that we can do hardware level copy while context switching from TVM's OpenCL sub graph to CLML subgraph. Too tight integration here may lead to unexpected functionality break as those who enhance OpenCL runtime may not pay attention to CLML component dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarification. In this case, I agree with you. It makes no sense to use IsProfiling here.

Copy link
Contributor

@echuraev echuraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM. One point to discuss below.

size_t reqd_size = 0;
result = clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, 0, NULL, &reqd_size);
result = clGetDeviceInfo(workspace->devices[0], CL_DEVICE_EXTENSIONS, 0, NULL, &reqd_size);
Copy link
Contributor

@echuraev echuraev Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that in current implementation, device_id is always zero. So, we can just get the first device in the vector.

But I have one point for a discussion. I'm afraid that if we change something in the OpenCLWorkspace related to the list of the devices in the future, then we can break this piece of code. So, this is why I prefer to get the device in this way: workspace->devices[workspace->GetThreadEntry()->device.device_id]. To make it more safe and readable, I proposed to introduce a new method. What do you think about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its very unlikely for Snapdragon to have multi GPU (at least near future). Anyway, I modified it to be generic.

Copy link
Contributor

@echuraev echuraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@srkreddy1238
Copy link
Contributor Author

@masahi can you take a look and merge it ?

@masahi masahi merged commit 6be04d7 into apache:main Sep 8, 2022
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
apache#12711)

* [OpenCLML] CLML Profiling fixes corresponding to OpenCL Timer recent changes.

* [OpenCLML] Review comments.

* * review comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants