You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to implement a filter on a GPU (namely transpose) but there already is a C implementation. I thought I could make the C vs. OpenCL a property but there is this UFO_TASK_MODE_GPU which I see by many OpenCL-based filters. Does that just mean that the filter won't be executed on OpenCL CPU devices? If so I just don't use the flag and I should be good to go. Anyway, can @matze tell me what are the "best practices" here?
Sure, one can say that if the filter can run on any device I could just implement it only in OpenCL because it can run on CPU as well but there is the unnecessary host <-> device transfer overhead (let's say I have read -> transpose -> write pipeline, then I would of course use the pure C version, otherwise I would need to transfer "up" and "down" again, and if we say that the transpose is almost as fast as data copying it is pretty pointless to do it by OpenCL CPU device). If I remember correctly we could "map" the buffer or use CL_MEM_USE_HOST_PTR and then we don't need to transfer back and forth, but I don't know if the framework supports that already.
It would be maybe interesting for optimization of the graph execution to have only one task but a property or something which the user can use to make their graph efficient. This property could be available to the scheduler as well and it could even optimize the graph for the user automatically (minimize data copying).
The text was updated successfully, but these errors were encountered:
I would like to implement a filter on a GPU (namely
transpose
) but there already is a C implementation. I thought I could make the C vs. OpenCL a property but there is thisUFO_TASK_MODE_GPU
which I see by many OpenCL-based filters. Does that just mean that the filter won't be executed on OpenCL CPU devices? If so I just don't use the flag and I should be good to go. Anyway, can @matze tell me what are the "best practices" here?Sure, one can say that if the filter can run on any device I could just implement it only in OpenCL because it can run on CPU as well but there is the unnecessary host <-> device transfer overhead (let's say I have read -> transpose -> write pipeline, then I would of course use the pure C version, otherwise I would need to transfer "up" and "down" again, and if we say that the transpose is almost as fast as data copying it is pretty pointless to do it by OpenCL CPU device). If I remember correctly we could "map" the buffer or use
CL_MEM_USE_HOST_PTR
and then we don't need to transfer back and forth, but I don't know if the framework supports that already.It would be maybe interesting for optimization of the graph execution to have only one task but a property or something which the user can use to make their graph efficient. This property could be available to the scheduler as well and it could even optimize the graph for the user automatically (minimize data copying).
The text was updated successfully, but these errors were encountered: