-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[METAL] Fix issue with GPU fails #7819
[METAL] Fix issue with GPU fails #7819
Conversation
14689c8
to
27cd682
Compare
4bcf1c6
to
0a3243f
Compare
ee0e150
to
0c5c207
Compare
Added first run to auto scheduler. This run is necessary for checking that the generated kernel is correct. When we just run time evaluator with incorrect kernel then it is possible that our application on iOS device will be added to ignore list because of big number of committed incorrect kernels. One run before running auto scheduling helps us to avoid this problem. Added complete handlers to all command buffers in Metal runtime. It helps to handle GPU errors and report about this error to the host application. In case when error happened, we have to create a new stream. Added mechanism for error handling and streams creating from python interface.
0c5c207
to
932c2cd
Compare
also cc @masahi @csullivan @ZihengJiang please help to review this PR |
@echuraev please kick CI again |
@tqchen blocked by your change request |
* [METAL] Fix issue with GPU fails Added first run to auto scheduler. This run is necessary for checking that the generated kernel is correct. When we just run time evaluator with incorrect kernel then it is possible that our application on iOS device will be added to ignore list because of big number of committed incorrect kernels. One run before running auto scheduling helps us to avoid this problem. Added complete handlers to all command buffers in Metal runtime. It helps to handle GPU errors and report about this error to the host application. In case when error happened, we have to create a new stream. Added mechanism for error handling and streams creating from python interface. * Try to fix QEMU build * Apply comment * Apply comments and fix build * Apply comments and fix lint * Fix CI
hmm it seems this commit broke auto scheduling on vulkan. Removing the change in |
@masahi this could due to the stream management introduced in this PR(explicit call of set stream and new stream/free stream). I believe in vk we should always allocate and return an indicator of default stream |
ok I see tvm/src/runtime/vulkan/vulkan.cc Lines 397 to 399 in 46e0634
|
We can let new stream return nullptr, and implement setstream/freestream for nullptr(nop) |
…rations rpc_runner_run interacts with stream handlers following PR apache#7819. Vulkan currently executes adds everything into a single command buffer per CPU thread, so there isn't a corresponding concept of streams. Therefore, added no-op implementations for these DeviceAPI methods.
…rations (#7969) rpc_runner_run interacts with stream handlers following PR #7819. Vulkan currently executes adds everything into a single command buffer per CPU thread, so there isn't a corresponding concept of streams. Therefore, added no-op implementations for these DeviceAPI methods. Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
…rations (apache#7969) rpc_runner_run interacts with stream handlers following PR apache#7819. Vulkan currently executes adds everything into a single command buffer per CPU thread, so there isn't a corresponding concept of streams. Therefore, added no-op implementations for these DeviceAPI methods. Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
* [METAL] Fix issue with GPU fails Added first run to auto scheduler. This run is necessary for checking that the generated kernel is correct. When we just run time evaluator with incorrect kernel then it is possible that our application on iOS device will be added to ignore list because of big number of committed incorrect kernels. One run before running auto scheduling helps us to avoid this problem. Added complete handlers to all command buffers in Metal runtime. It helps to handle GPU errors and report about this error to the host application. In case when error happened, we have to create a new stream. Added mechanism for error handling and streams creating from python interface. * Try to fix QEMU build * Apply comment * Apply comments and fix build * Apply comments and fix lint * Fix CI
…rations (apache#7969) rpc_runner_run interacts with stream handlers following PR apache#7819. Vulkan currently executes adds everything into a single command buffer per CPU thread, so there isn't a corresponding concept of streams. Therefore, added no-op implementations for these DeviceAPI methods. Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
* [METAL] Fix issue with GPU fails Added first run to auto scheduler. This run is necessary for checking that the generated kernel is correct. When we just run time evaluator with incorrect kernel then it is possible that our application on iOS device will be added to ignore list because of big number of committed incorrect kernels. One run before running auto scheduling helps us to avoid this problem. Added complete handlers to all command buffers in Metal runtime. It helps to handle GPU errors and report about this error to the host application. In case when error happened, we have to create a new stream. Added mechanism for error handling and streams creating from python interface. * Try to fix QEMU build * Apply comment * Apply comments and fix build * Apply comments and fix lint * Fix CI
…rations (apache#7969) rpc_runner_run interacts with stream handlers following PR apache#7819. Vulkan currently executes adds everything into a single command buffer per CPU thread, so there isn't a corresponding concept of streams. Therefore, added no-op implementations for these DeviceAPI methods. Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
* [METAL] Fix issue with GPU fails Added first run to auto scheduler. This run is necessary for checking that the generated kernel is correct. When we just run time evaluator with incorrect kernel then it is possible that our application on iOS device will be added to ignore list because of big number of committed incorrect kernels. One run before running auto scheduling helps us to avoid this problem. Added complete handlers to all command buffers in Metal runtime. It helps to handle GPU errors and report about this error to the host application. In case when error happened, we have to create a new stream. Added mechanism for error handling and streams creating from python interface. * Try to fix QEMU build * Apply comment * Apply comments and fix build * Apply comments and fix lint * Fix CI
…rations (apache#7969) rpc_runner_run interacts with stream handlers following PR apache#7819. Vulkan currently executes adds everything into a single command buffer per CPU thread, so there isn't a corresponding concept of streams. Therefore, added no-op implementations for these DeviceAPI methods. Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
* [METAL] Fix issue with GPU fails Added first run to auto scheduler. This run is necessary for checking that the generated kernel is correct. When we just run time evaluator with incorrect kernel then it is possible that our application on iOS device will be added to ignore list because of big number of committed incorrect kernels. One run before running auto scheduling helps us to avoid this problem. Added complete handlers to all command buffers in Metal runtime. It helps to handle GPU errors and report about this error to the host application. In case when error happened, we have to create a new stream. Added mechanism for error handling and streams creating from python interface. * Try to fix QEMU build * Apply comment * Apply comments and fix build * Apply comments and fix lint * Fix CI
…rations (apache#7969) rpc_runner_run interacts with stream handlers following PR apache#7819. Vulkan currently executes adds everything into a single command buffer per CPU thread, so there isn't a corresponding concept of streams. Therefore, added no-op implementations for these DeviceAPI methods. Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
Added first run to auto scheduler. This run is necessary for checking
that the generated kernel is correct. When we just run time evaluator
with incorrect kernel then it is possible that our application on iOS
device will be added to ignore list because of big number of committed
incorrect kernels. One run before running auto scheduling helps us to
avoid this problem.
Added complete handlers to all command buffers in Metal runtime. It
helps to handle GPU errors and report about this error to the host
application.
In case when error happened, we have to create a new stream. Added
mechanism for error handling and streams creating from python interface.
Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.