Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[METAL] Fix issue with GPU fails #7819

Merged
merged 6 commits into from
Apr 16, 2021

Conversation

echuraev
Copy link
Contributor

@echuraev echuraev commented Apr 9, 2021

Added first run to auto scheduler. This run is necessary for checking
that the generated kernel is correct. When we just run time evaluator
with incorrect kernel then it is possible that our application on iOS
device will be added to ignore list because of big number of committed
incorrect kernels. One run before running auto scheduling helps us to
avoid this problem.

Added complete handlers to all command buffers in Metal runtime. It
helps to handle GPU errors and report about this error to the host
application.

In case when error happened, we have to create a new stream. Added
mechanism for error handling and streams creating from python interface.

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

@echuraev echuraev force-pushed the echuraev/fix_gpu_fails_add_stream branch from 14689c8 to 27cd682 Compare April 9, 2021 20:19
@echuraev echuraev force-pushed the echuraev/fix_gpu_fails_add_stream branch 3 times, most recently from 4bcf1c6 to 0a3243f Compare April 12, 2021 06:15
@jwfromm jwfromm requested a review from merrymercy April 12, 2021 16:04
@echuraev echuraev force-pushed the echuraev/fix_gpu_fails_add_stream branch 2 times, most recently from ee0e150 to 0c5c207 Compare April 13, 2021 06:17
Added first run to auto scheduler. This run is necessary for checking
that the generated kernel is correct. When we just run time evaluator
with incorrect kernel then it is possible that our application on iOS
device will be added to ignore list because of big number of committed
incorrect kernels. One run before running auto scheduling helps us to
avoid this problem.

Added complete handlers to all command buffers in Metal runtime. It
helps to handle GPU errors and report about this error to the host
application.

In case when error happened, we have to create a new stream. Added
mechanism for error handling and streams creating from python interface.
@echuraev echuraev force-pushed the echuraev/fix_gpu_fails_add_stream branch from 0c5c207 to 932c2cd Compare April 13, 2021 14:25
src/runtime/minrpc/rpc_reference.h Outdated Show resolved Hide resolved
python/tvm/_ffi/runtime_ctypes.py Outdated Show resolved Hide resolved
python/tvm/_ffi/runtime_ctypes.py Outdated Show resolved Hide resolved
python/tvm/_ffi/runtime_ctypes.py Outdated Show resolved Hide resolved
@tqchen
Copy link
Member

tqchen commented Apr 14, 2021

also cc @masahi @csullivan @ZihengJiang please help to review this PR

@tqchen tqchen assigned masahi and unassigned merrymercy Apr 14, 2021
@masahi
Copy link
Member

masahi commented Apr 16, 2021

@echuraev please kick CI again

@masahi
Copy link
Member

masahi commented Apr 16, 2021

@tqchen blocked by your change request

@masahi masahi merged commit 899bc06 into apache:main Apr 16, 2021
@masahi
Copy link
Member

masahi commented Apr 16, 2021

thanks @echuraev @tqchen

mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Apr 22, 2021
* [METAL] Fix issue with GPU fails

Added first run to auto scheduler. This run is necessary for checking
that the generated kernel is correct. When we just run time evaluator
with incorrect kernel then it is possible that our application on iOS
device will be added to ignore list because of big number of committed
incorrect kernels. One run before running auto scheduling helps us to
avoid this problem.

Added complete handlers to all command buffers in Metal runtime. It
helps to handle GPU errors and report about this error to the host
application.

In case when error happened, we have to create a new stream. Added
mechanism for error handling and streams creating from python interface.

* Try to fix QEMU build

* Apply comment

* Apply comments and fix build

* Apply comments and fix lint

* Fix CI
@masahi
Copy link
Member

masahi commented May 2, 2021

hmm it seems this commit broke auto scheduling on vulkan. Removing the change in auto_scheduler/measure.py fixes it. I'll take a look but @echuraev do you know what could be wrong?

@tqchen
Copy link
Member

tqchen commented May 2, 2021

@masahi this could due to the stream management introduced in this PR(explicit call of set stream and new stream/free stream). I believe in vk we should always allocate and return an indicator of default stream

@masahi
Copy link
Member

masahi commented May 2, 2021

ok I see

void SetStream(Device dev, TVMStreamHandle stream) final {
LOG(FATAL) << "Not implemented";
return;

@tqchen
Copy link
Member

tqchen commented May 2, 2021

We can let new stream return nullptr, and implement setstream/freestream for nullptr(nop)

Lunderberg added a commit to Lunderberg/tvm that referenced this pull request May 3, 2021
…rations

rpc_runner_run interacts with stream handlers following PR apache#7819.
Vulkan currently executes adds everything into a single command buffer
per CPU thread, so there isn't a corresponding concept of streams.
Therefore, added no-op implementations for these DeviceAPI methods.
masahi pushed a commit that referenced this pull request May 4, 2021
…rations (#7969)

rpc_runner_run interacts with stream handlers following PR #7819.
Vulkan currently executes adds everything into a single command buffer
per CPU thread, so there isn't a corresponding concept of streams.
Therefore, added no-op implementations for these DeviceAPI methods.

Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
umangyadav pushed a commit to umangyadav/tvm that referenced this pull request May 5, 2021
…rations (apache#7969)

rpc_runner_run interacts with stream handlers following PR apache#7819.
Vulkan currently executes adds everything into a single command buffer
per CPU thread, so there isn't a corresponding concept of streams.
Therefore, added no-op implementations for these DeviceAPI methods.

Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
* [METAL] Fix issue with GPU fails

Added first run to auto scheduler. This run is necessary for checking
that the generated kernel is correct. When we just run time evaluator
with incorrect kernel then it is possible that our application on iOS
device will be added to ignore list because of big number of committed
incorrect kernels. One run before running auto scheduling helps us to
avoid this problem.

Added complete handlers to all command buffers in Metal runtime. It
helps to handle GPU errors and report about this error to the host
application.

In case when error happened, we have to create a new stream. Added
mechanism for error handling and streams creating from python interface.

* Try to fix QEMU build

* Apply comment

* Apply comments and fix build

* Apply comments and fix lint

* Fix CI
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
…rations (apache#7969)

rpc_runner_run interacts with stream handlers following PR apache#7819.
Vulkan currently executes adds everything into a single command buffer
per CPU thread, so there isn't a corresponding concept of streams.
Therefore, added no-op implementations for these DeviceAPI methods.

Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
* [METAL] Fix issue with GPU fails

Added first run to auto scheduler. This run is necessary for checking
that the generated kernel is correct. When we just run time evaluator
with incorrect kernel then it is possible that our application on iOS
device will be added to ignore list because of big number of committed
incorrect kernels. One run before running auto scheduling helps us to
avoid this problem.

Added complete handlers to all command buffers in Metal runtime. It
helps to handle GPU errors and report about this error to the host
application.

In case when error happened, we have to create a new stream. Added
mechanism for error handling and streams creating from python interface.

* Try to fix QEMU build

* Apply comment

* Apply comments and fix build

* Apply comments and fix lint

* Fix CI
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
…rations (apache#7969)

rpc_runner_run interacts with stream handlers following PR apache#7819.
Vulkan currently executes adds everything into a single command buffer
per CPU thread, so there isn't a corresponding concept of streams.
Therefore, added no-op implementations for these DeviceAPI methods.

Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
* [METAL] Fix issue with GPU fails

Added first run to auto scheduler. This run is necessary for checking
that the generated kernel is correct. When we just run time evaluator
with incorrect kernel then it is possible that our application on iOS
device will be added to ignore list because of big number of committed
incorrect kernels. One run before running auto scheduling helps us to
avoid this problem.

Added complete handlers to all command buffers in Metal runtime. It
helps to handle GPU errors and report about this error to the host
application.

In case when error happened, we have to create a new stream. Added
mechanism for error handling and streams creating from python interface.

* Try to fix QEMU build

* Apply comment

* Apply comments and fix build

* Apply comments and fix lint

* Fix CI
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
…rations (apache#7969)

rpc_runner_run interacts with stream handlers following PR apache#7819.
Vulkan currently executes adds everything into a single command buffer
per CPU thread, so there isn't a corresponding concept of streams.
Therefore, added no-op implementations for these DeviceAPI methods.

Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request May 11, 2021
* [METAL] Fix issue with GPU fails

Added first run to auto scheduler. This run is necessary for checking
that the generated kernel is correct. When we just run time evaluator
with incorrect kernel then it is possible that our application on iOS
device will be added to ignore list because of big number of committed
incorrect kernels. One run before running auto scheduling helps us to
avoid this problem.

Added complete handlers to all command buffers in Metal runtime. It
helps to handle GPU errors and report about this error to the host
application.

In case when error happened, we have to create a new stream. Added
mechanism for error handling and streams creating from python interface.

* Try to fix QEMU build

* Apply comment

* Apply comments and fix build

* Apply comments and fix lint

* Fix CI
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request May 11, 2021
…rations (apache#7969)

rpc_runner_run interacts with stream handlers following PR apache#7819.
Vulkan currently executes adds everything into a single command buffer
per CPU thread, so there isn't a corresponding concept of streams.
Therefore, added no-op implementations for these DeviceAPI methods.

Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
@echuraev echuraev deleted the echuraev/fix_gpu_fails_add_stream branch September 23, 2021 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants