Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove halide_abort_if_false() usage in runtime/metal #6398

Merged
merged 2 commits into from
Nov 9, 2021

Conversation

steven-johnson
Copy link
Contributor

This converts all the usage of halide_abort_if_false() in runtime/metal into either an explicit runtime check-and-return-error-code (if the check looks plausible), or halide_debug_assert() (if the check seems to be stating an invariant that shouldn't be possible in well-structured code). These changes are admittedly subjective, so feedback is especially welcome.

Also, driveby change to sync-common.h to use halide_debug_assert() rather than a local equivalent.

This converts all the usage of `halide_abort_if_false()` in runtime/metal into either an explicit runtime check-and-return-error-code (if the check looks plausible), or `halide_debug_assert()` (if the check seems to be stating an invariant that shouldn't be possible in well-structured code). These changes are admittedly subjective, so feedback is especially welcome.

Also, driveby change to sync-common.h to use `halide_debug_assert()` rather than a local equivalent.
@@ -340,7 +340,7 @@ extern "C" {
// previous call (if any) has not yet been released via halide_release_metal_context.
WEAK int halide_metal_acquire_context(void *user_context, mtl_device **device_ret,
mtl_command_queue **queue_ret, bool create) {
halide_abort_if_false(user_context, &thread_lock != nullptr);
halide_debug_assert(user_context, &thread_lock != nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should actually trigger an abort, if I understand the check correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this could ever fail, TBH -- we're taking the address of a (weak) global var. How can that fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably right-- it just looks strange to potentially deref a null pointer, even if it can't be null. If it really can't happen, I wonder if we should elide the check.

Copy link
Contributor

@shoaibkamil shoaibkamil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo nits & the one place where I think we should fail hard, LGTM

halide_abort_if_false(user_context, (device == nullptr) || (queue != nullptr));
if (device != nullptr && queue == nullptr) {
error(user_context) << "halide_metal_acquire_context: device initialized but queue is not.\n";
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do we have a set policy on returning -1 vs returning an error code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the prevalence (here and elsewhere) of just returning -1 or similar, apparently not :-/

(This is something we really should improve; at least making a typedef for the return codes and attempting to return error values. But that's probably a separate issue.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, runtime functions that are only ever called by other runtime functions often return anything non-zero back upwards, and then the calling function selects an appropriate error code (e.g. if this acquire context failed while trying to do a device copy, we ultimately return a device_copy_failed error). Sometimes you need to know the calling context to return the appropriate error.

This convention sort of makes sense, but it's not very clear in the code that there even is a convention happening here. There's no reason this couldn't return an failed_to_acquire_context error or something, and then the calling function could replace the error code if it wants to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime functions that are only ever called by other runtime functions

there's also not a clean demarcation here -- I will bet you a dollar that we can find 'public' runtime functions that shortcut out in this way too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The demarcation is supposed to be that the exposed functions in device_interface.cpp are the entrypoints, and return real error codes. The specific backend implementations, and the default implementations in device_interface.cpp don't need to.

@@ -542,10 +555,11 @@ WEAK int halide_metal_initialize_kernels(void *user_context, void **state_ptr, c
mtl_library *library{};
if (!compilation_cache.kernel_state_setup(user_context, state_ptr, metal_context.device, library,
new_library_with_source, metal_context.device,
source, source_size)) {
source, source_size) ||
library == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This what clang-format did, let me see if I can pacify it otherwise

halide_abort_if_false(user_context, from_host || src->device);
halide_abort_if_false(user_context, to_host || dst->device);
if (!(from_host || src->device)) {
error(user_context) << "halide_metal_buffer_copy: impossible copy source\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: impossible -> invalid

@steven-johnson
Copy link
Contributor Author

The change on line 343 seems to be the only controversial thing?

@shoaibkamil
Copy link
Contributor

I'm ok to merge as is, honestly. I just find it a weird check to do if it can't fail.

@steven-johnson steven-johnson merged commit 8e34a35 into master Nov 9, 2021
@steven-johnson steven-johnson deleted the srj/runtime-metal branch November 9, 2021 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants