Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_halide_buffer_crop() needs to check for runtime failures #6402

Closed
wants to merge 2 commits into from

Conversation

steven-johnson
Copy link
Contributor

We currently assume that _halide_buffer_crop() will never fail. This is a bad assumption, as it can call device_crop(), which can fail due to unexpected runtime errors, or from a backend simply leaving the device_crop field at the default (unimplemented) case (as is currently the case for the OGLC backend).

When this happens, the dst buffer was left in an inconsistent, invalid state (which was what led to the crashes fixed by #6401).

This change modifies _halide_buffer_crop() to return nullptr in the event of an error, and to wrap all callers in a require() clause to check the result. (This is not optimal, of course, since the specific error returned by device_crop is getting dropped on the floor, but the existence of an error is no longer ignored.)

This addresses at least some of the failure issues we are seeing in performance_async_gpu with the OpenGLCompute backend.

(Also: drive-by whitespace fix in CodegenC)

We currently assume that _halide_buffer_crop() will never fail. This is a bad assumption, as it can call device_crop(), which can fail due to unexpected runtime errors, or from a backend simply leaving the `device_crop` field at the default (unimplemented) case (as is currently the case for the OGLC backend).

When this happens, the `dst` buffer was left in an inconsistent, invalid state (which was what led to the crashes fixed by #6401).

This change modifies _halide_buffer_crop() to return nullptr in the event of an error, and to wrap all callers in a `require()` clause to check the result. (This is not optimal, of course, since the specific error returned by device_crop is getting dropped on the floor, but the existence of an error is no longer ignored.)

This addresses at least some of the failure issues we are seeing in performance_async_gpu with the OpenGLCompute backend.

(Also: drive-by whitespace fix in CodegenC)
@@ -103,7 +103,10 @@ class FoldStorageOfFunction : public IRMutator {
Expr new_mins = Call::make(type_of<int *>(), Call::make_struct, mins, Call::Intrinsic);
vector<Expr> new_args = op->args;
new_args[3] = new_mins;
expr = Call::make(op->type, op->name, new_args, op->call_type);
internal_assert(op->type == type_of<halide_buffer_t *>() &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutates an existing buffer_crop call, so isn't this going to redundantly wrap it in another require?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yes, I suppose it could -- let me see if it will get optimized away or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gotcha here is that this code transmutes _halide_buffer_crop() -> _halide_buffer_set_bounds(_halide_buffer_crop()); if we leave out the require(), we can risk passing nullptr to the enclosing call. I suppose we could just make that call check for an incoming nullptr and bail early...

Copy link
Member

@abadams abadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other runtime calls are checked using assertstmts after the lets, not by wrapping a require. For consistency, I think it would be better to add a non-null assertstmts for each entry in cropped_buffers to the pre_call stmt in the loop on line ~790 (edit: in ScheduleFunctions.cpp)

@steven-johnson
Copy link
Contributor Author

The other runtime calls are checked using assertstmts after the lets, not by wrapping a require.

Not so: look at the code you just commented on, StorageFolding.cpp:132

@abadams
Copy link
Member

abadams commented Nov 9, 2021

Fair enough, I guess both idioms are in use.

steven-johnson added a commit that referenced this pull request Nov 9, 2021
(Alternate to #6402)

We currently assume that _halide_buffer_crop() will never fail. This is a bad assumption, as it can call device_crop(), which can fail due to unexpected runtime errors, or from a backend simply leaving the device_crop field at the default (unimplemented) case (as is currently the case for the OGLC backend).

When this happens, the dst buffer was left in an inconsistent, invalid state (which was what led to the crashes fixed by #6401).

This change modifies _halide_buffer_crop() to return nullptr in the event of an error, and ensure that all cropped buffers are checked for null at the right point. (This is not optimal, of course, since the specific error returned by device_crop is getting dropped on the floor, but the existence of an error is no longer ignored.)

This addresses at least some of the failure issues we are seeing in performance_async_gpu with the OpenGLCompute backend.

(Also: drive-by whitespace fix in CodegenC)
@steven-johnson
Copy link
Contributor Author

The other runtime calls are checked using assertstmts after the lets, not by wrapping a require. For consistency, I think it would be better to add a non-null assertstmts for each entry in cropped_buffers to the pre_call stmt in the loop on line ~790 (edit: in ScheduleFunctions.cpp)

Please see #6403

@steven-johnson
Copy link
Contributor Author

Closing in favor of #6403

steven-johnson added a commit that referenced this pull request Nov 11, 2021
* _halide_buffer_crop() needs to check for runtime failures (v2)

(Alternate to #6402)

We currently assume that _halide_buffer_crop() will never fail. This is a bad assumption, as it can call device_crop(), which can fail due to unexpected runtime errors, or from a backend simply leaving the device_crop field at the default (unimplemented) case (as is currently the case for the OGLC backend).

When this happens, the dst buffer was left in an inconsistent, invalid state (which was what led to the crashes fixed by #6401).

This change modifies _halide_buffer_crop() to return nullptr in the event of an error, and ensure that all cropped buffers are checked for null at the right point. (This is not optimal, of course, since the specific error returned by device_crop is getting dropped on the floor, but the existence of an error is no longer ignored.)

This addresses at least some of the failure issues we are seeing in performance_async_gpu with the OpenGLCompute backend.

(Also: drive-by whitespace fix in CodegenC)

* Oops
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants