Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bounds checking on input images #1

Closed
abadams opened this issue Jul 31, 2012 · 1 comment
Closed

Bounds checking on input images #1

abadams opened this issue Jul 31, 2012 · 1 comment
Assignees
Labels

Comments

@abadams
Copy link
Member

abadams commented Jul 31, 2012

We don't currently do any bounds checking on input images, which causes segfaults. One subtle way this triggers is if you vectorize something which accesses the input image but the input image is not a multiple of the vector width.

See test/cpp/input_image_bounds_check/test.cpp for code that triggers this bug

We should add asserts at the function preamble that check this (conservatively).

@ghost ghost assigned abadams Jul 31, 2012
jrk added a commit that referenced this issue Aug 2, 2012
Modified codegen to use explicit entrypoint/args
@abadams
Copy link
Member Author

abadams commented Aug 5, 2012

Added. Seems to work.

@abadams abadams closed this as completed Aug 5, 2012
abadams pushed a commit that referenced this issue May 18, 2016
Tools::Image<T> constructs from array
abadams added a commit that referenced this issue May 4, 2017
Two changes:

1) Use 32 free bits in the IRNode to store the IRNodeType, so that it
can be gotten at by a load instead of having to call a virtual function

2) Change things to a style where any function that's going to make a
copy of an Expr takes it by value, and then does a std::move internally
at its last use. This avoids a bunch of atomic ops and conditional
branches in caller code in the case where you're passing in an rvalue.

With these changes, lowering local laplacian gets about 12% faster (1.5s
-> 1.33s). Most of the win is from change #1

These are being done in advance of a planned change to simplify the
simplifier to be more concise and use less stack space, so hopefully the
fact that this represents a partial reversion of #1810 won't bite us.
dkurt referenced this issue in dkurt/Halide Aug 30, 2017
Halide as OpenCL kernels generator
steven-johnson pushed a commit that referenced this issue Jul 17, 2018
frengels pushed a commit to frengels/Halide that referenced this issue Apr 30, 2021
Add support for unsigned tile operations
abadams added a commit that referenced this issue Dec 7, 2021
This lets it save a few instructions on x86 and arm.

cast(UInt(16), lerp(some_u8s)) produces the following, before and after
this PR

Before:

x86:

	vmovdqu	(%r15,%r13), %xmm4
	vpmovzxbw	-2(%r15,%r13), %ymm5
	vpxor	%xmm0, %xmm4, %xmm6
	vpmovzxbw	%xmm6, %ymm6
	vpmovzxbw	-1(%r15,%r13), %ymm7
	vpmullw	%ymm6, %ymm5, %ymm5
	vpmovzxbw	%xmm4, %ymm4
	vpmullw	%ymm4, %ymm7, %ymm4
	vpaddw	%ymm4, %ymm5, %ymm4
	vpaddw	%ymm1, %ymm4, %ymm4
	vpmulhuw	%ymm2, %ymm4, %ymm4
	vpsrlw	$7, %ymm4, %ymm4
	vpand	%ymm3, %ymm4, %ymm4
	vmovdqu	%ymm4, (%rbx,%r13,2)
	addq	$16, %r13
	decq	%r10
	jne	.LBB0_10
arm:

	ldr	q0, [x17]
	ldur	q2, [x17, #-1]
	ldur	q1, [x17, #-2]
	subs	x0, x0, #1                      // =1
	mvn	v3.16b, v0.16b
	umull	v4.8h, v2.8b, v0.8b
	umull2	v0.8h, v2.16b, v0.16b
	umlal	v4.8h, v1.8b, v3.8b
	umlal2	v0.8h, v1.16b, v3.16b
	urshr	v1.8h, v4.8h, #8
	urshr	v2.8h, v0.8h, #8
	raddhn	v1.8b, v1.8h, v4.8h
	raddhn	v0.8b, v2.8h, v0.8h
	ushll	v0.8h, v0.8b, #0
	ushll	v1.8h, v1.8b, #0
	add	x17, x17, #16                   // =16
	stp	q1, q0, [x18, #-16]
	add	x18, x18, #32                   // =32
	b.ne	.LBB0_10

After:

x86:

	vpmovzxbw	-2(%r15,%r13), %ymm3
	vmovdqu	(%r15,%r13), %xmm4
	vpxor	%xmm0, %xmm4, %xmm5
	vpmovzxbw	%xmm5, %ymm5
	vpmullw	%ymm5, %ymm3, %ymm3
	vpmovzxbw	-1(%r15,%r13), %ymm5
	vpmovzxbw	%xmm4, %ymm4
	vpmullw	%ymm4, %ymm5, %ymm4
	vpaddw	%ymm4, %ymm3, %ymm3
	vpaddw	%ymm1, %ymm3, %ymm3
	vpmulhuw	%ymm2, %ymm3, %ymm3
	vpsrlw	$7, %ymm3, %ymm3
	vmovdqu	%ymm3, (%rbp,%r13,2)
	addq	$16, %r13
	decq	%r10
	jne	.LBB0_10

arm:

	ldr	q0, [x17]
	ldur	q2, [x17, #-1]
	ldur	q1, [x17, #-2]
	subs	x0, x0, #1                      // =1
	mvn	v3.16b, v0.16b
	umull	v4.8h, v2.8b, v0.8b
	umull2	v0.8h, v2.16b, v0.16b
	umlal	v4.8h, v1.8b, v3.8b
	umlal2	v0.8h, v1.16b, v3.16b
	ursra	v4.8h, v4.8h, #8
	ursra	v0.8h, v0.8h, #8
	urshr	v1.8h, v4.8h, #8
	urshr	v0.8h, v0.8h, #8
	add	x17, x17, #16                   // =16
	stp	q1, q0, [x18, #-16]
	add	x18, x18, #32                   // =32
	b.ne	.LBB0_10

So on X86 we skip a pointless and instruction, and on ARM we get a
rounding add and shift right instead of a rounding narrowing add shift
right followed by a widen.
abadams added a commit that referenced this issue Dec 10, 2021
* Let lerp lowering incorporate a final cast

This lets it save a few instructions on x86 and arm.

cast(UInt(16), lerp(some_u8s)) produces the following, before and after
this PR

Before:

x86:

	vmovdqu	(%r15,%r13), %xmm4
	vpmovzxbw	-2(%r15,%r13), %ymm5
	vpxor	%xmm0, %xmm4, %xmm6
	vpmovzxbw	%xmm6, %ymm6
	vpmovzxbw	-1(%r15,%r13), %ymm7
	vpmullw	%ymm6, %ymm5, %ymm5
	vpmovzxbw	%xmm4, %ymm4
	vpmullw	%ymm4, %ymm7, %ymm4
	vpaddw	%ymm4, %ymm5, %ymm4
	vpaddw	%ymm1, %ymm4, %ymm4
	vpmulhuw	%ymm2, %ymm4, %ymm4
	vpsrlw	$7, %ymm4, %ymm4
	vpand	%ymm3, %ymm4, %ymm4
	vmovdqu	%ymm4, (%rbx,%r13,2)
	addq	$16, %r13
	decq	%r10
	jne	.LBB0_10
arm:

	ldr	q0, [x17]
	ldur	q2, [x17, #-1]
	ldur	q1, [x17, #-2]
	subs	x0, x0, #1                      // =1
	mvn	v3.16b, v0.16b
	umull	v4.8h, v2.8b, v0.8b
	umull2	v0.8h, v2.16b, v0.16b
	umlal	v4.8h, v1.8b, v3.8b
	umlal2	v0.8h, v1.16b, v3.16b
	urshr	v1.8h, v4.8h, #8
	urshr	v2.8h, v0.8h, #8
	raddhn	v1.8b, v1.8h, v4.8h
	raddhn	v0.8b, v2.8h, v0.8h
	ushll	v0.8h, v0.8b, #0
	ushll	v1.8h, v1.8b, #0
	add	x17, x17, #16                   // =16
	stp	q1, q0, [x18, #-16]
	add	x18, x18, #32                   // =32
	b.ne	.LBB0_10

After:

x86:

	vpmovzxbw	-2(%r15,%r13), %ymm3
	vmovdqu	(%r15,%r13), %xmm4
	vpxor	%xmm0, %xmm4, %xmm5
	vpmovzxbw	%xmm5, %ymm5
	vpmullw	%ymm5, %ymm3, %ymm3
	vpmovzxbw	-1(%r15,%r13), %ymm5
	vpmovzxbw	%xmm4, %ymm4
	vpmullw	%ymm4, %ymm5, %ymm4
	vpaddw	%ymm4, %ymm3, %ymm3
	vpaddw	%ymm1, %ymm3, %ymm3
	vpmulhuw	%ymm2, %ymm3, %ymm3
	vpsrlw	$7, %ymm3, %ymm3
	vmovdqu	%ymm3, (%rbp,%r13,2)
	addq	$16, %r13
	decq	%r10
	jne	.LBB0_10

arm:

	ldr	q0, [x17]
	ldur	q2, [x17, #-1]
	ldur	q1, [x17, #-2]
	subs	x0, x0, #1                      // =1
	mvn	v3.16b, v0.16b
	umull	v4.8h, v2.8b, v0.8b
	umull2	v0.8h, v2.16b, v0.16b
	umlal	v4.8h, v1.8b, v3.8b
	umlal2	v0.8h, v1.16b, v3.16b
	ursra	v4.8h, v4.8h, #8
	ursra	v0.8h, v0.8h, #8
	urshr	v1.8h, v4.8h, #8
	urshr	v0.8h, v0.8h, #8
	add	x17, x17, #16                   // =16
	stp	q1, q0, [x18, #-16]
	add	x18, x18, #32                   // =32
	b.ne	.LBB0_10

So on X86 we skip a pointless and instruction, and on ARM we get a
rounding add and shift right instead of a rounding narrowing add shift
right followed by a widen.

* Add test

* Fix bug in test

* Don't produce out-of-range lerp values
steven-johnson added a commit that referenced this issue Dec 16, 2021
This makes an assumption that ~all of the calls to `error()` in the runtime have descriptive text that is generally only useful for developers, and leaving this text in release builds is of limited-to-no-use. On that assumption:
-Add a new "runtime internal" error code that is a catch-all for these cases
- Add a new `_halide_runtime_error()` error function, which is a smart wrapper that discards all its arguments in non-debug runtimes
- Convert runtime/cuda.cpp as a proof-of-concept of possible code savings. On My OSX box, `bin/host-cuda/runtime.a` is 174128 bytes at current top-of-tree, 163160 bytes with this PR in place. Extending this to the rest of runtime would likely get us down to ~140k if the estimates in #6474 are correct.

Note #1: You could achieve (nearly) the same thing by just changing `error()` to be special-case for DEBUG_RUNTIME, but this formulation is (IMHO) slightly cleaner, since it also allows us to return the error result directly, rather than requiring two statements. It also provides a good excuse to do a once-over of all existing usage, which is probably worthwhile.

- As mentioned above, it basically drops useful text on the floor for release builds, on the assumption that developers can use a debug-runtime build for more details; this may be a terrible assumption. Thoughts?

- This PR makes no attempt to address the really-quite-loose bounds on what can be returned; e.g. there are lots of places we just return a Cuda error where (technically) a halide_error_code_t is expected; this doesn't seem to ever have been a real problem in practice, but it makes my spidey-sense tingle.
brent-carmer referenced this issue in InteonCo/Halide Feb 28, 2022
Rioffe dev: Python schedule generation
RootButcher referenced this issue in RootButcher/Halide-rustbinding-old Apr 21, 2022
steven-johnson added a commit that referenced this issue Sep 23, 2022
This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well:
- The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere.
- We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this.

(Side note #1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.)

(Side note #2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.)
steven-johnson added a commit that referenced this issue Sep 23, 2022
* add_requirement() maintenance

This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well:
- The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere.
- We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this.

(Side note #1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.)

(Side note #2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.)

* tidy

* remove underscores
ardier pushed a commit to ardier/Halide-mutation that referenced this issue Mar 3, 2024
* add_requirement() maintenance

This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well:
- The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere.
- We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this.

(Side note halide#1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.)

(Side note halide#2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.)

* tidy

* remove underscores
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant