-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bounds checking on input images #1
Labels
Comments
ghost
assigned abadams
Jul 31, 2012
jrk
added a commit
that referenced
this issue
Aug 2, 2012
Modified codegen to use explicit entrypoint/args
Added. Seems to work. |
abadams
pushed a commit
that referenced
this issue
May 18, 2016
Tools::Image<T> constructs from array
abadams
added a commit
that referenced
this issue
May 4, 2017
Two changes: 1) Use 32 free bits in the IRNode to store the IRNodeType, so that it can be gotten at by a load instead of having to call a virtual function 2) Change things to a style where any function that's going to make a copy of an Expr takes it by value, and then does a std::move internally at its last use. This avoids a bunch of atomic ops and conditional branches in caller code in the case where you're passing in an rvalue. With these changes, lowering local laplacian gets about 12% faster (1.5s -> 1.33s). Most of the win is from change #1 These are being done in advance of a planned change to simplify the simplifier to be more concise and use less stack space, so hopefully the fact that this represents a partial reversion of #1810 won't bite us.
Closed
Closed
This was referenced Jan 28, 2021
frengels
pushed a commit
to frengels/Halide
that referenced
this issue
Apr 30, 2021
Add support for unsigned tile operations
abadams
added a commit
that referenced
this issue
Dec 7, 2021
This lets it save a few instructions on x86 and arm. cast(UInt(16), lerp(some_u8s)) produces the following, before and after this PR Before: x86: vmovdqu (%r15,%r13), %xmm4 vpmovzxbw -2(%r15,%r13), %ymm5 vpxor %xmm0, %xmm4, %xmm6 vpmovzxbw %xmm6, %ymm6 vpmovzxbw -1(%r15,%r13), %ymm7 vpmullw %ymm6, %ymm5, %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm7, %ymm4 vpaddw %ymm4, %ymm5, %ymm4 vpaddw %ymm1, %ymm4, %ymm4 vpmulhuw %ymm2, %ymm4, %ymm4 vpsrlw $7, %ymm4, %ymm4 vpand %ymm3, %ymm4, %ymm4 vmovdqu %ymm4, (%rbx,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b urshr v1.8h, v4.8h, #8 urshr v2.8h, v0.8h, #8 raddhn v1.8b, v1.8h, v4.8h raddhn v0.8b, v2.8h, v0.8h ushll v0.8h, v0.8b, #0 ushll v1.8h, v1.8b, #0 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 After: x86: vpmovzxbw -2(%r15,%r13), %ymm3 vmovdqu (%r15,%r13), %xmm4 vpxor %xmm0, %xmm4, %xmm5 vpmovzxbw %xmm5, %ymm5 vpmullw %ymm5, %ymm3, %ymm3 vpmovzxbw -1(%r15,%r13), %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm5, %ymm4 vpaddw %ymm4, %ymm3, %ymm3 vpaddw %ymm1, %ymm3, %ymm3 vpmulhuw %ymm2, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 vmovdqu %ymm3, (%rbp,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b ursra v4.8h, v4.8h, #8 ursra v0.8h, v0.8h, #8 urshr v1.8h, v4.8h, #8 urshr v0.8h, v0.8h, #8 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 So on X86 we skip a pointless and instruction, and on ARM we get a rounding add and shift right instead of a rounding narrowing add shift right followed by a widen.
abadams
added a commit
that referenced
this issue
Dec 10, 2021
* Let lerp lowering incorporate a final cast This lets it save a few instructions on x86 and arm. cast(UInt(16), lerp(some_u8s)) produces the following, before and after this PR Before: x86: vmovdqu (%r15,%r13), %xmm4 vpmovzxbw -2(%r15,%r13), %ymm5 vpxor %xmm0, %xmm4, %xmm6 vpmovzxbw %xmm6, %ymm6 vpmovzxbw -1(%r15,%r13), %ymm7 vpmullw %ymm6, %ymm5, %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm7, %ymm4 vpaddw %ymm4, %ymm5, %ymm4 vpaddw %ymm1, %ymm4, %ymm4 vpmulhuw %ymm2, %ymm4, %ymm4 vpsrlw $7, %ymm4, %ymm4 vpand %ymm3, %ymm4, %ymm4 vmovdqu %ymm4, (%rbx,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b urshr v1.8h, v4.8h, #8 urshr v2.8h, v0.8h, #8 raddhn v1.8b, v1.8h, v4.8h raddhn v0.8b, v2.8h, v0.8h ushll v0.8h, v0.8b, #0 ushll v1.8h, v1.8b, #0 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 After: x86: vpmovzxbw -2(%r15,%r13), %ymm3 vmovdqu (%r15,%r13), %xmm4 vpxor %xmm0, %xmm4, %xmm5 vpmovzxbw %xmm5, %ymm5 vpmullw %ymm5, %ymm3, %ymm3 vpmovzxbw -1(%r15,%r13), %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm5, %ymm4 vpaddw %ymm4, %ymm3, %ymm3 vpaddw %ymm1, %ymm3, %ymm3 vpmulhuw %ymm2, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 vmovdqu %ymm3, (%rbp,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b ursra v4.8h, v4.8h, #8 ursra v0.8h, v0.8h, #8 urshr v1.8h, v4.8h, #8 urshr v0.8h, v0.8h, #8 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 So on X86 we skip a pointless and instruction, and on ARM we get a rounding add and shift right instead of a rounding narrowing add shift right followed by a widen. * Add test * Fix bug in test * Don't produce out-of-range lerp values
steven-johnson
added a commit
that referenced
this issue
Dec 16, 2021
This makes an assumption that ~all of the calls to `error()` in the runtime have descriptive text that is generally only useful for developers, and leaving this text in release builds is of limited-to-no-use. On that assumption: -Add a new "runtime internal" error code that is a catch-all for these cases - Add a new `_halide_runtime_error()` error function, which is a smart wrapper that discards all its arguments in non-debug runtimes - Convert runtime/cuda.cpp as a proof-of-concept of possible code savings. On My OSX box, `bin/host-cuda/runtime.a` is 174128 bytes at current top-of-tree, 163160 bytes with this PR in place. Extending this to the rest of runtime would likely get us down to ~140k if the estimates in #6474 are correct. Note #1: You could achieve (nearly) the same thing by just changing `error()` to be special-case for DEBUG_RUNTIME, but this formulation is (IMHO) slightly cleaner, since it also allows us to return the error result directly, rather than requiring two statements. It also provides a good excuse to do a once-over of all existing usage, which is probably worthwhile. - As mentioned above, it basically drops useful text on the floor for release builds, on the assumption that developers can use a debug-runtime build for more details; this may be a terrible assumption. Thoughts? - This PR makes no attempt to address the really-quite-loose bounds on what can be returned; e.g. there are lots of places we just return a Cuda error where (technically) a halide_error_code_t is expected; this doesn't seem to ever have been a real problem in practice, but it makes my spidey-sense tingle.
brent-carmer
referenced
this issue
in InteonCo/Halide
Feb 28, 2022
Rioffe dev: Python schedule generation
RootButcher
referenced
this issue
in RootButcher/Halide-rustbinding-old
Apr 21, 2022
Auto make libruntime.a
steven-johnson
added a commit
that referenced
this issue
Sep 23, 2022
This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well: - The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere. - We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this. (Side note #1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.) (Side note #2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.)
steven-johnson
added a commit
that referenced
this issue
Sep 23, 2022
* add_requirement() maintenance This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well: - The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere. - We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this. (Side note #1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.) (Side note #2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.) * tidy * remove underscores
ardier
pushed a commit
to ardier/Halide-mutation
that referenced
this issue
Mar 3, 2024
* add_requirement() maintenance This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well: - The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere. - We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this. (Side note halide#1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.) (Side note halide#2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.) * tidy * remove underscores
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We don't currently do any bounds checking on input images, which causes segfaults. One subtle way this triggers is if you vectorize something which accesses the input image but the input image is not a multiple of the vector width.
See test/cpp/input_image_bounds_check/test.cpp for code that triggers this bug
We should add asserts at the function preamble that check this (conservatively).
The text was updated successfully, but these errors were encountered: