Bounds checking on input images #1

abadams · 2012-07-31T19:46:41Z

We don't currently do any bounds checking on input images, which causes segfaults. One subtle way this triggers is if you vectorize something which accesses the input image but the input image is not a multiple of the vector width.

See test/cpp/input_image_bounds_check/test.cpp for code that triggers this bug

We should add asserts at the function preamble that check this (conservatively).

Modified codegen to use explicit entrypoint/args

abadams · 2012-08-05T16:30:46Z

Added. Seems to work.

Tools::Image<T> constructs from array

Two changes: 1) Use 32 free bits in the IRNode to store the IRNodeType, so that it can be gotten at by a load instead of having to call a virtual function 2) Change things to a style where any function that's going to make a copy of an Expr takes it by value, and then does a std::move internally at its last use. This avoids a bunch of atomic ops and conditional branches in caller code in the case where you're passing in an rvalue. With these changes, lowering local laplacian gets about 12% faster (1.5s -> 1.33s). Most of the win is from change #1 These are being done in advance of a planned change to simplify the simplifier to be more concise and use less stack space, so hopefully the fact that this represents a partial reversion of #1810 won't bite us.

Halide as OpenCL kernels generator

sync with master

Add support for unsigned tile operations

This lets it save a few instructions on x86 and arm. cast(UInt(16), lerp(some_u8s)) produces the following, before and after this PR Before: x86: vmovdqu (%r15,%r13), %xmm4 vpmovzxbw -2(%r15,%r13), %ymm5 vpxor %xmm0, %xmm4, %xmm6 vpmovzxbw %xmm6, %ymm6 vpmovzxbw -1(%r15,%r13), %ymm7 vpmullw %ymm6, %ymm5, %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm7, %ymm4 vpaddw %ymm4, %ymm5, %ymm4 vpaddw %ymm1, %ymm4, %ymm4 vpmulhuw %ymm2, %ymm4, %ymm4 vpsrlw $7, %ymm4, %ymm4 vpand %ymm3, %ymm4, %ymm4 vmovdqu %ymm4, (%rbx,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b urshr v1.8h, v4.8h, #8 urshr v2.8h, v0.8h, #8 raddhn v1.8b, v1.8h, v4.8h raddhn v0.8b, v2.8h, v0.8h ushll v0.8h, v0.8b, #0 ushll v1.8h, v1.8b, #0 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 After: x86: vpmovzxbw -2(%r15,%r13), %ymm3 vmovdqu (%r15,%r13), %xmm4 vpxor %xmm0, %xmm4, %xmm5 vpmovzxbw %xmm5, %ymm5 vpmullw %ymm5, %ymm3, %ymm3 vpmovzxbw -1(%r15,%r13), %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm5, %ymm4 vpaddw %ymm4, %ymm3, %ymm3 vpaddw %ymm1, %ymm3, %ymm3 vpmulhuw %ymm2, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 vmovdqu %ymm3, (%rbp,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b ursra v4.8h, v4.8h, #8 ursra v0.8h, v0.8h, #8 urshr v1.8h, v4.8h, #8 urshr v0.8h, v0.8h, #8 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 So on X86 we skip a pointless and instruction, and on ARM we get a rounding add and shift right instead of a rounding narrowing add shift right followed by a widen.

* Let lerp lowering incorporate a final cast This lets it save a few instructions on x86 and arm. cast(UInt(16), lerp(some_u8s)) produces the following, before and after this PR Before: x86: vmovdqu (%r15,%r13), %xmm4 vpmovzxbw -2(%r15,%r13), %ymm5 vpxor %xmm0, %xmm4, %xmm6 vpmovzxbw %xmm6, %ymm6 vpmovzxbw -1(%r15,%r13), %ymm7 vpmullw %ymm6, %ymm5, %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm7, %ymm4 vpaddw %ymm4, %ymm5, %ymm4 vpaddw %ymm1, %ymm4, %ymm4 vpmulhuw %ymm2, %ymm4, %ymm4 vpsrlw $7, %ymm4, %ymm4 vpand %ymm3, %ymm4, %ymm4 vmovdqu %ymm4, (%rbx,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b urshr v1.8h, v4.8h, #8 urshr v2.8h, v0.8h, #8 raddhn v1.8b, v1.8h, v4.8h raddhn v0.8b, v2.8h, v0.8h ushll v0.8h, v0.8b, #0 ushll v1.8h, v1.8b, #0 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 After: x86: vpmovzxbw -2(%r15,%r13), %ymm3 vmovdqu (%r15,%r13), %xmm4 vpxor %xmm0, %xmm4, %xmm5 vpmovzxbw %xmm5, %ymm5 vpmullw %ymm5, %ymm3, %ymm3 vpmovzxbw -1(%r15,%r13), %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm5, %ymm4 vpaddw %ymm4, %ymm3, %ymm3 vpaddw %ymm1, %ymm3, %ymm3 vpmulhuw %ymm2, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 vmovdqu %ymm3, (%rbp,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b ursra v4.8h, v4.8h, #8 ursra v0.8h, v0.8h, #8 urshr v1.8h, v4.8h, #8 urshr v0.8h, v0.8h, #8 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 So on X86 we skip a pointless and instruction, and on ARM we get a rounding add and shift right instead of a rounding narrowing add shift right followed by a widen. * Add test * Fix bug in test * Don't produce out-of-range lerp values

This makes an assumption that ~all of the calls to `error()` in the runtime have descriptive text that is generally only useful for developers, and leaving this text in release builds is of limited-to-no-use. On that assumption: -Add a new "runtime internal" error code that is a catch-all for these cases - Add a new `_halide_runtime_error()` error function, which is a smart wrapper that discards all its arguments in non-debug runtimes - Convert runtime/cuda.cpp as a proof-of-concept of possible code savings. On My OSX box, `bin/host-cuda/runtime.a` is 174128 bytes at current top-of-tree, 163160 bytes with this PR in place. Extending this to the rest of runtime would likely get us down to ~140k if the estimates in #6474 are correct. Note #1: You could achieve (nearly) the same thing by just changing `error()` to be special-case for DEBUG_RUNTIME, but this formulation is (IMHO) slightly cleaner, since it also allows us to return the error result directly, rather than requiring two statements. It also provides a good excuse to do a once-over of all existing usage, which is probably worthwhile. - As mentioned above, it basically drops useful text on the floor for release builds, on the assumption that developers can use a debug-runtime build for more details; this may be a terrible assumption. Thoughts? - This PR makes no attempt to address the really-quite-loose bounds on what can be returned; e.g. there are lots of places we just return a Cuda error where (technically) a halide_error_code_t is expected; this doesn't seem to ever have been a real problem in practice, but it makes my spidey-sense tingle.

Rioffe dev: Python schedule generation

Auto make libruntime.a

This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well: - The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere. - We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this. (Side note #1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.) (Side note #2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.)

* add_requirement() maintenance This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well: - The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere. - We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this. (Side note #1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.) (Side note #2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.) * tidy * remove underscores

* add_requirement() maintenance This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well: - The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere. - We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this. (Side note halide#1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.) (Side note halide#2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.) * tidy * remove underscores

ghost assigned abadams Jul 31, 2012

jrk added a commit that referenced this issue Aug 2, 2012

Merge pull request #1 from mit-gfx/entrypoint-ir

4403c60

Modified codegen to use explicit entrypoint/args

abadams closed this as completed Aug 5, 2012

abadams mentioned this issue May 1, 2014

Win32: external function '\01_EnterCriticalSection@4' could not be resolved #281

Closed

steven-johnson mentioned this issue Jul 11, 2014

Building with -g produces LLVM assertion #369

Closed

gchauras mentioned this issue Jul 31, 2014

Segfault in JIT compilation #404

Closed

steven-johnson mentioned this issue Dec 2, 2014

define_extern() should make it easy to match Target::UserContext #554

Closed

abestephensg mentioned this issue Jun 5, 2015

Regression: runtime crash in test/opengl/ JIT tests #819

Closed

abadams mentioned this issue Sep 21, 2015

Debug symbol parsing crashes for programs build with Xcode 7 #936

Closed

abadams pushed a commit that referenced this issue May 18, 2016

Merge pull request #1 from kgnk/kgnk-patch-1

e54040b

Tools::Image<T> constructs from array

dpalermo mentioned this issue Sep 29, 2016

Prefetch scheduling directive #1527

Merged

steven-johnson mentioned this issue Nov 18, 2016

apps/ should only rely on distrib/ folder? #1637

Closed

abadams mentioned this issue Feb 10, 2017

Add vrmpy, vdmpy, improve vmpa on Hexagon #1825

Merged

steven-johnson mentioned this issue Feb 13, 2017

Halide needs a way to limit total stack allocation in generated code #1839

Open

abadams mentioned this issue May 4, 2017

Compiler performance penny-pinching #2045

Merged

dkurt referenced this issue in dkurt/Halide Aug 30, 2017

Merge pull request #1 from dkurt/ocl_generator

eca0cbe

Halide as OpenCL kernels generator

ronlieb mentioned this issue Oct 26, 2017

Error running HelloHexagon #1444

Closed

dliang0406 mentioned this issue Nov 16, 2017

HelloHexagon Error: Not pic code 30 #2273

Closed

abadams mentioned this issue May 30, 2018

Generating vmlal.s16 #2997

Closed

steven-johnson pushed a commit that referenced this issue Jul 17, 2018

Merge pull request #1 from halide/master

69cf6fd

sync with master

AlexanderGarmash mentioned this issue Jan 28, 2019

HelloAndroidCamera2 is crashing when camera permissions granted #3633

Open

shoaibkamil mentioned this issue May 13, 2019

Halide Releases should be more regular and more frequent #1905

Closed

chtsao8 mentioned this issue Jul 12, 2019

Onnx->Halide Conversion Issues #4007

Closed

steven-johnson mentioned this issue Jul 15, 2019

Fix Issue #3955 #3962

Closed

steven-johnson mentioned this issue Apr 2, 2020

Avoid including IR.h in other .h files except where necessary #4815

Merged

This was referenced Jan 28, 2021

Add initial support for Sapphire Rapids AVX512 features #5677

Merged

Make write_debug_image MSAN-safe #5691

Draft

frengels pushed a commit to frengels/Halide that referenced this issue Apr 30, 2021

Merge pull request halide#1 from frengels/tile_matmul

c17b626

Add support for unsigned tile operations

steven-johnson mentioned this issue Dec 16, 2021

Sketch of partial fix for #6474 #6502

Closed

brent-carmer referenced this issue in InteonCo/Halide Feb 28, 2022

Merge pull request #1 from InteonCo/rioffe-dev

2a830bc

Rioffe dev: Python schedule generation

steven-johnson mentioned this issue Mar 18, 2022

Applying .split() to .fuse()'d vars produces strange results #6652

Open

steven-johnson mentioned this issue Mar 30, 2022

-mtune=/-mcpu= support for x86 AMD CPU's #6655

Merged

RootButcher referenced this issue in RootButcher/Halide-rustbinding-old Apr 21, 2022

Merge pull request #1 from RootButcher/AutoMake_libruntime.a

b06f658

Auto make libruntime.a

steven-johnson mentioned this issue May 26, 2022

Prefer accessing Generator's target, auto_schedule, and machine_params... #6729

Closed

steven-johnson mentioned this issue Jul 28, 2022

[vulkan phase1] Add SPIR-V IR #6882

Merged

steven-johnson mentioned this issue Aug 23, 2022

Fix possible overflow in saturating_cast bounds inference #6961

Closed

steven-johnson mentioned this issue Sep 23, 2022

add_requirement() maintenance #7045

Merged

derek-gerstmann mentioned this issue Aug 3, 2023

[vulkan] Fix heap buffer overflow in Vulkan extension handling discovered by ASAN #7740

Merged

steven-johnson mentioned this issue Aug 16, 2023

FR: make structured bindings work with RDom #7765

Open

steven-johnson mentioned this issue Nov 28, 2023

fuzz-testing failure in fuzz_simplify #7962

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bounds checking on input images #1

Bounds checking on input images #1

abadams commented Jul 31, 2012

abadams commented Aug 5, 2012

Bounds checking on input images #1

Bounds checking on input images #1

Comments

abadams commented Jul 31, 2012

abadams commented Aug 5, 2012