HexagonOffload needs to call inject_hvx_lock_unlock() before lower_parallel_tasks() #6457

steven-johnson · 2021-12-01T00:24:40Z

Not 100% sure this is the proper fix, but it's definitely the issue we are seeing with correctness_parallel_nested_1; the qurt_hvx_lock stuff only got injected for parallel For loops, but with the changes in factor_parallel_codegen, those are all gone by the time we hit that code. Solution here is to explicitly call inject_hvx_lock_unlock() on blocks that we are lowering as parallel tasks in HexagonOffload, but I need to convince myself we aren't inserting duplicate locks or unlocks now (or just getting lucky).

…rallel_tasks()

dsharletg · 2021-12-01T01:34:29Z

I am not sure this is the right fix. I thought the behavior at master was:

We lock QURT on function entry
Somewhere in the thread pool, prior to entering a parallel loop, we checked the HVX lock state, and pass that to the worker threads to call qurt_hvx_lock (via halide_do_task?)

However, I can't find the implementation of this. The thread pool/halide_do_par_for implementation has been refactored significantly since I remember implementing this.

I think after @zvookin's PR, the behavior instead should be simplified, because now we should be locking HVX every time we enter a parallel loop body, i.e. we should be able to simplify the code.

I suspect this may be buggy, because it's trying to re-lock HVX on the same thread where it is already locked. The solution that is probably to unlock HVX prior to calling halide_do_par_for, then relocking it after.

However, I now suspect that on master, we simply aren't calling qurt_hvx_lock at all on worker threads! I'm wondering if maybe we should just stop calling qurt_hvx_lock entirely? @pranavb-ca @aankit-ca IIRC locking HVX became optional at some point?

dsharletg · 2021-12-01T01:36:31Z

Actually, my suggestion of unlocking + relocking might not be safe, because maybe we can capture vector variables in the parallel loop closure? Maybe these vectors would be lost if we unlocked + relocked HVX?

steven-johnson · 2021-12-01T02:00:37Z

In theory, the behavior with this fix should be identical to what we did before; we're just doing the transformation to insert locks earlier, and the lock/unlock calls get redistributed along with the rest of the code in the loop. That said, this was already a messy thing that was hard to understand, so if this allows us to simplify the situation, I'm all for it -- I'm just not sure what the needed semantics for the lock/unlock are.

steven-johnson · 2021-12-01T02:02:22Z

(converted to draft so I can change the base to master to enable proper testing; don't land it this way)

dsharletg · 2021-12-01T02:09:03Z

OK, here's the logic I was looking for:

Halide/src/CodeGen_Hexagon.cpp

Lines 333 to 355 in ed87acb

 // Primarily, we do two things when we encounter a parallel for loop. 

 // First, we check if the paralell for loop uses_hvx and accordingly 

 // acqure_hvx_context i.e. acquire and release HVX locks. 

 // Then we insert a conditional unlock before the for loop, let's call 

 // this the prolog, and a conditional lock after the for loop which 

 // we shall call the epilog. So the code for a parallel loop that uses 

 // hvx should look like so. 

 // 

 // if (uses_hvx_var) { 

 // halide_qurt_hvx_unlock(); 

 // } 

 // parallel_for { 

 // halide_qurt_hvx_lock(); 

 // ... 

 // ... 

 // halide_qurt_hvx_unlock(); 

 // } 

 // if (uses_hvx_var) { 

 // halide_qurt_hvx_lock(); 

 // } 

 // 

 // When we move up to the enclosing scope we substitute the value of uses_hvx 

 // into the IR that should convert the conditionals to constants.

The thing I missed is that it was moved from runtime logic to codegen logic at some point.

steven-johnson · 2021-12-01T02:25:02Z

OK, here's the logic I was looking for:

Right, and that's (still) what we're doing here, so I think we should be alright. That said, it may be possible to simplify this down further; I will take a look at doing so tomorrow and try to offer an alternate fix.

steven-johnson · 2021-12-01T18:09:35Z

@dsharletg @pranavb-ca @aankit-ca Are the specific rules for when qurt_hvx_lock()/qurt_hvx_unlock() need to be called documented? From a quick search through to code, I don't see any explicit docs about what needs to be protected this way; based on the current implementation of InjectHVXLocks, it looks like the rules is that any instruction that uses vector registers in Hexagon code needs to have this lock acquired, but it would be nice to be sure I'm guessing correctly...

pranavb-ca · 2021-12-01T18:36:07Z

Hey @steven-johnson give me some time to get back to you. I'll try this afternoon to see if we can get rid of this call altogether. (Sorry for the delayed response, I have been out of the office for the last two weeks. I am out until tomorrow but might be able to try this this afternoon nonetheless)

dsharletg · 2021-12-01T18:51:54Z

I think we should keep the calls for now and work on removing them separately.

I think what we should do for now is that Zalman's refactor will insert qurt_hvx_lock calls everywhere needed. The problem is we lost the ability to unlock HVX prior to calling halide_do_par_for.

The existing logic (and this PR preserves this) tries to be clever and avoid unlocking/relocking when possible. I think maybe we should just unconditionally unlock HVX and relock HVX (if needed) prior to any halide_do_par_for call. I doubt the overhead/cost of this will be significant relative to loops that are parallelized.

steven-johnson · 2021-12-01T18:59:43Z

I think what we should do for now is that Zalman's refactor will insert qurt_hvx_lock calls everywhere needed. The problem is we lost the ability to unlock HVX prior to calling halide_do_par_for.

Not sure I understand -- AFAICT, Zalman's refactor doesn't attempt to do anything like this right now, are you suggesting we modify it to do so?

The existing logic (and this PR preserves this) tries to be clever and avoid unlocking/relocking when possible

Yeah, but it seems to have been working for quite a while now. Is there a downside to re-using it that I'm overlooking? Lack of verifiability?

dsharletg · 2021-12-01T19:10:58Z

Not sure I understand -- AFAICT, Zalman's refactor doesn't attempt to do anything like this right now, are you suggesting we modify it to do so?

Zalman's PR causes the body of parallel loops to be a separate function, which get qurt_hvx_lock calls inserted by CodeGen_Hexagon just like any other function.

Yeah, but it seems to have been working for quite a while now. Is there a downside to re-using it that I'm overlooking? Lack of verifiability?

Yes, it's complicated, and hard to be sure it is working as intended.

steven-johnson · 2021-12-01T19:25:23Z

which get qurt_hvx_lock calls inserted by CodeGen_Hexagon just like any other function.

Ah, ok, I see now a case where we do end up with a double-lock getting inserted by this. (Oddly, the test still passes.) I'll withdraw this and try your suggestions.

steven-johnson · 2021-12-01T19:32:29Z

unconditionally unlock HVX and relock HVX (if needed) prior to any halide_do_par_for call

Speaking of which -- it looks like we currently don't attempt to do any of the locking for HVX code that uses async (i.e. halide_do_parallel_tasks); I presume we can continue to just drop these on the floor...

Fix qurt_hvx_lock issues

…R lowering pass. (#6195) * First cut at factoring parallel task compilation, including closure generating and calling, into a normal IR to IR lowering pass. Includes adding struct handling intrinsics to LLVM and C++ backends. Still a work in progress. * Fix formating that got munged by emacs somehow. * Checkpoint progress. * Small fixes. * Checkpoint progress. * Checkpoint preogress. * Checkpoint progress. * Checkpoint progress. Debugging code will be removed. * Try a fix for make_typed_struct in C++ codegen. * Another attempt to fix C++ codegen. * Another C codegen fix. * Checkpoint progress. * Use make_typed_struct rather than make_struct to construct closure. Ensure all types are carried through exactly the same to both make_struct_type and make_typed_struct. * Checkpoint. * Uniqueify closure names because LLVM was doing that to function names. * Small formatting cleanups. Fixes to call graph checker. Disable tests related to this while Andrew and I figure out how to get it to work across closures. * Get generated C++ to compile via a combination of fixing types and bludgeoning types into submission via subterfuge. * Typo fix to a typo fix. * Restore inadvertently deleted code. * Rename make_struct_type to declare_struct_type. * Add new file to CMake. * Add fixes for Hexagon offload and any passes that might add additional LoweredFunctions in the future. * Add comment with a bit of info for the future.. * Typo fix. * Don't duplicate the closure call to test the error return. Don't declare halide_buffer_t type if there are no buffers in closure. * Use _ucon in C++ code to get rid of constness casting ugliness. * Change resolve_function_name intrinsic to use a Call node to designate the function. This makes generating the C++ declaration in the C++ backend trivial. Few more changes to type handling and naming. * Small C++ backend output formating change. Don't generate For loops with no variable. Update internal test for C++ output. * Add halide_semaphore_acquire_t as a well known type for use inside compiler. * Add handling for halide_semaphore_t allocation. Formating fixes. * Fix type for halide_semaphore_t. * Reapply C++ backend formatting fix. * Add support for calling legacy halide_do_par_for runtime routine in cases where it is valid. * Formatting fixes. * Format and tidy fixes. * Attempt to pass formatting check. * Fix last set of test failures. * Formatting whitespace fixes. * Update comments. * Attempt to fix pointer cast error with some versions of LLVM. * Another attempt at fixing bool compatibility casting. * Another iteration. * Remove likely useless extern argument check logic. * Add hacky fix for losing global variables. * Comment typo fixes. * Remove no-longer-used Closure code from Codegen_Internal * Remove unused MayBlock visitor class * clang-tidy * Attempt to fix parallel offloads for HVX * Update parallel_nested_1.cpp * Augment Closure debugging * Add some std::move usage * Fix hvx lock/unlock semantics for PR #6457 (#6462) Fix qurt_hvx_lock issues * Sort IntrinsicOp and corresponding names * Remove unused `is_const_pointer()` function * Minor hygiene in LowerParallelTasks - normalize local functions to snake_case - put all local functions & classes in anon namespace - move MinThreads visitor to file scope to reduce nestedness of code * use Closure::include * Switch to PureIntrinsics per review feedback. * Minor cleanup of parallel refactor intrinsics (#6465) * Minor cleanup of parallel refactor intrinsics - Renamed `load_struct_member` to `load_typed_struct_member` to make it more clear that it is intended for use only with the results of `make_typed_struct`. - Split `declare_struct_type` into two intrinsics, `define_typed_struct` and `forward_declare_typed_struct`, removing the need for the underdocumented `mode` argument and hopefully making usage clearer - Added / clarified comments for the intrinsics modified above * Update comments * Fix comments * Update CodeGen_C.cpp * Remove 'foo.buffer' from Closure entirely This is a direct adaptation of what #6481 does for the old LLVM-based code, and allows elimination of one use of `get_pointer_or_null()`. PR is meant to be merged into factor_parallel_codegen, not master. * Update LowerParallelTasks.cpp * Keep track of task_parent inside LowerParallelTasks; remove no-longer-needed get_pointer_or_symbol() intrinsic (#6486) * Fix potential issue with additional LoweredFuncs (#6490) I think this is the right fix for this case; that said, I can't find a single instance in any of our test cases that actually triggers this. * factor parallel codegen with fewer intrinsics (#6487) * Rework some of parallel closure lowering to avoid some intrinsics This version relies more heavily on the existing make_struct, and puts function names in the Variable namespace as globals. Co-authored-by: Steven Johnson <srj@google.com> Co-authored-by: dsharletg <dsharlet@google.com> Co-authored-by: Steven Johnson <srj@google.com> Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>

steven-johnson added 2 commits November 30, 2021 16:10

HexagonOffload needs to call inject_hvx_lock_unlock() before lower_pa…

ece0f38

…rallel_tasks()

Fixes

499cf4a

steven-johnson requested review from zvookin and dsharletg December 1, 2021 00:24

steven-johnson added 2 commits November 30, 2021 16:25

Update CodeGen_Hexagon.cpp

3779104

Update CodeGen_Hexagon.cpp

729ecb4

steven-johnson mentioned this pull request Dec 1, 2021

Move parallel/async lowering from LLVM codegen to a standard Halide IR lowering pass. #6195

Merged

steven-johnson requested a review from halidebuildbots December 1, 2021 02:00

steven-johnson marked this pull request as draft December 1, 2021 02:01

steven-johnson changed the base branch from factor_parallel_codegen to master December 1, 2021 02:01

tickle buildbots

9a51757

steven-johnson changed the base branch from master to factor_parallel_codegen December 1, 2021 19:10

steven-johnson marked this pull request as ready for review December 1, 2021 19:10

steven-johnson closed this Dec 1, 2021

steven-johnson added a commit that referenced this pull request Dec 2, 2021

Fix hvx lock/unlock semantics for PR #6457 (#6462)

0faca07

Fix qurt_hvx_lock issues

steven-johnson deleted the srj/parallel-hvx-lock-fix branch December 2, 2021 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HexagonOffload needs to call inject_hvx_lock_unlock() before lower_parallel_tasks() #6457

HexagonOffload needs to call inject_hvx_lock_unlock() before lower_parallel_tasks() #6457

steven-johnson commented Dec 1, 2021

dsharletg commented Dec 1, 2021 •

edited

Loading

dsharletg commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

dsharletg commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

pranavb-ca commented Dec 1, 2021

dsharletg commented Dec 1, 2021 •

edited

Loading

steven-johnson commented Dec 1, 2021

dsharletg commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

HexagonOffload needs to call inject_hvx_lock_unlock() before lower_parallel_tasks() #6457

HexagonOffload needs to call inject_hvx_lock_unlock() before lower_parallel_tasks() #6457

Conversation

steven-johnson commented Dec 1, 2021

dsharletg commented Dec 1, 2021 • edited Loading

dsharletg commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

dsharletg commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

pranavb-ca commented Dec 1, 2021

dsharletg commented Dec 1, 2021 • edited Loading

steven-johnson commented Dec 1, 2021

dsharletg commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

steven-johnson commented Dec 1, 2021

dsharletg commented Dec 1, 2021 •

edited

Loading

dsharletg commented Dec 1, 2021 •

edited

Loading