More generic grouped grid reduction kernel #1740

naoyam · 2022-06-02T06:42:44Z

This PR generalizes the grouped grid reduction kernel with respect to the number of grouped reductions. The new kernel itself should work with an arbitrary number of inputs, but the underlying data structure, Tuple, still explicitly needs to be specialized for the number of values, which is currently limited to 8. Previously, there's only two-way grouped kernel.

See FusionGroupAllreduce4, which groups 8 grid reductions into a single grouped grid reduction.

This PR is meant to allow more aggressive grouping of grid reductions, e.g., grouping across iterations.

There's still no support for Welford. Fusions with multiple Welford reductions would be unlikely, so horizontal grouping wouldn't be important, but there would be opportunities to group across iterations.

The kernel itself should work with an arbitrary number of inputs, but the underlying data structure, Tuple, still explicitly needs to be specialized for the number of values, which is currently limited to 8.

naoyam · 2022-06-02T07:26:12Z

torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu

+      RefTuple<DataTypes...> out,
+      const ConstRefTuple<DataTypes...>& inp,
+      VolatilePtrTuple<DataTypes...> global_work_buffer,
+      const LocalTuple<DataTypes...>& init_val,
+      int64_t* global_sync_buffer,
      void* shared_mem,
-      bool read_pred, // Prevent reading from out of bounds memory
-      bool write_pred) { // Prevent from writing out of bounds
+      const LocalTuple<BoolTypes...>& read_preds,
+      const LocalTuple<BoolTypes...>& write_preds,
+      Funcs... funcs) {


This is the main change of this PR. Each tuple aggregates a parameter of each grid operations. The number of operations can be as large as 8.

csarofeen

Didn't do a super in-depth review of all the functions, but overall LGTM. Would like to see an additional test to make sure the tuple's support multiple dtypes across the reductions. Would also just like to see some more comments in the helper functions like the for each functions in the runtime files. Simply reiterating the necessity of the different versions of the functions would be helpful for folks that may need to go through these files in the future. Thanks!

csarofeen · 2022-06-09T15:24:29Z

torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp

+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});


Could you add some mixed types in the reductions (like one on float vals, one n double vals, and one on int vals)? It looked like that should be supported, right?

Yes. Added one more test.

naoyam · 2022-06-13T19:24:13Z

Thanks for the review. Added more comments.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - TransformPropagator refactor: switched to Dijkstra instead of exhaustive enumeration on all possible paths to reduce compilation time on transform propagation; - Indexing refactor: remove reference tensor creation in all tensor indexing logic (csarofeen#1690) - (more) generic grouped grid reduction kernel; - Minor parser/fuser patches: 1. zero-dim tensor reduction support 3. no-op binary removal within fused graph 4. expand supported in fusion Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` a054b3e Refactor TransormPropagator to allow specifying a position and propagating to part of the DAG (csarofeen#1775) d67e1cd Indexing refactor stage 1: remove reference tensor creation in all tensor indexing logic (csarofeen#1690) 1b65299 Issue 1770 (csarofeen#1774) 35b0427 Avoid compilation errors like below: (csarofeen#1773) 452c773 Ignore reductions of zero-dim tensors per PyTorch conventions (csarofeen#1771) 31d6c56 TransformPropagator refactor (csarofeen#1769) 570c5a8 Merge pull request csarofeen#1767 from csarofeen/upstream_merge_0621 9d6c3d8 merging upstream 61305cd 0ed815f New TransformPropagator algorithm (csarofeen#1763) 6c19520 no-op binary removal (csarofeen#1764) ec7fa41 Proper propagation of IterType (csarofeen#1762) b263562 Fix dimensionality check (csarofeen#1759) 2d6343f More generic grouped grid reduction kernel (csarofeen#1740) 64e2b56 [nvfuser] prevent spamming warning message (pytorch#77777) (csarofeen#1758) 0c43162 [nvFuser] Improving bitwise ops support (pytorch#77158) (csarofeen#1757) b93a147 Parser expand (csarofeen#1754) ``` RUN_TORCHBENCH: nvfuser Pull Request resolved: pytorch#80355 Approved by: https://github.com/davidberard98

naoyam added 3 commits June 1, 2022 23:30

Extend the grouped grid reduction kernel

288ef90

The kernel itself should work with an arbitrary number of inputs, but the underlying data structure, Tuple, still explicitly needs to be specialized for the number of values, which is currently limited to 8.

cleanup

1f5f861

cleanup

5f9a9ae

naoyam requested a review from csarofeen June 2, 2022 07:23

naoyam commented Jun 2, 2022

View reviewed changes

naoyam added 2 commits June 2, 2022 21:26

Merge branch 'devel' into grid_reduction_runtime_kernel_ext

8787578

Merge branch 'devel' into grid_reduction_runtime_kernel_ext

1d95801

csarofeen approved these changes Jun 9, 2022

View reviewed changes

naoyam added 2 commits June 13, 2022 11:09

Add a test of grouped allreduce with 3 different types

4670707

Add more comments

4398e31

naoyam force-pushed the grid_reduction_runtime_kernel_ext branch from 0754259 to 4398e31 Compare June 13, 2022 19:20

Merge branch 'devel' into grid_reduction_runtime_kernel_ext

d5cf60d

naoyam merged commit 2d6343f into devel Jun 13, 2022

naoyam deleted the grid_reduction_runtime_kernel_ext branch June 13, 2022 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More generic grouped grid reduction kernel #1740

More generic grouped grid reduction kernel #1740

naoyam commented Jun 2, 2022 •

edited

Loading

naoyam Jun 2, 2022 •

edited

Loading

csarofeen left a comment

csarofeen Jun 9, 2022

naoyam Jun 13, 2022

naoyam commented Jun 13, 2022

More generic grouped grid reduction kernel #1740

More generic grouped grid reduction kernel #1740

Conversation

naoyam commented Jun 2, 2022 • edited Loading

naoyam Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

csarofeen left a comment

Choose a reason for hiding this comment

csarofeen Jun 9, 2022

Choose a reason for hiding this comment

naoyam Jun 13, 2022

Choose a reason for hiding this comment

naoyam commented Jun 13, 2022

naoyam commented Jun 2, 2022 •

edited

Loading

naoyam Jun 2, 2022 •

edited

Loading