Tracking 20_12_3_devel #570

csarofeen · 2020-12-10T18:10:41Z

No description provided.

* Fixed a minor issue in the CudaFusionManager where the string version of the canonicalized graph wasn't actually be used to cache the graph. We were accidentally using the original graph. * Changed seed to get BiasGeluBwd test to pass. It was barely over the threshold.

Parallelize all IterDomains when inferred by computeAt relationships. Do not substiutte kir::IterDomain::extent_ with parallel dimensions.

Predicate inside blockBroadcast rather than enclosing it with a predicate if clause.

* Destroy left-over cuda events * Remove unused variable

Eager mode RNG kernels needed some minor changes to interact safely with cuda graphs. This PR extends those changes to the kernels generated by nvfuser.

Rework reduction heuristics, add a large reduction benchmarking suite.

Tiny fix to allow fusion with pure scalar tensor in PW fusion Note that similar changes would need to be applied to other schedulers as well

Revert CudaFusionGroup where profiling information are not available. Application here is when we have branching in code path that is not executed during profile runs.

* disable for CUDA MAJOR<11 * fix

…tions (#778) * add utilities needed for multi node merging * add combine reduction pass * add input groups * add vertical test * bug fix * add config; add horizontal test * comment * add drawing util * fix dependency maintenance * bugfix * add test * format * clang-tidy * comment * fix test case print * move dependency analysis pass out of the header * Deprioritize fusing through outputs. * trigger CI Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>

This reverts commit b9fde03.

* Use the new version of getAllValsBetween

* Do not create mappings of non-leaf domains in the CA Parallel Map

This allows us to select each DifferentiableGraphOp with optimized plan to update its forward graph with fusion while allow others without that to keep their stock graph. Makes it slightly easier to debug/query fusion using graph_for without going through setting PYTORCH_JIT_LOG_LEVEL

Fixed some CI failure on 20.04 container. cherry-pick them back to dev_branch

…ytorch#54374) (#796) Summary: Fixes pytorch#54040 `prim::RequiresGradCheck` guarantees that requires_grad properties of input tensors will match the profiled, otherwise a fallback path will be triggered. This allow us to prune off gradients in backward graph for inputs that don't need gradients. We transfer requires_grad properties from inputs to the `prim::DifferentiableGraph` onto inputs to the differentiable graph. Autodiff will inspect these properties and prune off gradients that aren't required Pull Request resolved: pytorch#54374 Reviewed By: H-Huang Differential Revision: D27369251 Pulled By: Krovatkin fbshipit-source-id: 2bce7a2d7f2ec091db9bf4c4b91d8b29edd5be11 Co-authored-by: Nikolay Korovaiko <korovaikon@gmail.com>

* always use segmented interface * bugfix * comment;rename * more comments * update naming * comment

Fixes #810

…#1117) with non-exact dimensions

* Move reorder to 2-D parallelization scheme in point-wise scheduler

… case of reductions (#1121) * Clean up ParallelTypeBitmap * Track redundant threads/blocks with ThreadPredicateMap Fixes #1110 * Predicate redundant threads/blocks in reductions to global buffers * Buffer allocation fix for grid/welford reductions (#1126) * Enable parallel type binding in precomputed integers (#1132) * add parallel type binding to pre-computed integers Co-authored-by: S. Song <41357537+shmsong@users.noreply.github.com>

* Fix missing "f" in binary math op * repro with WAR

Make sure segmentation doesn't insert additional h2f->f2h casts within a kernel.

cap maxrregcount at constant 255 instead of query device properties

…and cast cleanup (#1114) * Use caParallelMap to simplify launch binding * Pre-allocate space and pre-compute order for multikernel runtime * avoid perf scope overhead in evaluator calls * clang-tidy * format

* Change FLT_MIN and DBL_MIN to use numeric_limits::lowest() * Fix clang issues. * Added some comments to Mask+Softmax test. * Fix clang trailing spaces. Co-authored-by: root <root@ipp1-1320.nvidia.com>

* Extend SimplifyingIrBuilder * refactoring

* Take `rnd` as a reference instead of a value rnd is modified inside the function, which should not be discarded. * Use globally unique index when initializing Philox

* Replace pow at codegen

* Expose some of the utility functions They are useful to have for the C++ interface.

* Remove rand_like fusion from ternary ops tests. * Clang fixes.

* rebased my changes onto 20_12_3_devel * rebased my changes onto 20_12_3_devel * rebased my changes onto 20_12_3_devel * rebased my changes onto 20_12_3_devel * rebased my changes onto 20_12_3_devel * rebased my changes onto 20_12_3_devel * fixing rebase error * restaring rebase manually for test_gpu.cpp * rebased manually for test_gpu.cpp * rebased manually for test_gpu.cpp * fixed fusion segmentation * fixed fusion segmentation * fixed fusion segmentation * syntax mixup * cleanup * cleanup * cleanup * added assert * added assert * added assert * added assert * added assert * added assert * cleanup * cleanup * cleanup * merged ops * linting * linting * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * clangtidy * clangtidy * clangtidy * clangtidy * clangtidy * clangtidy * fixing assertion * fixing assertion * skipping bfloat tests if not ampere * skipping bfloat tests if not ampere * skipping bfloat tests if not ampere * skipping bfloat tests if not ampere * skipping bfloat tests if not ampere * protect bfloat on cuda <11 * protect bfloat on cuda <11 * if running on ampere but cuda10, still disable bfloat * lint Co-authored-by: riship <riship@nvidia.com>

Validation of allocations need to be done only for tensors, so non-tensor allocations can be just skipped.

* Use WARP_SIZE instead of 32

* Fix computation of thread predicate with broadcast Previously, a broadcasted input resets a thread predicate of any other input.

Channels Last support in nvfuser Background: To support channels last in nvfuser with optimal performance, we want to allow dimension collapsing in generated code on channels-last tensors, which greatly simplifies indexing. Current API in codegen only allows dimensional collapsing on neighboring axes. The unfortunate thing is that memory format design in PyTorch is implicitly marked by strides, while the semantics meaning of axes remain unchanged. i.e. A 4d tensor with axes [N, C, H, W] would have the same shape in both format, while contiguous tensor carries strides [CHW, HW, W, 1] and channels-last tensor [HWC, 1, WC, C]. Approach: We identify input tensor in channels-last format and permute them to NHWC. This creates an inconsistency between codegen tensor and TorchScript tensor. Our parser handles and propagates memory format accordingly. I.e., consumes and produces channels-last inputs when it can, while transposes inputs to original format and output non-permuted outputs. Fusion inputs/outputs in channels-last format is marked and permuted before/after fusion execution to ensure correctness on the interfacing between nvfuser and TorchScript. add simple cpp test to ensure simplified indexing in generated code. add python tests to verify nhwc fp16 inputs is handled properly. It has been handled in recent bfloat PR

… (#1170) * Revert "Revert D30752939: [pytorch][PR] nvfuser update" (pytorch#65137) Summary: This reverts commit 03389dc. Attempt again for PR: pytorch#63745 Fixes the windows build failure. Pull Request resolved: pytorch#65137 Reviewed By: seemethere, dzhulgakov, heitorschueroff Differential Revision: D30994556 Pulled By: malfet fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d * review comments addressed * clang-tidy non-private member variables * clang-format * quick fix on skipping logic

* Collect thread predicates when generating unswitch conditions Multiple thread predicates are merged into a single ThreadPredicate::predicate_info by simply taking a union of them. See ThreadPredicateMap::mergeForUnswitch for more details. Fixes #1129

csarofeen · 2021-10-23T00:53:49Z

Closed in favor of: #1208

csarofeen changed the base branch from master to master_bump_20_12_9 January 27, 2021 18:50

csarofeen changed the base branch from master_bump_20_12_9 to master January 27, 2021 18:50

jjsjann123 changed the base branch from master to master_bump_21_3_1 March 2, 2021 16:49

jjsjann123 changed the base branch from master_bump_21_3_1 to master March 2, 2021 16:49

jjsjann123 changed the base branch from master to master_bump_21_3_1 March 17, 2021 23:26

jjsjann123 changed the base branch from master_bump_21_3_1 to master March 17, 2021 23:26

kevinstephano and others added 24 commits March 18, 2021 14:20

Fix clang-tidy errors (#775)

ae64b1f

Fix tv parallelization (#758)

4df7a6a

Parallelize all IterDomains when inferred by computeAt relationships. Do not substiutte kir::IterDomain::extent_ with parallel dimensions.

Predicate inside blockBroadcast rather than enclosing it with a (#764)

7bfbeb3

Predicate inside blockBroadcast rather than enclosing it with a predicate if clause.

Minor refactoring of index compute.

c56e94c

fixing leaking memory (#787)

0b0cbf8

Destroy left-over cuda events (#789)

146c1a4

* Destroy left-over cuda events * Remove unused variable

[CUDA graphs] [JIT] Capture-safe RNG in nvfuser (#593)

14bd01e

Eager mode RNG kernels needed some minor changes to interact safely with cuda graphs. This PR extends those changes to the kernels generated by nvfuser.

Reworking reduction heuristic/scheduler (#735)

c1a1f04

Rework reduction heuristics, add a large reduction benchmarking suite.

Pure scalar tensor fusion (#779)

6fa4864

Tiny fix to allow fusion with pure scalar tensor in PW fusion Note that similar changes would need to be applied to other schedulers as well

Branching pe fix (#780)

f6d07dd

Revert CudaFusionGroup where profiling information are not available. Application here is when we have branching in code path that is not executed during profile runs.

Make sure Stmt is visited once (#802)

8e713ac

Fix a clang-tidy error (#804)

7457689

Disable rngtest cuda10 (#799)

72ba5a9

* disable for CUDA MAJOR<11 * fix

Revert "Fusion::lookupValue() (#652)" (#667)

f6a6b35

This reverts commit b9fde03.

Fix getAllValsBetween (2nd attempt) (#803)

b3c3c84

* Use the new version of getAllValsBetween

Do not create mappings of non-leaf domains in the CA Parallel Map (#806)

d08f041

* Do not create mappings of non-leaf domains in the CA Parallel Map

bug fixes from pytorch container CI (#801)

f4f359c

Fixed some CI failure on 20.04 container. cherry-pick them back to dev_branch

Executor and segment runtime cleanup (#800)

576351b

* always use segmented interface * bugfix * comment;rename * more comments * update naming * comment

Only map CA-shared axes (#808)

634a0b8

disable graph capture test when caching allocator is not enabled (#812)

529d3ed

Fixes #810

naoyam and others added 26 commits September 16, 2021 12:03

Do not omit a predicate even if an extent is simple when parallelized (…

66d2af7

…#1117) with non-exact dimensions

Fix Issue #1115 (#1123)

416b8b3

* Move reorder to 2-D parallelization scheme in point-wise scheduler

Fix missing "f" in binary math op (#1137)

e3ceefe

* Fix missing "f" in binary math op * repro with WAR

Pwise scheduler fix. (#1138)

d1534d9

Fix segmentation casting (#1139)

b14fcb6

Make sure segmentation doesn't insert additional h2f->f2h casts within a kernel.

Make sure maxrregcount is 255 at largest (#1134)

2db1e77

cap maxrregcount at constant 255 instead of query device properties

Dynamic shape latency improvement, step 1.5. Misc runtime allocation …

9caea22

…and cast cleanup (#1114) * Use caParallelMap to simplify launch binding * Pre-allocate space and pre-compute order for multikernel runtime * avoid perf scope overhead in evaluator calls * clang-tidy * format

bug fix (#1140)

f9335e7

Change FLT_MIN and DBL_MIN to use numeric_limits::lowest() (#1141)

fd2fe6f

* Change FLT_MIN and DBL_MIN to use numeric_limits::lowest() * Fix clang issues. * Added some comments to Mask+Softmax test. * Fix clang trailing spaces. Co-authored-by: root <root@ipp1-1320.nvidia.com>

Adds all supported binary op shortcuts (#1146)

d496609

Misc updates (#1148)

ba3dcc1

* Extend SimplifyingIrBuilder * refactoring

Randlike bug fix (#1149)

4a0d076

* Take `rnd` as a reference instead of a value rnd is modified inside the function, which should not be discarded. * Use globally unique index when initializing Philox

Adds all shortcuts for unary ops (#1147)

c200a46

Replace pow(, 2) with square (#1150)

4a39cbe

* Replace pow at codegen

Expose some of the utility functions (#1154)

642c58d

* Expose some of the utility functions They are useful to have for the C++ interface.

Ternary op test fix (#1153)

26c4e8e

* Remove rand_like fusion from ternary ops tests. * Clang fixes.

Fix invalid downcasting (#1156)

6024138

Validation of allocations need to be done only for tensors, so non-tensor allocations can be just skipped.

Print smem error info (#1157)

157c57b

Use WARP_SIZE instead of 32 (#1158)

68dec55

* Use WARP_SIZE instead of 32

Prevent unused variable warning (#1159)

9ebcb2a

Fix computation of thread predicate with broadcast (#1163)

21884ea

* Fix computation of thread predicate with broadcast Previously, a broadcasted input resets a thread predicate of any other input.

csarofeen force-pushed the master branch 2 times, most recently from e8ecaa3 to aa80f05 Compare October 7, 2021 18:52

csarofeen closed this Oct 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tracking 20_12_3_devel #570

Tracking 20_12_3_devel #570

Uh oh!

csarofeen commented Dec 10, 2020

Uh oh!

csarofeen commented Oct 23, 2021

Uh oh!

Uh oh!

Tracking 20_12_3_devel #570

Tracking 20_12_3_devel #570

Uh oh!

Conversation

csarofeen commented Dec 10, 2020

Uh oh!

csarofeen commented Oct 23, 2021

Uh oh!

Uh oh!