Skip to content

Conversation

csarofeen
Copy link
Owner

No description provided.

@csarofeen csarofeen changed the base branch from master to master_bump_20_12_9 January 27, 2021 18:50
@csarofeen csarofeen changed the base branch from master_bump_20_12_9 to master January 27, 2021 18:50
@jjsjann123 jjsjann123 changed the base branch from master to master_bump_21_3_1 March 2, 2021 16:49
@jjsjann123 jjsjann123 changed the base branch from master_bump_21_3_1 to master March 2, 2021 16:49
@jjsjann123 jjsjann123 changed the base branch from master to master_bump_21_3_1 March 17, 2021 23:26
@jjsjann123 jjsjann123 changed the base branch from master_bump_21_3_1 to master March 17, 2021 23:26
kevinstephano and others added 24 commits March 18, 2021 14:20
* Fixed a minor issue in the CudaFusionManager where the string version of the canonicalized graph wasn't actually be used to cache the graph.  We were accidentally using the original graph.
* Changed seed to get BiasGeluBwd test to pass.  It was barely over the threshold.
Parallelize all IterDomains when inferred by computeAt relationships. Do not substiutte kir::IterDomain::extent_ with parallel dimensions.
Predicate inside blockBroadcast rather than enclosing it with a predicate if clause.
* Destroy left-over cuda events

* Remove unused variable
Eager mode RNG kernels needed some minor changes to interact safely with cuda graphs. This PR extends those changes to the kernels generated by nvfuser.
Rework reduction heuristics, add a large reduction benchmarking suite.
Tiny fix to allow fusion with pure scalar tensor in PW fusion
Note that similar changes would need to be applied to other schedulers as well
Revert CudaFusionGroup where profiling information are not available. Application here is when we have branching in code path that is not executed during profile runs.
* disable for CUDA MAJOR<11

* fix
…tions (#778)

* add utilities needed for multi node merging

* add combine reduction pass

* add input groups

* add vertical test

* bug fix

* add config; add horizontal test

* comment

* add drawing util

* fix dependency maintenance

* bugfix

* add test

* format

* clang-tidy

* comment

* fix test case print

* move dependency analysis pass out of the header

* Deprioritize fusing through outputs.

* trigger CI

Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
* Use the new version of getAllValsBetween
* Do not create mappings of non-leaf domains in the CA Parallel Map
This allows us to select each DifferentiableGraphOp with optimized plan to update its forward graph with fusion while allow others without that to keep their stock graph.
Makes it slightly easier to debug/query fusion using graph_for without going through setting PYTORCH_JIT_LOG_LEVEL
Fixed some CI failure on 20.04 container. cherry-pick them back to dev_branch
…ytorch#54374) (#796)

Summary:
Fixes pytorch#54040
`prim::RequiresGradCheck` guarantees that requires_grad properties
of input tensors will match the profiled, otherwise a fallback path
will be triggered. This allow us to prune off gradients in backward
graph for inputs that don't need gradients. We transfer requires_grad
properties from inputs to the `prim::DifferentiableGraph` onto inputs to the
differentiable graph. Autodiff will inspect these properties and prune
off gradients that aren't required

Pull Request resolved: pytorch#54374

Reviewed By: H-Huang

Differential Revision: D27369251

Pulled By: Krovatkin

fbshipit-source-id: 2bce7a2d7f2ec091db9bf4c4b91d8b29edd5be11

Co-authored-by: Nikolay Korovaiko <korovaikon@gmail.com>
* always use segmented interface

* bugfix

* comment;rename

* more comments

* update naming

* comment
naoyam and others added 26 commits September 16, 2021 12:03
* Move reorder to 2-D parallelization scheme in point-wise scheduler
… case of reductions (#1121)


* Clean up ParallelTypeBitmap

* Track redundant threads/blocks with ThreadPredicateMap

Fixes #1110

* Predicate redundant threads/blocks in reductions to global buffers

* Buffer allocation fix for grid/welford reductions (#1126)

* Enable parallel type binding in precomputed integers (#1132)

* add parallel type binding to pre-computed integers


Co-authored-by: S. Song <41357537+shmsong@users.noreply.github.com>
* Fix missing "f" in binary math op

* repro with WAR
Make sure segmentation doesn't insert additional h2f->f2h casts within a kernel.
cap maxrregcount at constant 255 instead of query device properties
…and cast cleanup (#1114)

* Use caParallelMap to simplify launch binding

* Pre-allocate space and pre-compute order for multikernel runtime

* avoid perf scope overhead in evaluator calls

* clang-tidy

* format
* Change FLT_MIN and DBL_MIN to use numeric_limits::lowest()

* Fix clang issues.

* Added some comments to Mask+Softmax test.

* Fix clang trailing spaces.

Co-authored-by: root <root@ipp1-1320.nvidia.com>
* Extend SimplifyingIrBuilder
* refactoring
* Take `rnd` as a reference instead of a value

rnd is modified inside the function, which should not be discarded.

* Use globally unique index when initializing Philox
* Replace pow at codegen
* Expose some of the utility functions

They are useful to have for the C++ interface.
* Remove rand_like fusion from ternary ops tests.

* Clang fixes.
* rebased my changes onto 20_12_3_devel

* rebased my changes onto 20_12_3_devel

* rebased my changes onto 20_12_3_devel

* rebased my changes onto 20_12_3_devel

* rebased my changes onto 20_12_3_devel

* rebased my changes onto 20_12_3_devel

* fixing rebase error

* restaring rebase manually for test_gpu.cpp

* rebased manually for test_gpu.cpp

* rebased manually for test_gpu.cpp

* fixed fusion segmentation

* fixed fusion segmentation

* fixed fusion segmentation

* syntax mixup

* cleanup

* cleanup

* cleanup

* added assert

* added assert

* added assert

* added assert

* added assert

* added assert

* cleanup

* cleanup

* cleanup

* merged ops

* linting

* linting

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* trying to fix

* clangtidy

* clangtidy

* clangtidy

* clangtidy

* clangtidy

* clangtidy

* fixing assertion

* fixing assertion

* skipping bfloat tests if not ampere

* skipping bfloat tests if not ampere

* skipping bfloat tests if not ampere

* skipping bfloat tests if not ampere

* skipping bfloat tests if not ampere

* protect bfloat on cuda <11

* protect bfloat on cuda <11

* if running on ampere but cuda10, still disable bfloat

* lint

Co-authored-by: riship <riship@nvidia.com>
Validation of allocations need to be done only for tensors, so
non-tensor allocations can be just skipped.
* Use WARP_SIZE instead of 32
* Fix computation of thread predicate with broadcast

Previously, a broadcasted input resets a thread predicate of any other input.
Channels Last support in nvfuser

Background:
To support channels last in nvfuser with optimal performance, we want to allow dimension collapsing in generated code on channels-last tensors, which greatly simplifies indexing.
Current API in codegen only allows dimensional collapsing on neighboring axes. The unfortunate thing is that memory format design in PyTorch is implicitly marked by strides, while the semantics meaning of axes remain unchanged. i.e. A 4d tensor with axes [N, C, H, W] would have the same shape in both format, while contiguous tensor carries strides [CHW, HW, W, 1] and channels-last tensor [HWC, 1, WC, C].

Approach:
We identify input tensor in channels-last format and permute them to NHWC. This creates an inconsistency between codegen tensor and TorchScript tensor. Our parser handles and propagates memory format accordingly. I.e., consumes and produces channels-last inputs when it can, while transposes inputs to original format and output non-permuted outputs.
Fusion inputs/outputs in channels-last format is marked and permuted before/after fusion execution to ensure correctness on the interfacing between nvfuser and TorchScript.

 add simple cpp test to ensure simplified indexing in generated code.
add python tests to verify nhwc fp16 inputs is handled properly. It has been handled in recent bfloat PR
… (#1170)

* Revert "Revert D30752939: [pytorch][PR] nvfuser update" (pytorch#65137)

Summary:
This reverts commit 03389dc.

Attempt again for PR: pytorch#63745
Fixes the windows build failure.

Pull Request resolved: pytorch#65137

Reviewed By: seemethere, dzhulgakov, heitorschueroff

Differential Revision: D30994556

Pulled By: malfet

fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d

* review comments addressed

* clang-tidy non-private member variables

* clang-format

* quick fix on skipping logic
* Collect thread predicates when generating unswitch conditions

Multiple thread predicates are merged into a single
ThreadPredicate::predicate_info by simply taking a union of them. See
ThreadPredicateMap::mergeForUnswitch for more details.

Fixes #1129
@csarofeen csarofeen force-pushed the master branch 2 times, most recently from e8ecaa3 to aa80f05 Compare October 7, 2021 18:52
@csarofeen
Copy link
Owner Author

Closed in favor of: #1208

@csarofeen csarofeen closed this Oct 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants