forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 7
Tracking 20_12_3_devel #570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Fixed a minor issue in the CudaFusionManager where the string version of the canonicalized graph wasn't actually be used to cache the graph. We were accidentally using the original graph. * Changed seed to get BiasGeluBwd test to pass. It was barely over the threshold.
Parallelize all IterDomains when inferred by computeAt relationships. Do not substiutte kir::IterDomain::extent_ with parallel dimensions.
Predicate inside blockBroadcast rather than enclosing it with a predicate if clause.
* Destroy left-over cuda events * Remove unused variable
Eager mode RNG kernels needed some minor changes to interact safely with cuda graphs. This PR extends those changes to the kernels generated by nvfuser.
Rework reduction heuristics, add a large reduction benchmarking suite.
Tiny fix to allow fusion with pure scalar tensor in PW fusion Note that similar changes would need to be applied to other schedulers as well
Revert CudaFusionGroup where profiling information are not available. Application here is when we have branching in code path that is not executed during profile runs.
* disable for CUDA MAJOR<11 * fix
…tions (#778) * add utilities needed for multi node merging * add combine reduction pass * add input groups * add vertical test * bug fix * add config; add horizontal test * comment * add drawing util * fix dependency maintenance * bugfix * add test * format * clang-tidy * comment * fix test case print * move dependency analysis pass out of the header * Deprioritize fusing through outputs. * trigger CI Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
* Use the new version of getAllValsBetween
* Do not create mappings of non-leaf domains in the CA Parallel Map
This allows us to select each DifferentiableGraphOp with optimized plan to update its forward graph with fusion while allow others without that to keep their stock graph. Makes it slightly easier to debug/query fusion using graph_for without going through setting PYTORCH_JIT_LOG_LEVEL
Fixed some CI failure on 20.04 container. cherry-pick them back to dev_branch
…ytorch#54374) (#796) Summary: Fixes pytorch#54040 `prim::RequiresGradCheck` guarantees that requires_grad properties of input tensors will match the profiled, otherwise a fallback path will be triggered. This allow us to prune off gradients in backward graph for inputs that don't need gradients. We transfer requires_grad properties from inputs to the `prim::DifferentiableGraph` onto inputs to the differentiable graph. Autodiff will inspect these properties and prune off gradients that aren't required Pull Request resolved: pytorch#54374 Reviewed By: H-Huang Differential Revision: D27369251 Pulled By: Krovatkin fbshipit-source-id: 2bce7a2d7f2ec091db9bf4c4b91d8b29edd5be11 Co-authored-by: Nikolay Korovaiko <korovaikon@gmail.com>
* always use segmented interface * bugfix * comment;rename * more comments * update naming * comment
…#1117) with non-exact dimensions
… case of reductions (#1121) * Clean up ParallelTypeBitmap * Track redundant threads/blocks with ThreadPredicateMap Fixes #1110 * Predicate redundant threads/blocks in reductions to global buffers * Buffer allocation fix for grid/welford reductions (#1126) * Enable parallel type binding in precomputed integers (#1132) * add parallel type binding to pre-computed integers Co-authored-by: S. Song <41357537+shmsong@users.noreply.github.com>
* Fix missing "f" in binary math op * repro with WAR
Make sure segmentation doesn't insert additional h2f->f2h casts within a kernel.
cap maxrregcount at constant 255 instead of query device properties
…and cast cleanup (#1114) * Use caParallelMap to simplify launch binding * Pre-allocate space and pre-compute order for multikernel runtime * avoid perf scope overhead in evaluator calls * clang-tidy * format
* Change FLT_MIN and DBL_MIN to use numeric_limits::lowest() * Fix clang issues. * Added some comments to Mask+Softmax test. * Fix clang trailing spaces. Co-authored-by: root <root@ipp1-1320.nvidia.com>
* Extend SimplifyingIrBuilder * refactoring
* Take `rnd` as a reference instead of a value rnd is modified inside the function, which should not be discarded. * Use globally unique index when initializing Philox
* Replace pow at codegen
* Expose some of the utility functions They are useful to have for the C++ interface.
* Remove rand_like fusion from ternary ops tests. * Clang fixes.
* rebased my changes onto 20_12_3_devel * rebased my changes onto 20_12_3_devel * rebased my changes onto 20_12_3_devel * rebased my changes onto 20_12_3_devel * rebased my changes onto 20_12_3_devel * rebased my changes onto 20_12_3_devel * fixing rebase error * restaring rebase manually for test_gpu.cpp * rebased manually for test_gpu.cpp * rebased manually for test_gpu.cpp * fixed fusion segmentation * fixed fusion segmentation * fixed fusion segmentation * syntax mixup * cleanup * cleanup * cleanup * added assert * added assert * added assert * added assert * added assert * added assert * cleanup * cleanup * cleanup * merged ops * linting * linting * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * trying to fix * clangtidy * clangtidy * clangtidy * clangtidy * clangtidy * clangtidy * fixing assertion * fixing assertion * skipping bfloat tests if not ampere * skipping bfloat tests if not ampere * skipping bfloat tests if not ampere * skipping bfloat tests if not ampere * skipping bfloat tests if not ampere * protect bfloat on cuda <11 * protect bfloat on cuda <11 * if running on ampere but cuda10, still disable bfloat * lint Co-authored-by: riship <riship@nvidia.com>
Validation of allocations need to be done only for tensors, so non-tensor allocations can be just skipped.
* Use WARP_SIZE instead of 32
* Fix computation of thread predicate with broadcast Previously, a broadcasted input resets a thread predicate of any other input.
Channels Last support in nvfuser Background: To support channels last in nvfuser with optimal performance, we want to allow dimension collapsing in generated code on channels-last tensors, which greatly simplifies indexing. Current API in codegen only allows dimensional collapsing on neighboring axes. The unfortunate thing is that memory format design in PyTorch is implicitly marked by strides, while the semantics meaning of axes remain unchanged. i.e. A 4d tensor with axes [N, C, H, W] would have the same shape in both format, while contiguous tensor carries strides [CHW, HW, W, 1] and channels-last tensor [HWC, 1, WC, C]. Approach: We identify input tensor in channels-last format and permute them to NHWC. This creates an inconsistency between codegen tensor and TorchScript tensor. Our parser handles and propagates memory format accordingly. I.e., consumes and produces channels-last inputs when it can, while transposes inputs to original format and output non-permuted outputs. Fusion inputs/outputs in channels-last format is marked and permuted before/after fusion execution to ensure correctness on the interfacing between nvfuser and TorchScript. add simple cpp test to ensure simplified indexing in generated code. add python tests to verify nhwc fp16 inputs is handled properly. It has been handled in recent bfloat PR
… (#1170) * Revert "Revert D30752939: [pytorch][PR] nvfuser update" (pytorch#65137) Summary: This reverts commit 03389dc. Attempt again for PR: pytorch#63745 Fixes the windows build failure. Pull Request resolved: pytorch#65137 Reviewed By: seemethere, dzhulgakov, heitorschueroff Differential Revision: D30994556 Pulled By: malfet fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d * review comments addressed * clang-tidy non-private member variables * clang-format * quick fix on skipping logic
* Collect thread predicates when generating unswitch conditions Multiple thread predicates are merged into a single ThreadPredicate::predicate_info by simply taking a union of them. See ThreadPredicateMap::mergeForUnswitch for more details. Fixes #1129
e8ecaa3
to
aa80f05
Compare
Closed in favor of: #1208 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.