Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. Indexing refactor -> Remove reference tensor in predicate indexing logic 2. MMA Rfactor support for cross-warp and cross-CTA split on K dimension 3. Grouping grid allreduces across iterations 4. Swizzle op formulation for non-affine swizzles 5. Use scheduler_utils to cache inputs and outputs in schedulePointwise - scheduler refactor 1. New compute at interface - transformation propagation refactor on MaxInfoSpanningTree 1. Added sibling path that is required to generate consistent replay for some cases where `MaxInfoSpanningTree` is used with a selector. 2. Optimization to skip Transform propagator 3. SpanningTreePrinter for debugging - parser update 1. Fixes `div` 2. Added `_to_copy` 3. Broadcast in dim with expand to support expanding to concrete size 4. Dropout prob extremal patch - executor patch on caching strides for output allocation Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` 3b87896 Fix allocation of work buffers and `fused_reduction::ParallelReduce` with unswitch (csarofeen#1818) 4cae122 schedulePointwise cleanup: - computeAt + InlinePropagator (csarofeen#1815) 3df9742 Use scheduler_utils to cache inputs and outputs in schedulePointwise (csarofeen#1811) 03180aa improve broadcast resolution (csarofeen#1792) bee6c69 bug fix (csarofeen#1819) 4413c8f Support PYTORCH_NVFUSER_DUMP=transform_propagator (csarofeen#1812) de6b7ca Fix negative position in InlinePropagator (csarofeen#1813) 10a996c Remove redundant check in schedulePointwise (csarofeen#1810) acd5ed4 Swizzle op formulation for non-affine swizzles (csarofeen#1441) 3ed8330 Kernel args patch to show zero_init buffer (csarofeen#1809) 037a75a Dropout prob extremal patch (csarofeen#1804) 282c429 spam nvrtc options (csarofeen#1783) 3ba6a5f Broadcast in dim with expand (csarofeen#1794) fd4be12 remove dead indexing code (csarofeen#1806) fa4e6a4 Check siblings in getMaxPosAll (csarofeen#1805) 025c840 Grouping grid allreduces across iterations (csarofeen#1755) 37c579e Temporarily disable test requring large shared memory. (csarofeen#1802) 5f375d0 More cleanup on InlinePropagator (csarofeen#1800) 8d384da Indexing refactor stage 2 : Remove reference tensor in predicate indexing logic (csarofeen#1784) f008140 MMA Rfactor support for cross-warp and cross-CTA split on K dimension (csarofeen#1554) 76b3cca Add parsing support for `_to_copy` to handle AMP casts. (csarofeen#1756) ef04f6c Coding style cleanups (csarofeen#1798) 38c7f3c InlinePropagator please don't replay (csarofeen#1797) 3f2c263 validateDomain in TransformPropagator (csarofeen#1796) c077085 Use TransformPropagatorWithCheck in many tests (csarofeen#1795) d0d0908 Some further cleanup for the new computeAt interface (csarofeen#1793) 45f5203 Fix TransformReplay::getMatchedLeafPosWithoutReplay* (csarofeen#1791) 28cbaf9 New compute at interface (csarofeen#1743) 635ebfc Add SpanningTreePrinter (csarofeen#1786) 59f3c32 Output allocate patch (csarofeen#1790) fe93bf5 Transform propagator skip replay when possible (csarofeen#1782) ebf23a5 Fix isIntegralType error msg (csarofeen#1789) 0c82ecf Disable register reuse across serial broadcast ops (csarofeen#1787) 33a824d Adding sibling path for MaxInfoSpanningTree (csarofeen#1776) 86f46aa Fix div(Val, TensorView) (csarofeen#1778) d3de227 Fix FusionMaxRootDomainInfoSpanningTreePrintTwice_CUDA (csarofeen#1781) ecc7a87 Extend mma dimension and layout checking to support strided batched matmul and tensor contractions (csarofeen#1761) ``` [ghstack-poisoned]
- Loading branch information