Dropout prob extremal patch #1804

jjsjann123 · 2022-07-05T23:09:42Z

Don't seem to impact perf in the example from the issue on a volta, but does look like it increases register usage.
Alternative is to slightly change the logic of mask generation for dropout, changing lt to le. But that means we'll diverge even further from aten RNG.

jjsjann123 · 2022-07-06T16:24:47Z

We should add @IvanYashchuk to our project so we can request review from him. cc'ing @csarofeen

IvanYashchuk · 2022-07-07T13:27:22Z

torch/csrc/jit/codegen/cuda/ops/composite.cpp

@@ -29,6 +29,13 @@ ForwardDropoutResult dropout(TensorView* x, Val* prob, Val* scale) {

  auto rand_vals = randlike(x);
  auto mask = lt(rand_vals, prob);
+  // p == 0.0, set mask as False so everything is dropped.


Is it needed? lt(rand_vals, 0.0) should always be all False because there shouldn't be any negative values generated by randlike.

Good catch~
No it's not. Added it earlier for debug when I forgot we have p flipped 😵‍💫

IvanYashchuk · 2022-07-07T13:34:02Z

torch/csrc/jit/codegen/cuda/ops/composite.cpp

@@ -29,6 +29,13 @@ ForwardDropoutResult dropout(TensorView* x, Val* prob, Val* scale) {

  auto rand_vals = randlike(x);


Alternatively, we could replace ones with zeros in rand_vals to change the interval from (0.0, 1.0] to [0.0, 1.0).

Suggested change

auto rand_vals = randlike(x);

auto rand_vals = randlike(x);

auto rand_vals_new_interval = where(eq(rand_vals, IrBuilder::create<Double>(1.0)), IrBuilder::create<Double>(0.0), rand_vals);

Not sure, if using where or bitwise functions is better.

where looks cleaner, I also think it should handle broadcast properly as well~ Let me try that~

Using where inside randlike would also fix #1807.

That does sound right as well, are we pushing towards that fix?

Having no idea how RNG works. To my naive mind, it only messes with some corner distribution when we squash 1->0, so I'm somewhat comfortable that nobody would notice the difference of the distribution....

I would push towards that fix.

It's difficult to notice this kind of difference, like the problem of dropout was there for some time already.
It's fine swapping 1.0 with 0.0 because mathematically every number has an equal probability to be generated and we won't mess up any mathematical properties.

CuPy does the same:
https://github.com/cupy/cupy/blob/18fab32f3d347fc13f86821cc72964515f6e5c4f/cupy/random/_generator.py#L619

_mod1_kernel = _core.ElementwiseKernel( '', 'T x', 'x = (x == (T)1) ? 0 : x', 'cupy_random_x_mod_1')

Updated to that. I'm fine with the change. A trivial question doesn't clapping 1 to 0 means that we'll have twice as many 0s as other numbers? But that's just a minor thing so not really trying to dive into that rabbit hole.

Does the update look better?

There wouldn't be twice as many 0s, because there should be none generated by the Philox RNG (implemented here

pytorch/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu

Line 16 in 282c429

__device__ unsigned long operator()() {

), at least usually it's the case that 0 is excluded.

jjsjann123 · 2022-07-07T17:07:57Z

Looks like there's some failing tests after putting the where to rand_like. I'll let tests finish first and patch them afterwards.

jjsjann123 · 2022-07-07T18:22:01Z

cleaned up some string format for float~ waiting on tests to finish.
Meanwhile, the PR is good for review~

IvanYashchuk

Changes to randlike look good to me. Let's hope no test gets broken.

jjsjann123 · 2022-07-07T19:47:24Z

Tests passed locally, patched some floating point prints. I'll merge once lintrunner finishes.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. Indexing refactor -> Remove reference tensor in predicate indexing logic 2. MMA Rfactor support for cross-warp and cross-CTA split on K dimension 3. Grouping grid allreduces across iterations 4. Swizzle op formulation for non-affine swizzles 5. Use scheduler_utils to cache inputs and outputs in schedulePointwise - scheduler refactor 1. New compute at interface - transformation propagation refactor on MaxInfoSpanningTree 1. Added sibling path that is required to generate consistent replay for some cases where `MaxInfoSpanningTree` is used with a selector. 2. Optimization to skip Transform propagator 3. SpanningTreePrinter for debugging - parser update 1. Fixes `div` 2. Added `_to_copy` 3. Broadcast in dim with expand to support expanding to concrete size 4. Dropout prob extremal patch - executor patch on caching strides for output allocation Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` 3b87896 Fix allocation of work buffers and `fused_reduction::ParallelReduce` with unswitch (csarofeen#1818) 4cae122 schedulePointwise cleanup: - computeAt + InlinePropagator (csarofeen#1815) 3df9742 Use scheduler_utils to cache inputs and outputs in schedulePointwise (csarofeen#1811) 03180aa improve broadcast resolution (csarofeen#1792) bee6c69 bug fix (csarofeen#1819) 4413c8f Support PYTORCH_NVFUSER_DUMP=transform_propagator (csarofeen#1812) de6b7ca Fix negative position in InlinePropagator (csarofeen#1813) 10a996c Remove redundant check in schedulePointwise (csarofeen#1810) acd5ed4 Swizzle op formulation for non-affine swizzles (csarofeen#1441) 3ed8330 Kernel args patch to show zero_init buffer (csarofeen#1809) 037a75a Dropout prob extremal patch (csarofeen#1804) 282c429 spam nvrtc options (csarofeen#1783) 3ba6a5f Broadcast in dim with expand (csarofeen#1794) fd4be12 remove dead indexing code (csarofeen#1806) fa4e6a4 Check siblings in getMaxPosAll (csarofeen#1805) 025c840 Grouping grid allreduces across iterations (csarofeen#1755) 37c579e Temporarily disable test requring large shared memory. (csarofeen#1802) 5f375d0 More cleanup on InlinePropagator (csarofeen#1800) 8d384da Indexing refactor stage 2 : Remove reference tensor in predicate indexing logic (csarofeen#1784) f008140 MMA Rfactor support for cross-warp and cross-CTA split on K dimension (csarofeen#1554) 76b3cca Add parsing support for `_to_copy` to handle AMP casts. (csarofeen#1756) ef04f6c Coding style cleanups (csarofeen#1798) 38c7f3c InlinePropagator please don't replay (csarofeen#1797) 3f2c263 validateDomain in TransformPropagator (csarofeen#1796) c077085 Use TransformPropagatorWithCheck in many tests (csarofeen#1795) d0d0908 Some further cleanup for the new computeAt interface (csarofeen#1793) 45f5203 Fix TransformReplay::getMatchedLeafPosWithoutReplay* (csarofeen#1791) 28cbaf9 New compute at interface (csarofeen#1743) 635ebfc Add SpanningTreePrinter (csarofeen#1786) 59f3c32 Output allocate patch (csarofeen#1790) fe93bf5 Transform propagator skip replay when possible (csarofeen#1782) ebf23a5 Fix isIntegralType error msg (csarofeen#1789) 0c82ecf Disable register reuse across serial broadcast ops (csarofeen#1787) 33a824d Adding sibling path for MaxInfoSpanningTree (csarofeen#1776) 86f46aa Fix div(Val, TensorView) (csarofeen#1778) d3de227 Fix FusionMaxRootDomainInfoSpanningTreePrintTwice_CUDA (csarofeen#1781) ecc7a87 Extend mma dimension and layout checking to support strided batched matmul and tensor contractions (csarofeen#1761) ``` [ghstack-poisoned]

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. Indexing refactor -> Remove reference tensor in predicate indexing logic 2. MMA Rfactor support for cross-warp and cross-CTA split on K dimension 3. Grouping grid allreduces across iterations 4. Swizzle op formulation for non-affine swizzles 5. Use scheduler_utils to cache inputs and outputs in schedulePointwise - scheduler refactor 1. New compute at interface - transformation propagation refactor on MaxInfoSpanningTree 1. Added sibling path that is required to generate consistent replay for some cases where `MaxInfoSpanningTree` is used with a selector. 2. Optimization to skip Transform propagator 3. SpanningTreePrinter for debugging - parser update 1. Fixes `div` 2. Added `_to_copy` 3. Broadcast in dim with expand to support expanding to concrete size 4. Dropout prob extremal patch - executor patch on caching strides for output allocation Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` 3b87896 Fix allocation of work buffers and `fused_reduction::ParallelReduce` with unswitch (csarofeen#1818) 4cae122 schedulePointwise cleanup: - computeAt + InlinePropagator (csarofeen#1815) 3df9742 Use scheduler_utils to cache inputs and outputs in schedulePointwise (csarofeen#1811) 03180aa improve broadcast resolution (csarofeen#1792) bee6c69 bug fix (csarofeen#1819) 4413c8f Support PYTORCH_NVFUSER_DUMP=transform_propagator (csarofeen#1812) de6b7ca Fix negative position in InlinePropagator (csarofeen#1813) 10a996c Remove redundant check in schedulePointwise (csarofeen#1810) acd5ed4 Swizzle op formulation for non-affine swizzles (csarofeen#1441) 3ed8330 Kernel args patch to show zero_init buffer (csarofeen#1809) 037a75a Dropout prob extremal patch (csarofeen#1804) 282c429 spam nvrtc options (csarofeen#1783) 3ba6a5f Broadcast in dim with expand (csarofeen#1794) fd4be12 remove dead indexing code (csarofeen#1806) fa4e6a4 Check siblings in getMaxPosAll (csarofeen#1805) 025c840 Grouping grid allreduces across iterations (csarofeen#1755) 37c579e Temporarily disable test requring large shared memory. (csarofeen#1802) 5f375d0 More cleanup on InlinePropagator (csarofeen#1800) 8d384da Indexing refactor stage 2 : Remove reference tensor in predicate indexing logic (csarofeen#1784) f008140 MMA Rfactor support for cross-warp and cross-CTA split on K dimension (csarofeen#1554) 76b3cca Add parsing support for `_to_copy` to handle AMP casts. (csarofeen#1756) ef04f6c Coding style cleanups (csarofeen#1798) 38c7f3c InlinePropagator please don't replay (csarofeen#1797) 3f2c263 validateDomain in TransformPropagator (csarofeen#1796) c077085 Use TransformPropagatorWithCheck in many tests (csarofeen#1795) d0d0908 Some further cleanup for the new computeAt interface (csarofeen#1793) 45f5203 Fix TransformReplay::getMatchedLeafPosWithoutReplay* (csarofeen#1791) 28cbaf9 New compute at interface (csarofeen#1743) 635ebfc Add SpanningTreePrinter (csarofeen#1786) 59f3c32 Output allocate patch (csarofeen#1790) fe93bf5 Transform propagator skip replay when possible (csarofeen#1782) ebf23a5 Fix isIntegralType error msg (csarofeen#1789) 0c82ecf Disable register reuse across serial broadcast ops (csarofeen#1787) 33a824d Adding sibling path for MaxInfoSpanningTree (csarofeen#1776) 86f46aa Fix div(Val, TensorView) (csarofeen#1778) d3de227 Fix FusionMaxRootDomainInfoSpanningTreePrintTwice_CUDA (csarofeen#1781) ecc7a87 Extend mma dimension and layout checking to support strided batched matmul and tensor contractions (csarofeen#1761) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38043938](https://our.internmc.facebook.com/intern/diff/D38043938) [ghstack-poisoned]

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. Indexing refactor -> Remove reference tensor in predicate indexing logic 2. MMA Rfactor support for cross-warp and cross-CTA split on K dimension 3. Grouping grid allreduces across iterations 4. Swizzle op formulation for non-affine swizzles 5. Use scheduler_utils to cache inputs and outputs in schedulePointwise - scheduler refactor 1. New compute at interface - transformation propagation refactor on MaxInfoSpanningTree 1. Added sibling path that is required to generate consistent replay for some cases where `MaxInfoSpanningTree` is used with a selector. 2. Optimization to skip Transform propagator 3. SpanningTreePrinter for debugging - parser update 1. Fixes `div` 2. Added `_to_copy` 3. Broadcast in dim with expand to support expanding to concrete size 4. Dropout prob extremal patch - executor patch on caching strides for output allocation Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` 3b87896 Fix allocation of work buffers and `fused_reduction::ParallelReduce` with unswitch (#1818) 4cae122 schedulePointwise cleanup: - computeAt + InlinePropagator (#1815) 3df9742 Use scheduler_utils to cache inputs and outputs in schedulePointwise (#1811) 03180aa improve broadcast resolution (#1792) bee6c69 bug fix (#1819) 4413c8f Support PYTORCH_NVFUSER_DUMP=transform_propagator (#1812) de6b7ca Fix negative position in InlinePropagator (#1813) 10a996c Remove redundant check in schedulePointwise (#1810) acd5ed4 Swizzle op formulation for non-affine swizzles (#1441) 3ed8330 Kernel args patch to show zero_init buffer (#1809) 037a75a Dropout prob extremal patch (#1804) 282c429 spam nvrtc options (#1783) 3ba6a5f Broadcast in dim with expand (#1794) fd4be12 remove dead indexing code (#1806) fa4e6a4 Check siblings in getMaxPosAll (#1805) 025c840 Grouping grid allreduces across iterations (#1755) 37c579e Temporarily disable test requring large shared memory. (#1802) 5f375d0 More cleanup on InlinePropagator (#1800) 8d384da Indexing refactor stage 2 : Remove reference tensor in predicate indexing logic (#1784) f008140 MMA Rfactor support for cross-warp and cross-CTA split on K dimension (#1554) 76b3cca Add parsing support for `_to_copy` to handle AMP casts. (#1756) ef04f6c Coding style cleanups (#1798) 38c7f3c InlinePropagator please don't replay (#1797) 3f2c263 validateDomain in TransformPropagator (#1796) c077085 Use TransformPropagatorWithCheck in many tests (#1795) d0d0908 Some further cleanup for the new computeAt interface (#1793) 45f5203 Fix TransformReplay::getMatchedLeafPosWithoutReplay* (#1791) 28cbaf9 New compute at interface (#1743) 635ebfc Add SpanningTreePrinter (#1786) 59f3c32 Output allocate patch (#1790) fe93bf5 Transform propagator skip replay when possible (#1782) ebf23a5 Fix isIntegralType error msg (#1789) 0c82ecf Disable register reuse across serial broadcast ops (#1787) 33a824d Adding sibling path for MaxInfoSpanningTree (#1776) 86f46aa Fix div(Val, TensorView) (#1778) d3de227 Fix FusionMaxRootDomainInfoSpanningTreePrintTwice_CUDA (#1781) ecc7a87 Extend mma dimension and layout checking to support strided batched matmul and tensor contractions (#1761) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38043938](https://our.internmc.facebook.com/intern/diff/D38043938) Pull Request resolved: pytorch#81861 Approved by: https://github.com/davidberard98

jjsjann123 added 5 commits July 5, 2022 14:49

updating dropout logic

09635ac

fixing the flipped prob

184ad12

fixing python tests for 0/1 dropout

9d08b66

lintrunner

b80098f

Merge remote-tracking branch 'origin/devel' into HEAD

832a3d4

jjsjann123 requested a review from csarofeen July 6, 2022 16:24

csarofeen requested a review from IvanYashchuk July 7, 2022 13:00

IvanYashchuk reviewed Jul 7, 2022

View reviewed changes

jjsjann123 added 4 commits July 7, 2022 09:43

switched clamp on dropout to randlike

a3f9d8d

fixing randlike dtype

6de9c36

patch

4e28641

clangformat

a542b0a

jjsjann123 requested a review from IvanYashchuk July 7, 2022 16:54

jjsjann123 added 4 commits July 7, 2022 10:51

fixing float format in codegen

3e1747b

hmmm this is probably a bad idea

12c3fc4

patching digits

81ba9ff

code cleaning

0a6142d

IvanYashchuk approved these changes Jul 7, 2022

View reviewed changes

jjsjann123 added 2 commits July 7, 2022 12:45

clangformat

da6f49b

remove unused variables

c6d70f7

jjsjann123 mentioned this pull request Jul 7, 2022

The interval for torch::jit::fuser::cuda::randlike is not the same as for torch.rand_like #1807

Closed

jjsjann123 merged commit 037a75a into devel Jul 7, 2022

jjsjann123 deleted the dropout_prob_extremal_patch branch July 7, 2022 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropout prob extremal patch #1804

Dropout prob extremal patch #1804

jjsjann123 commented Jul 5, 2022

jjsjann123 commented Jul 6, 2022

IvanYashchuk Jul 7, 2022

jjsjann123 Jul 7, 2022

IvanYashchuk Jul 7, 2022

jjsjann123 Jul 7, 2022

IvanYashchuk Jul 7, 2022

jjsjann123 Jul 7, 2022

IvanYashchuk Jul 7, 2022

jjsjann123 Jul 7, 2022

IvanYashchuk Jul 7, 2022

jjsjann123 commented Jul 7, 2022

jjsjann123 commented Jul 7, 2022

IvanYashchuk left a comment

jjsjann123 commented Jul 7, 2022

		@@ -29,6 +29,13 @@ ForwardDropoutResult dropout(TensorView* x, Val* prob, Val* scale) {

		auto rand_vals = randlike(x);

	auto rand_vals = randlike(x);
	auto rand_vals = randlike(x);
	auto rand_vals_new_interval = where(eq(rand_vals, IrBuilder::create<Double>(1.0)), IrBuilder::create<Double>(0.0), rand_vals);

Dropout prob extremal patch #1804

Dropout prob extremal patch #1804

Conversation

jjsjann123 commented Jul 5, 2022

jjsjann123 commented Jul 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsjann123 commented Jul 7, 2022

jjsjann123 commented Jul 7, 2022

IvanYashchuk left a comment

Choose a reason for hiding this comment

jjsjann123 commented Jul 7, 2022