[inductor] layout optimization for conv #99773

shunting314 · 2023-04-21T23:34:39Z

Stack from ghstack (oldest at bottom):

-> [inductor] layout optimization for conv #99773

convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

Latest perf number here

TB: 1.64x -> 1.69x
HF: 1.79x -> 1.78x (random noise)
TIMM: 1.51x -> 1.65x

Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: #102670

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @anijain2305 @soumith @desertfire

[ghstack-poisoned]

pytorch-bot · 2023-04-21T23:34:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99773

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 94c2012:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 4d7cc9c Pull Request resolved: #99773

github-actions · 2023-04-21T23:35:12Z

This PR needs a label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

ngimel · 2023-04-22T00:18:11Z

This looks fine but I"m interested in the full results for timm suite.

shunting314 · 2023-04-22T07:37:18Z

This looks fine but I"m interested in the full results for timm suite.

The CI run is ready.

We see there are quite a few torchench/timm_models get 10-20% speedup

There are also a lot of things need to dig.

some models have slow down
some passed model failed
for some model having improved perf, they get worse memory compression ratio. E.g. the 3 resnet models in torchbench

My goal here is figure these things out so that we get the speedup by using channels last layout for convolution but don't slow-down or fail any passed model. I'll debug memory compression ratio after that.

jansel · 2023-04-22T16:15:58Z

Yup makes sense, the tricky part will be coming up with a heuristic to harvest all those wins without taking on the slowdowns.

convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

ghstack-source-id: 7c0ea5b Pull Request resolved: #99773

convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

ghstack-source-id: e928b06 Pull Request resolved: #99773

shunting314 · 2023-04-25T07:06:36Z

Unlike regular convolution, grouped convolution (groups argument > 1) prefers channels last. Using channels last for grouped convolution can results in 1.65x slow down (https://github.com/pytorch/pytorch/pull/99971/files#diff-7e3515959c8570fe48dcbeb882b15bc9079729b3da365c50f622ec9d4349adc9R25 ).

Here are some tests for timm_regnet. Previously I used channels last for all convolutions. That even slows down inference. Later on, I use channels last for regular convolution while contiguous for grouped convolution, we improve inference from 1.453x to 1.710x (number measured on my dev environment). But with this strategy training is still slowed down (slow down from 1.168x to 0.847x).

For now, I just disable the convolution layout optimization if a model uses grouped convolution.

torch/_inductor/graph.py

shunting314 · 2023-04-25T19:07:48Z

New round of perf result

We can see previous failed models like timm_resnest/timm_nfnet now pass. Also previously we see timm_regnet slows down from 1.23x to 0.88x. Now it's neutral since the model contains grouped convolution. And we skip layout optimization for such models for now.

Still need debug why some other models get slow down.

convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

ghstack-source-id: 458c2db Pull Request resolved: #99773

convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

ghstack-source-id: 359c515 Pull Request resolved: #99773

convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

shunting314 · 2023-06-02T06:01:44Z

@pytorchbot merge

pytorchmergebot · 2023-06-02T06:03:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-06-02T12:02:12Z

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

shunting314 · 2023-06-02T17:32:48Z

@pytorchbot merge

pytorchmergebot · 2023-06-02T17:34:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-06-02T17:34:53Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 23fc8d626446b8d2308d39e0586ef1b5c5e14113 returned non-zero exit code 1

Auto-merging benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_training.csv
CONFLICT (content): Merge conflict in benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_training.csv
Auto-merging test/test_fake_tensor.py
CONFLICT (content): Merge conflict in test/test_fake_tensor.py
Auto-merging torch/_functorch/aot_autograd.py
CONFLICT (content): Merge conflict in torch/_functorch/aot_autograd.py
Auto-merging torch/_inductor/codegen/triton.py
CONFLICT (content): Merge conflict in torch/_inductor/codegen/triton.py
Auto-merging torch/_inductor/codegen/wrapper.py
Auto-merging torch/_inductor/fx_passes/post_grad.py
Auto-merging torch/_inductor/ir.py
Auto-merging torch/_meta_registrations.py
error: could not apply 23fc8d62644... [inductor] layout optimization for conv
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".

Details for Dev Infra team

Raised by workflow job

convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16) - TB: 1.64x -> 1.69x - HF: 1.79x -> 1.78x (random noise) - TIMM: 1.51x -> 1.65x Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: #102670 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel anijain2305 soumith desertfire [ghstack-poisoned]

ghstack-source-id: ad58d38 Pull Request resolved: #99773

shunting314 · 2023-06-02T18:31:31Z

@pytorchbot merge

pytorchmergebot · 2023-06-02T18:34:43Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ezyang · 2023-06-03T19:43:12Z

Dynamic shape perf metric is neutral now since we disable layout optimization for it dashboard

Following up on this, do we have a plan for dynamic shapes? It seems like layout optimization should help even with dynamic batch size. Is the problem tuning the heuristics?

cc @voznesenskym

voznesenskym · 2023-06-04T01:07:42Z

Dynamic shape perf metric is neutral now since we disable layout optimization for it dashboard

Following up on this, do we have a plan for dynamic shapes? It seems like layout optimization should help even with dynamic batch size. Is the problem tuning the heuristics?

cc @voznesenskym

It would be nice if we stopped doing changes where we ignored dyn shapes. I don't know what the best way to educate folks is, other than asking @jansel to reject PRs that carve out dynamic shapes / don't take them into account.

That being said, @shunting314 has done a lot of due diligence and has a very nice followup task linked in the code #102670 - I think, at the very least, we should start by replacing the if dynamic check here with a check for if all the inputs are static, (after we reject conv > 1 channels, smaller out channels, sdpa, we can iterate over the nodes and I think if they are all static it will be identical to if dynamic=False for this feature).

@shunting314 let's followup, I am happy to help with the change I described above, and also with actual proper layout optimization for dyn afterwords, not just static-in-dyn.

shunting314 · 2023-06-05T18:55:50Z

Thanks @ezyang and @voznesenskym for following up on dynamic shape support.

In #102670, we've found that the main reason we see slow down when enabling both dynamic shape and layout optimization is because of the disabling of split reduction. Currently, when dynamic shape is enabled, we may disable split reduction in the code (if any dimension has dynamic shape). It turns out that layout optimization get more penalty by disabling split reduction (we have experiment results support this in the tracking issue).

So if we want to let layout optimization brings gain for dynamic shape as well, we should try to enable split reduction. Some ideas we have is to generate multiple kernels for reduction - each kernel corresponds to different ranges of values for the dynamic dimension. Then at runtime we can pick the proper one based on the runtime value of the dynamic dimension. We are building this multi-kernel support anyway for something else, so it looks promising to build it in general and apply it here as well.

ezyang · 2023-06-05T20:12:26Z

OK. Generating multiple kernels is the PoR. We just have to do it.

Who is working on multi-kernel support?

shunting314 · 2023-06-05T20:18:21Z

Who is working on multi-kernel support?

I'll work on that after I finish a couple of other followups on this PR.

ezyang · 2023-06-08T03:00:21Z

torch/_functorch/aot_autograd.py

+                    # saved activations can have different stride to eager if
+                    # the compiler does layout optimization. We should restride the
+                    # tensor passed in for compiling the backward graph using the
+                    # saved tensor's stride.


Is there a reason why layout optimization cannot be done prior to partitioning? If it could be done at that point in time, then you would have all the strides correct and you wouldn't have to write this code.

So layout optimization currently is applied in inductor IR (i.e. apply certain strides in ir.Layout) so it has to be done after partitioning. If we were adding some Fx nodes to represent layout change, then maybe we can do that early before the partitioning. But that's a completely different way to make it work. I also think this introduce extra complexities on AOTAutograd side.

ezyang · 2023-06-08T03:05:36Z

torch/_functorch/aot_autograd.py

+                        real_arg = all_args[i]
+                        if not isinstance(ph_arg, torch.Tensor):
+                            continue
+                        if ph_arg.stride() != real_arg.stride():


This forces specializations because ph_arg is symbolic, but real_arg is not symbolic, and an equality test between not symbolic and symbolic does specialization. Then you run into the other problem which is we silently discard backward guards (hopefully you got a warning about this, which helped you diagnose the problem.)

There is no way to do this stride test. What you should do instead is a permutation test. It should be possible to extract the layout permutation without causing extra guards and then do an equality test.

But I am also skeptical that you should be doing this logic here anyway.

Thanks @ezyang for explanation! One question I have is, can I get the stride hint for ph_arg directly and use that to compare with real_arg.stride() ?

Hopefully this also makes sure we don't drop dynamic dimensions in ph_arg.

[inductor] layout optimization for conv

de2561a

[ghstack-poisoned]

shunting314 added a commit that referenced this pull request Apr 21, 2023

[inductor] layout optimization for conv

db74b3c

ghstack-source-id: 4d7cc9c Pull Request resolved: #99773

github-actions bot added ciflow/inductor module: inductor labels Apr 21, 2023

shunting314 changed the title ~~[inductor] layout optimization for conv~~ [WIP][inductor] layout optimization for conv Apr 21, 2023

shunting314 requested review from jansel, Chillee and ngimel April 21, 2023 23:40

shunting314 added a commit that referenced this pull request Apr 25, 2023

[inductor] layout optimization for conv

3f492b2

ghstack-source-id: 7c0ea5b Pull Request resolved: #99773

shunting314 added a commit that referenced this pull request Apr 25, 2023

[inductor] layout optimization for conv

1716291

ghstack-source-id: e928b06 Pull Request resolved: #99773

XiaobingSuper reviewed Apr 25, 2023

View reviewed changes

torch/_inductor/graph.py Outdated Show resolved Hide resolved

github-actions bot added the module: dynamo label Apr 25, 2023

shunting314 added a commit that referenced this pull request Apr 25, 2023

[inductor] layout optimization for conv

a3ca422

ghstack-source-id: 458c2db Pull Request resolved: #99773

shunting314 added a commit that referenced this pull request Apr 26, 2023

[inductor] layout optimization for conv

16c9b19

ghstack-source-id: 359c515 Pull Request resolved: #99773

jansel approved these changes Jun 2, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 2, 2023

pytorchmergebot added the merging label Jun 2, 2023

pytorchmergebot removed the merging label Jun 2, 2023

shunting314 added a commit that referenced this pull request Jun 2, 2023

[inductor] layout optimization for conv

5af7d9f

ghstack-source-id: ad58d38 Pull Request resolved: #99773

pytorchmergebot added the merging label Jun 2, 2023

pytorchmergebot added Merged and removed merging labels Jun 2, 2023

pytorchmergebot closed this in 86c7652 Jun 2, 2023

ezyang reviewed Jun 8, 2023

View reviewed changes

facebook-github-bot deleted the gh/shunting314/52/head branch June 8, 2023 18:46

[inductor] layout optimization for conv #99773

[inductor] layout optimization for conv #99773

Uh oh!

Conversation

shunting314 commented Apr 21, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99773

✅ No Failures

Uh oh!

github-actions bot commented Apr 21, 2023

This PR needs a label

Uh oh!

ngimel commented Apr 22, 2023

Uh oh!

shunting314 commented Apr 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jansel commented Apr 22, 2023

Uh oh!

shunting314 commented Apr 25, 2023

Uh oh!

Uh oh!

shunting314 commented Apr 25, 2023

Uh oh!

shunting314 commented Jun 2, 2023

Uh oh!

pytorchmergebot commented Jun 2, 2023

Merge started

Uh oh!

pytorchmergebot commented Jun 2, 2023

Uh oh!

shunting314 commented Jun 2, 2023

Uh oh!

pytorchmergebot commented Jun 2, 2023

Merge started

Uh oh!

pytorchmergebot commented Jun 2, 2023

Merge failed

Uh oh!

shunting314 commented Jun 2, 2023

Uh oh!

pytorchmergebot commented Jun 2, 2023

Merge started

Uh oh!

ezyang commented Jun 3, 2023

Uh oh!

voznesenskym commented Jun 4, 2023

Uh oh!

shunting314 commented Jun 5, 2023

Uh oh!

ezyang commented Jun 5, 2023

Uh oh!

shunting314 commented Jun 5, 2023

Uh oh!

ezyang Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

shunting314 Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

ezyang Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

shunting314 Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shunting314 commented Apr 21, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 21, 2023 •

edited

Loading

shunting314 commented Apr 22, 2023 •

edited

Loading