-
Notifications
You must be signed in to change notification settings - Fork 23k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[inductor] layout optimization for conv #99773
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99773
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 94c2012: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 4d7cc9cae903c92e14de207c7c5b50d6a10088f4 Pull Request resolved: #99773
This PR needs a labelIf your changes are user facing and intended to be a part of release notes, please use a label starting with If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see |
This looks fine but I"m interested in the full results for timm suite. |
The CI run is ready. We see there are quite a few torchench/timm_models get 10-20% speedup There are also a lot of things need to dig.
My goal here is figure these things out so that we get the speedup by using channels last layout for convolution but don't slow-down or fail any passed model. I'll debug memory compression ratio after that. |
Yup makes sense, the tricky part will be coming up with a heuristic to harvest all those wins without taking on the slowdowns. |
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: 7c0ea5bbfe09d4fae81839c4b30ddfba599b391a Pull Request resolved: #99773
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: e928b06a63441578b99e5f1f7df4b230a8be0666 Pull Request resolved: #99773
Unlike regular convolution, grouped convolution (groups argument > 1) prefers channels last. Using channels last for grouped convolution can results in 1.65x slow down (https://github.com/pytorch/pytorch/pull/99971/files#diff-7e3515959c8570fe48dcbeb882b15bc9079729b3da365c50f622ec9d4349adc9R25 ). Here are some tests for timm_regnet. Previously I used channels last for all convolutions. That even slows down inference. Later on, I use channels last for regular convolution while contiguous for grouped convolution, we improve inference from 1.453x to 1.710x (number measured on my dev environment). But with this strategy training is still slowed down (slow down from 1.168x to 0.847x). For now, I just disable the convolution layout optimization if a model uses grouped convolution. |
New round of perf result We can see previous failed models like timm_resnest/timm_nfnet now pass. Also previously we see timm_regnet slows down from 1.23x to 0.88x. Now it's neutral since the model contains grouped convolution. And we skip layout optimization for such models for now. Still need debug why some other models get slow down. |
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: 458c2db61563640867511917e50e36867663d738 Pull Request resolved: #99773
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: 359c5157f611d6e031a2fcc6aa47382450d74e33 Pull Request resolved: #99773
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. # Example command - with the optimization disabled TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training - with the optimization enabled TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform ance --dashboard --only resnet18 --training # Result I'll do some local runs first and then create a report from CI. - resnet18: 1.269x -> 1.446x - resnet50: 1.100x -> 1.263x - resnet152: 1.048x -> 1.218x - vgg16: 1.266x -> 1.500x - alexnet: 1.116x -> 1.263x - timm_resnest: 1.582x -> FAIL (NEED DEBUG) - resmlp_12_224: 1.265x -> 1.290x - convmixer_768_32: 0.994x -> 2.958x - hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution) cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command
Details for Dev Infra teamRaised by workflow job |
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16) - TB: 1.64x -> 1.69x - HF: 1.79x -> 1.78x (random noise) - TIMM: 1.51x -> 1.65x Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: #102670 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel anijain2305 soumith desertfire [ghstack-poisoned]
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16) - TB: 1.64x -> 1.69x - HF: 1.79x -> 1.78x (random noise) - TIMM: 1.51x -> 1.65x Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: #102670 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel anijain2305 soumith desertfire [ghstack-poisoned]
ghstack-source-id: ad58d381595366fe374618ca5f6f530cd18ea917 Pull Request resolved: #99773
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Following up on this, do we have a plan for dynamic shapes? It seems like layout optimization should help even with dynamic batch size. Is the problem tuning the heuristics? |
It would be nice if we stopped doing changes where we ignored dyn shapes. I don't know what the best way to educate folks is, other than asking @jansel to reject PRs that carve out dynamic shapes / don't take them into account. That being said, @shunting314 has done a lot of due diligence and has a very nice followup task linked in the code #102670 - I think, at the very least, we should start by replacing the @shunting314 let's followup, I am happy to help with the change I described above, and also with actual proper layout optimization for dyn afterwords, not just static-in-dyn. |
Thanks @ezyang and @voznesenskym for following up on dynamic shape support. In #102670, we've found that the main reason we see slow down when enabling both dynamic shape and layout optimization is because of the disabling of split reduction. Currently, when dynamic shape is enabled, we may disable split reduction in the code (if any dimension has dynamic shape). It turns out that layout optimization get more penalty by disabling split reduction (we have experiment results support this in the tracking issue). So if we want to let layout optimization brings gain for dynamic shape as well, we should try to enable split reduction. Some ideas we have is to generate multiple kernels for reduction - each kernel corresponds to different ranges of values for the dynamic dimension. Then at runtime we can pick the proper one based on the runtime value of the dynamic dimension. We are building this multi-kernel support anyway for something else, so it looks promising to build it in general and apply it here as well. |
OK. Generating multiple kernels is the PoR. We just have to do it. Who is working on multi-kernel support? |
I'll work on that after I finish a couple of other followups on this PR. |
# saved activations can have different stride to eager if | ||
# the compiler does layout optimization. We should restride the | ||
# tensor passed in for compiling the backward graph using the | ||
# saved tensor's stride. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why layout optimization cannot be done prior to partitioning? If it could be done at that point in time, then you would have all the strides correct and you wouldn't have to write this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So layout optimization currently is applied in inductor IR (i.e. apply certain strides in ir.Layout) so it has to be done after partitioning. If we were adding some Fx nodes to represent layout change, then maybe we can do that early before the partitioning. But that's a completely different way to make it work. I also think this introduce extra complexities on AOTAutograd side.
real_arg = all_args[i] | ||
if not isinstance(ph_arg, torch.Tensor): | ||
continue | ||
if ph_arg.stride() != real_arg.stride(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This forces specializations because ph_arg is symbolic, but real_arg is not symbolic, and an equality test between not symbolic and symbolic does specialization. Then you run into the other problem which is we silently discard backward guards (hopefully you got a warning about this, which helped you diagnose the problem.)
There is no way to do this stride test. What you should do instead is a permutation test. It should be possible to extract the layout permutation without causing extra guards and then do an equality test.
But I am also skeptical that you should be doing this logic here anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ezyang for explanation! One question I have is, can I get the stride hint for ph_arg directly and use that to compare with real_arg.stride() ?
Hopefully this also makes sure we don't drop dynamic dimensions in ph_arg.
Stack from ghstack (oldest at bottom):
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.
Latest perf number here
Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: #102670
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @anijain2305 @soumith @desertfire