-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Relay] A set of utilities that allows a model to be run efficiently on tensorcores. #6748
Conversation
@Laurawly @masahi @csullivan @jroesch Can you guys take a look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the layercount and recastmutator passes @jwfromm! They are quite useful additions to have.
|
||
def __init__(self, skip_layers=None): | ||
self.skip_counter = 0 | ||
self.skip_layers = skip_layers if skip_layers is not None else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When have you found it useful to skip a specific layer of a given operator type / how do you envision it being used? Mainly for debugging and performance tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the first layer of most networks does not have a sufficient number of channels for our tensorcore schedules to be applied. Although this would in theory not be a problem, there aren't HWNC schedules for GPU. So if you blindly apply ConvertLayout to all layers, you end up with a first layer that cant be executed. Skipping it during conversion is an elegant way to avoid this issue. I imagine a similar pathology could apply to other situations.
This collection of new utility functions allows enables a starting floating point model to be converted to a datatype and format that can be run using the efficient HWNC tensorcore schedules introduced in #6121. Although these schedules are the fastest available in TVM, they have a few very specific requirements that make it difficult to apply generally to models. Specifically, compatible operators must have inputs set to
int4
orint8
, all compatible layers must be in theHWNC
layout, and incompatible layers should be left in their original layout and datatype. There are currently not tools to make such changes to an existing model. To address this, I've written the following utilities:count_layers
: A pass that determines the number of layers of the specified operator in a graph. Although generally useful, for tensorcores we use this to enable theskip_layers
feature.recast
: A pass that changes the input and output datatype of all specified operators in a graph, with the option to skip a set of layers. Although this pass is only useful for benchmarking as it does not apply any intelligent quantization, this type of utility is a common topic on the Discuss forums and can serve as a good example for users interested in similar functionality.LayoutConfig
: An optional scope that can be applied around theConvertLayout
pass. In this PR I use it to enable skipping the conversion of specified conv2d layers, but it could be extended for other customization down the line.HWNC support for
ConvertLayout
.The combination of these utilities allows us to target HWNC tensorcores using a workflow such as this:
When autotuned, the resulting
mod
will qualify for using the HWNC tensorcore strategy.