-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] 8-bit quantization for inference #771
[WIP] 8-bit quantization for inference #771
Conversation
Works with this quantization program (TODO integrate): import mxnet as mx model = mx.nd.load("/home/ubuntu/idid-enus/model.amt.sf-concat/params.best") dense = [k[0:-7] for k in model.keys() if k.endswith('.weight') and not k.startswith("embedding_source.")] dense.remove("encoder.pos_embedding") dense.remove("decoder.pos_embedding") for param in dense: name = param + ".weight" b = model[name] b_max = mx.nd.contrib.intgemm_maxabsolute(b) # The disk format just quantizes. b_prepared = mx.nd.contrib.intgemm_prepare_data(b, b_max) model[name] = b_prepared model[param + ".scaling"] = b_max / 127.0 mx.nd.save("/home/ubuntu/idid-enus/model.amt.sf-concat.quant/params.best", model)
But it doesn't check all parameters are in the provided model
Updated:
You'll also need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really looking forward to the corresponding mxnet change to get this merged!
Left a few comments, mostly minor style comments.
I think it would be nice to test in8 quantization in the system tests. This would entail quantizing the model in the test suite and have another decoding pass which allows you to assert on output similarity and/or BLEU. This would also clarify the workflow with int8 quantization.
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
class QuantizableDense(mx.gluon.HybridBlock): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you inherit from mx.gluon.nn.basic_layers.Dense
directly and only overwrite cast()
and hybrid_forward
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Tried to do this, will need a consultation with a gluon expert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to carefully set the prefix for the inheriting class to make sure the parameter names match.
model.cast(model_config.dtype) | ||
|
||
if quantizing: | ||
logger.info("Model dtype: quantizing from float32 to int8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially quantize from FP16, right? Or everything on disk is FP32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a kernel to quantize from FP16 to INT8. CPUs aren't so great at FP16 anyway; they only have instructions to convert to/from FP32 then do all the math in FP32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this means that being able to quantize to int8 for inference requires having trained an FP32 model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have stable training in FP16? I guess I could add a code path to convert FP16 -> FP32 -> int8 which, sadly, is how the CPU would do it anyway.
Co-Authored-By: Felix Hieber <fhieber@users.noreply.github.com>
Co-Authored-By: Felix Hieber <fhieber@users.noreply.github.com>
…to heafield-quantize
Now supports three disk formats:
Adding scaling factors (transition 1 -> 2): import sockeye.model
model = sockeye.model.load_model("model", for_disk_saving='float32', dtype='int8')
model[0].save_parameters("model.annotated/params.best")
model[0].save_config("model.annotated")
#Warning: do not use the loaded model for inference. Load from disk. Adding scaling factors and quantizing (transition 1 -> 3): import sockeye.model
model = sockeye.model.load_model("model", for_disk_saving='int8', dtype='int8')
model[0].save_parameters("model.annotated/params.best")
model[0].save_config("model.annotated")
#Warning: do not use the loaded model for inference. Load from disk. In both cases you'll need the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved for merge into an intermediate branch for final cleanup.
This pull request adds wrappers to the intgemm matrix multiplication library: https://github.com/kpu/intgemm . A performance comparison with DNNL aka MKL-DNN is at kpu/intgemm#59 The library targets thin matrix sizes seen in neural machine translation inference and was part of the top submission to the 2018 Workshop on Neural Generation and Translation efficiency task: https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf . The purpose of this issue is to add similar functionality to Sockeye: awslabs/sockeye#771 . Quantized Sockeye performance is 2.95x as fast. One problem with the current MXQuantizeSymbol approach is that Sockeye does not have a static graph for everything. intgemm uses a custom memory layout for the weight matrix to make more memory accesses consecutive, so there are operators to convert weights to that format. The idea is that weights are typically loaded once for inference. On architectures without VNNI, intgemm uses saturating 16-bit accumulation. This avoids an expensive madd_epi16 instruction every multiply by exploiting the fact that most neural network parameters are near 0. Because x86 only offers a unsigned * signed instruction and most people want signed * signed, there are two strategies one can take. Add 128 to data so now it's unsigned. But that biases the output. DNNL calculates this bias on the fly by summing weights then subtracts it out during GEMM. intgemm calculates this bias in advance, which can then be subtracted from the bias term with no overhead at runtime. A problem with this strategy is that it makes the accumulator bigger, requiring more upcasting with an expensive madd_epi16 instruction. Emulate signed * signed by normalizing the sign bit into the second argument. This requires extra instructions in the hot loop but keeps the accumulator small, so it's less necessary to accumulate into 32-bit integers and madd_epi16 can be avoided. Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2. Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3, AVX2, AVX512BW, and AVX512VNNI.
Add support for 8-bit quantized matrix multiplication in inference.
This code depends on the intgemm branch in my fork of MXNet: https://github.com/kpuatamazon/incubator-mxnet/tree/intgemm . This will turn into a pull request against MXNet.
Quantization on one thread runs 2.95x as fast as the baseline on one thread. Quantization on one thread is 1.28x as fast as the baseline on four threads. Results on an AWS c5.x12large.
BLEU: 42.6 quantized, 42.5 baseline float32. No significant change.
Note that the on-disk format of the int8 file is dependent on the CPU architecture. A fix for this is pending a change to intgemm to separate the quantization and rearrangement steps.
The model is converted to 8-bit offline using a program. Here's a program to convert a model from fp32 to int8. You should also change the
config
file'sdtype
toint8
. I'm soliciting suggestions on how to make this cleanly, probably another command-line program.Pull Request Checklist
until you can check this box.
pytest
)pytest test/system
)./style-check.sh
)sockeye/__init__.py
. Major version bump if this is a backwards incompatible change.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.