Skip to content

Conversation

@linkerzhang
Copy link
Contributor

@linkerzhang linkerzhang commented Dec 13, 2018

This is adding int16/uint16 in quantization, since CNTK quantized speech model is using 16-bit in the first FC layer to ensure the accuracy. The PR is also clarifying the overflow policy, 1) production MUST not overflow, 2) Accumulation may overflow in 32 bits if the input is 8 bits or in 64 bits if the input is 16 bits.

@KeDengMS @liwchang let me know your thoughts please, especially the overflow clarification. Should we simplify it as "Accumulation may overflow in 32 bits in any case."? Thanks!

@linkerzhang linkerzhang requested a review from a team as a code owner December 13, 2018 00:43
Copy link
Contributor

@ke1337 ke1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@linkerzhang linkerzhang merged commit 1d32aa9 into master Dec 13, 2018
@raymondxyang raymondxyang deleted the kezhan/add_16_bit_for_quantization branch December 14, 2018 19:00
quic-ankus pushed a commit to CodeLinaro/onnxruntime that referenced this pull request Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants