Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QNN][TFLite] TFLite rounding mode support #4828

Closed
wants to merge 4 commits into from

Conversation

fwd4
Copy link
Contributor

@fwd4 fwd4 commented Feb 6, 2020

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

Add tflite rounding support with corresponding test cases. The tflite rounding mode golden results are generated with a testbench using MultiplyByQuantizedMultiplier function here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/common.h#L148

@FrozenGene @anijain2305

Might help fix the problem described here:
https://discuss.tvm.ai/t/supporting-bit-exact-tflite-qnn-inference/5528

@FrozenGene
Copy link
Member

Thanks @LiangHao151941 ! I haven't reviewed the code, however, I have some high level comments:

  1. if we have TFLITE rounding support, we should make TFLite frontend using TFLITE rounding and we should get the bit exact result as TFLite.

  2. you should also modify the test_forward.py for tflite, like test_qnn* related test cases, we shouldn't need atol=1 any more ideally.

  3. you could add q_conv2d unit testing case, we could get the same result compared with TFLite. we lack of this unit testing.

@FrozenGene FrozenGene changed the title tflite rounding mode support [QNN][TFLite] TFLite rounding mode support Feb 6, 2020
@fwd4
Copy link
Contributor Author

fwd4 commented Feb 6, 2020

Makes sense! I'll follow up on those. @FrozenGene

@anijain2305
Copy link
Contributor

anijain2305 commented Feb 6, 2020

Thanks for this PR. It will help reduce/remove those 1-off deviations.

One request that I have is - While setting the rounding mode to TFLite in TFLite parser, it might be better to set it by adding an extra arg in from_tflite API. The main reason is that the lowering here might involve many more number of Relay ops, which might degrade performance. So, to enable TVM user to choose different accuracy-performance tradeoff points, we can add a new argument in the TFLite parser API, and then pass it along to the QNN ops. The default value of this arg can be tflite rounding. @FrozenGene Are you ok with this?

@u99127 @mbarrett97 will also be interested in this.

Also, I think we can have small PRs merged. And incrementally improve test coverage.

Copy link
Contributor

@anijain2305 anijain2305 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like the scope of this PR. IMO, we can get this in. And send other PRs for TFLite parser.

@FrozenGene
Copy link
Member

FrozenGene commented Feb 7, 2020

One request that I have is - While setting the rounding mode to TFLite in TFLite parser, it might be better to set it by adding an extra arg in from_tflite API. The main reason is that the lowering here might involve many more number of Relay ops, which might degrade performance. So, to enable TVM user to choose different accuracy-performance tradeoff points, we can add a new argument in the TFLite parser API, and then pass it along to the QNN ops. The default value of this arg can be tflite rounding. @FrozenGene Are you ok with this?

Yes, this is my thought too. I am ok with it.

LGTM. I like the scope of this PR. IMO, we can get this in. And send other PRs for TFLite parser.

IMO, I would prefer adding TFLite parser in this PR too. Because this will be organized into one complete solution for TFLite rounding. When others review the history, they don't need to check isolated PRs together to get complete support for TFLite rounding support. Maybe these PRs during time is long. It is right that we could make small prs and merge it in quickly so that we could get functionality supporting quickly and could locate the reason quickly. However, from this pr's perspective, TFLite's parser code will not be complex and could be contained in. It is acceptable in one single PR from my opinion. So I would suggest @LiangHao151941 adding TFLite parser in this PR too. If we think it is import, we even could port this back to release 0.6, one PR will make us do it easily too.

Copy link
Member

@FrozenGene FrozenGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you have done the parser part, you could also change the test_forward for tflite. We shouldn't need atol=1 any more ideally.

python/tvm/relay/frontend/tflite.py Outdated Show resolved Hide resolved
src/relay/qnn/util.cc Show resolved Hide resolved
src/relay/qnn/util.cc Outdated Show resolved Hide resolved
@fwd4
Copy link
Contributor Author

fwd4 commented Feb 7, 2020

Some update on further experiments @FrozenGene @anijain2305

  1. if we have TFLITE rounding support, we should make TFLite frontend using TFLITE rounding and we should get the bit exact result as TFLite.

This is added in tflite relay frontend.

  1. you should also modify the test_forward.py for tflite, like test_qnn* related test cases, we shouldn't need atol=1 any more ideally.

For the current label comparison, the new rounding mode works fine. Unfortunately, the new rounding mode does not bring bit-exact execution. In fact, for the 3 qnn models intest_forward.py, with L1 norm of prediction error as the metric, mobilenet_v1andmobilenet_v2 do drop with "TFLITE" rounding, but inception_v1 increases the metric (means more error).
Also note that performance will degrade, but not quantitatively measured for now.

  1. you could add q_conv2d unit testing case, we could get the same result compared with TFLite. we lack of this unit testing.

This would be essential to further investigate what are the root causes of bit mismatch. Will follow up.

@anijain2305
Copy link
Contributor

I thought little more about the bit exact problem. One source of discrepancy for certain is QNN add, and QNN concatenate ops. These call Requantize internally, and they will have default rounding in C++ (UPWARD). For testing, @LiangHao151941 , you can set the C++ default to TFLIte to see if it helps. Meanwhile, we can also think how to can make that C++ rounding visible at python level.

@anijain2305
Copy link
Contributor

If we think it is import, we even could port this back to release 0.6, one PR will make us do it easily too.

That is ok. Lets try to see if we can get it in single PR.

python/tvm/relay/frontend/tflite.py Outdated Show resolved Hide resolved
@FrozenGene
Copy link
Member

I thought little more about the bit exact problem. One source of discrepancy for certain is QNN add, and QNN concatenate ops. These call Requantize internally, and they will have default rounding in C++ (UPWARD). For testing, @LiangHao151941 , you can set the C++ default to TFLIte to see if it helps. Meanwhile, we can also think how to can make that C++ rounding visible at python level.

Maybe we could use @register_func to register one function named as @register_func(qnn.requantize.rounding) in TFLite parser, then we could get the value in RequantizeQnnCanonicalize from C++. However, this will have one problem that if users specify the rounding value, it won't work, because we don't know whether the rounding value is setting by default or by user explicitly. So, if we support it this, we will constraint we will only one rounding setting by TFLite parser.

Another way is we provide rounding args for qnn.add / qnn.mul / qnn.conconcate, because they use requantize in fact too, so they need rounding.

@anijain2305
Copy link
Contributor

I like this way

Another way is we provide rounding args for qnn.add / qnn.mul / qnn.conconcate, because they use requantize in fact too, so they need rounding.

include/tvm/relay/qnn/attrs.h Outdated Show resolved Hide resolved
@FrozenGene
Copy link
Member

I like this way

Another way is we provide rounding args for qnn.add / qnn.mul / qnn.conconcate, because they use requantize in fact too, so they need rounding.

cc @LiangHao151941 Could you try this way to do bit-extract testing? For conveniently, you could simply change the C++ default rounding value to "TFLITE" to test.

@fwd4
Copy link
Contributor Author

fwd4 commented Feb 7, 2020

just modified add/mul/concat requantize rounding mode and tested, no luck. will change the default rounding behavior for a later test.

update: I forced FixedPointMultiply(PerChannel) rounding mode to be TFLITE, but still unable to get bit-exact results.

one more thing, setting tflite default rounding mode to TFLITE seems to break GPU test of mobilenet_v2, maybe you guys have any ideas/suggestions?

@FrozenGene @anijain2305

@anijain2305
Copy link
Contributor

update: I forced FixedPointMultiply(PerChannel) rounding mode to be TFLITE, but still unable to get bit-exact results.

:( This will be a challenging task. I think, instead of trying to fix the full e2e test, we should now focus on unit tests. I can help with improving unit test coverage.

one more thing, setting tflite default rounding mode to TFLITE seems to break GPU test of mobilenet_v2, maybe you guys have any ideas/suggestions?

Do we run TFLIte tests on GPU? Little confused.

Copy link
Contributor

@u99127 u99127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't yet have an opinion on these changes but there are coding style related changes that I've noticed which could be explained / improved.

In general I would prefer someone read through the comments to make sure they still held once the changes landed as I don't see them fully updated in the source base.

include/tvm/relay/qnn/attrs.h Outdated Show resolved Hide resolved
include/tvm/relay/qnn/attrs.h Outdated Show resolved Hide resolved
src/relay/qnn/util.cc Outdated Show resolved Hide resolved
src/relay/qnn/util.cc Outdated Show resolved Hide resolved
@FrozenGene
Copy link
Member

FrozenGene commented Feb 8, 2020

just modified add/mul/concat requantize rounding mode and tested, no luck. will change the default rounding behavior for a later test.

update: I forced FixedPointMultiply(PerChannel) rounding mode to be TFLITE, but still unable to get bit-exact results.

one more thing, setting tflite default rounding mode to TFLITE seems to break GPU test of mobilenet_v2, maybe you guys have any ideas/suggestions?

@FrozenGene @anijain2305

Let us break the model into layer by layer and compare with tflite. I want to describe my development way, maybe it could help you. For example, we have mobilenetv2 quantized model, you could get the quantized tensorflow and tflite model. Then you could call tflite_convert (feed it quantized tensorflow model) and set the output layer (for example, just the first convolution layer), then you get one quantized tflite model only contain the first convolution layer of mobilenet v2. After you verify it correctly, you could go on until you finish the whole model e2e correctly. Command example:
tflite_convert --input_format=TENSORFLOW_GRAPHDEF --graph_def_file="xx.pb" --output_file=xx.tflite --output_format=TFLITE --input_arrays=input --input_shapes=1,224,224,3 --std_dev_values=127 --mean_values=127 --inference_type=QUANTIZED_UINT8 --inference_input_type=QUANTIZED_UINT8 --default_ranges_min=0 --default_ranges_max=6 --output_arrays=xx

I think when you verify, you could run on cpu firstly locally to find issue, then consider gpu ci issue.

@fwd4
Copy link
Contributor Author

fwd4 commented Feb 10, 2020

:( This will be a challenging task. I think, instead of trying to fix the full e2e test, we should now focus on unit tests. I can help with improving unit test coverage.

That would be appreciated, thanks!

Do we run TFLIte tests on GPU? Little confused.

Yes, according to jenkins report.

@fwd4
Copy link
Contributor Author

fwd4 commented Feb 10, 2020

Let us break the model into layer by layer and compare with tflite. I want to describe my development way, maybe it could help you. For example, we have mobilenetv2 quantized model, you could get the quantized tensorflow and tflite model. Then you could call tflite_convert (feed it quantized tensorflow model) and set the output layer (for example, just the first convolution layer), then you get one quantized tflite model only contain the first convolution layer of mobilenet v2. After you verify it correctly, you could go on until you finish the whole model e2e correctly. Command example:
tflite_convert --input_format=TENSORFLOW_GRAPHDEF --graph_def_file="xx.pb" --output_file=xx.tflite --output_format=TFLITE --input_arrays=input --input_shapes=1,224,224,3 --std_dev_values=127 --mean_values=127 --inference_type=QUANTIZED_UINT8 --inference_input_type=QUANTIZED_UINT8 --default_ranges_min=0 --default_ranges_max=6 --output_arrays=xx

Thanks for the detailed instruction here. Will follow up.

I think when you verify, you could run on cpu firstly locally to find issue, then consider gpu ci issue.

No problem.

@FrozenGene
Copy link
Member

ping @LiangHao151941

@tqchen
Copy link
Member

tqchen commented Feb 26, 2020

ping @LiangHao151941 please followup

@tqchen tqchen added the status: need update need update based on feedbacks label Feb 26, 2020
@fwd4
Copy link
Contributor Author

fwd4 commented Feb 29, 2020

There is quite some tedious work to do, but thanks for the reminder, will follow up shortly @FrozenGene @tqchen

@fwd4 fwd4 force-pushed the features/tflite-rounding branch from 67c2be9 to 3fef741 Compare March 5, 2020 11:24
@fwd4
Copy link
Contributor Author

fwd4 commented Mar 5, 2020

The problem is located and solved. The discrepancy comes from average pooling calculation when input is integral type. tflite does a UPWARD rounding after the division in avg_pool but current topi implementation of avg_pool does of FLOOR after the division.

The fix is added and tested locally on cpu. Now the qnn tests in tflite test_forward has rtol and atol set to 0, which should meet the bit-exact exectution requirement.

@FrozenGene @anijain2305 @tqchen @u99127 Please have another look, thanks!

@FrozenGene
Copy link
Member

The problem is located and solved. The discrepancy comes from average pooling calculation when input is integral type. tflite does a UPWARD rounding after the division in avg_pool but current topi implementation of avg_pool does of FLOOR after the division.

The fix is added and tested locally on cpu. Now the qnn tests in tflite test_forward has rtol and atol set to 0, which should meet the bit-exact exectution requirement.

@FrozenGene @anijain2305 @tqchen @u99127 Please have another look, thanks!

Thanks @LiangHao151941 !

Maybe we have to handle it a bit more. Please see: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/optimized/integer_ops/pooling.h#L300

The default behaviour of TFLite is optimize, so TFLite will run optimzed of kernel directory. For average pool, it has two conditions. The type of acc will be uint16 if params.filter_height * params.filter_width <= 16 * 16 , otherwise, the type of acc will be int32 as our implementation. If we want to keep bit-exact result, we should handle it here too. You remind me this situation, I met this bug before if we don't handle it.

Another way is we provide rounding args for qnn.add / qnn.mul / qnn.conconcate, because they use requantize in fact too, so they need rounding.

We should handle these ops too.

@FrozenGene
Copy link
Member

Serveral updates:

  1. tflite conversion segfault is resolved by using the tflite whl provided by @FrozenGene in the tvm tutorial and change tf version to 1.13.
  2. qnn_mobilenet_v2 segfault is resolved by setting opt_level explicitly to 2, specifically disabling the FoldScaleAxis pass (this one is causing the problem)
  3. bit-exact execution remains

But it is strange that when mobilnet_v2 qnn test is passed locally, it still fails ci test. Below is a local screenshot.
image

I'm listing my local env here to see if anything need to be aligned:
llvm 9.0.1 (this I can downgrade, but maybe there is a specific version used in the ci so I can try?)
gcc 7.4.0
Ubuntu 18.04
cmake 3.15.2

Please have another look, thanks! @FrozenGene @anijain2305 @tqchen

Thanks for the hard work.

For FoldScaleAxis, It will be better we could dig what issue it is (I describe some background of this algorithm here: https://github.com/apache/incubator-tvm/pull/1664/files#diff-88bcab09a087491487d7655fa76ba116R527).

I checked CI's docker instance just now. Our CI's LLVM version is LLVM 9 too. So your LLVM is OK IMO. However, I think you could run make cython3 on your tvm root directory (not build directory) if you don't do this. Sometimes, cython will expose problem (CI has it).

@fwd4
Copy link
Contributor Author

fwd4 commented Mar 21, 2020

Update:

  1. resolved one problem of MSVC buliding failure in CMake
  2. rebase to recent master resolves the opt_level problem, so I change test opt_level back to 3, and it can pass my local test.
  3. found llvm version in ci to be 8.x, so I downgraded my llvm version to 8.0.1, still having a segfault for qnn mobilenet_v2 in ci test.

Trying to pull the ci docker to further check @FrozenGene

@anijain2305
Copy link
Contributor

@LiangHao151941 Was wondering if you ever got a chance to look at the performance of TFLite rounding? I am looking at the performance of UPWARD rounding and surprisingly Requantize was quite expensive in e2e models like mobilenet. I suspect TFLite rounding will be even slower.

So, wondering if you got a chance to check performance. I worry that so much efforts in this PR might give us perfect functionality but not so good performance. And therefore, in future, we might decide to tradeoff performance with accuracy, taking us back to square 1.

@FrozenGene
Copy link
Member

@LiangHao151941 Was wondering if you ever got a chance to look at the performance of TFLite rounding? I am looking at the performance of UPWARD rounding and surprisingly Requantize was quite expensive in e2e models like mobilenet. I suspect TFLite rounding will be even slower.

So, wondering if you got a chance to check performance. I worry that so much efforts in this PR might give us perfect functionality but not so good performance. And therefore, in future, we might decide to tradeoff performance with accuracy, taking us back to square 1.

IMO, I think we should keep functionality first, then we should consider performance. Currently, our way is to lower many ops, I suspect this leads worse performance, because we can not control all computation in registers. I think TFLite rounding will not the bottleneck of performance. Let us put the effort to keep functionality firstly, next, we should reconsider our code and find the performance bottleneck leverage the performance tool (like DS5 Streamline).

@anijain2305
Copy link
Contributor

anijain2305 commented Apr 8, 2020

Actually, Requantize is the bottleneck in my observation on Rasp Pi 4. For TFLite quantized mobilenetv1, currently with master + auto-tuning, I get 56 ms (TFLite is 35 ms). Upon deeper analysis, I found requantize to be the bottleneck (even with UPWARD rounding which is the cheapest). The reason is that calculations happen in int64 and either ARM does not have good instructions or LLVM does not pick up the right instructions.

As an experiment, I forced the datatype of Requantize to be int32. This will lead to bad accuracy but it will give us an idea what is the minimum latency of Requantize op. The runtime reduced from earlier 56 ms to 36 ms (almost as good as TFLite). So, requantize is the bottleneck.

Additionally, we should notice that TFLite rounding is not a standard rounding. They have moved backwards from the ARM instructions they want to use. Because of that they have 2 roundings instead of 1 - 1 in SaturatingRoundingDoublingHighMul (non-standard rounding) and other is
RoundingDivideByPOT (standard). Given there are 2 roundings, they have also made an accuracy-performance tradeoff. So, my suggestion is that we should also wisely think if it makes sense to completely follow TFLite rounding as the golden reference. For example, following reference function

// This function implements the same computation as the ARMv7 NEON VQRDMULH
// instruction.

template <>
inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
                                                      std::int32_t b) {
  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
  std::int64_t a_64(a);
  std::int64_t b_64(b);
  std::int64_t ab_64 = a_64 * b_64;
  std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
  std::int32_t ab_x2_high32 =
      static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
  return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
}

This function alone needs around 10 Relay ops (if not more), but in TFLite this is just 1 ARM instruction - VQRDMULH. I doubt that LLVM will be able to use that instruction automatically.

In summary, I understand the benefits of exact match between TFLite and TVM. But in contrast to FP32 networks, this is one of the cases, where frameworks have made their choice to tradeoff perf and accuracy for a specific ARM device. Note that with non-TFLite rounding, the application accuracy is still similar, so it might be an acceptable tradeoff. Therefore, I would encourage to think more deeply. I worry that the extensive work going in this PR might need major reconsideration when we dont see good performance improvements.

If we relax the exact tensor match criteria, there might be other ways to get the perfromance like using tensorize.

@u99127 @masahi might also be interested in this.

@u99127
Copy link
Contributor

u99127 commented Apr 8, 2020

Actually, Requantize is the bottleneck in my observation on Rasp Pi 4. For TFLite quantized mobilenetv1, currently with master + auto-tuning, I get 56 ms (TFLite is 35 ms). Upon deeper analysis, I found requantize to be the bottleneck (even with UPWARD rounding which is the cheapest). The reason is that calculations happen in int64 and either ARM does not have good instructions or LLVM does not pick up the right instructions.

As an experiment, I forced the datatype of Requantize to be int32. This will lead to bad accuracy but it will give us an idea what is the minimum latency of Requantize op. The runtime reduced from earlier 56 ms to 36 ms (almost as good as TFLite). So, requantize is the bottleneck.

Additionally, we should notice that TFLite rounding is not a standard rounding. They have moved backwards from the ARM instructions they want to use. Because of that they have 2 roundings instead of 1 - 1 in SaturatingRoundingDoublingHighMul (non-standard rounding) and other is
RoundingDivideByPOT (standard). Given there are 2 roundings, they have also made an accuracy-performance tradeoff. So, my suggestion is that we should also wisely think if it makes sense to completely follow TFLite rounding as the golden reference. For example, following reference function

// This function implements the same computation as the ARMv7 NEON VQRDMULH
// instruction.

template <>
inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
                                                      std::int32_t b) {
  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
  std::int64_t a_64(a);
  std::int64_t b_64(b);
  std::int64_t ab_64 = a_64 * b_64;
  std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
  std::int32_t ab_x2_high32 =
      static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
  return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
}

This function alone needs around 10 Relay ops (if not more), but in TFLite this is just 1 ARM instruction - VQRDMULH. I doubt that LLVM will be able to use that instruction automatically.

The vqrdmulh instruction is pretty standard on all Arm CPUs that have Advanced SIMD in 32 bit. C compilers will struggle with generating that instruction automagically. It's just too big an idiom to recognize and I don't think llvm has that infrastructure to detect it . Is there anything preventing us from having a specific tvm intrinsic that we can lower to this vqrdmulh instruction ? Ofcourse if this fails we could instead lower to the C implementation ?

The equivalent aarch64 instruction is sqrdmulh IIRC and is default on all devices. There should be an llvm intrinsic equivalent of this.

ramana

@anijain2305
Copy link
Contributor

Moving to discuss as we are talking a lot :)

https://discuss.tvm.ai/t/tflite-rounding/6287

@FrozenGene
Copy link
Member

Actually, Requantize is the bottleneck in my observation on Rasp Pi 4. For TFLite quantized mobilenetv1, currently with master + auto-tuning, I get 56 ms (TFLite is 35 ms). Upon deeper analysis, I found requantize to be the bottleneck (even with UPWARD rounding which is the cheapest). The reason is that calculations happen in int64 and either ARM does not have good instructions or LLVM does not pick up the right instructions.

As an experiment, I forced the datatype of Requantize to be int32. This will lead to bad accuracy but it will give us an idea what is the minimum latency of Requantize op. The runtime reduced from earlier 56 ms to 36 ms (almost as good as TFLite). So, requantize is the bottleneck.

Additionally, we should notice that TFLite rounding is not a standard rounding. They have moved backwards from the ARM instructions they want to use. Because of that they have 2 roundings instead of 1 - 1 in SaturatingRoundingDoublingHighMul (non-standard rounding) and other is

RoundingDivideByPOT (standard). Given there are 2 roundings, they have also made an accuracy-performance tradeoff. So, my suggestion is that we should also wisely think if it makes sense to completely follow TFLite rounding as the golden reference. For example, following reference function


// This function implements the same computation as the ARMv7 NEON VQRDMULH

// instruction.



template <>

inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,

                                                      std::int32_t b) {

  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();

  std::int64_t a_64(a);

  std::int64_t b_64(b);

  std::int64_t ab_64 = a_64 * b_64;

  std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));

  std::int32_t ab_x2_high32 =

      static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));

  return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;

}

This function alone needs around 10 Relay ops (if not more), but in TFLite this is just 1 ARM instruction - VQRDMULH. I doubt that LLVM will be able to use that instruction automatically.

In summary, I understand the benefits of exact match between TFLite and TVM. But in contrast to FP32 networks, this is one of the cases, where frameworks have made their choice to tradeoff perf and accuracy for a specific ARM device. Note that with non-TFLite rounding, the application accuracy is still similar, so it might be an acceptable tradeoff. Therefore, I would encourage to think more deeply. I worry that the extensive work going in this PR might need major reconsideration when we dont see good performance improvements.

If we relax the exact tensor match criteria, there might be other ways to get the perfromance like using tensorize.

@u99127 @masahi might also be interested in this.

Please note this thread, if we don’t have bit-extract result, we will have many problems, not just trade off. Mobilenet accuracy result is not enough.

You could also one blog how do boost the quant perf. Inside it, we could get the bit-extract result and use the same rouding of TFLite, but we could beat TFLite performance a lot. However, the design has a little bit, we design to implement one q_conv2d / q_requantize to contorl it. For q_conv2d, we tensorize it and do requantize inside it. For depthwise convolution, we even not tensorize and use q_conv2d + q_requantize, we could still beat TFLite and QNNPack a lot. Out current design of one problem is we lower too many ops and can not make sure all computations inside register and NCHW has bad cache like previous blog said.

We should make sure functionality firstly. Then we consider the performance. I think it is not only the arm instruction you mention but also many other things we need to do.

@anijain2305
Copy link
Contributor

Ok. I forgot the accuracy issue of Object detection models.

I will also go through the blog to see if we can do something in the short term. If possible, we should make a plan for what all things that we need to do to get the performance speedup. I will share the results in a discussion post in few days. Maybe you can share your experience from there.

@FrozenGene
Copy link
Member

Ok. I forgot the accuracy issue of Object detection models.

I will also go through the blog to see if we can do something in the short term. If possible, we should make a plan for what all things that we need to do to get the performance speedup. I will share the results in a discussion post in few days. Maybe you can share your experience from there.

Yeah. Because of the different design, my original plan is we could achieve the same result of TFLite, then we start to consider how to achieve better performance leveraging our previous implementation of q_conv2d. :-)

@fwd4 fwd4 force-pushed the features/tflite-rounding branch from 83434fc to 960cc5c Compare May 17, 2020 13:47
@fwd4 fwd4 force-pushed the features/tflite-rounding branch from 8ef0656 to a3b5c21 Compare May 20, 2020 02:25
@anijain2305
Copy link
Contributor

Hi, I am wondering if we can take the pool operator changes from this PR and send a separate PR. I know that the earlier decision on this was to put everything in one PR for easy backporting. But, as this PR might take more time, and master TVM today does not implement any rounding for pooling ops for integer datatypes, I think it is valuable to separate that part out. What do you think? @FrozenGene @LiangHao151941

@tqchen tqchen closed this Aug 21, 2020
@tqchen
Copy link
Member

tqchen commented Aug 21, 2020

Unfortunately we need to close this PR due to inactive status.

Thanks @Fwd-IV for the original contribution

@Fwd-IV @FrozenGene @anijain2305 it would be great if you can coordinate to bring some of the PR up to the mainline and send a new PR. We can use the https://docs.github.com/en/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors mechanism to acknowledge the original author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants