Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking the correct dtypes for choosing the Intel int8 instructions. #3516

Merged
merged 9 commits into from
Jul 23, 2019

Conversation

anijain2305
Copy link
Contributor

Intel VNNI and Skylake HW-supported int8 instructions require one tensor to be
unsigned int8 and other to be signed int8. Adding more checks in the x86 topi to
satisfy that requirement.

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers.

@anijain2305
Copy link
Contributor Author

anijain2305 commented Jul 9, 2019

@tqchen
Copy link
Member

tqchen commented Jul 9, 2019

Please try also tag reviewers who you do not interact physically :) as per https://docs.tvm.ai/contribute/committer_guide.html?highlight=physically#broad-collaboration

@anijain2305
Copy link
Contributor Author

Good point @tqchen :) Will keep in mind from next time

@anijain2305
Copy link
Contributor Author

anijain2305 commented Jul 9, 2019

@llyfacebook and @jianyuh you might be interested in this. Please also tag other reviewers if you think they will be interested.

@FrozenGene
Copy link
Member

Just curious. Why VNNI requires this? Like TFLite, data and kernel both are u8. If VNNI requires this, TFLite's quantized model can not get HW-supported instruction.

@anijain2305
Copy link
Contributor Author

Just curious. Why VNNI requires this? Like TFLite, data and kernel both are u8. If VNNI requires this, TFLite's quantized model can not get HW-supported instruction.

Good question. I am not entirely sure about what Relay operations to perform. But, the key idea is to shift the zero point by subtracting/adding 128 to the quantized tensor values. There are lot of details here - https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training?_ga=2.113337095.1887331677.1562772854-237924077.1562647522

Will be using the above link to shift the zero points.

@tqchen
Copy link
Member

tqchen commented Jul 10, 2019

Different hw backends has their own native instructions, and we might need a specific quantized model for each hw backends to maximize the perf. That is why the native relay quantization flow is super important and we should push for some of that

@anijain2305
Copy link
Contributor Author

@FrozenGene @tqchen @jianyuh @llyfacebook Ping for review

@FrozenGene
Copy link
Member

FrozenGene commented Jul 12, 2019

one concern is do we need this restrict? For example, we pass data / kernel u8 to x86, what will be happened? Current we will meet assert error. My prefer way maybe is in the _schedule_conv_nhwc_pack_int8

    if check_skylake(target):
        int32_lanes = 16
    else:
        return s

we check the data type here (for example in the function check_skylake) . Then we could make data / kernel u8 could work well. For example, when the x86 cpu doesn't have VNNI instruction.

Intel VNNI and Skylake HW-supported int8 instructions require one tensor to be
unsigned int8 and other to be signed int8. Adding more checks in the x86 topi to
satisfy that requirement.
@anijain2305
Copy link
Contributor Author

@FrozenGene Looked more into this. There are 2 reasons why I am hesitant to do it

  • In case of your suggestion, we will directly return default schedule i.e. tvm compute function. If we put the check outside as done in the PR, we can atleast use the schedule_conv_NCHWc schedule, which gives better performance than the default schedule.
  • VNNI requires kernel to be 7D (as opposed to the 6D in the case of NCHWc). This has to be done in alter_op_layout for conv. So, we need checks at multiple places.

Please let me know what you think.

Regarding the transformation of kernel from 4D to 7D, we are already working on it. We plan to update this PR with that transformation as well this week.

@FrozenGene
Copy link
Member

FrozenGene commented Jul 17, 2019

@FrozenGene Looked more into this. There are 2 reasons why I am hesitant to do it

  • In case of your suggestion, we will directly return default schedule i.e. tvm compute function. If we put the check outside as done in the PR, we can atleast use the schedule_conv_NCHWc schedule, which gives better performance than the default schedule.

My previous context of assert error is here: https://github.com/dmlc/tvm/blob/4f10a90e3e2ec18071612bed657108b219bc6796/topi/python/topi/x86/conv2d.py#L278-L288 Current your PR restrict kernel dtype must be int8 for NHWC int8 computation.

If we restrict NCHWc, I looked more. I think it is ok, we could use FP32 schedule for it and should have better performance than current schedule. Personally I suggest we should write int8 schedule for non-skylake haraware in the future. i.e.

    if check_skylake(target):
        int32_lanes = 16
    else:
        # new schedule for non-skylake int8 computation
        #return s

I find we only write the schedule for skylake haraware for int8, I think it doesn't make sense in fact.

@anijain2305
Copy link
Contributor Author

@FrozenGene I agree that just int8 for skylake sounds weird. But, it arises from the HW support. Skylake onwards, Intel is adding int8 HW support. We need to run tensorize to utilize that HW support as LLVM does not pick it up. If we dont use tensorize, the performance gets bad. Therefore, we have to fall back to NCHWc FP32 schedule.

But, I agree with the general direction that you are proposing. I think in future as LLVM gets more mature for int8 Intel instructions, we can refine this schedule.

@FrozenGene
Copy link
Member

FrozenGene commented Jul 17, 2019

@FrozenGene I agree that just int8 for skylake sounds weird. But, it arises from the HW support. Skylake onwards, Intel is adding int8 HW support. We need to run tensorize to utilize that HW support as LLVM does not pick it up. If we dont use tensorize, the performance gets bad. Therefore, we have to fall back to NCHWc FP32 schedule.

But, I agree with the general direction that you are proposing. I think in future as LLVM gets more mature for int8 Intel instructions, we can refine this schedule.

I think it is not only about HW-supported int8 instructions but also no HW-supported int8 instructions, For example Haswell. I think it is ok to use Tensorize for skylake. My point is for no HW-supported int8 cpu architecture, we also should have one int8 schedule, not just return s or use FP32 schedule.

- Ensures that we fall back to NCHWc when int8 support is absent.
- Added test to check both cases - When HW support present and absent.
- Verified that the performance of int8 conv.
kernel_OHWoIie = F.reshape(kernel_OHWoIi, (out_channel//oc_bn, kh, kw, oc_bn,
in_channel//ic_bn, ic_bn//n_elems, n_elems))
kernel_OIHWioe = F.transpose(kernel_OHWoIie, axes=(0, 4, 1, 2, 5, 3, 6))
copy_inputs = [data_func, kernel_OIHWioe]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to do dispatch_ctx.update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks. Added that now

@anijain2305
Copy link
Contributor Author

@FrozenGene Updated the patch adapting to somewhat to what you said.

Int8 is little hacky to work with. New Intel machines that have Int8 support perform reduction within a vector, requiring new compute, new alter conv2d layout and new schedule. So, we need to check int8 support at multiple places in the codebase. If int8 is not supported, there is not much we can do and falling back to FP32 schedule gives decent enough performance.

As of now, I have tried my best to guard things properly so that we never go to int8 schedule if int8 support is absent.

@yzhliu
Copy link
Member

yzhliu commented Jul 21, 2019

@anijain2305 Please address the pylint error.
@FrozenGene Could you review again?

Copy link
Member

@FrozenGene FrozenGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@anijain2305
Copy link
Contributor Author

For Int8 NHWC pack, I am returning unoptimized schedule because of this bug - #3598

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants