diff --git a/advanced_source/static_quantization_tutorial.rst b/advanced_source/static_quantization_tutorial.rst index 671cd4f95..c202e4c64 100644 --- a/advanced_source/static_quantization_tutorial.rst +++ b/advanced_source/static_quantization_tutorial.rst @@ -1,633 +1,631 @@ -(beta) Static Quantization with Eager Mode in PyTorch -========================================================= -**Author**: `Raghuraman Krishnamoorthi `_ -**Edited by**: `Seth Weidman `_, `Jerry Zhang `_ - -This tutorial shows how to do post-training static quantization, as well as illustrating -two more advanced techniques - per-channel quantization and quantization-aware training - -to further improve the model's accuracy. Note that quantization is currently only supported -for CPUs, so we will not be utilizing GPUs / CUDA in this tutorial. -By the end of this tutorial, you will see how quantization in PyTorch can result in -significant decreases in model size while increasing speed. Furthermore, you'll see how -to easily apply some advanced quantization techniques shown -`here `_ so that your quantized models take much less -of an accuracy hit than they would otherwise. -Warning: we use a lot of boilerplate code from other PyTorch repos to, for example, -define the ``MobileNetV2`` model archtecture, define data loaders, and so on. We of course -encourage you to read it; but if you want to get to the quantization features, feel free -to skip to the "4. Post-training static quantization" section. -We'll start by doing the necessary imports: +(베타) PyTorch에서 Eager Mode를 이용한 정적 양자화 +========================================================= +**저자**: `Raghuraman Krishnamoorthi `_ +**편집**: `Seth Weidman `_, `Jerry Zhang `_ +**번역**: `김현길 `_, `Choi Yoonjeong `_ + +이 튜토리얼에서는 어떻게 학습 후 정적 양자화(post-training static quantization)를 하는지 보여주며, +모델의 정확도(accuracy)을 더욱 높이기 위한 두 가지 고급 기술인 채널별 양자화(per-channel quantization)와 +양자화 자각 학습(quantization-aware training)도 살펴봅니다. 현재 양자화는 CPU만 지원하기에, +이 튜토리얼에서는 GPU/ CUDA를 이용하지 않습니다. +이 튜토리얼을 끝내면 PyTorch에서 양자화가 어떻게 속도는 향상시키면서 모델 사이즈를 큰 폭으로 줄이는지 +확인할 수 있습니다. 게다가 `여기 `_ 에 소개된 몇몇 고급 양자화 기술을 +얼마나 쉽게 적용하는지도 볼 수 있고, 이런 기술들이 다른 양자화 기술들보다 모델의 정확도에 부정적인 영향을 +덜 끼치는 것도 볼 수 있습니다. + +주의: 다른 PyTorch 저장소의 상용구 코드(boilerplate code)를 많이 사용합니다. +예를 들어 ``MobileNetV2`` 모델 아키텍처 정의, DataLoader 정의 같은 것들입니다. +물론 이런 코드들을 읽는 것을 추천하지만, 양자화 특징만 알고 싶다면 +"4. 학습 후 정적 양자화" 부분으로 넘어가도 됩니다. +필요한 것들을 import 하는 것부터 시작해 봅시다: .. code:: python - import numpy as np - import torch - import torch.nn as nn - import torchvision - from torch.utils.data import DataLoader - from torchvision import datasets - import torchvision.transforms as transforms - import os - import time - import sys - import torch.quantization - - # # Setup warnings - import warnings - warnings.filterwarnings( - action='ignore', - category=DeprecationWarning, - module=r'.*' - ) - warnings.filterwarnings( - action='default', - module=r'torch.quantization' - ) - - # Specify random seed for repeatable results - torch.manual_seed(191009) - -1. Model architecture ---------------------- - -We first define the MobileNetV2 model architecture, with several notable modifications -to enable quantization: - -- Replacing addition with ``nn.quantized.FloatFunctional`` -- Insert ``QuantStub`` and ``DeQuantStub`` at the beginning and end of the network. -- Replace ReLU6 with ReLU - -Note: this code is taken from -`here `_. + import numpy as np + import torch + import torch.nn as nn + import torchvision + from torch.utils.data import DataLoader + from torchvision import datasets + import torchvision.transforms as transforms + import os + import time + import sys + import torch.quantization + + # # warnings 설정 + import warnings + warnings.filterwarnings( + action='ignore', + category=DeprecationWarning, + module=r'.*' + ) + warnings.filterwarnings( + action='default', + module=r'torch.quantization' + ) + + # 반복 가능한 결과를 위한 랜덤 시드 지정하기 + torch.manual_seed(191009) + +1. 모델 아키텍처 +--------------------- + +처음으로 MobileNetV2 모델 아키텍처를 정의합니다. +이 모델은 양자화를 위한 몇 가지 중요한 변경사항들이 있습니다: + +- 덧셈을 ``nn.quantized.FloatFunctional`` 으로 교체 +- 신경망의 처음과 끝에 ``QuantStub`` 및 ``DeQuantStub`` 삽입 +- ReLU를 ReLU6로 교체 + +알림: `여기 `_ 에서 +이 코드를 가져왔습니다. .. code:: python - from torch.quantization import QuantStub, DeQuantStub - - def _make_divisible(v, divisor, min_value=None): - """ - This function is taken from the original tf repo. - It ensures that all layers have a channel number that is divisible by 8 - It can be seen here: - https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py - :param v: - :param divisor: - :param min_value: - :return: - """ - if min_value is None: - min_value = divisor - new_v = max(min_value, int(v + divisor / 2) // divisor * divisor) - # Make sure that round down does not go down by more than 10%. - if new_v < 0.9 * v: - new_v += divisor - return new_v - - - class ConvBNReLU(nn.Sequential): - def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1): - padding = (kernel_size - 1) // 2 - super(ConvBNReLU, self).__init__( - nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False), - nn.BatchNorm2d(out_planes, momentum=0.1), - # Replace with ReLU - nn.ReLU(inplace=False) - ) - - - class InvertedResidual(nn.Module): - def __init__(self, inp, oup, stride, expand_ratio): - super(InvertedResidual, self).__init__() - self.stride = stride - assert stride in [1, 2] - - hidden_dim = int(round(inp * expand_ratio)) - self.use_res_connect = self.stride == 1 and inp == oup - - layers = [] - if expand_ratio != 1: - # pw - layers.append(ConvBNReLU(inp, hidden_dim, kernel_size=1)) - layers.extend([ - # dw - ConvBNReLU(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim), - # pw-linear - nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), - nn.BatchNorm2d(oup, momentum=0.1), - ]) - self.conv = nn.Sequential(*layers) - # Replace torch.add with floatfunctional - self.skip_add = nn.quantized.FloatFunctional() - - def forward(self, x): - if self.use_res_connect: - return self.skip_add.add(x, self.conv(x)) - else: - return self.conv(x) - - - class MobileNetV2(nn.Module): - def __init__(self, num_classes=1000, width_mult=1.0, inverted_residual_setting=None, round_nearest=8): - """ - MobileNet V2 main class - Args: - num_classes (int): Number of classes - width_mult (float): Width multiplier - adjusts number of channels in each layer by this amount - inverted_residual_setting: Network structure - round_nearest (int): Round the number of channels in each layer to be a multiple of this number - Set to 1 to turn off rounding - """ - super(MobileNetV2, self).__init__() - block = InvertedResidual - input_channel = 32 - last_channel = 1280 - - if inverted_residual_setting is None: - inverted_residual_setting = [ - # t, c, n, s - [1, 16, 1, 1], - [6, 24, 2, 2], - [6, 32, 3, 2], - [6, 64, 4, 2], - [6, 96, 3, 1], - [6, 160, 3, 2], - [6, 320, 1, 1], - ] - - # only check the first element, assuming user knows t,c,n,s are required - if len(inverted_residual_setting) == 0 or len(inverted_residual_setting[0]) != 4: - raise ValueError("inverted_residual_setting should be non-empty " - "or a 4-element list, got {}".format(inverted_residual_setting)) - - # building first layer - input_channel = _make_divisible(input_channel * width_mult, round_nearest) - self.last_channel = _make_divisible(last_channel * max(1.0, width_mult), round_nearest) - features = [ConvBNReLU(3, input_channel, stride=2)] - # building inverted residual blocks - for t, c, n, s in inverted_residual_setting: - output_channel = _make_divisible(c * width_mult, round_nearest) - for i in range(n): - stride = s if i == 0 else 1 - features.append(block(input_channel, output_channel, stride, expand_ratio=t)) - input_channel = output_channel - # building last several layers - features.append(ConvBNReLU(input_channel, self.last_channel, kernel_size=1)) - # make it nn.Sequential - self.features = nn.Sequential(*features) - self.quant = QuantStub() - self.dequant = DeQuantStub() - # building classifier - self.classifier = nn.Sequential( - nn.Dropout(0.2), - nn.Linear(self.last_channel, num_classes), - ) - - # weight initialization - for m in self.modules(): - if isinstance(m, nn.Conv2d): - nn.init.kaiming_normal_(m.weight, mode='fan_out') - if m.bias is not None: - nn.init.zeros_(m.bias) - elif isinstance(m, nn.BatchNorm2d): - nn.init.ones_(m.weight) - nn.init.zeros_(m.bias) - elif isinstance(m, nn.Linear): - nn.init.normal_(m.weight, 0, 0.01) - nn.init.zeros_(m.bias) - - def forward(self, x): - - x = self.quant(x) - - x = self.features(x) - x = x.mean([2, 3]) - x = self.classifier(x) - x = self.dequant(x) - return x - - # Fuse Conv+BN and Conv+BN+Relu modules prior to quantization - # This operation does not change the numerics - def fuse_model(self): - for m in self.modules(): - if type(m) == ConvBNReLU: - torch.quantization.fuse_modules(m, ['0', '1', '2'], inplace=True) - if type(m) == InvertedResidual: - for idx in range(len(m.conv)): - if type(m.conv[idx]) == nn.Conv2d: - torch.quantization.fuse_modules(m.conv, [str(idx), str(idx + 1)], inplace=True) - -2. Helper functions -------------------- - -We next define several helper functions to help with model evaluation. These mostly come from -`here `_. + from torch.quantization import QuantStub, DeQuantStub + + def _make_divisible(v, divisor, min_value=None): + """ + 이 함수는 원본 TensorFlow 저장소에서 가져왔습니다. + 모든 계층이 8로 나누어지는 채널 숫자를 가지고 있습니다. + 이곳에서 확인 가능합니다: + https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py + :param v: + :param divisor: + :param min_value: + :return: + """ + if min_value is None: + min_value = divisor + new_v = max(min_value, int(v + divisor / 2) // divisor * divisor) + # 내림은 10% 넘게 내려가지 않는 것을 보장합니다. + if new_v < 0.9 * v: + new_v += divisor + return new_v + + + class ConvBNReLU(nn.Sequential): + def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1): + padding = (kernel_size - 1) // 2 + super(ConvBNReLU, self).__init__( + nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False), + nn.BatchNorm2d(out_planes, momentum=0.1), + # ReLU로 교체 + nn.ReLU(inplace=False) + ) + + + class InvertedResidual(nn.Module): + def __init__(self, inp, oup, stride, expand_ratio): + super(InvertedResidual, self).__init__() + self.stride = stride + assert stride in [1, 2] + + hidden_dim = int(round(inp * expand_ratio)) + self.use_res_connect = self.stride == 1 and inp == oup + + layers = [] + if expand_ratio != 1: + # pw + layers.append(ConvBNReLU(inp, hidden_dim, kernel_size=1)) + layers.extend([ + # dw + ConvBNReLU(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim), + # pw-linear + nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), + nn.BatchNorm2d(oup, momentum=0.1), + ]) + self.conv = nn.Sequential(*layers) + # torch.add를 floatfunctional로 교체 + self.skip_add = nn.quantized.FloatFunctional() + + def forward(self, x): + if self.use_res_connect: + return self.skip_add.add(x, self.conv(x)) + else: + return self.conv(x) + + + class MobileNetV2(nn.Module): + def __init__(self, num_classes=1000, width_mult=1.0, inverted_residual_setting=None, round_nearest=8): + """ + MobileNet V2 메인 클래스 + Args: + num_classes (int): 클래스 숫자 + width_mult (float): 넓이 multiplier - 이 수를 통해 각 계층의 채널 개수를 조절 + inverted_residual_setting: 네트워크 구조 + round_nearest (int): 각 계층의 채널 숫를 이 숫자의 배수로 반올림 + 1로 설정하면 반올림 정지 + """ + super(MobileNetV2, self).__init__() + block = InvertedResidual + input_channel = 32 + last_channel = 1280 + + if inverted_residual_setting is None: + inverted_residual_setting = [ + # t, c, n, s + [1, 16, 1, 1], + [6, 24, 2, 2], + [6, 32, 3, 2], + [6, 64, 4, 2], + [6, 96, 3, 1], + [6, 160, 3, 2], + [6, 320, 1, 1], + ] + + # 사용자가 t,c,n,s를 필요하다는 것을 안다는 전제하에 첫 번째 요소만 확인 + if len(inverted_residual_setting) == 0 or len(inverted_residual_setting[0]) != 4: + raise ValueError("inverted_residual_setting should be non-empty " + "or a 4-element list, got {}".format(inverted_residual_setting)) + + # 첫 번째 계층 만들기 + input_channel = _make_divisible(input_channel * width_mult, round_nearest) + self.last_channel = _make_divisible(last_channel * max(1.0, width_mult), round_nearest) + features = [ConvBNReLU(3, input_channel, stride=2)] + # 역전된 잔차 블럭(inverted residual blocks) 만들기 + for t, c, n, s in inverted_residual_setting: + output_channel = _make_divisible(c * width_mult, round_nearest) + for i in range(n): + stride = s if i == 0 else 1 + features.append(block(input_channel, output_channel, stride, expand_ratio=t)) + input_channel = output_channel + # 마지막 계층들 만들기 + features.append(ConvBNReLU(input_channel, self.last_channel, kernel_size=1)) + # nn.Sequential로 만들기 + self.features = nn.Sequential(*features) + self.quant = QuantStub() + self.dequant = DeQuantStub() + # 분류기(classifier) 만들기 + self.classifier = nn.Sequential( + nn.Dropout(0.2), + nn.Linear(self.last_channel, num_classes), + ) + + # 가중치 초기화 + for m in self.modules(): + if isinstance(m, nn.Conv2d): + nn.init.kaiming_normal_(m.weight, mode='fan_out') + if m.bias is not None: + nn.init.zeros_(m.bias) + elif isinstance(m, nn.BatchNorm2d): + nn.init.ones_(m.weight) + nn.init.zeros_(m.bias) + elif isinstance(m, nn.Linear): + nn.init.normal_(m.weight, 0, 0.01) + nn.init.zeros_(m.bias) + + def forward(self, x): + + x = self.quant(x) + + x = self.features(x) + x = x.mean([2, 3]) + x = self.classifier(x) + x = self.dequant(x) + return x + + # 양자화 전에 Conv+BN과 Conv+BN+Relu 모듈 결합(fusion) + # 이 연산은 숫자를 변경하지 않음 + def fuse_model(self): + for m in self.modules(): + if type(m) == ConvBNReLU: + torch.quantization.fuse_modules(m, ['0', '1', '2'], inplace=True) + if type(m) == InvertedResidual: + for idx in range(len(m.conv)): + if type(m.conv[idx]) == nn.Conv2d: + torch.quantization.fuse_modules(m.conv, [str(idx), str(idx + 1)], inplace=True) + +2. 헬퍼(Helper) 함수 +-------------------- + +다음으로 모델 평가를 위한 헬퍼 함수들을 만듭니다. 코드 대부분은 +`여기 `_ 에서 가져왔습니다. .. code:: python - class AverageMeter(object): - """Computes and stores the average and current value""" - def __init__(self, name, fmt=':f'): - self.name = name - self.fmt = fmt - self.reset() - - def reset(self): - self.val = 0 - self.avg = 0 - self.sum = 0 - self.count = 0 - - def update(self, val, n=1): - self.val = val - self.sum += val * n - self.count += n - self.avg = self.sum / self.count - - def __str__(self): - fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})' - return fmtstr.format(**self.__dict__) - - - def accuracy(output, target, topk=(1,)): - """Computes the accuracy over the k top predictions for the specified values of k""" - with torch.no_grad(): - maxk = max(topk) - batch_size = target.size(0) - - _, pred = output.topk(maxk, 1, True, True) - pred = pred.t() - correct = pred.eq(target.view(1, -1).expand_as(pred)) - - res = [] - for k in topk: - correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True) - res.append(correct_k.mul_(100.0 / batch_size)) - return res - - - def evaluate(model, criterion, data_loader, neval_batches): - model.eval() - top1 = AverageMeter('Acc@1', ':6.2f') - top5 = AverageMeter('Acc@5', ':6.2f') - cnt = 0 - with torch.no_grad(): - for image, target in data_loader: - output = model(image) - loss = criterion(output, target) - cnt += 1 - acc1, acc5 = accuracy(output, target, topk=(1, 5)) - print('.', end = '') - top1.update(acc1[0], image.size(0)) - top5.update(acc5[0], image.size(0)) - if cnt >= neval_batches: - return top1, top5 - - return top1, top5 - - def load_model(model_file): - model = MobileNetV2() - state_dict = torch.load(model_file) - model.load_state_dict(state_dict) - model.to('cpu') - return model - - def print_size_of_model(model): - torch.save(model.state_dict(), "temp.p") - print('Size (MB):', os.path.getsize("temp.p")/1e6) - os.remove('temp.p') - -3. Define dataset and data loaders ----------------------------------- - -As our last major setup step, we define our dataloaders for our training and testing set. - -ImageNet Data -^^^^^^^^^^^^^ - -To run the code in this tutorial using the entire ImageNet dataset, first download imagenet by following the instructions at here `ImageNet Data `_. Unzip the downloaded file into the 'data_path' folder. - -With the data downloaded, we show functions below that define dataloaders we'll use to read -in this data. These functions mostly come from -`here `_. + class AverageMeter(object): + """평균과 현재 값 계산 및 저장""" + def __init__(self, name, fmt=':f'): + self.name = name + self.fmt = fmt + self.reset() + + def reset(self): + self.val = 0 + self.avg = 0 + self.sum = 0 + self.count = 0 + + def update(self, val, n=1): + self.val = val + self.sum += val * n + self.count += n + self.avg = self.sum / self.count + + def __str__(self): + fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})' + return fmtstr.format(**self.__dict__) + + + def accuracy(output, target, topk=(1,)): + """특정 k값을 위해 top k 예측의 정확도 계산""" + with torch.no_grad(): + maxk = max(topk) + batch_size = target.size(0) + + _, pred = output.topk(maxk, 1, True, True) + pred = pred.t() + correct = pred.eq(target.view(1, -1).expand_as(pred)) + + res = [] + for k in topk: + correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True) + res.append(correct_k.mul_(100.0 / batch_size)) + return res + + + def evaluate(model, criterion, data_loader, neval_batches): + model.eval() + top1 = AverageMeter('Acc@1', ':6.2f') + top5 = AverageMeter('Acc@5', ':6.2f') + cnt = 0 + with torch.no_grad(): + for image, target in data_loader: + output = model(image) + loss = criterion(output, target) + cnt += 1 + acc1, acc5 = accuracy(output, target, topk=(1, 5)) + print('.', end = '') + top1.update(acc1[0], image.size(0)) + top5.update(acc5[0], image.size(0)) + if cnt >= neval_batches: + return top1, top5 + + return top1, top5 + + def load_model(model_file): + model = MobileNetV2() + state_dict = torch.load(model_file) + model.load_state_dict(state_dict) + model.to('cpu') + return model + + def print_size_of_model(model): + torch.save(model.state_dict(), "temp.p") + print('Size (MB):', os.path.getsize("temp.p")/1e6) + os.remove('temp.p') + +3. Dataset과 DataLoader 정의하기 +---------------------------------- + +마지막 주요 설정 단계로서 학습과 테스트 데이터를 위한 DataLoader를 정의합니다. + +ImageNet 데이터 +^^^^^^^^^^^^^^^ + +전체 ImageNet Dataset을 이용해서 이 튜토리얼의 코드를 실행시키기 위해, 첫번째로 `ImageNet Data `_ 의 지시를 따라 ImageNet을 다운로드합니다. 다운로드한 파일의 압축을 'data_path'에 풉니다. + +다운로드받은 데이터를 읽기 위해 아래에 정의된 DataLoader 함수들을 사용합니다. +이런 함수들 대부분은 +`여기 `_ 에서 가져왔습니다. .. code:: python - def prepare_data_loaders(data_path): + def prepare_data_loaders(data_path): - normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], + normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) dataset = torchvision.datasets.ImageNet( data_path, split="train", - transforms.Compose([ - transforms.RandomResizedCrop(224), - transforms.RandomHorizontalFlip(), - transforms.ToTensor(), - normalize, - ])) + transforms.Compose([ + transforms.RandomResizedCrop(224), + transforms.RandomHorizontalFlip(), + transforms.ToTensor(), + normalize, + ])) dataset_test = torchvision.datasets.ImageNet( - data_path, split="val", - transforms.Compose([ - transforms.Resize(256), - transforms.CenterCrop(224), - transforms.ToTensor(), - normalize, - ])) + data_path, split="val", + transforms.Compose([ + transforms.Resize(256), + transforms.CenterCrop(224), + transforms.ToTensor(), + normalize, + ])) - train_sampler = torch.utils.data.RandomSampler(dataset) - test_sampler = torch.utils.data.SequentialSampler(dataset_test) + train_sampler = torch.utils.data.RandomSampler(dataset) + test_sampler = torch.utils.data.SequentialSampler(dataset_test) - data_loader = torch.utils.data.DataLoader( - dataset, batch_size=train_batch_size, - sampler=train_sampler) + data_loader = torch.utils.data.DataLoader( + dataset, batch_size=train_batch_size, + sampler=train_sampler) - data_loader_test = torch.utils.data.DataLoader( - dataset_test, batch_size=eval_batch_size, - sampler=test_sampler) + data_loader_test = torch.utils.data.DataLoader( + dataset_test, batch_size=eval_batch_size, + sampler=test_sampler) - return data_loader, data_loader_test + return data_loader, data_loader_test -Next, we'll load in the pre-trained MobileNetV2 model. We provide the URL to download the data from in ``torchvision`` -`here `_. +다음으로 사전에 학습된 MobileNetV2을 불러옵니다. ``torchvision`` 에서 데이터를 다운로드받을 수 있는 URL은 +`여기 `_ 입니다. .. code:: python data_path = '~/.data/imagenet' - saved_model_dir = 'data/' - float_model_file = 'mobilenet_pretrained_float.pth' - scripted_float_model_file = 'mobilenet_quantization_scripted.pth' - scripted_quantized_model_file = 'mobilenet_quantization_scripted_quantized.pth' + saved_model_dir = 'data/' + float_model_file = 'mobilenet_pretrained_float.pth' + scripted_float_model_file = 'mobilenet_quantization_scripted.pth' + scripted_quantized_model_file = 'mobilenet_quantization_scripted_quantized.pth' - train_batch_size = 30 - eval_batch_size = 50 + train_batch_size = 30 + eval_batch_size = 50 - data_loader, data_loader_test = prepare_data_loaders(data_path) - criterion = nn.CrossEntropyLoss() - float_model = load_model(saved_model_dir + float_model_file).to('cpu') - - # Next, we'll "fuse modules"; this can both make the model faster by saving on memory access - # while also improving numerical accuracy. While this can be used with any model, this is - # especially common with quantized models. + data_loader, data_loader_test = prepare_data_loaders(data_path) + criterion = nn.CrossEntropyLoss() + float_model = load_model(saved_model_dir + float_model_file).to('cpu') - print('\n Inverted Residual Block: Before fusion \n\n', float_model.features[1].conv) - float_model.eval() + # 다음으로 "모듈 결합"을 합니다. 모듈 결합은 메모리 접근을 줄여 모델을 빠르게 만들면서 + # 정확도 수치를 향상시킵니다. 모듈 결합은 어떠한 모델에라도 사용할 수 있지만, + # 양자화된 모델에 사용하는 것이 특히나 더 일반적입니다. - # Fuses modules - float_model.fuse_model() + print('\n Inverted Residual Block: Before fusion \n\n', float_model.features[1].conv) + float_model.eval() - # Note fusion of Conv+BN+Relu and Conv+Relu - print('\n Inverted Residual Block: After fusion\n\n',float_model.features[1].conv) + # 모듈 결합 + float_model.fuse_model() - -Finally to get a "baseline" accuracy, let's see the accuracy of our un-quantized model -with fused modules + # Conv+BN+Relu와 Conv+Relu 결합에 유의 + print('\n Inverted Residual Block: After fusion\n\n',float_model.features[1].conv) + + +마지막으로 "기준"이 될 정확도를 얻기 위해, +모듈 결합을 사용한 양자화되지 않은 모델의 정확도를 봅시다. .. code:: python num_eval_batches = 1000 - print("Size of baseline model") - print_size_of_model(float_model) + print("Size of baseline model") + print_size_of_model(float_model) - top1, top5 = evaluate(float_model, criterion, data_loader_test, neval_batches=num_eval_batches) - print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg)) + top1, top5 = evaluate(float_model, criterion, data_loader_test, neval_batches=num_eval_batches) + print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg)) torch.jit.save(torch.jit.script(float_model), saved_model_dir + scripted_float_model_file) - -On the entire model, we get an accuracy of 71.9% on the eval dataset of 50,000 images. -This will be our baseline to compare to. Next, let's try different quantization methods +전체 모델은 50,000개의 이미지를 가진 eval 데이터셋에서 71.9%의 정확도를 보입니다. + +이 값이 비교를 위한 기준이 될 것입니다. 다음으로 양자화된 모델을 봅시다. -4. Post-training static quantization ------------------------------------- +4. 학습 후 정적 양자화(post-training static quantization) +-------------------------------------------------------- -Post-training static quantization involves not just converting the weights from float to int, -as in dynamic quantization, but also performing the additional step of first feeding batches -of data through the network and computing the resulting distributions of the different activations -(specifically, this is done by inserting `observer` modules at different points that record this -data). These distributions are then used to determine how the specifically the different activations -should be quantized at inference time (a simple technique would be to simply divide the entire range -of activations into 256 levels, but we support more sophisticated methods as well). Importantly, -this additional step allows us to pass quantized values between operations instead of converting these -values to floats - and then back to ints - between every operation, resulting in a significant speed-up. +학습 후 정적 양자화는 동적 양자화처럼 가중치를 float에서 int로 변환하는 것뿐만 아니라 +추가적인 단계도 수행합니다. 네트워크에 데이터 배치의 첫 번째 공급과 다른 활성값들의 +분포 결과 계산이 이러한 단계입니다. (특히 이러한 추가적인 단계는 계산한 값을 +기록하고 싶은 지점에 `observer` 모듈을 삽입합으로써 끝납니다.) +이러한 분포들은 추론 시점에 특정한 다른 활성값들이 어떻게 양자화되어야 하는지 결정하는데 사용됩니다. +(간단한 방법으로는 단순히 활성값들의 전체 범위를 256개의 단계로 나누는 것이지만, +좀 더 복잡한 방법도 제공합니다.) 특히, 이러한 추가적인 단계는 각 연산 사이사이의 +양자화된 값을 float으로 변환 - 및 int로 되돌림 - 하는 것뿐만 아니라 +양자화된 값을 모든 연산들끼리 주고 받는 것도 가능하게 하여 엄청난 속도 향상이 됩니다. .. code:: python num_calibration_batches = 32 - myModel = load_model(saved_model_dir + float_model_file).to('cpu') - myModel.eval() + myModel = load_model(saved_model_dir + float_model_file).to('cpu') + myModel.eval() - # Fuse Conv, bn and relu - myModel.fuse_model() + # Conv, bn과 relu 결합 + myModel.fuse_model() - # Specify quantization configuration - # Start with simple min/max range estimation and per-tensor quantization of weights - myModel.qconfig = torch.quantization.default_qconfig - print(myModel.qconfig) - torch.quantization.prepare(myModel, inplace=True) + # 양자화 설정 명시 + # 간단한 min/max 범위 추정 및 텐서별 가중치 양자화로 시작 + myModel.qconfig = torch.quantization.default_qconfig + print(myModel.qconfig) + torch.quantization.prepare(myModel, inplace=True) - # Calibrate first - print('Post Training Quantization Prepare: Inserting Observers') - print('\n Inverted Residual Block:After observer insertion \n\n', myModel.features[1].conv) + # 첫 번째 보정 + print('Post Training Quantization Prepare: Inserting Observers') + print('\n Inverted Residual Block:After observer insertion \n\n', myModel.features[1].conv) - # Calibrate with the training set - evaluate(myModel, criterion, data_loader, neval_batches=num_calibration_batches) - print('Post Training Quantization: Calibration done') + # 학습 세트로 보정 + evaluate(myModel, criterion, data_loader, neval_batches=num_calibration_batches) + print('Post Training Quantization: Calibration done') - # Convert to quantized model - torch.quantization.convert(myModel, inplace=True) - print('Post Training Quantization: Convert done') - print('\n Inverted Residual Block: After fusion and quantization, note fused modules: \n\n',myModel.features[1].conv) + # 양자화된 모델로 변환 + torch.quantization.convert(myModel, inplace=True) + print('Post Training Quantization: Convert done') + print('\n Inverted Residual Block: After fusion and quantization, note fused modules: \n\n',myModel.features[1].conv) - print("Size of model after quantization") - print_size_of_model(myModel) + print("Size of model after quantization") + print_size_of_model(myModel) - top1, top5 = evaluate(myModel, criterion, data_loader_test, neval_batches=num_eval_batches) + top1, top5 = evaluate(myModel, criterion, data_loader_test, neval_batches=num_eval_batches) print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg)) - -For this quantized model, we see an accuracy of 56.7% on the eval dataset. This is because we used a simple min/max observer to determine quantization parameters. Nevertheless, we did reduce the size of our model down to just under 3.6 MB, almost a 4x decrease. -In addition, we can significantly improve on the accuracy simply by using a different -quantization configuration. We repeat the same exercise with the recommended configuration for -quantizing for x86 architectures. This configuration does the following: +양자화된 모델은 eval 데이터셋에서 56.7%의 정확도를 보여줍니다. 이는 양자화 파라미터를 결정하기 위해 단순 min/max Observer를 사용했기 때문입니다. 그럼에도 불구하고 모델의 크기를 3.6 MB 밑으로 줄였습니다. 이는 거의 4분의 1 로 줄어든 크기입니다. -- Quantizes weights on a per-channel basis -- Uses a histogram observer that collects a histogram of activations and then picks - quantization parameters in an optimal manner. +이에 더해 단순히 다른 양자화 설정을 사용하기만 해도 정확도를 큰 폭으로 향상시킬 수 있습니다. +x86 아키텍처에서 양자화를 위한 권장 설정을 그대로 쓰기만 해도 됩니다. +이러한 설정은 아래와 같습니다: + +- 채널별 기본 가중치 양자화 +- 활성값을 수집해서 최적화된 양자화 파라미터를 고르는 히스토그램 Observer 사용 .. code:: python - per_channel_quantized_model = load_model(saved_model_dir + float_model_file) - per_channel_quantized_model.eval() - per_channel_quantized_model.fuse_model() - per_channel_quantized_model.qconfig = torch.quantization.get_default_qconfig('fbgemm') - print(per_channel_quantized_model.qconfig) - - torch.quantization.prepare(per_channel_quantized_model, inplace=True) - evaluate(per_channel_quantized_model,criterion, data_loader, num_calibration_batches) - torch.quantization.convert(per_channel_quantized_model, inplace=True) - top1, top5 = evaluate(per_channel_quantized_model, criterion, data_loader_test, neval_batches=num_eval_batches) - print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg)) + per_channel_quantized_model = load_model(saved_model_dir + float_model_file) + per_channel_quantized_model.eval() + per_channel_quantized_model.fuse_model() + per_channel_quantized_model.qconfig = torch.quantization.get_default_qconfig('fbgemm') + print(per_channel_quantized_model.qconfig) + + torch.quantization.prepare(per_channel_quantized_model, inplace=True) + evaluate(per_channel_quantized_model,criterion, data_loader, num_calibration_batches) + torch.quantization.convert(per_channel_quantized_model, inplace=True) + top1, top5 = evaluate(per_channel_quantized_model, criterion, data_loader_test, neval_batches=num_eval_batches) + print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg)) torch.jit.save(torch.jit.script(per_channel_quantized_model), saved_model_dir + scripted_quantized_model_file) -Changing just this quantization configuration method resulted in an increase -of the accuracy to over 67.3%! Still, this is 4% worse than the baseline of 71.9% achieved above. -So lets try quantization aware training. +단순히 양자화 설정 방법을 변경하는 것만으로도 정확도가 67.3%를 넘을 정도로 향상이 되었습니다! +그럼에도 이 수치는 위에서 구한 기준값 71.9%에서 4퍼센트나 낮은 수치입니다. +이제 양자화 자각 학습을 시도해 봅시다. -5. Quantization-aware training ------------------------------- +5. 양자화 자각 학습(Quantization-aware training) +------------------------------------------------- -Quantization-aware training (QAT) is the quantization method that typically results in the highest accuracy. -With QAT, all weights and activations are “fake quantized” during both the forward and backward passes of -training: that is, float values are rounded to mimic int8 values, but all computations are still done with -floating point numbers. Thus, all the weight adjustments during training are made while “aware” of the fact -that the model will ultimately be quantized; after quantizing, therefore, this method will usually yield -higher accuracy than either dynamic quantization or post-training static quantization. +양자화 자각 학습(QAT)은 일반적으로 가장 높은 정확도를 제공하는 양자화 방법입니다. +모든 가중치화 활성값은 QAT로 인해 학습 도중에 순전파와 역전파를 도중 "가짜 양자화"됩니다. +이는 float값이 int8 값으로 반올림하는 것처럼 흉내를 내지만, 모든 계산은 여전히 +부동소수점 숫자로 계산을 합니다. 그래서 결국 훈련 동안의 모든 가중치 조정은 모델이 양자화될 +것이라는 사실을 "자각"한 채로 이루어지게 됩니다. 그래서 QAT는 양자화가 이루어지고 나면 +동적 양자화나 학습 전 정적 양자화보다 대체로 더 높은 정확도를 보여줍니다. -The overall workflow for actually performing QAT is very similar to before: +실제로 QAT가 이루어지는 전체 흐름은 이전과 매우 유사합니다: -- We can use the same model as before: there is no additional preparation needed for quantization-aware - training. -- We need to use a ``qconfig`` specifying what kind of fake-quantization is to be inserted after weights - and activations, instead of specifying observers +- 이전과 같은 모델을 사용할 수 있습니다. 양자화 자각 학습을 위한 추가적인 준비는 필요 없습니다. +- 가중치와 활성값 뒤에 어떤 종류의 가짜 양자화를 사용할 것인지 명시하는 ``qconfig`` 의 사용이 필요합니다. + Observer를 명시하는 것 대신에 말이죠. -We first define a training function: +먼저 학습 함수부터 정의합니다: .. code:: python - def train_one_epoch(model, criterion, optimizer, data_loader, device, ntrain_batches): - model.train() - top1 = AverageMeter('Acc@1', ':6.2f') - top5 = AverageMeter('Acc@5', ':6.2f') - avgloss = AverageMeter('Loss', '1.5f') - - cnt = 0 - for image, target in data_loader: - start_time = time.time() - print('.', end = '') - cnt += 1 - image, target = image.to(device), target.to(device) - output = model(image) - loss = criterion(output, target) - optimizer.zero_grad() - loss.backward() - optimizer.step() - acc1, acc5 = accuracy(output, target, topk=(1, 5)) - top1.update(acc1[0], image.size(0)) - top5.update(acc5[0], image.size(0)) - avgloss.update(loss, image.size(0)) - if cnt >= ntrain_batches: - print('Loss', avgloss.avg) - - print('Training: * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}' - .format(top1=top1, top5=top5)) - return - - print('Full imagenet train set: * Acc@1 {top1.global_avg:.3f} Acc@5 {top5.global_avg:.3f}' - .format(top1=top1, top5=top5)) - return - - -We fuse modules as before + def train_one_epoch(model, criterion, optimizer, data_loader, device, ntrain_batches): + model.train() + top1 = AverageMeter('Acc@1', ':6.2f') + top5 = AverageMeter('Acc@5', ':6.2f') + avgloss = AverageMeter('Loss', '1.5f') + + cnt = 0 + for image, target in data_loader: + start_time = time.time() + print('.', end = '') + cnt += 1 + image, target = image.to(device), target.to(device) + output = model(image) + loss = criterion(output, target) + optimizer.zero_grad() + loss.backward() + optimizer.step() + acc1, acc5 = accuracy(output, target, topk=(1, 5)) + top1.update(acc1[0], image.size(0)) + top5.update(acc5[0], image.size(0)) + avgloss.update(loss, image.size(0)) + if cnt >= ntrain_batches: + print('Loss', avgloss.avg) + + print('Training: * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}' + .format(top1=top1, top5=top5)) + return + + print('Full imagenet train set: * Acc@1 {top1.global_avg:.3f} Acc@5 {top5.global_avg:.3f}' + .format(top1=top1, top5=top5)) + return + + +이전처럼 모듈을 결합합니다. .. code:: python - qat_model = load_model(saved_model_dir + float_model_file) - qat_model.fuse_model() + qat_model = load_model(saved_model_dir + float_model_file) + qat_model.fuse_model() + + optimizer = torch.optim.SGD(qat_model.parameters(), lr = 0.0001) + qat_model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') - optimizer = torch.optim.SGD(qat_model.parameters(), lr = 0.0001) - qat_model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') - -Finally, ``prepare_qat`` performs the "fake quantization", preparing the model for quantization-aware training +마지막으로 모델이 양자화 자각 학습을 준비하기 위해 ``prepare_qat`` 로 "가짜 양자화"를 수행합니다. .. code:: python - torch.quantization.prepare_qat(qat_model, inplace=True) + torch.quantization.prepare_qat(qat_model, inplace=True) print('Inverted Residual Block: After preparation for QAT, note fake-quantization modules \n',qat_model.features[1].conv) - -Training a quantized model with high accuracy requires accurate modeling of numerics at -inference. For quantization aware training, therefore, we modify the training loop by: -- Switch batch norm to use running mean and variance towards the end of training to better - match inference numerics. -- We also freeze the quantizer parameters (scale and zero-point) and fine tune the weights. +높은 정확도의 양자화된 모델을 학습시키기 위해서는 추론 시점에서 정확한 숫자 모델링을 필요로 합니다. +그래서 양자화 자각 학습에서는 학습 루프를 이렇게 변경합니다: + +- 추론 수치와 더 잘 일치하도록 학습이 끝날 때 배치 정규화를 이동 평균과 분산을 사용하는 것으로 변경합니다. +- 양자화 파라미터(크기와 영점)를 고정하고 가중치를 미세 조정(fine tune)합니다. .. code:: python - num_train_batches = 20 - - # QAT takes time and one needs to train over a few epochs. - # Train and check accuracy after each epoch - for nepoch in range(8): - train_one_epoch(qat_model, criterion, optimizer, data_loader, torch.device('cpu'), num_train_batches) - if nepoch > 3: - # Freeze quantizer parameters - qat_model.apply(torch.quantization.disable_observer) - if nepoch > 2: - # Freeze batch norm mean and variance estimates - qat_model.apply(torch.nn.intrinsic.qat.freeze_bn_stats) - - # Check the accuracy after each epoch - quantized_model = torch.quantization.convert(qat_model.eval(), inplace=False) - quantized_model.eval() - top1, top5 = evaluate(quantized_model,criterion, data_loader_test, neval_batches=num_eval_batches) - print('Epoch %d :Evaluation accuracy on %d images, %2.2f'%(nepoch, num_eval_batches * eval_batch_size, top1.avg)) - -Quantization-aware training yields an accuracy of over 71.5% on the entire imagenet dataset, which is close to the floating point accuracy of 71.9%. - -More on quantization-aware training: - -- QAT is a super-set of post training quant techniques that allows for more debugging. - For example, we can analyze if the accuracy of the model is limited by weight or activation - quantization. -- We can also simulate the accuracy of a quantized model in floating point since - we are using fake-quantization to model the numerics of actual quantized arithmetic. -- We can mimic post training quantization easily too. - -Speedup from quantization -^^^^^^^^^^^^^^^^^^^^^^^^^ - -Finally, let's confirm something we alluded to above: do our quantized models actually perform inference -faster? Let's test: + num_train_batches = 20 + + # QAT는 시간이 걸리는 작업이며 몇 에폭에 걸쳐 훈련이 필요합니다. + # 학습 및 각 에폭 이후 정확도 확인 + for nepoch in range(8): + train_one_epoch(qat_model, criterion, optimizer, data_loader, torch.device('cpu'), num_train_batches) + if nepoch > 3: + # 양자화 파라미터 고정 + qat_model.apply(torch.quantization.disable_observer) + if nepoch > 2: + # 배치 정규화 평균 및 분산 추정값 고정 + qat_model.apply(torch.nn.intrinsic.qat.freeze_bn_stats) + + # 각 에폭 이후 정확도 확인 + quantized_model = torch.quantization.convert(qat_model.eval(), inplace=False) + quantized_model.eval() + top1, top5 = evaluate(quantized_model,criterion, data_loader_test, neval_batches=num_eval_batches) + print('Epoch %d :Evaluation accuracy on %d images, %2.2f'%(nepoch, num_eval_batches * eval_batch_size, top1.avg)) + +양자화 자각 학습은 전체 ImageNet 데이터셋에서 71.5%의 정확도를 나타냅니다. 이 값은 기준값 71.9%에 소수점 수준으로 근접한 수치입니다. + +양자화 자각 학습에 대한 더 많은 것들: + +- QAT는 더 많은 디버깅을 가능하게 하는 학습 후 양자화 기술의 상위 집합입니다. + 예를 들어 모델의 정확도가 가중치나 활성 양자화로 인해 제한을 받아 + 더 높아질 수 없는 상황인지 분석할 수 있습니다. +- 부동소수점을 사용한 양자화된 모델을 시뮬레이션 할 수도 있습니다. + 실제 양자화된 연산의 수치를 모델링하기 위해 가짜 양자화를 이용하고 있기 때문입니다. +- 학습 후 양자화 또한 쉽게 흉내낼 수 있습니다. + +양자화를 통한 속도 향상 +^^^^^^^^^^^^^^^^^^^^^^^^^ + +마지막으로 위에서 언급한 것들을 확인해 봅시다. 양자화된 모델이 실제로 추론도 더 빠르게 하는 걸까요? +시험해 봅시다: .. code:: python - def run_benchmark(model_file, img_loader): - elapsed = 0 - model = torch.jit.load(model_file) - model.eval() - num_batches = 5 - # Run the scripted model on a few batches of images - for i, (images, target) in enumerate(img_loader): - if i < num_batches: - start = time.time() - output = model(images) - end = time.time() - elapsed = elapsed + (end-start) - else: - break - num_images = images.size()[0] * num_batches - - print('Elapsed time: %3.0f ms' % (elapsed/num_images*1000)) - return elapsed - - run_benchmark(saved_model_dir + scripted_float_model_file, data_loader_test) - - run_benchmark(saved_model_dir + scripted_quantized_model_file, data_loader_test) - -Running this locally on a MacBook pro yielded 61 ms for the regular model, and -just 20 ms for the quantized model, illustrating the typical 2-4x speedup -we see for quantized models compared to floating point ones. - -Conclusion ----------- - -In this tutorial, we showed two quantization methods - post-training static quantization, -and quantization-aware training - describing what they do "under the hood" and how to use -them in PyTorch. - -Thanks for reading! As always, we welcome any feedback, so please create an issue -`here `_ if you have any. + def run_benchmark(model_file, img_loader): + elapsed = 0 + model = torch.jit.load(model_file) + model.eval() + num_batches = 5 + # 이미지 배치들 이용하여 스크립트된 모델 실행 + for i, (images, target) in enumerate(img_loader): + if i < num_batches: + start = time.time() + output = model(images) + end = time.time() + elapsed = elapsed + (end-start) + else: + break + num_images = images.size()[0] * num_batches + + print('Elapsed time: %3.0f ms' % (elapsed/num_images*1000)) + return elapsed + + run_benchmark(saved_model_dir + scripted_float_model_file, data_loader_test) + + run_benchmark(saved_model_dir + scripted_quantized_model_file, data_loader_test) + +맥북 프로의 로컬 환경에서 일반적인 모델 실행은 61ms, 양자화된 모델 실행은 20ms가 걸렸습니다. +이러한 결과는 부동소수점 모델과 양자화된 모델을 비교했을 때, +양자화된 모델에서 일반적으로 2-4x 속도 향상이 이루어진 것을 보여줍니다. + +결론 +---------- + +이 튜토리얼에서 학습 후 정적 양자화와 양자화 자각 학습이라는 두 가지 양자화 방법을 살펴봤습니다. +이 양자화 방법들이 "내부적으로" 어떻게 동작을 하는지와 +PyTorch에서 어떻게 사용할 수 있는지도 보았습니다. + +읽어주셔서 감사합니다. 언제나처럼 어떠한 피드백도 환영이니, 의견이 있다면 +`여기 `_ 에 이슈를 남겨 주세요.