Dynamic Quantization with Pytorch model using Pytorch built-in function - part1
All the experiments are run on CPU.
This is going to be a three parts tutorial, and we’re going to quantize pytorch model with three different way. Here’s the source code of the first part tutorial.
Pytorch built-in function
Intel® Neural Compressor Library
ONNX Libary
In this first part tutorial, we will discuss how to export a PyTorch Lightning model(or Pytorch model) using Pytorch built-in function
to a quantized version and compare their performance.
Quantization is a technique used to optimize deep learning models by reducing their memory requirements and preserving their performance on resource-limited environments. In this tutorial, we will use the PyTorch built-in quantization module to quantize a model and compare its performance with the original model.
We will perform the following steps in this tutorial:
- Load the PyTorch Lightning model
- Quantize the model
- Compare the sizes of the original and quantized models
- Evaluate the performance of the original and quantized models
- Summary
Let’s get started.
Step 1: Load the pre-trained PyTorch model
We will use a tone classification model, which is a multiclass classification model to predict one of the five tones in Mandarin Chinese.
Let’s import LightningToneClassifier
model from the tone_pl_model.py
file. You can use your own Pytorch or Pytorch lightning model.
1
2
3
4
5
6
7
import torch
from tone_pl_model import LightningToneClassifier
checkpoint = "/home/vincent0730/ML_chinese_tone_classification/87/a8/epoch=0-val_loss=1.250-val_acc=0.521.ckpt"
pl_model = LightningToneClassifier.load_from_checkpoint(checkpoint)
model = pl_model.model
model.eval()
In the above code, we load the pre-trained model from a checkpoint file and set it to evaluation mode.
Step 2: Quantize the model
We will use the torch.quantization.quantize_dynamic
function to quantize the model. This function applies dynamic quantization to all linear layers of the model.
1
2
3
4
5
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear},
dtype=torch.qint8,
)
In the above code, we quantize the model using dynamic quantization and specify that only linear layers should be quantized. We also specify the datatype of the quantized model as torch.qint8
.
Step 3: Compare the sizes of the original and quantized models
We use the helper function print_size_of_model
to compare the sizes of the original and quantized models.
1
2
3
4
5
6
7
8
9
import os
def print_size_of_model(model):
torch.save(model.state_dict(), "temp.p")
print("Size (MB):", os.path.getsize("temp.p") / (1024 * 1024))
os.remove("temp.p")
print_size_of_model(model)
print_size_of_model(quantized_model)
In the above code, we use the print_size_of_model function to print the sizes of the original and quantized models. The function saves the model to a temporary file, computes its size.
Step 4: Evaluate the performance of the original and quantized models
To evaluate the performance of both the original and quantized models, we’ll use a validation dataset consisting of one thousand data points. We’ll also use the Accuracy
metric from torchmetrics
to compute the accuracy of the models.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import time
from torchmetrics import Accuracy
from tqdm import tqdm
from tone_datamodule import ToneDataModule
acc = Accuracy(task="multiclass", num_classes=5)
eval_metadata = "/home/vincent0730/ML_chinese_tone_classification/test1000.csv"
max_length = 75
batch_size = 128
datamodule = ToneDataModule(
train_metadata="",
eval_metadata=eval_metadata,
model_name_or_path="MIT/ast-finetuned-audioset-10-10-0.4593",
max_length=max_length,
batch_size=batch_size,
)
valid_dataset = datamodule.get_dataset(datamodule.eval_metadata, streaming=False)
datamodule.val_dataset = valid_dataset
Let’s define a helper function to compare their accuracy.
1
2
3
4
5
6
7
8
9
10
11
12
13
def model_runtime(model):
preds = []
labels = []
start = time.time()
for data in tqdm(datamodule.val_dataloader()):
pred = torch.argmax(model(data["input_values"]).logits, dim=-1).tolist()
preds.extend(pred)
labels.extend(data["labels"].tolist())
print(f"Total time: {time.time() - start}s")
print("Accuracy:", acc(torch.tensor(preds), torch.tensor(labels)))
model_runtime(model)
model_runtime(quantized_model)
Summary
The table below summarizes the results of the evaluation:
Model Type | Size (MB) | Accuracy | Latency(s) |
---|---|---|---|
Original Model | 325.53 | 0.521 | 114.7 |
Quantized Model | 82.58 | 0.516 | 92.2 |
As we can see, the quantized model is 74% smaller in size than the original model while having a slightly lower accuracy of around 0.5%.
Although we have a minor drop in accuracy, it’s important to note that the extent of accuracy reduction may vary depending on the model architecture and the dataset we used.