March 5, 2025

ikayaniaamirshahzad@gmail.com

Optimizing AI Models with Quanto


The transformer-based diffusion models are improving day by day and have proven to revolutionize the text-to-image generation model. The capabilities of transformers enhance the scalability and performance of any model, thereby increasing the model’s complexity.

With great power comes great responsibility

In this case, with great model complexities comes great power and memory consumption.

For instance, running inference with models like Stable Diffusion 3 requires a huge GPU memory, due to the involvement of components—text encoders, diffusion backbones, and image decoders. This high memory requirement causes set back for those using consumer-grade GPUs, which hampers both accessibility and experimentation.

Enter Model Quantization. Imagine being able to scale down a resource-hungry model to a more manageable size without sacrificing its effectiveness. Quantization, is like compressing a high-resolution image into a more compact format, transforms the model’s parameters into lower-precision representations. This not only reduces memory usage but also speeds up computations, making complex models more accessible and easier to work with.

In this post, we explore how Quanto’s quantization tools can significantly enhance the memory efficiency of Transformer-based diffusion pipelines.

Prerequisites

Introducing Quanto: A Versatile PyTorch Quantization Backend

Hugging Face Optimum, a set of tools for hardware optimization

  • Eager Mode Compatibility: Works seamlessly with non-traceable models.
  • Device Flexibility: Quantized models can be deployed on any device, including CUDA and MPS.
  • Automatic Integration: Inserts quantization/dequantization stubs, functional operations, and quantized modules automatically.
  • Streamlined Workflow: Provides an effortless transition from a float model to both dynamic and static quantized models.
  • Serialization Support: Compatible with PyTorch weight_only and 🤗 Safetensors formats.
  • Accelerated Matrix Multiplications: Supports various quantization formats (int8-int8, fp16-int4, bf16-int8, bf16-int4) on CUDA devices.
  • Wide Range of Support: Handles int2, int4, int8, and float8 weights and activations.

While many tools focus on making large AI models smaller, Quanto is designed to be simple and useful for all kinds of models.

To install Quanto using pip, please use the code below:-

!pip install optimum-quanto

Quantize a Model

The below code will help to convert a standard model to a quantized model

from optimum.quanto import quantize, qint8
quantize(model, weights=qint8, activations=qint8)

Calibrate

Quanto’s calibration mode ensures that the quantization parameters are adjusted to the actual data distributions in the model, enhancing the accuracy and efficiency of the quantized model.

from optimum.quanto import Calibration

with Calibration(momentum=0.9):
    model(samples)

Quantization-Aware-Training

In case the model performance is effected one can tune the model for few epochs to enhance the model performance.

import torch

model.train()
for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = model(data).dequantize()
    loss = torch.nn.functional.nll_loss(output, target)
    loss.backward()
    optimizer.step()

Freeze integer weights

While freezing the model, the float weights gets converted to quantized weights.

from optimum.quanto import freeze
freeze(model)

The H100 GPU is a high-performance graphics card designed specifically for demanding AI tasks, including training and inference for large models like transformers and diffusion models. Here’s why it’s chosen for this benchmark:

  • Top-tier Performance: The H100 offers exceptional speed and power, making it ideal for handling complex operations required by large models like text-to-image and text-to-video generation pipelines.
  • Support for FP16: This GPU efficiently handles computations in FP16 (half-precision floating point), which reduces memory usage and speeds up calculations without significantly sacrificing accuracy.
  • Advanced Hardware Features: The H100 supports optimized operations for mixed-precision training and inference, making it an excellent choice for quantization techniques that aim to reduce model size while maintaining performance.

In the benchmarking study, the main focus is on applying Quanto, a new quantization tool, to diffusion models. While quantization is well-known among practitioners of Large Language Models (LLMs), it’s less commonly used with diffusion models. Quanto is used to explore whether it can provide memory savings in these models with little or no loss in quality.

Here’s what the study involves:

Environment Setup

Image Source)

The below code can be used to quantize the text encoder.

quantize(pipeline.text_encoder, weights=qfloat8)
freeze(pipeline.text_encoder)

The text encoder, being a transformer model as well, can also be quantized. By quantizing both the text encoder and the diffusion backbone, significantly greater memory savings are achieved.

LLM Pipelines

Integrations with the Transformers

Image Source).

  • Quantizing the diffusion transformer in all cases ensures that the observed memory savings are primarily due to the text encoder quantization.
  • Using the bfloat16 can be faster when powerful GPUs such as H100 or 4090 are considered.
  • qint8 is generally faster for inference due to efficient integer operations and hardware optimization.
  • Fusing QKV Projections thickens the int8 kernels, which optimizes computation further by reducing the number of operations and leveraging efficient data processing.
  • When using qint4 with bfloat16 on an H100 GPU, results improvements in memory usage because qint4 uses only 4 bits per value, which reduces the amount of memory needed to store the weights. However, this comes at the cost of increased inference latency. This is because the H100 GPU still does not support computations with 4-bit integers (int4). Although the weights are stored in a compressed 4-bit format, the actual computations are still performed in bfloat16 (a 16-bit floating-point format), which means the hardware has to handle more complex operations, leading to slower processing times.

Quanto offers a powerful quantization backend for PyTorch, optimizing model performance by converting weights to lower precision formats. By supporting techniques like qint8 and qint4, Quanto reduces memory consumption and speeds up inference. Additionally, Quanto works across different devices (CPU, GPU, MPS) and is compatible with various setups. However, on MPS devices, using float8 will cause an error.
Overall, Quanto enables more efficient deployment of deep learning models, balancing memory savings with performance trade-offs.



Source link

Leave a Comment