Black History Month is here! Discover ERA research focused on Black experiences in Canada and worldwide. Use our general search below to get started!

Mixed Low-bit Quantization for Model Compression with Layer Importance and Gradient Estimations

Loading...
Thumbnail Image

Institution

http://id.loc.gov/authorities/names/n79058482

Degree Level

Master's

Degree

Master of Science

Department

Department of Computing Science

Supervisor / Co-Supervisor and Their Department(s)

Citation for Previous Publication

Link to Related Item

Abstract

Deep neural networks (DNNs) have been widely used in the modern world in recent years. However, due to the substantial memory consumption and high computational power use of DNNs, deploying them on devices with limited resources is challenging. Model compression methods can provide us with a remedy here. Among those techniques, neural network quantization has achieved a high compression rate using a low bitwidth representation of weights and activations while maintaining the accuracy of the high-precision original network. However, mixed precision (per-layer bit-width precision) quantization requires careful tuning to maintain accuracy while achieving further compression and higher granularity than fixed precision quantization. In this thesis, We propose an accuracy-aware criterion to quantify the layer’s importance rank. Our method applies imprinting per layer, which acts as a proxy module for accuracy estimation in an efficient way. We rank the layers based on the accuracy gain from previous modules and iteratively quantize those with less accuracy. Previous mixed-precision methods either rely on expensive search techniques such as reinforcement learning (RL) or end-to-end optimization with a lack of interpretation to the quantization configuration scheme. Our method is a one-shot, efficient, accuracy-aware information estimation and thus draws better interpretability to the selected bit-width configuration. We have also pointed out the problem of the Straight-Through Estimator (STE), which is commonly used for gradients estimation in the quantization field. We’ve discussed some ways to address the problem of using STE.

Item Type

http://purl.org/coar/resource_type/c_46ec

Alternative

License

Other License Text / Link

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

en

Location

Time Period

Source