Enhancing CNN Inference Time and Reducing Latency on Edge and Resource-Constrained Systems through Quantization

Author	G. Upadhyay C.S. Prasad V. Akash A.K. Singh Monish Sai S. Ghosal S. Kar S.H. Shantala Thakur Lalitkrushna S. Srividhya S.A. Balwantrao
Keywords	OTHER
Abstract	Systems that use Deep Learning (DL) models extensively utilize cloud computing for inference tasks in various domains such as traffic monitoring, healthcare, and IoT. However, applications like autonomous vehicles, surveillance systems, and spacecraft are transitioning towards edge computing due to band-width limitations, transmission delays, and network connectivity issues. Edge computing mitigates these challenges by reducing latency through local data and model processing on the device. Implementing Deep Neural Networks (DNNs) on edge devices faces resource constraints, such as limited memory, computing power, etc. DNNs employ 32-bit floating-point precision for accuracy, leading to inflated model sizes. Quantization offers a solution by converting high-precision floating-point (FP) values to lower-precision or integer (INT) values, focusing on throughput and improving latency. This paper presents a comparative study of the accuracy and performance of 64-bit, 32-bit, and 16-bit floating-point instructions, along with 8-bit integer instructions, using Post Training Quantization (PTQ) and Quantization-Aware Training (QAT), on multiple Nets including CustomNets, which was inferenced on a GPU as well as a Xilinx Deep Processing Unit (DPU). The models were evaluated on a sample of the EuroSat Remote Sensing dataset. Quantizing models to FP16 and INT8 resulted in 2-3 x and 4 x faster inferencing, respectively, with a negligible decrease in accuracy of 1-4 %. FP64 exhibited a 2 3 x decrease in speed but a slight accuracy improvement (2 %). On the DPU, models showed minimal accuracy degradation of about 1 %. Overall, model size decreased by a constant 2 x and 4x from FP32 to FP16 and INT8, respectively, while increasing by 2 x for FP64. This reduction in size, with negligible loss in accuracy enables onboard storage along withfaster and accurate inferencing on resource constraint systems. © 2024 IEEE.
Year of Conference	2024
Conference Name	2nd IEEE International Conference on Networks, Multimedia and Information Technology, NMITCON 2024
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN Number	979-835037289-2 (ISBN)
DOI	10.1109/NMITCON62075.2024.10699069
	Conference Proceedings
Download citation	DOI Google Scholar
Cits	0