Quantization Techniques for Efficient Deployment of Large Language Models: A Comprehensive Review

The deployment of large language models (LLMs) in resource-constrained environments is challenging due to the significant computational and memory demands of these models. To address this challenge, various quantization techniques have been proposed to reduce the model’s resource requirements while maintaining its accuracy. This talk provides a comprehensive review of post-training quantization (PTQ) methods, highlighting their trade-offs and applications in LLMs. We explain quantization techniques such as gradient post-training quantization (GPTQ), activation-aware weight quantization (AWQ) and SmoothQuant, and evaluate their performance on popular LLMs like the Open Pre-trained Transformer (OPT) language model series and Meta’s Llama-2 LLM. Our results demonstrate that these techniques can significantly reduce these models’ size and computational requirements while maintaining their accuracy, making them suitable for deployment in edge environments.

Track

Technical Insights

Session Speakers

Dwith Chenna
MTS Product Engineer, AI Inference, AMD

Dwith Chenna is an experienced research and development professional specializing in algorithm development and optimization in computer vision, deep learning and edge AI. With a strong background in creating advanced, performance-critical perception systems, Dwith excels at optimizing deep learning models for resource-limited hardware accelerators. In his current role at AMD, he plays a crucial role in promoting the adoption of AMD’s machine learning inference solutions, ensuring a smooth customer onboarding experience and delivering high customer satisfaction. He is key in enabling the productization of end-to-end AI inference solutions on AMD’s CPUs, NPUs and embedded devices by working closely with developers and collaborating with sales, marketing and R & D teams. Dwith’s responsibilities include evaluating embedded algorithms for performance and accuracy, driving key performance metrics and developing comprehensive onboarding materials such as use cases, tutorials and methodology documents.

Quantization Techniques for Efficient Deployment of Large Language Models: A Comprehensive Review

Track

Session Speakers

Dwith Chenna

See you May 20-22, 2025, at the Santa Clara Convention Center!

Sponsors & Exhibitors

Get in Touch

Share