Large language models are powerful but often impractical for embedded and on-prem systems due to latency, cost, privacy and memory constraints. Small language models (SLMs)—typically comprised of single-digit billions of parameters or less—offer a deployable alternative but require different expectations and engineering choices. In this talk, Dwith Chenna will introduce the SLM landscape and their applications together with performance and accuracy comparisons against LLMs. Dwith will then examine the quantization techniques that matter for SLM deployment—gradient post-training quantization, SmoothQuant and activation-aware weight quantization—explaining how they work and how to compare them using metrics such as perplexity, task accuracy (e.g., MMLU/ARC/HellaSwag) and runtime performance (e.g., tokens/sec, latency). Attendees will leave with a practical checklist for selecting, quantizing and evaluating SLMs for real edge systems.

