Self-compression is a quantization-aware training technique to reduce neural network size and optimize performance for edge inference. By learning optimal bit depths for weights and activations during training, self-compression achieves significant reductions in memory footprint and bandwidth consumption while maintaining accuracy. The method employs high sparsity alongside low-bit representations, enabling efficient deployment on CPUs, GPUs, DSPs and NPUs without specialized hardware. Unlike traditional compression approaches, self-compression removes redundant weights and minimizes bits required for remaining parameters. Experiments demonstrate floating-point accuracy across applications including perception CNNs (as few as 3% of the original bits and 18% of weights retained) and LLMs (outperforming ternary compression in transformer-based language models). In this presentation, we explain how self-compression works, its practical implementation and real-world benefits for embedded systems, offering a simple yet powerful solution to reduce inference costs (execution time, power consumption, bandwidth and memory usage).

