Deployment of neural networks at the edge is often constrained by the rigidity and integration cost of domain-specific accelerators. In this talk, we address the “efficiency wall”: when an accelerator optimized for one architecture (e.g., CNNs) runs another (e.g., transformers), TOPS don’t translate into throughput. Leveraging the extensible RISC-V ISA, we present a holistic hardware-software co-design that enhances a standard RISC-V CPU with vector/matrix extensions optimized for low-precision tensor and elementwise operations found in CNNs and vision transformers—avoiding a separate accelerator and its integration overhead. Custom CPU instructions provide fine-grained datapath control, cutting data movement and power consumption. We’ll share results showing 4x better energy efficiency in physical AI applications. Complementing the hardware, an ISA-aware software ecosystem streamlines moving models from PyTorch/TensorFlow to optimized implementations without manual kernel tuning, decoupling model definition from hardware specifics and enabling on-device AI in use cases from IoT to automotive.

