Multimodal LLMs promise to bring exciting new abilities to devices! As we see foundational models become more capable, we see compute requirements grow as well. It is not uncommon to see LLMs grow to tens of billions of parameters, at a rate faster than what embedded processors can provide. In this talk, we introduce the concept of a “neural cascade,” a scheme that allows us to divide computation across devices. We’ll present a recipe for constructing a neural cascade from a pre-existing LLM and we’ll show how this system harmonizes edge and cloud devices to enable new experiences.