Date: Monday, May 11
Start Time: 2:40 pm
End Time: 3:10 pm
What makes an image effective for marketing is subtle, and we’ve found that general-purpose vision-language models often miss the cues practitioners care about. In this talk, we present MarketingGenie, a domain-specialized VLM trained on ~20K marketing images annotated by experts for composition, lighting, emotional appeal and storytelling. MarketingGenie is ~100x smaller than GPT-4o yet scores significantly higher on marketing-specific evaluations. We’ll share the techniques that made it work: how we defined “marketing quality” and converted expert labels into consistent QA pairs, why fine-tuning an open model (LLaVA-8B) beat using a large API model for cost and controllability and how a multi-encoder design (CLIP plus aesthetic and human-preference encoders) with learnable adapters improved critique quality. We’ll also cover our data-mixture strategy to avoid catastrophic forgetting, a calibrated scoring head that grounds numeric ratings in reference images and how we measure scoring accuracy against human judgments.

