Training strong models starts with building strong datasets. In this talk, we’ll present a practical, end-to-end view of how data strategy drives model quality, development speed and iteration at scale. We’ll cover the core decisions in data collection—what to capture, how much is enough and which operating condition variables (lighting, viewpoints, environments, edge cases) most influence required volume. We’ll then share curation principles for high-leverage training sets: matching real deployment conditions; maintaining balance across classes and scenarios; and pruning data that is irrelevant or misleading. Next, we’ll discuss task-specific labeling, including how we handle ambiguity, reduce inconsistency and implement quality checks. Finally, we’ll focus on evaluation data and iteration: building representative holdout sets; segmenting performance to expose failure modes; and running a data-refinement loop where deployment feedback guides targeted new collection and updates. Attendees will leave with a repeatable framework for making data decisions that scale.

