Date: Wednesday, May 18 (Main Conference Day 2)
Start Time: 2:05 pm
End Time: 2:35 pm
Surprisingly in 2022, reproducibility is still a big pain point in most data science workflows. A critical element required for reproducibility is version control. Unfortunately, in machine learning there is a notorious lack of standards for version control, so developers typically resort to crafting ad-hoc workflows. And, frequently, developers reinvent the wheel due to lack of awareness of existing solutions. In this talk, we introduce DVC, short for “Data Version Control,” an open source tool that we have found can significantly alleviate the pain of reproducibility in data science workflows. We will cover the motivation for such a tool, dig into its main features and hopefully convince you that your life will be much better if you integrate it into your next project. Everything will be illustrated through a real-world example of an end-to-end ML pipeline.