Skip to content

EasyData Talks and Tutorials

Amy Wooding edited this page Jan 27, 2023 · 6 revisions

Talks

Summary

How do you make your data science processed reproducible? There's a lot of really hard problems hidden in that question. This talk describes a multi-year journey towards reproducibility at the Tutte Institute, and exposes the tradeoffs that we have made to get close to that goal.

Description

It sounds so easy on paper: "Let's make our data science processes reproducible!" And while parts of the journey feel pretty good, we soon find ourselves confronting the hard problems that we initially pushed aside. Looking closely, these problems seem harder than the problem we started with, and so on, ad-infinitum.

Getting to reproducibility feels like Zeno's paradox: to get to our finish line (reproducibility!), we must first go halfway (through the hard parts); before we can go halfway, we have to go a quarter of the way (through the REALLY hard parts), and so on, until we find ourselves confronting an infinitude of really, really hard problems.

This talk describes our journey towards reproducibility: the tools, techniques, workflows, and brutal hacks that have gotten us ever closer to the holy grail of reproducible data science. Along the way, we dig into some of the hard problems we have faced, and the even harder sub-problems that underlie them, and the compromises we have made to draw a line that is "close enough" to our finish line: reproducible data science for heterogeneous workgroups.

Summary

Conda environments can be fantastic for managing your data science dependencies. They can also be fragile, conflict-riddled, disk-filling monsters. Wouldn't it be great if we could easily maintain, delete, and reproduce these environments on a project-by-project basis? We can, and all it takes is a little Makefile magic.

Description

At our shop, we had a problem: our conda environments were a mess. Most of us kept one or two monolithic environments per python version around (conda activate data_science_37 anyone?), but quickly, these environments became fragile and unmaintainable. Upgrading packages was near-impossible because of version conflicts with other installed packages. Switching machines was a nightmare, as we were never really sure which packages were required for a particular application. We couldn’t easily fix environments, and we couldn’t delete them. We didn't know how to recreate them, and so we had no easy way to share them. We were stuck.

In desperation, we started scripting our conda environment creation. Since we were already using make for our data pipelines, we started stashing the creation code there, forcing ourselves to creating a unique conda environment for each git repo, and checking it in with the rest of the codebase.

Over time, we tweaked these Makefile targets to work around some long-standing limitations of our conda setups. We added lockfiles, and self-documenting targets. We found reliable ways to mix pip and conda (in the odd cases where it was needed), and started making heavy use of editable python modules in our workflow. It worked out better than we ever imagined. Our work became reproducible, portable, and better documented.

In this talk, I walk you through the challenges of creating a reproducible, maintainable data science environment using little more than conda, environment.yml, Makefiles, and git, in hopes that you too will be able to make your conda environments more managable.

Repo

https://github.com/hackalog/make_better_defaults

Tutorials

Summary

Tired of wasting your time and energy re-doing work that you’ve done before? Want to reduce the hidden costs that come with collaboration? In this hands-on tutorial, we’ll uncover the overlooked parts of making your data science workflow reproducible. You’ll learn about gotchas, reproducibility bugs, and better defaults along the way.

Repo

https://github.com/acwooding/easydata-tutorial

Summary

How fragile is your data science pipeline? Can you recover from a laptop crash? Can you reproduce last year's analysis? (Can your co-workers?) This tutorial will take you through the process of making your data science work reproducible: from using the right tools, to creating reproducible workflows, to patterns for testing, automating, and sharing your results.

Repo

Warning: Based off a very old version of EasyData, so the implementation is significantly out of date https://github.com/hackalog/bus_number

Clone this wiki locally