AI News, Reproducible Machine Learning with Jupyter and Quilt

Reproducible Machine Learning with Jupyter and Quilt

In this guest blog post, Aneesh Karve, Co-founder and CTO of Quilt, demonstrates how Quilt works in conjunction with Domino’s Reproducibility Engine to make Jupyter notebooks portable and reproducible for machine learning.

Code dependencies are simple to express: Data dependencies, on the other hand, are messier: custom scripts acquire files from the network, parse files in a variety of formats, populate data structures, and wrangle data.

We can think of reproducible machine learning as an equation in three variables: code  +  data  +  model  =  reproducible machine learning The open source community has produced strong support for reproducing the first variable, code.

book as follows: If we evaluate pb.titanic in Jupyter, we’ll see that it’s a GroupNode that contains DataNodes: We can access the data in pb.titanic as follows: Note the parentheses in the code sample above.

Let’s convert our training data into numpy arrays that are usable in scikit-learn: Now let’s train a random forest classifier on our data, followed by a five-fold cross-validation to measure our accuracy: The model scores 81% mean accuracy.

You can load the model as follows: To verify that it’s the same model we trained above, repeat the cross-validation: Often times a single Jupyter notebook depends on multiple data packages.

Example: super-resolution imaging with PyTorch and Quilt

In this article, we'll train a PyTorch model to perform super-resolution imaging, a technique for gracefully upscaling images.

Machine learning projects typically begin by acquiring data, cleaning the data, and converting the data into model-native formats.

Such manual data pipelines are tedious to create and difficult to reproduce over time, across collaborators, and across machines.

In this article we'll create reusable units of data that deploy like PyPI packages: If you've ever tried to store data on GitHub, you may have discovered that large data are not welcome.

You'll see the following: Optionally, add a README.md file, so that your data package is self-documenting: To convert the above files into a versioned data package, we'll need to install Quilt: Windows users, first install the Visual C++ Redistributable for Visual Studio 2015.

Quilt prepends n to the file name so that every package node is a valid Python identifier, accessible with Python's dot operator, or with brackets.

In order for a model to infer resolution, it requires a training corpus of high-resolution images (in our case, the BSDS300 training set).

Quilt provides a higher-order function, asa.pytorch.dataset(), that converts packaged data into a torch.utils.data.Dataset object: For a full code sample, see this fork of pytorch-examples.

The repository quiltdata/pytorch-examples contains an entrypoint script, train_super_resolution.sh, which calls main.py to install dependencies, train the model, and persist model checkpoints to disk: You can clone this Paperspace Job to train the model in your own account.

In order for inference to work, be sure that your model checkpoints are saved in /storage/models/super_resolution (as shown in the training scripts above), or that you update the code to use a different directory.

+ data + model = reproducibility By adding versioned data and versioned models to our workflow, we make it easier for developers to get consistent results over time, across machines, and across collaborators.

Here's how to do it correctly: If you wish for your data packages to live in a specific directory, for example on a shared drive, create a quilt_packages directory.

Use an environment variable to tell Quilt where to search for packages: Quilt de-duplicates files, serializes them to high-speed formats, and stores them under a unique identifier (the file's SHA-256 hash) in quilt_packages.

Data Packages for Fast, Reproducible Python Analysis

The tragedy of data science is that 79% of an analyst’s time goes to data preparation.

You could locate the source data, download it, parse it, index the date column, etc. — as Jake Vanderplas demonstrates — or you could install the data as a package in less than a minute: Now we can load the data directly into Python: In contrast to files, data packages require very little data preparation.

On the left we see a typical file-based workflow: download files, discover file formats, write scripts to parse, clean, and load the data, run the scripts, and finally begin analysis.

quilt install is similar in spirit to git clone or npm install, but it scales to big data, keeps your source code history clean, and handles serialization.

simplify dependency injection, Quilt rolls data packages into a Python module so that you can import data like you import code: Importing large data packages is fast since disk I/O is deferred until the data are referenced in code.

quilt generate creates a build file that mirrors the contents of any directory: Let’s open the file that we just generated, src/build.yml: contents dictates the structure of a package.

Oh, and let’s index on the “Date” column: counts — or any name that we write in its place — is the name that package users will type to access the data extracted from the CSV file.

quilt log tracks changes over time: quilt install -x allows us to install historical snapshots: The upshot for reproducibility is that we no longer run models on “some data,” but on specific hash versions of specific packages.

Data Science with Juliet Hougland and Michelle Casbon: GCPPodcast 130

Original post: Juliet Hougland and Michelle Casbon are ..