AI News, The Real Value of Containers for Data Science

The Real Value of Containers for Data Science

This is the real value of containers in data science: the ability to capture an experiment’s state (data, code, results, package versions, parameters, etc) at a point in time, making it possible to reproduce an experiment at any stage in the research process.

Reproducibility is critical for quantitative research in regulated environments — for example, documenting the provenance of a lending model to prove it avoids racial bias.

Even in non-regulated contexts, however, reproducibility is critical to the success of a data science program, because it helps capture assumptions and biases that may be surfaced later on.

This accumulation may be driven by a group of data scientists collaborating over a period of time, or a lone data scientist building on past experience.

How Docker Can Help You Become A More Effective Data Scientist

By Hamel Husain For the past 5 years, I have heard lots of buzz about docker containers.

I wanted to figure out how this technology could make me more effective but I found tutorials online either too detailed: elucidating features I would never use as a data scientist, or too shallow: not giving me enough information to help me understand how to be effective with Docker quickly.

There are a wide variety of Docker images hosted on Dockerhub, including those that provide more than an operating system, for example if you want a container with Anaconda already installed you can build a container on top of the official anaconda docker image.

If you navigate to the Ubuntu DockerHub repo, you will notice that different versions of Ubuntu correspond with different tags: For example ubuntu:16.04, ubuntu:xenial-20171201, ubuntu:xenial, and ubuntu:latest all refer to Ubuntu version 16.04 and are all aliases for the same image.

Why? — If you look closely at the above screenshot, you will see the :latest tag is associated with 16.04 One last note about Docker images: exercise sensible judgment when pulling random Docker images from DockerHub.

In this case, I’m installing some utilities that I like such as curl, htop, byobu and then installing anaconda, followed by other libraries that do not come in the base anaconda install (scroll up to the full Dockerfile to see all of the RUN statements).

The commands after the RUN statement have nothing to do with Docker but are normal linux commands that you would run if you were installing these packages yourself, so do not worry if you aren’t familiar with some of these packages or linux commands.

From the docker user guide: Furthermore, these volumes are meant to persist data outside the filesystem of a container, which often useful if you are working with large amounts of data that you do not want to bloat the docker image with.

For example, the last statement in the Dockerfile is Which assumes the working directory is /ds This command allows you to copy files from the host computer into the docker container when the docker container is run.

Notice how the path of the host container is not fully specified here, as the host path is relative to the context directory that you specify when the container is run (which is discussed later).

However, if you do not have any specific application you want to run but you want your container to run without exiting — you can simply run the bash shell instead with the following command: This works because the bash shell does not terminate until you exit out of it, thus the container stays up and running.

Why data scientists should love Linux containers

Linux containers make it easy for teams to deploy, manage, and scale distributed applications and for operators to exploit compute capacity in the cloud.

You’ll learn how containers fulfill the promise of reproducible research, ease moving techniques from prototype to production, enable painless publishing and collaboration workflows, and empower you to safely develop techniques against sensitive data in a production environment from the comfort of your laptop.

There are myriad tutorial resources explaining how to build and run container images, but these largely assume an audience whose primary responsibilities include packaging, releasing, and managing applications.

The Real Value of Containers for Data Science

This is the real value of containers in data science: the ability to capture an experiment’s state (data, code, results, package versions, parameters, etc) at a point in time, making it possible to reproduce an experiment at any stage in the research process.

Reproducibility is critical for quantitative research in regulated environments — for example, documenting the provenance of a lending model to prove it avoids racial bias.

Even in non-regulated contexts, however, reproducibility is critical to the success of a data science program, because it helps capture assumptions and biases that may be surfaced later on.

This accumulation may be driven by a group of data scientists collaborating over a period of time, or a lone data scientist building on past experience.

Use containers to enhance research reproducibility¶

A brute-force way to get the same software environment on different computers is to use a virtual machine (VM) to deliver the entire system.

But VMs incur a significant performance penalty since a new operating system needs to run inside the existing system.

(Their underlying implementations are vastly different but that’s out of scope here.) Docker is the most widely used container in the software world.

Also, domain scientists might find Docker’s workflow quite unintuitive because it is mainly designed for web apps, not for numerical computing.

<?xml version="1.0" encoding="UTF-8"?>Enhancing reproducibility in scientific computing: Metrics and registry for Singularity containers

The pairwise calculation to compare many containers is computationally not fast (2-5 seconds for a container across all levels, depending on the size).

For the user wanting to make a small number of comparisons on the fly or locally, our software is reasonable to use, and for the user wanting to do many comparisons, we recommend the standard approach to scaling by way of running jobs on a cluster or in parallel, or using the singularity-python functions to cache hashes.

We thought about using another heuristic that would not require reading bytes to assess the file, such as size, and while we think the comparisons would likely be accurate, they would have the potential to produce false positives (files assessed as equal based on size but are not truly equal in content).

Further, we would want the user to be able to look for an entire set of files pertaining to a particular software installation, and we would want to be able to map a standard location in one operating system to a (possibly different) standard location in another.

For example, a series of files that are not relevant for the function of the image, perhaps leftover output files, would be included in the total (in the set represented in one image but not another) and make the images look very different even if the core software (the function of the image) was equivalent.

Creating Reproducible Data Science Workflows using Docker Containers

Aly Sivji Jupyter notebooks make it easy to create reproducible workflows that can be distributed across groups and ..

Singularity: Containers for Science, Reproducibility, and HPC

In this video from the 2017 HPC Advisory Council Stanford Conference, Greg Kurtzer from LBNL presents: Singularity: Containers for Science, reproducibility, ...

A Docker Container Toolbox for the Data Scientist by Douglas Liming, SAS Institute Inc.

A Docker Container Toolbox for the Data Scientist - Douglas Liming, SAS Institute Inc. A major financial institute supports its analytic workload via docker ...

Using Docker Containers to Improve Reproducibility in PL/SE Research

The ability to replicate and reproduce scientific results has become an increasingly important topic for many academic disciplines. In computer science and, ...

Creating Reproducible Data Science Workflows using Docker Containers

Aly Sivji Jupyter notebooks make it easy to create reproducible workflows that can be distributed across groups and organizations. This is a simple.

Singularity HPC container for Supercomputing

Singularity enables users to have full control of their environment. Singularity containers can be used to package entire scientific workflows, software and ...

DataLearn - Docker for reproducible research - DataKind SG

Speaker: We're holding a DataLearn on how to use docker within a data project! What is this DataLearn?This DataLearn is to prepare volunteers for our ...

Manage Reproducibility of Computational Workflows with Docker Containers and Nextflow

In this video from the 2016 HPC Advisory Council Switzerland Conference, Paolo Di Tommaso from the Center for Genomic Regulation presents: Manage ...

Data Provenance and Reproducibility with Pachyderm

Versioning isn't just for source code. Being able to track changes to data is critical for answering questions about data provenance, quality, and reproducibility.

Singularity and SUSE: Singularity for HPC

Singularity enables users to have full control of their environment. Singularity containers can be used to package entire scientific workflows, software and ...