- On Monday, June 4, 2018
- By Read More
(This is part 2 of a two part series of blog posts about doing data science and engineering in a containerized world, see part 1 here) Let's admit it, data scientists are developing some pretty sweet (and potentially valuable) models, optimizations, visualizations, etc.
happening in industry is happening in isolation on data scientists' laptops, and, in the case in which data science applications are actually deployed, they are often deployed as hacky python/R scripts uploaded AWS and run as a cron job.
-Robert Chang, data scientist at Twitter "Data engineers are often frustrated that data scientists produce inefficient and poorly written code, have little consideration for the maintenance cost of productionizing ideas, demand unrealistic features that skew implementation effort for little gain… The list goes on, but you get the point".
In the following, I will present a very simple example of a data science application that: For this example, we are going to build a k-NN classification model (with scikit-learn) using the famous Iris dataset: This predict function will return a species of Iris based on input features inputFeatures (sepal length, sepal width, petal length, and petal width).
The dataset on which the model is trained is static in this case (i.e., loaded from the scikit-learn datasets), however it is easy to imagine how you could dynamically load in a dataset or aggregated values here via messaging, API(s), or database interactions.
However, assuming you want to play around with this locally first, you can run the Docker image via: This will run the docker image as a container named myiris, as a daemon (-d), and using the same network interface as the localhost (--net host).
How Docker Can Help You Become A More Effective Data Scientist
By Hamel Husain For the past 5 years, I have heard lots of buzz about docker containers.
I wanted to figure out how this technology could make me more effective but I found tutorials online either too detailed: elucidating features I would never use as a data scientist, or too shallow: not giving me enough information to help me understand how to be effective with Docker quickly.
There are a wide variety of Docker images hosted on Dockerhub, including those that provide more than an operating system, for example if you want a container with Anaconda already installed you can build a container on top of the official anaconda docker image.
If you navigate to the Ubuntu DockerHub repo, you will notice that different versions of Ubuntu correspond with different tags: For example ubuntu:16.04, ubuntu:xenial-20171201, ubuntu:xenial, and ubuntu:latest all refer to Ubuntu version 16.04 and are all aliases for the same image.
Why? — If you look closely at the above screenshot, you will see the :latest tag is associated with 16.04 One last note about Docker images: exercise sensible judgment when pulling random Docker images from DockerHub.
In this case, I’m installing some utilities that I like such as curl, htop, byobu and then installing anaconda, followed by other libraries that do not come in the base anaconda install (scroll up to the full Dockerfile to see all of the RUN statements).
The commands after the RUN statement have nothing to do with Docker but are normal linux commands that you would run if you were installing these packages yourself, so do not worry if you aren’t familiar with some of these packages or linux commands.
From the docker user guide: Furthermore, these volumes are meant to persist data outside the filesystem of a container, which often useful if you are working with large amounts of data that you do not want to bloat the docker image with.
For example, the last statement in the Dockerfile is Which assumes the working directory is /ds This command allows you to copy files from the host computer into the docker container when the docker container is run.
Notice how the path of the host container is not fully specified here, as the host path is relative to the context directory that you specify when the container is run (which is discussed later).
However, if you do not have any specific application you want to run but you want your container to run without exiting — you can simply run the bash shell instead with the following command: This works because the bash shell does not terminate until you exit out of it, thus the container stays up and running.
A Step Towards Reproducible Data Science : Docker for Data Science Workflows
You have a clean laptop and you need to install TensorFlow in your system, but you are lazy (yes we all are sometimes).
Once all the intermediate layers are downloaded, run: docker images to check whether our docker pull was successful.
To run the image, run the command: docker run -it -p 8888:8888 tensorflow/tensorflow Now the above docker run command packs in a few more command line argurments.
Playing with Caffe and Docker to build Deep Learning models.
But this statement isn’t entirely correct as I found out later, It also helps Data scientists while building ML/DL models in the following ways.
My goal for this post will be to give you an hands on experience on building deep learning models using Docker.
This is somewhat I didn’t find it on Internet and with the help of my colleague Prathamesh Sarang, I have gained enough knowledge on Docker and thought of sharing my experience.
RCNN, Fast RCNN and Faster RCNN ( Object detection algorithms), which I need to implement and experiment around for my project were developed using Caffe.
It took me just 6 hours (minus the time I spent on Installation setup) to learn Caffe and run my first deep learning model.
I will share along a few other links which helped me in the process: There are 4 steps in building deep learning models Its platform dependent and you can check the official Docker documentation for more details .
To run the above file, you need to go to the particular folder and execute the following command FROM bvlc/caffe:cpu : This is the Docker image you are pulling from Docker hub.
Here I kept all my data in data folder and all the code (jupyter notebooks) are saved inside my notebooks folder.
Using Docker Containers For Data Science Environments
For a data scientist, running a container that is already equipped with the libraries and tools needed for a particular analysis eliminates the need to spend hours debugging packages across different environments or configuring custom environments.
Rather than building a new environment for every analysis, your IT team can put the tools and packages required for certain types of analyses (e.g., scikit-learn, TensorFlow, Jupyter, etc.) into a container, create an image of that container, and have every user boot up an isolated, standardized environment from that image.
We’ve created a number of pre-baked images for deep learning, natural language processing, and other data science techniques for this purpose that can be used in RStudio and Jupyter sessions on our platform.
The process of getting an environment up and running varies from company to company, but in some cases, a data scientist must submit a formal request to IT and wait for days or weeks, depending on the backlog.
We provide plenty of standard environment templates to choose from.) Ultimately, containers solve a lot of common problems associated with doing data science work at the enterprise level.
They take the pressure off of IT to produce custom environments for every analysis, standardize how data scientists work, and ensure that old code doesn’t stop running because of environment changes.
Demystifying Docker for Data Scientists – A Docker Tutorial for Your Deep Learning Projects
If you too are wondering what the fuss is all about, or how to leverage Docker in your data science work (especially for deep learning projects) you’re in the right place.
As a data scientist, I find Docker containers to be especially helpful as my development environment for deep learning projects for the reasons outlined below.
Getting the right anaconda distribution, the correct version of Python, setting up the paths, the correct versions of different packages, ensuring the installation does not interfere with other Python-based installations on your system is not a trivial exercise.
Even if you manage to get the framework installed and running in your machine, every time there’s a new release, something could inadvertently break.
Making Docker your development environment shields your project from these version changes until you are ready to upgrade your code to make it compatible with the newer version.
When sharing a project via a container you are not only sharing your code but your development environment as well ensuring that your script can be reliably executed, and your work faithfully reproduced.
They are not VMs but you can think of them as fully functional and isolated operating systems with everything you need to run your scripts already installed and yet very lightweight.
The tutorial below starts by downloading the right image, starting a container with that image and interacting with the container to perform various tasks.
To run the GPU versions of these Docker containers (only available on Linux), we will need to use nvidia-docker rather than docker to launch the containers (basically replace all occurrences of docker with nvidia-docker in all the commands).
To create a new container, we must specify an image name from which to derive the container from and an optional command to run (/bin/bash here to access the bash shell).
The docker run command first creates a writeable container layer over the specified image, and then starts it using the specified command.
We get a new container every time docker run command is executed allowing us to have multiple instances of the same image.
By executing the docker ps -a command, we can see the list of all containers on our machine (both running and stopped containers).
To start the deep learning project, I will jump inside the container in a bash shell and use it as my development environment.
The above command starts the container in an interactive node and puts us in a bash shell as though we were working directly in our terminal.
To restart the stopped container and jump inside to the container shell we can use docker start command with the -a option.
The docker exec is used to run a command in a running container, the command above was /bin/bash which would give us a bash shell and the -it flags would put us inside the container.
Next I will copy the training and test data along with my Python script from my local machine to the working folder in my container mycntkdemo using the docker cp command.
Once we have the output from running our script, we could transfer it back to our local machine using the docker cp command again.
Alternatively we could map the folder C:\dockertut on our machine (host machine) to the directory mylightgbmex in the Docker container when starting the container by using the -v flag with docker run command.
This new image will contain everything that the CNTK image came with plus lightgbm, all the files we transferred from our machine and the output from executing our script.
In my dockerfile I will also add instructions to transfer some files from my local machine to a specific location (as in the exercise above to a directory called mylightgbmex).
Once I have everything in place, I will execute the docker build command at the end of which I should the image mycntkwlgbmimage listed in the output of docker image command.
To access these applications, we need to expose the containers internal port and bind the exposed port to a specified port on the host.
Starting a container with -p flag will explicitly map the port of the Docker host to the port number on our localhost to access the application running on that port in the container (port 8888 is default for Jupyter notebook application).
docker run -it -p 8888:8888 –name mycntkdemo2 microsoft/cntk:2.2-cpu-python3.5 bash -c “source /cntk/activate-cntk &&
You are spared the overhead of installing and setting up the environment for the various frameworks and can start working on your deep learning projects right away.
- On Tuesday, January 28, 2020
Data Science Workflows using Docker Containers
Containerization technologies such as Docker enable software to run across various computing environments. Data Science requires auditable workflows where ...
Docker for Data Scientists, Strata 2016, Michelangelo D'Agostino
Data scientists inhabit such an ever-changing landscape of languages, packages, and frameworks that it can be easy to succumb to tool fatigue. If this sounds ...
Docker Tutorial - What is Docker & Docker Containers, Images, etc?
Docker tutorial for beginners - part 1: Free Digital Ocean Credit! Docker is amazing, and it doesn't have to be difficult to ..
Docker Container Tutorial - How to build a Docker Container & Image
This tutorial covers how to build a docker container. It covers everything you need to know from setting up boot2docker on your machine to building and ...
Deploying Machine Learning apps with Docker containers - MUPy 2017
Talk at Manipal Institute of Technology on deploying machine learning apps using docker containers. This talk covers data science implementations in python, ...
Andy Terrel | Dev Ops meets Data Science Taking models from prototype to production with Docker
PyData DC 2016 We present the evolution of a model to a production API that can scale to large e-commerce needs. On the journey we discuss metrics of ...
Easy Image Classification with Tensorflow
In this coding tutorial, learn how to use Google's Tensorflow machine learning framework to develop a simple image classifier with object recognition and neural ...
Creating Reproducible Data Science Workflows using Docker Containers
Aly Sivji Jupyter notebooks make it easy to create reproducible workflows that can be distributed across groups and ..
Containerized Video Image Classification with MapR
In this video, you will learn how to classify high resolution images using containerized applications. The video showcases bringing in an image, it can be from ...
Train an Image Classifier with TensorFlow for Poets - Machine Learning Recipes #6
Monet or Picasso? In this episode, we'll train our own image classifier, using TensorFlow for Poets. Along the way, I'll introduce Deep Learning, and add context ...