Introduction

Container technologies, such as Docker, significantly simplify dependency management and portability of your software. In this series of articles, we explore Docker usage in Machine Learning (ML) scenarios. In this article – the first one of the series – we’ll go over some Docker basics as they apply to ML applications.

This series assumes that you are familiar with ML, containerization in general, and Docker in particular.

Installing Docker

Docker can run on multiple platforms, including Docker Desktop for Windows and macOS on Intel/AMD processors, as well as Docker Server for the various Linux distributions on Intel/AMD and ARM processors. You can find a comprehensive installation guide for your platform at the Docker website.

Why Docker?

Docker is a lightweight technology for packaging and executing software components in an isolated environment. Such a package is called "container image" (or just "image"), and the environment where the image code is executed is called "container."

The Docker technology ensures complete isolation of both Python and system dependencies of each container, which is more than virtual environments offer. At the same time, this technology allows maximum portability across runtime environments. In many cases, the same container can run on a local workstation or on a server, either on-premises or in the cloud.

To better understand some design choices we’ll make in the following articles, it is worth spending some time on the Docker basics.

Layers

A Docker image consists of a number of read-only layers. Each layer contains only the differences between the current and the previous layer. When an image is built, only those layers that had changed, and the following ones, are refreshed. This is why it is crucial to define Docker layers in such a way that they are ordered from the most static to the most "dynamic" – most likely to change. It can greatly reduce the time required to build an image.

The number of layers has an impact on the image size (and build time) as well, so it is recommended to execute multiple Linux commands using a single RUN statement (single RUN = single layer).

Dockerfile

Dockerfile defines an image. Let’s consider a very simple example:

FROM python:3.7-slim
RUN apt-get update & apt-get install python3-numpy
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY app /app
WORKDIR app
ENTRYPOINT ["python", "app.py"]
CMD ["--input", "1234"]

The meaning of the above statements is as follows:

FROM defines the base image in the <image_name>:<image_tag> format. If such an image is not available locally, Docker will try to pull it from a known repository (by default, https://hub.docker.com).
RUN executes a Linux command.
COPY copies local files from the host context to the image.
WORKDIR switches the current folder (creates it if needed).
ENTRYPOINT and CMD define a command that executes each time the container starts. If these statements are used together, ENTRYPOINT reflects the constant part of the command. and CMD– its parameters.

The following statement builds an image:

$ docker build -t <image_name>:<tag_name> .

Note that the "." (dot) at the end indicates the current folder as the Docker context (in a nutshell, the location of the root folder with the files copied to the container and the Dockerfile).

The "Latest" is Not the Best

By default, Docker uses the "latest" tag for the base image if no tag is provided explicitly. Similarly to specifying package version when defining Python dependencies for Pip or Conda, you should always add a predetermined tag to the base image instead of the "latest" default. It can save you a lot of pain, especially in production. While this doesn't ensure that the image will be 100% the same each time it is built, it significantly reduces the risk of introducing harmful changes.

Process Isolation on Different Platforms

Docker works slightly differently depending on the host operating system. Internally, Docker relies on four core Linux features: union file system, Linux processes, namespaces, and cgroups.

The union file system (UnionFS) is a technology used to handle image layers. In the case of a Docker Server on Linux, all containers share a single kernel with the host.

The remaining three features – processes, namespaces, and cgroups – ensure proper container isolation. In the case of Docker Desktop on Windows and Mac, a Linux VM is installed on the host machine, and all the running containers share its kernel.

With the above in mind, it is easy to understand why running a container using its default root user is a very bad idea on a Linux server. The issue is slightly less dangerous if only a single container runs on a single dedicated machine (which is often the case for cloud deployments).

Anyway, you should never run container code as a root user, even with Docker Desktop.

Running Container with Data

The secret of lightning-fast container execution is that when it starts, no data is copied. Only a single (and initially very thin) read-write container layer is added to the stack of read-only image layers. All changes to files during container execution are stored in this new layer as "delta" using the union file system.

When a container is removed, this read-write layer is deleted along with it. That is why you should always treat the container data as temporary. If you care about data processed by the container, you need a volume. Depending on the needs and the host environment, it may be persisted by the Docker instance or mapped to a local or cloud folder.

Architecture Matters

Docker is not magic – it depends on the hardware used to run it. This means that you will still, most likely, need slightly different images for Intel/AMD and ARM processors.

Summary

In this article, we have covered some Docker basics, focusing on the features commonly encountered when working on ML projects. In the following article, we’ll put this knowledge to use. We will create a simple container image to use for experimentation and training with the Intel/AMD CPU.