Incremental Docker builds for monolithic codebases
As a developer working on a monolithic codebase, how can you use Docker to build and deploy the projects contained in it? If you take the naive approach, you quickly run into problems with bloated images and frequent rebuilds of the entire source tree.
In this post, I show you how to build images from monorepos incrementally, reusing previous builds beyond the Docker build cache. The solution I describe avoids code duplication, reduces image size, and speeds up builds dramatically. If you’re a developer who needs to run frequent integration tests on your work in progress, then this technique is for you.
Sample code is available in a GitHub repository. Each section in the post corresponds to a commit in the GitHub repository, linked to at the top of the section.
Contents
- Introduction
- Example codebase
- Writing Dockerfiles
- Avoiding code duplication
- Using Docker Compose for a builder image
- Reducing image size
- Building incrementally
- Summary
Introduction
Companies and open-source organizations often use a single source code repository for their projects. These large monolithic codebases are also known as monorepos. As an extreme example, Google’s monorepo spanned 2 billion lines of code in early 2015, amounting to 85 terabytes of data.1
How do you test changes that potentially affect a multitude of services built from it? While unit tests are the primary means of checking code changes during development, it is important to also run integration tests early on. They help make sure that services start up and interact with each other as expected.
Docker is a convenient way of automating the build and deployment process. What’s more, it is not restricted to continuous integration and production. Docker can also be used on a developer machine, allowing you to test the running environment even before you push your changes to a branch for CI and review.
Using Docker to build and deploy artifacts from a monolithic codebase presents several challenges. This is especially true if you’re a developer who needs to frequently rebuild the codebase.
Example codebase
I will use a small example codebase to explain how to use Docker for a
monolithic repository. It consists of a static library foo
and two executables
bar
and baz
. The build system produces a Debian package for each executable.
.
├── CMakeLists.txt
├── bar
│ ├── CMakeLists.txt
│ └── bar.c
├── baz
│ ├── CMakeLists.txt
│ └── baz.c
└── foo
├── CMakeLists.txt
├── foo.c
└── foo.h
3 directories, 8 files
Don’t worry if you’re not familiar with the C programming language or the CMake build system. This article does not assume any knowledge about these. The technique shown here applies to any programming language and build system, and it should be straightforward to translate it to your weapon of choice.
Writing Dockerfiles
Let’s start by writing a Dockerfile for each of the two executables.
The Dockerfiles are almost identical: They install the build requirements, copy the source tree, and build the binaries and packages. At the end, each Dockerfile installs the package containing its executable, and sets it as the command to be executed when running the image.
FROM debian:stretch-slim
RUN apt-get update && apt-get install -y \
cmake \
dpkg-dev \
gcc \
make \
&& rm -rf /var/lib/apt/lists/*
COPY . /src
WORKDIR /build
RUN cmake /src
RUN make package
RUN dpkg --install foobar-0.1.1-Linux-bar.deb
CMD ["bar"]
Instructions to build and deploy many services become complex, very fast. While Dockerfiles already help greatly with this, you can use Docker Compose to encapsulate the entire build and deployment process in a single declarative file.
Create the following docker-compose.yml
file:
version: "3.7"
services:
bar:
build:
context: .
dockerfile: bar/Dockerfile
baz:
build:
context: .
dockerfile: baz/Dockerfile
This allows you to build and deploy the images with a single invocation of
docker-compose up
.
Test everything using the following commands:
docker-compose up --build --detach
docker-compose logs --tail=10
docker-compose down
It works, but this solution has several shortcomings:
- Each Dockerfile contains identical commands to build the source tree.
- Each Docker image contains the entire source and build trees.
- The entire codebase is built every time a line of code is changed.
Avoiding code duplication
Code duplication is easily avoided by moving the build instructions to a separate Dockerfile:
FROM debian:stretch-slim
RUN apt-get update && apt-get install -y \
cmake \
dpkg-dev \
gcc \
make \
&& rm -rf /var/lib/apt/lists/*
COPY . /src
WORKDIR /build
RUN cmake /src
RUN make package
The above Dockerfile is used to build a common base image named builder
, which
can be referenced by the other Dockerfiles like this:
FROM builder
RUN dpkg --install foobar-0.1.1-Linux-bar.deb
CMD ["bar"]
The builder image is not yet registered in the Docker Compose file. Before the services can be created and started, the builder image needs to be built explicitly using a command such as the following:
docker build --tag=builder .
Using Docker Compose for a builder image
Docker Compose only builds a single image for every service, but these images are derived from a common base image. How do you ensure the base image is rebuilt any time the services are?
Getting Docker Compose to build the base image is possible, but it involves adding a service for it at the top of the Docker Compose file:
version: "3.7"
services:
builder:
image: builder
build: .
bar:
build: bar
baz:
build: baz
The image name is specified explicitly because, by default, image names are
constructed from the basename of the directory and the service name. So the
image would end up being named docker-incremental-build-example_builder
,
rather than builder
.
Of course, you do not actually want Docker Compose to run a service with the
builder image. Ensure that the builder service exits immediately by changing its
command to true
:
diff --git a/Dockerfile b/Dockerfile
index 2202b3f..70a4be8 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -9,3 +9,4 @@ COPY . /src
WORKDIR /build
RUN cmake /src
RUN make package
+CMD ["true"]
Reducing image size
One problem you encounter when writing Dockerfiles for a monolithic codebase is the size of the Docker images resulting from it. Each image contains the entire source and build trees, including a heap of intermediate and unrelated build artifacts, as well as the complete build toolchain. You can verify this for the example codebase:
$ docker image ls --format="table {{.Repository}}\t{{.Size}}"
REPOSITORY SIZE
docker-incremental-build-example_baz 317MB
docker-incremental-build-example_bar 317MB
builder 317MB
Multi-stage builds
Multi-stage
builds are
commonly used to keep build dependencies out of the final Docker image: The
first stage imports the source tree, installs the build toolchain, and produces
the build artifact. The second stage extracts the build artifact and copies it
into a minimal base image. This is achieved using the COPY --from
instruction,
which allows copying files from another image or build stage.
A typical multi-stage build looks like this:
FROM debian
RUN apt-get update && apt-get install -y build-essential \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY . .
RUN make foo
FROM alpine
COPY --from=0 /build/foo /usr/bin/
CMD ["foo"]
This Dockerfile has two stages, each introduced by a FROM
instruction. The
first stage uses a Debian image to build an executable named foo
. The second
stage uses an Alpine image, copies foo
from the first stage and sets it as the
command to be executed when the image is run. Docker build stages are numbered
from zero, so COPY --from=0
copies from the first stage. Alpine
Linux is a security-oriented, lightweight Linux
distribution and a popular choice for Docker images.
Using the builder image as a “stage”
With a monolithic codebase, the build instructions for the first stage are identical for all images. How do you use multi-stage builds when the initial stage is shared between the images?
This is actually rather simple. The COPY --from
instruction can also be used
with the name of an external image, rather than a build stage. You already have
an image that builds the codebase: the builder image.
Let’s rewrite the Dockerfiles for bar
and baz
using the COPY --from
instruction. Instead of deriving the final images from the builder image, derive
them from a minimal base image. Then extract the Debian package from the builder
image, and install it into the image:
FROM debian:stretch-slim
COPY --from=builder /build/foobar-0.1.1-Linux-bar.deb /tmp/
RUN dpkg --install /tmp/foobar-0.1.1-Linux-bar.deb
CMD ["bar"]
Images now contain only the minimum required to run the service, leaving source and build trees as well as build dependencies behind. This reduces image sizes significantly, even for our tiny example codebase:
$ docker image ls --format="table {{.Repository}}\t{{.Size}}"
REPOSITORY SIZE
docker-incremental-build-example_baz 55.4MB
docker-incremental-build-example_bar 55.4MB
builder 317MB
This technique is related to multi-stage builds, but—due to the monolithic nature of the codebase—the first stage is shared between images and contained in a separate Dockerfile.
Building incrementally
Every time a line of code is changed, Docker rebuilds the entire codebase from scratch. Typically, this takes anywhere from minutes to hours, depending on codebase and build infrastructure. Let’s see why this happens.
The Dockerfile for the builder image imports the source tree into the image
using the COPY
instruction. When encountering a COPY
instruction, Docker
examines the contents of the copied files and calculates a checksum for each
file. If the checksum of each of the files matches its checksum in a previous
build, the image is retrieved from the Docker build
cache.
If, on the other hand, one of the files has changed, the build cache is
invalidated and Docker runs the remaining instructions in the Dockerfile without
using the cache. As a consequence, the build tools—cmake and make—cannot
reuse any artifacts from a previous run. The layers containing them appear after
the COPY
instruction and are thus no longer available.
How to avoid building the entire source tree each time
How can you get Docker to build incrementally, reusing the intermediate build
artifacts from its last invocation? Luckily, there is an instruction that allows
you to copy the build artifacts into the image explicitly: the COPY --from
instruction, which you just used in the section Reducing image
size. At the time the Dockerfile is being executed, the
builder
tag still references the last successful build of this image, so you
can use it to copy the build tree.
Let’s add this instruction right after we copy the source tree:
diff --git a/Dockerfile b/Dockerfile
index 70a4be8..82469c0 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -6,6 +6,7 @@ RUN apt-get update && apt-get install -y \
make \
&& rm -rf /var/lib/apt/lists/*
COPY . /src
+COPY --from=builder /build /build
WORKDIR /build
RUN cmake /src
RUN make package
Now the build tools can access the results of previous runs, so only targets with changed dependencies need to be rebuilt. For a sizeable codebase, this can speed up Docker builds dramatically.
Bootstrapping incremental builds
The COPY --from=builder
instruction references the very image that is
currently being built: the builder image. The consequence of this
self-reference is that
the initial build is now broken: there is no image to copy the build tree from.
The following Dockerfile.init
creates an initial builder image from which the
“real” builder image can copy the build directory. The build directory contains
only an empty placeholder file.
FROM scratch
COPY .keep /build/
For convenience, create this docker-compose.init.yaml
to trigger the initial
build:
version: "3.7"
services:
builder:
image: builder
build:
context: .
dockerfile: Dockerfile.init
Docker builds can now be bootstrapped using the following command:
docker-compose --file=docker-compose.init.yml build
Summary
The technique outlined here allows you to build and deploy artifacts from large and monolithic codebases.
- A single command builds and deploys Docker images.
- Docker builds are fast due to the use of incremental builds.
- Image sizes are small due to the use of multi-stage builds.
At the heart of this technique is the COPY --from
instruction. It is used for
two purposes:
- Copy the build tree from one builder image to the next.
- Copy the build artifacts into the final images.
# Copy the build tree from one builder image to the next.
COPY --from=builder /build /build
# Copy the build artifact from builder image to final image.
COPY --from=builder /build/foobar-0.1.1-Linux-bar.deb /tmp/
Builds are bootstrapped using the following command:
docker-compose --file=docker-compose.init.yml build
From then on, images can be built and deployed using the command:
docker-compose up --build
I hope you found this useful, and I’m happy to answer questions or to learn about how you’ve implemented this method in your own work.
-
Why Google Stores Billions of Lines of Code in a Single Repository, by Rachel Potvin and Josh Levenberg, in: Communications of the ACM, July 2016, Vol. 59 No. 7, Pages 78-87. ↩︎