I'm giving Docker a shot again, after couple years, and this time I'm "dockerizing" an old Python app that don't even have configuration management or orchestration. Docker is easy to install now. No more bullshit about kernel features and filesystem support. They even claim it works seamlessly on OS X and Windows [*].
The good parts *
Docker has several things going for it: a good name, huge community, huge amount of images you can start from. Sure, most images are not great, even the official images can have issues but you can always take out the Dockerfile and customize things.
Right now it's hard to find an alternative. CoreOS is trying to make an alternative called "rkt". With a flikr-esque name at that! How do you even read it? ercati? erkit? rocket? rockit? rackit? reckit? You know what, wreckit. If you think I'm lambasting the name unfairly then consider that most of the world don't speak very good English and interpretable spelling is a problem.
Even if rkt didn't have such a terrible name, it still has a long way to reaching Docker's convenience.
What irks me *
Note
If it matters, I'm using Docker version 1.11.0. They call it the "Engine" now.
Docker has some strange oversights and quirks. There's a very strong focus on not breaking any interfaces, it's a bit depressing to look at the bug tracker if you're the impatient type.
These things annoy me:
Documentation has no search. Seriously? I understand that there's Google but come on, I don't want to look at docs for old Docker versions and other junk Google gives me.
Command line seem clumsy:
Inflexible parsing, eg: docker build . --help ain't valid. That makes no sense, there's only a single non-option argument, why can't I have options after that single possible argument?
Most arguments have short form (eg: -i) but not --help. Nooo, not that. No one needs a short form for that, God forbid!
Bad help for most options. Is this supposed to help?
-v, --volume=[] Bind mount a volume
Cause that sure as hell don't tell me anything about what I can pass in there. Now I have to go into the docs with no search. After rummaging trough 10 useless pages of documentation I eventually I get to this:
-v, --volume=[host-src:]container-dest[:<options>] Bind mount a volume. The comma-delimited `options` are [rw|ro], [z|Z], [[r]shared|[r]slave|[r]private], and [nocopy]. The 'host-src' is an absolute path or a name value.
What. Why can't that be in the command line?
If you got a run or build error in docker-compose you're usually left with half running containers and you have to manually cleanup the mess. If you don't pay attention you're left wondering why docker-compose build doesn't do anything. I did some changes, ran docker-compose build, why don't docker-compose up don't run new images? Cause some stuff is already running, that's why!
If you stop a docker-compose up all services stop. If you stop a docker-compose up myservice it leaves all the dependencies running around. Argh! Why the inconsistency?
There is no garbage collection. Leaves around huge piles of useless containers and images. Sure, there's docker rmi $(docker images -q) and docker rm $(docker ps -a -q) but that's like cleaning your computer dust with a water hose. It sure does the job but your computer probably don't work after that.
Dockerfiles are quirky and non-orthogonal. Similarly to command line interface, the Dockerfile syntax is a culmination of ill-advised choices.
Support for multiline values is an afterthought. Dockerfiles are littered with "\" and "&&" all over the place. Lots of noise, lots of mistakes.
There two ways to do the same thing. The parser takes JSON for most commands but if parse fails then it takes whatever gunk is in there as a string.
For example the CMD and ENTRYPOINT take this to the extreme: they have two exec mode. Take a look at this crazy table:
No ENTRYPOINT ENTRYPOINT exec_entry p1_entry
ENTRYPOINT ["exec_entry", "p1_entry"]
No CMD error, not allowed /bin/sh -c exec_entry p1_entry
exec_entry p1_entry
CMD ["exec_cmd", "p1_cmd"]
exec_cmd p1_cmd
/bin/sh -c exec_entry p1_entry exec_cmd p1_cmd
exec_entry p1_entry exec_cmd p1_cmd
CMD ["p1_cmd", "p2_cmd"]
p1_cmd p2_cmd
/bin/sh -c exec_entry p1_entry p1_cmd p2_cmd
exec_entry p1_entry p1_cmd p2_cmd
CMD exec_cmd p1_cmd
/bin/sh -c exec_cmd p1_cmd
/bin/sh -c exec_entry p1_entry /bin/sh -c exec_cmd p1_cmd
exec_entry p1_entry /bin/sh -c exec_cmd p1_cmd
There is clear overlap there. Maybe there are historical reasons for it but that's why we have version numbers. Why do we need to keep all that cruft?
No volumes/mounts during docker build. I have found ways to deal with this (read on) but still, it's an irk.
If you look in the bug tracker you'll notice that lots of people have come up with ideas to improve Dockerfiles, but these issues just don't seem important enough.
There's something rotten about how Docker run containers as root by default. Docker seems to mitigate this by disabling some capabilities like ptrace by default. But this ain't purely a security concern, it affects usability as well.
For example, once you install docker on your Ubuntu you're faced with a choice: use sudo all over the place or give your user full access to the daemon:
sudo usermod -aG docker $USER # then relogin, for the change to take effect - obvious right?
Still not sure if that's better than a constant reminder that you're doing lots of stuff as root.
I know these might look like superficial complaints but they rile me up regardless. At least it's not this bad. </rant>
The ecosystem *
There are a lot of images, and you'll probably find anything you'll ever want. And there's even a set of specially branded images that pop up almost everywhere: the "official" images. They are pretty good: up to date, they verify signatures for whatever they download, consistent presentation etc.
Though I have a problem with the Python images. For some reason they decided to just compile Python themselves. Sure, it's the latest point-release but they don't include any of the patches the Debian maintainers made. I don't like those patches either (they customize package install paths and the import system) but it's worse without them:
Suddenly gdb stops working. Just try a docker run --rm -it python:2.7 sh:
# apt-get update # apt-get install gdb # gdb python Traceback (most recent call last): File "/usr/lib/python2.7/site.py", line 563, in <module> main() File "/usr/lib/python2.7/site.py", line 545, in main known_paths = addusersitepackages(known_paths) File "/usr/lib/python2.7/site.py", line 272, in addusersitepackages user_site = getusersitepackages() File "/usr/lib/python2.7/site.py", line 247, in getusersitepackages user_base = getuserbase() # this will also set USER_BASE File "/usr/lib/python2.7/site.py", line 237, in getuserbase USER_BASE = get_config_var('userbase') File "/usr/lib/python2.7/sysconfig.py", line 582, in get_config_var return get_config_vars().get(name) File "/usr/lib/python2.7/sysconfig.py", line 528, in get_config_vars _init_posix(_CONFIG_VARS) File "/usr/lib/python2.7/sysconfig.py", line 412, in _init_posix from _sysconfigdata import build_time_vars File "/usr/lib/python2.7/_sysconfigdata.py", line 6, in <module> from _sysconfigdata_nd import * ImportError: No module named _sysconfigdata_nd
Looks like there's some import path trampling going on. I don't want broken debug tools when my app is broken.
You can't use any Python package from the APT repos. Sure, most of them are old and easy to install with pip, but there are exceptions [2].
Strange C.UTF-8 locale. I can understand they don't want to put a specific language in there but if you run any locale using applications you'll run into issues.
What I ended up was using the ubuntu:xenial image (Xenial being the new LTS). It ships latest point release for 2.7 so why compile it again? I took the good parts from python:2.7-slim and I got this Dockerfile:
FROM ubuntu:xenial
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
ca-certificates curl \
strace gdb lsof locate net-tools htop \
python2.7-dbg python2.7 libpython2.7 \
&& rm -rf /var/lib/apt/lists/*
ENV PYTHON_PIP_VERSION 8.1.1
RUN set -eEx \
&& curl -fSL 'https://bootstrap.pypa.io/get-pip.py' | python2.7 - \
--no-cache-dir --upgrade pip==$PYTHON_PIP_VERSION
COPY .wheels /wheels
Everything works great, and there's some magic in those debug packages that make gdb give me some real nice commands like py-bt [3]. Note that I snuck in some other tools to help debugging.
All I'm saying there is that even official images need scrutiny. Check their Dockerfile and decide if it really fits your needs.
The challenges *
Docker has some interesting challenges, or limitations. The build system has something called "layers" and there's a hard limit of how many layers you can have. Each command in your Dockerfile makes a new layer. If you look at the "official" best practices guide you'll see most of the stuff there revolves around this limitation. You inevitably end up with some damn ugly RUN commands.
There's a good thing about these layers - they are cached. However, the context is not. Never one layer needs all the context, or the same part of the context. Layers should be able to have individual context, but alas, docker build wasn't designed with that in mind.
Another limitation by design is that docker build doesn't allow any mounts or volumes during build. The only way to get stuff in the container that eventually becomes the image is by network or by the "context".
What's this context? *
When you run docker build foobar Docker will make an archive of foobar/* (minus what you have in .dockerignore) and build an image according to what you have in foobar/Dockerfile. You can specify the context path and Dockerfile path individually but, never too odd, that Dockerfile must be in the context. You can't get creative here.
Optimizing the build process *
You can parametrize this build process but the lack of mounts or volumes exposes you to some pretty annoying slowness if you have to build external packages for example. This problem is still pervasive in Python, most of the stuff in PyPI is just source packages. Even if now you can publish Linux binaries on PyPI it's still years till most packages will publish those manylinux1 wheels [1]. Even if we'd have wheels for everything there's still the question of network slowness. Setting up caching proxies is inconvenient.
Most Dockerfiles I've seen have something like this:
RUN mkdir -p /app
COPY requirements.txt /
# slow as hell ...
RUN pip install -r /requirements.txt
COPY app /app
Now for simple projects this is fine, because you only have a handful of dependencies. But for a larger projects, hundreds of dependencies is order of the day. Changing them or upgrading versions (as you should always pin versions [4]) will introduce serious delays in build times. Because the container running the build process is pretty insulated (no volumes or mount remember?) pip can't really cache anything.
Staging the build process *
A way to solve this is having a "builder image" that you run to build wheels for all your dependencies. When you run an image you can use volumes and mounts.
Before jumping in, lets look briefly at file layout. I like to have a docker directory and then another level for each kind of image. Quite similar to the layout the builder for official images have. And no weird filenames, just Dockerfile everywhere:
docker βββ base β βββ Dockerfile βββ builder β βββ Dockerfile βββ deps β βββ Dockerfile βββ web β βββ Dockerfile βββ worker β βββ Dockerfile βββ ...
In this scenario we'd deploy two images: web and worker. The inheritance chain would look like this:
- buildpack-deps:xenial β builder
- ubuntu:xenial β deps β base β web
- ubuntu:xenial β deps β base β worker
In which:
- builder has development libraries, compilers and other stuff we don't want in production.
- deps only has python and the dependencies installed.
- base has the source code installed.
- web and worker have specific customizations (like installing Nginx or different settings).
And in .dockerignore we'd have:
# just ignore that directory, we don't need that stuff when we have "." as the context
docker
This layout might seem artificial but not quite:
- Both the worker and web need the same source code.
- The deps and base are not in the same image because their contexts are distinct: one needs a bunch of wheels and the other one only needs the sources. This setup allows us to skip building the deps image if the requirement files did not change.
- The web and worker images do not need to have the source code in the context. This allows faster build times. For development purposes we can just mount the sourcecode. More about that later.
In builder/Dockerfile there would be:
# we start from an image with build deps preinstalled
FROM buildpack-deps:xenial
# seems acceptable for building, note the notes above
# about C.UTF-8 - it's not really good for running apps
ENV LANG C.UTF-8
# we'd add all the "-dev" packages we need here
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
python2.7 python2.7-dbg python2.7-dev libpython2.7 \
strace gdb lsof locate \
&& rm -rf /var/lib/apt/lists/*
ENV PYTHON_PIP_VERSION 8.1.1
RUN set -eEx \
&& curl -fSL 'https://bootstrap.pypa.io/get-pip.py' | python2.7 - \
--no-cache-dir --upgrade pip==$PYTHON_PIP_VERSION
ARG USER
ARG UID
ARG GID
# we set some default options for pip here
# so we don't have to specify them all the time
# this will make pip additionally look for available wheels here
ENV PIP_FIND_LINKS=/home/$USER/wheelcache
# and this is the default output dir when we run `pip wheel`
ENV PIP_WHEEL_DIR=/home/$USER/wheelcache
ENV PIP_TIMEOUT=60
# one network request less, we don't care about latest version
ENV PIP_DISABLE_PIP_VERSION_CHECK=true
RUN echo "Creating user: $USER ($UID:$GID)" \
&& groupadd --system --gid=$GID $USER \
&& useradd --system --create-home --gid=$GID --uid=$UID $USER \
&& mkdir /home/$USER/wheelcache
WORKDIR /home/$USER
The interesting part here is the USER, UID and GID build arguments. Unless you do something special the processes inside the container runs with root user. This is fine, right? That's the whole point of using a container, processes in the container actually have all sort of limitations - so it don't matter what user runs inside. However, if you mount something from the host inside the container then the owner any new file inside that mount is going to be the same user that the container runs with. The result is that you're going to get a bunch of files owned by root in the host. Not nice.
Because I don't do development with a root account and because user namespaces are surprisingly inconvenient to use [6] I have resorted to recreating my user inside the container. It needs to have the exact uid and git, otherwise I get files owned by an account that don't exist.
Similarly to what was shown before, deps/Dockerfile would have:
FROM ubuntu:xenial
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
ca-certificates curl \
strace gdb lsof locate net-tools htop \
python2.7-dbg python2.7 libpython2.7 \
&& rm -rf /var/lib/apt/lists/*
ENV PYTHON_PIP_VERSION 8.1.1
RUN set -eEx \
&& curl -fSL 'https://bootstrap.pypa.io/get-pip.py' | python2.7 - \
--no-cache-dir --upgrade pip==$PYTHON_PIP_VERSION
COPY .wheels /wheels
RUN pip install --force-reinstall --ignore-installed --upgrade \
--no-index --use-wheel --no-deps /wheels/* \
&& rm -rf /wheels
And base/Dockerfile:
FROM app-deps
# copy the application files and add them on the import path
RUN mkdir /app
WORKDIR /app
COPY setup.py /app/
COPY src /app/src
# there are other ways (like .pth file) but this one allows
# for easy setup of various bin scripts app might need
RUN python2.7 setup.py develop
# create an user for the application and install basic tools
# to change user (pysu) and wait for services (holdup)
ARG USER=app
ENV USER=$USER
RUN echo "Creating user: $USER" \
&& groupadd --system $USER \
&& useradd --system --create-home --gid=$USER --base-dir=/var $USER \
&& pip install pysu==0.1.0 holdup==1.0.0 \
&& pysu $USER id
# this last one just tests that pysu works
For web/Dockerfile we can have something like:
FROM app-base
RUN apt-get update \
&& apt-get install -yq --no-install-recommends nginx-core supervisor \
&& rm -rf /var/lib/apt/lists/*
COPY site.conf /etc/nginx/sites-enabled/default
COPY supervisor.conf /etc/supervisor/conf.d/
COPY entrypoint.sh /
RUN echo "daemon off;" >> /etc/nginx/nginx.conf
EXPOSE 80
CMD ["/entrypoint.sh"]
To build the images we can run this:
#!/bin/sh
set -eux
docker build --tag=app-builder \
--build-arg USER=$USER \
--build-arg UID=$(id --user $USER) \
--build-arg GID=$(id --group $USER) \
docker/builder
# we run this image two times, once to prime a wheel cache
mkdir -p .dockercache
docker run --rm \
--user=$USER \
--volume="$PWD/requirements.txt":/requirements.txt:ro \
--volume="$PWD/.dockercache":/home/$USER \
app-builder \
pip wheel --requirement=/requirements.txt
# and the second time to create the final wheel set
rm -rf "$PWD/docker/deps/.wheels"
mkdir "$PWD/docker/deps/.wheels"
docker run --rm \
--user=$USER \
--volume="$PWD/requirements.txt":/requirements.txt:ro \
--volume="$PWD/.dockercache":/home/$USER \
--volume="$PWD/docker/deps/.wheels":/home/$USER/wheels \
app-builder \
pip wheel --wheel-dir=wheels \
--requirement=/requirements.txt
# and now there are going to be tons of wheels in "docker/deps/.wheels/"
docker build --tag=app-deps docker/deps
docker build --tag=app-base --file=docker/base/Dockerfile .
# this is why we simply ignore "docker/" in .dockerignore -- it
# would inflate the context a lot and we don't need the wheels in this step
docker build --tag=app-web docker/web
If you look close enough you'll notice a small flaw here: rebuilding the base image will invalidate everything that depends on it. Because the app-base image has the code in the context it will get invalidated often enough to be annoying. However, there's a solution for that, and lets not forget the whole point of this - we can produce a large fleet of images cheaply (context only includes docker/<kind>). This is quite good if you have a huge codebase and a large number of image types (in addition to app-web and app-worker you could have various other services).
Doing development *
For development we don't really need to rebuild the app-base image (which includes the sources) - we can just mount our project checkout in the right place. This could work for development:
# no need to rebuild the base image (that has
# the /app/src) if we just mount it anyway
docker run --volume=$PWD/src:/app/src:ro app-web
Enter docker-compose *
A nicer way to do it, especially if you depend on services like a database, is with docker-compose - all you need is a docker-compose.yml file like this:
version: '2'
services:
web:
build: 'docker/web'
image: 'app-web'
ports:
- '8080:80'
volumes:
- '${PWD}/src:/app/src:ro'
links:
- 'pg'
environment:
DATABASE_HOST: 'pg'
pg:
image: 'postgres:9.5'
environment:
POSTGRES_DB: 'app'
POSTGRES_USER: 'app'
POSTGRES_PASSWORD: 'app'
With this configuration we solve the problem with the app-base that was outlined earlier: note the '${PWD}/src:/app/src:ro' mount.
Living inside the container *
Working inside the container is not easy. Take debugging for example - normally you'd expect strace and gdb to just work but they require some privileges that are disabled by default. Thus this pattern appears:
docker run --privileged --rm -it myimage bash
# or, if you want to debug a running container
docker exec --privileged -it mycontainer bash
While strace can work from the outside gdb is not feasible to use from the host due to the library paths being mismatched. It's unlikely that you have exact OS on the host, so you will lack the sorely needed debug symbols.
Avoiding bash *
Turns out this is a thing, for example Alpine don't have bash by default. You can still install it but if you want the smallest possible images then you should avoid it.
My rule of thumb here is: if at any point you'll want to have an interactive session in a container just install it. Space can't be be more important that tooling.
Container dependencies *
A frequent problem with dependencies is that startup is slow. Take for instance the pg service shown in the docker-compose.yml example: it can take few seconds for it to start the first time due to setup. Then the web service will fail to start if it needs to connect to the database at startup (e.g.: migrations).
People definitely want a solution for this. Currently there are 3 ways to handle this issue:
- Use one of Docker's restart policies. This might be a bit too coarse.
- Use your orchestration tool's healthcheck features to handle this. This might be incovenient, for example docker-compose don't have this feature. Kubernetes has it but it's a bit more involved.
- Make your container wait a bit for services by doing basic port or health checks. There are some recommendations here, however, none of them support unix domain sockets so I made my own tool: holdup.
Process handling *
An easy mistake is not using exec in the shell scripts you'll inevitably need. If you forget to do this then your container will fail to stop correctly and Docker will be forced to send a SIGKILL. This can lead to data loss. If your PID 1 is something like /bin/bash or /bin/sh, or you don't have any process with PID 1 then you have this problem.
Suppose you have a CMD ["script.sh"] or ENTRYPOINT ["script.sh"]. Then:
#!/bin/sh
# general good practice (stop on error, missing variables, verbose mode):
set -eux
# avoid this:
pure-ftpd
# do this instead:
exec pure-ftpd
To make it worse, the su bin doesn't support an "exec"-mode. Thus tools like gosu [5] were created.
Another problem are sloppy applications that don't cleanup after child processes. Containers don't have a "init" process by default - so there's no "catch-all" for zombie processes. If you have that sort of application then your container will inevitably exhaust all available PIDs. You can work around this with minimal init systems like pidunu but a better solution is to fix your application's code - it only needs a os.waitpid(-1, WNOHANG) [7] in the right place.
Lack of standard services *
This is a surprise problem, especially if you grew on well established services like syslog or cron.
For example pure-ftpd has an unhealthy marriage to syslog. You can either:
Get the source package, patch/configure it to log to stdout and recompile. Unfortunatelly this results in a bloated image, unless you inconvenience yourself with layer merging tools or using a builder image.
There may be a distro having a package with all the right compile options but I haven't found it.
Mount a syslog socket in the container (example: /var/log). But this goes against the grain of Docker managing the logs. Good luck finding what container emitted the messages.
Run pure-ftpd and syslogd with supervisord or similar. You'd need a Dockerfile like:
FROM ubuntu:xenial RUN apt-get update \ && apt-get install -y --no-install-recommends supervisor inetutils-syslogd \ && rm -rf /var/lib/apt/lists/* \ && echo '*.* /dev/stdout' > /etc/syslog.conf COPY supervisor.conf /etc/supervisor/conf.d/
And a supervisor.conf like:
[program:syslog] command=syslogd -n --no-klog stdout_logfile=/dev/stdout stdout_logfile_maxbytes=0 redirect_stderr=true priority=1 [program:mystuff] command=mystuff
But this creates some friction with the way things should be run in Docker: should mystuff fail to run, Docker won't know.
Run something like:
# just forward everything to stdout socat -u unix-recv:/dev/log - & exec pure-ftpdexec
And hope socat never dies ...
Cron is another kind of I don't need this issue. In addition to Cron requiring a /dev/log (syslog), your cron jobs are ran with a clean environment. All those environment variables are gone unless you have a contraption like this:
- Dump all the env vars somewhere: gosu myuser env > /saved-environ
- And then in the cron job, make sure to load that: env - `cat /saved-environ` python ... >/dev/null 2>&1
Alternatively you could (depending on what base image you use) dump all the env vars in this location: gosu myuser env > /etc/environment
Output redirection is another issue with Cron, you can't simply log to stdout, so your only resort is, guess what, syslog: 2>&1 | logger -t myjob.
Is that all you say? Of course not. Cron also wants a mailer. You probably don't want that so you add a MAILTO="" in your cron config.
Managing configuration *
Turns out Docker doesn't have anything builtin for this. You either have to use external services to store configuration and orchestrate restarting the containers, or rely on volumes. Using volumes has some limitations but it seems way more approachable.
The gist of this is having an image just for configuration, a docker/conf/Dockerfile like this:
FROM alpine:3.3
RUN mkdir -p /etc/app/conf /etc/app/conf-update
# this will be the default data, if volume wasn't created
COPY settings.conf /etc/app/conf/
VOLUME /etc/app/conf
# this second copy is used for updating
COPY settings.conf /etc/app/conf-update
COPY update-conf.sh /
# similarly to a cp command
ENTRYPOINT ["/update-conf.sh", "/etc/app/conf-update", "/etc/app/conf"]
This update-conf.sh script is the difference to Jeff Nickoloff's example - it logs what gets changed:
#!/bin/sh
set -eu
echo "Updating configuration in $2 with files from $1"
for to_update in $1/*; do
name="${to_update##*/}"
echo "Updating: $name"
echo "* diff from: $to_update to: $2/$name"
diff -U 5 "$to_update" "$2/$name"
echo "* copying: $to_update to: $2/$name"
cp "$to_update" "$2/$name"
done
echo Success. Configuration updated!
Then to use this you just link the volumes (assuming docker-compose):
conf:
image: 'app-conf'
build: 'docker/conf'
web:
...
volumes_from:
- 'conf:ro'
This is all I got so far, more in a future article!
[*] | Haven't tried it on Windows but it does use VirtualBox so I expect a painfully slow experience. |
[1] | Wheels are binaries for Python packages. |
[2] | Try building GUI library bindings or pysvn. Enter a world of pain. |
[3] | py-bt prints the callstack of Python code, while bt prints the C call stack which don't show python line numbers or function names. |
[4] | Pinning the versions is a best practice because:
|
[5] | If you already use Python and don't want an extra 1.7Mb binary use pysu instead. |
[6] | You need to configure your Docker daemon and subuid/subgid (another system-wide thing). User namespace is a security feature for deployment, not development. Imagine having to configure that on every developer's machine, good grief! |
[7] | See: https://docs.python.org/3/library/os.html#os.waitpid |