Desktop Docker (3/3): GPU-enabled Linux Graphical Containers

This is my third post in a series I’m doing on using graphical applications in Docker containers. If you missed parts 1 and/or 2, here’s the flow of the 3-part series:

  1. Desktop Docker (1/3): Linux Graphical Containers
  2. Desktop Docker (2/3): Secure Linux Graphical Containers
  3. Desktop Docker (3/3): GPU-enabled Linux Graphical Containers

During my exploration of using Docker containers to isolate graphical desktop applications, there have been a number of times where having GPU capabilities inside the container is desirable. In my case, NVIDIA is the GPU choice, for a couple of reasons; 1) because it came as part of my laptop, and 2) because one of the focuses for CB Technologies is High Performance Computing (HPC), and having containers that can run CUDA code could is valuable.

The requirements for using GPU resources inside a Docker container will vary depending on your choice of GPU, the driver you choose to use and your use case. If you are using a non-NVIDIA GPU or an NVIDIA GPU with the open source Nouveau driver, things are pretty easy. You’ll need the Mesa packages installed in your image and you’ll need to use the --device /dev/dri Docker directive to pass the GPU into the container (more info). If using x11docker, you’ll use the --gpu parameter. This will get you GPU-accelerated graphics (visualization) in your container. If you’re using the NVIDIA proprietary driver and you just want accelerated graphics from your container, the easiest way is to pass the graphics device into the container, and then run the EXACT SAME VERSION of the NVIDIA driver inside the container. x11docker includes an automated way of installing the driver in the container that I mentioned it at the end of the previous blog.

If you’re wanting a more robust method of passing an NVIDIA GPU using proprietary drivers into containers, without needing to deal with getting the exact same version drivers in the container and giving you the ability to determine what GPU capabilities the container can use (graphics, compute, etc.), nvidia-docker is for you. It’s a more complex install, but if you’ll be doing more advanced NVIDIA GPU work it’s worth the effort. A good overview of the history from NVIDIA’s perspective can be found here. This technology is now at version 2, and the concept is that NVIDIA has created a separate runtime for Docker that allows GPU capabilities to be passed through to a container, WITHOUT having to install the driver in the container (you do still need to install a few things though).

Part of the reason to write this blog is to help people through some of the issues I ran into when attempting to install the nvidia-docker technology. As I mentioned in my previous posts for this series, I use OpenSUSE Leap 15 as my main OS. Unfortunately, nvidia-docker is not officially supported on Leap 15, and so I needed to go past the norms to get the technology running. In order for this to work, you need the following:

  • An NVIDIA card
  • A working install of Docker at a supported version
  • An NVIDIA CUDA supported platform — Have a look at NVIDIA’s CUDA Linux Installation Guide to validate.
  • An nvidia-docker supported platform — Have a look here to determine the supported distributions.

The only unsupported part in my case is with nvidia-docker. Fortunately, one of the nvidia-docker devs stepped in and provided an unsupported workaround for Leap 15. If your platform doesn’t support CUDA I believe you’re out of luck.

If you have a platform that is fully supported, by all means, use the binary packages available and shortcut this whole thing. In my case, I’m running a Quadro M620, and already had the driver installed and working. But when I went to install CUDA by binary package, a GeForce driver was installed that clobbered my working driver. I therefore chose to install CUDA manually using the runfile. The information below discusses how I was able to get nvidia-docker working, and some of the abnormal steps that were used. Hopefully, your experience will be cleaner, but if not, I hope that this will give you some options and guidance to get things working properly. In the content below I’ve noted the items that are OpenSUSE specific with this icon:

  1. Install NVIDIA Driver
    The first step is to get your NVIDIA driver installed properly. There are a couple ways to install the proprietary NVIDIA driver. If you want to use your package manager, refer to the documentation specific to your OS. In my case, I chose to install manually using NVIDIA’s runfile (the hard way) so I could manually control updates. I did this by booting my system without graphics, adding “nomodeset 3” to my kernel parameters by editing the Grub menu entry during boot. The end of my kernel parameters looked something like this: “splash=silent quiet nomodeset 3 showopts”. Once at the CLI all that’s required is to chmod +x NVIDIA-Linux-x86_64-410.93.run and then sudo ./NVIDIA-Linux-x86_64-410.93.run, answering the questions posed along the way.
  2. Install a Supported Version of Docker
    I’m not going to go how to install Docker here, as it’s very well documented all over the net for most every platform. What I will say here is that the default Leap 15 docker package is rather old (17.09.1_ce). Previous to this activity, I ran into a situation with docker-compose that required that I upgrade to a later version. Since I had to upgrade, I thought I’d upgrade to the latest version (18.09.0_ce) and did so using an “Experimental” package found on software.opensuse.org. I don’t view this as truly experimental, but rather just not supported by OpenSUSE — similar to installing from a PPA on Ubuntu. I had no issues with install and everything “Docker” worked flawlessly.
    To install docker on other OSes please refer to the documentation specific to your OS.
  3. Configure NVIDIA’s Persistence Daemon
    During this process I came across a good reference blog by Jonathan Petitcolas. In it he talks about enabling the NVIDIA Persistence Daemon. I won’t repeat the process here, so click on the link and follow the steps in that section. Note that I didn’t need to deal with UDEV rules.
  4. Install CUDA Toolkit
    As mentioned previously, I ran into an issue with installing from binary packages, so opted to install manually from NVIDIA’s runfile. You can download any of the CUDA Toolkit installers from the CUDA Toolkit 10.0 Download site. Just click on your OS, your architecture, your distro and the type of install you want to do — runfile in my case. After downloading, a chmod +x cuda_10.0.130_410.48_linux.run and then sudo ./cuda_10.0.130_410.48_linux.run and I was on my way. The installer will ask questions about what to install. I only installed the toolkit and samples — don’t skip the samples as they’re required to validate CUDA works properly. After install run nvidia-smi and you should have something like the following:

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.93       Driver Version: 410.93       CUDA Version: 10.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Quadro M620         On   | 00000000:01:00.0  On |                  N/A |
    | N/A   48C    P0    N/A /  N/A |   1468MiB /  1968MiB |     12%      Default |
    +-------------------------------+----------------------+----------------------+
    

    Note that positive results with nvidia-smi don’t guarantee that CUDA is working properly. To do that we need to look back at Jonathan’s blog JUST ABOVE this link, and below the UDEV rules section. Basically, cd into the ~/NVIDIA_CUDA-10.0_Samples directory and run make. Take a break for a while and when you come back run ./bin/x86_64/linux/release/deviceQuery | tail -n 1. If the output says Result = PASS you’re good to go with CUDA!

  5. Install nvidia-docker
    As mentioned earlier, if binary packages are available for your OS — and they work — by all means, use them.
    For my first attempt, I thought I’d take the CentOS repo, add it to my system (since OpenSUSE is RPM-based) and see if it works. If I remember correctly, it installed, but when I tried to run nvidia-docker I got errors having to do with AppArmor. That caused me to go in search of to find the unsupported workaround for Leap 15. So here’s how I progressed, in a newly made directory:

    1. git clone https://github.com/dev-zero/nvidia-container-runtime.git -b opensuse-support nvidia-container-runtime-opensuse
    2. git clone https://github.com/dev-zero/nvidia-docker.git -b opensuse-support nvidia-docker-opensuse
    3. As I mentioned earlier, the default Docker version in OpenSUSE Leap 15 is dated, and I installed the current version. So before we can continue with the process, we need to tweak the Makefiles to reflect the proper Docker version. This process could be beneficial for other OSes if you’re going to compile the nvidia-docker components manually and need to support a Docker version that isn’t officially supported by NVIDIA’s code. In nvidia-container-runtime-opensuse/runtime/Makefile I changed one line and added a section for version 18.09.0-%-runc (validating the runc fingerprint via “docker info”):
      34c34,37
      < opensuse_leap15.0: $(addsuffix -opensuse_leap15.0, 17.09.1) --- > opensuse_leap15.0: $(addsuffix -opensuse_leap15.0, 18.09.0)
      >
      > 18.09.0-%-runc:
      >       echo "69663f0bd4b60df09991c08812a60108003fa340"
      

      Then in nvidia-docker-opensuse/Makefile I changed the following line:

      41c41
      < opensuse_leap15.0: $(addsuffix -opensuse_leap15.0, 17.09.1_ce) --- > opensuse_leap15.0: $(addsuffix -opensuse_leap15.0, 18.09.0_ce)
      

      From that point, I continued the normal process.

    4. leap_version="15.0"
    5. make -C nvidia-container-runtime-opensuse opensuse_leap${leap_version}
    6. make -C nvidia-docker-opensuse opensuse_leap${leap_version}
    7. To get nvidia-container-runtime (part of nvidia-docker) On openSUSE Leap 15.0, add the libnvidia-repo from centos7 for now since the CUDA Toolkit repo does not yet contain packages for it:
      sudo zypper ar -c 'https://nvidia.github.io/libnvidia-container/centos7/$basearch' nvidia-container-runtime
    8. sudo zypper install nvidia-{container-runtime,docker}-opensuse/dist/opensuse_leap${leap_version}/*.rpm
      Note: ignore warnings about unsigned packages
    9. Here we need to make a departure from the normal process. The nvidia-docker2 package changes the way the docker service starts. It adds a new runtime to
      /etc/docker/daemon.json and also overrides the systemd service file for the docker.service at /usr/lib/systemd/system/docker.service.d/nvidia-docker.conf. Unfortunately, the parameters that are used by default don’t work, at least in my case. With the default, when I started the docker service I was getting the following errors:

      eSubConnStateChange: 0xc42095e9f0, CONNECTING" module=grpc
      Jan 28 12:39:50 zeus dockerd[19765]: time="2019-01-28T12:39:50.713328360-08:00" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\"" module=libcontainerd namespace=moby
      Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713333172-08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0  }. Err :connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
      Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713478303-08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0  }. Err :connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
      Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713536953-08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42095e9f0, TRANSIENT_FAILURE" module=grpc
      Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713687894-08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420176690, TRANSIENT_FAILURE" module=grpc
      Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713682952-08:00" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\"" module=libcontainerd namespace=moby
      Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713792660-08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420176690, CONNECTING" module=grpc
      Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713891656-08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42095e9f0, CONNECTING" module=grpc
      Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713965683-08:00" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\"" module=libcontainerd namespace=moby
      

      The resolution turned out to be very simple. I just had to remove the containerd parameter, and everything worked properly. I’m still not sure what the implications are, but I’m able to run properly with the change. This might be beneficial on other OSes in a similar situation.

      4c4
      < ExecStart=/usr/bin/dockerd --containerd /run/containerd/containerd.sock $DOCKER_NETWORK_OPTIONS $DOCKER_OPTS --- > ExecStart=/usr/bin/dockerd $DOCKER_NETWORK_OPTIONS $DOCKER_OPTS
      
    10. Next, fully reload the systemd configuration and restart the docker daemon.
      This is required since we changed the flags passed to the docker daemon.
      sudo systemctl daemon-reload
      sudo systemctl restart docker
    11. Test the nvidia runtime:
      docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
      The results should be similar to the output of nvidia-smi above, and assuming it works, you’ll have NVIDIA graphics-enabled Docker containers!
    12. Run something real:
      1. mkdir nbody && cd nbody
      2. Download the nbody Dockerfile for CUDA OpenGL into the nbody directory
      3. docker build -t nbody .
      4. xhost +si:localuser:root
      5. docker run --runtime=nvidia -ti --rm -e DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix nbody
        This should present the sample running in an NVIDIA graphics-enabled Docker container.GPU-enabled Linux Graphical Container

Now, revisiting the topic from part 2 about making a more secure container, at this time your options are very limited to run a separate X server with nvidia-docker. Using x11docker, the only display options that yield accelerated graphics with nvidia-docker are --hostdisplay and --xorg. In the reading I’ve done on the x11docker site, there is very little love for NVIDIA due to their proprietary, closed source ways. That said, I’d like to formally give a shout out and sincere THANK YOU to @mviereck, the author of x11docker. We’ve communicated quite a lot on this topic and he’s made some tweaks to x11docker to work better with --runtime=nvidia. He’s also helped me solidify this blog series. So once again, Thank you!

When using x11docker with nvidia-docker, you CAN use --runtime=nvidia. In x11docker-gui, select --hostdisplay or --xorg for the X server, select --gpu, then go to “Advanced Options”, click the “Additional special options” button and check the --runtime=nvidia option. This yields the following commandline:
x11docker --hostdisplay --gpu -- --runtime=nvidia -- nbody
Normally, when x11docker sees --hostdisplay and --gpu and you are running the proprietary NVIDIA driver, it will check to see if the exact same version of the driver already exists in the container image, and if it’s not it will install it (provided you’ve positioned the driver in a specific x11docker directory). When x11docker sees that you’re using --runtime=nvidia it will bypass the NVIDIA proprietary driver install. The only way currently to run an nvidia-docker container securely is to use the --xorg host display option, but be aware that this will put your container GUI on a separate TTY. Beyond that, x11docker offers a number of other security measures as well as allowing you to easily add additional functionality to your GPU-enabled container.

So now you’ll be able to develop CUDA in a container sandbox, which is pretty cool.

Let the adventure begin! If you’ve found this blog series helpful or have questions, follow the discourse link below and let me know!

Here are some interesting sites I found along the way in this series:

Let's Innovate Together

Just ask us how we can make a difference for you today.