Building a Singularity Container for Machine Learning, Data Science, & Chemistry

Learning Objectives

  1. Build a Linux based Singularity container.
  2. First build a writable sandbox with essential elements.
  3. Inspect the container.
  4. Install additional software.
  5. Convert the sandbox to a read-only SquashFS container image.
  6. Install software & packages from multiple sources.
  7. Using apt-get package management system.
  8. Compiling from source code.
  9. Using Python pip.
  10. Using install.packages() function in R.
  11. Software highlight.
  12. Jupyter notebook.
  13. Tensorflow GPU version.
  14. OpenMPI.
  15. Popular datascience packages in Python and R.
  16. Chemistry/chemoinformatics software: RDkit, OpenBabel, Pybel, & Mordred.
  17. Test the container.
  18. Test the GPU version of Tensorflow.

Core Container Build

First we will build a writable Singularity sandbox with the essential software, languages, and developmental libraries. To build a writable sandbox copy the recipe below to a container.def text file and then execute:

sudo singularity build --sandbox container/ container.def

Recipe/Definition File

BootStrap: docker
From: ubuntu:bionic

    APPLICATION_NAME Data Science and Chemistry
    AUTHOR_NAME Rohit Farmer
    YEAR 2021

    Container for data science and chemistry with packages from Python 3 & R 3.6. 
    It also includes CUDA and MPI for Tensorflow GPU and parallel processing respectively. 

    # Set system locale

    # Change to tmp directory to download temporary files.
    cd /tmp

    # Install essential software, languages and libraries. 
    apt-get -qq -y update

    export DEBIAN_FRONTEND=noninteractive
    apt-get -qq install -y --no-install-recommends tzdata apt-utils 

    ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime 
    dpkg-reconfigure --frontend noninteractive  tzdata

    apt-get -qq -y update 
    apt-get -qq install -y --no-install-recommends \
        autoconf \
        automake \
        build-essential \
        bzip2 \
        ca-certificates \
        cmake \
        gcc \
        g++ \
        gfortran \
        git \
        gnupg2 \
        libtool \
        libjpeg-dev \
        libpng-dev \
        libtiff-dev \
        libatlas-base-dev \
        libxml2-dev \
        zlib1g-dev \
        libcairo2-dev \
        libeigen3-dev \
        libcupti-dev \
        libpcre3-dev \
        libssl-dev \
        libcurl4-openssl-dev \
        libboost-all-dev \
        libboost-dev \
        libboost-system-dev \
        libboost-thread-dev \
        libboost-serialization-dev \
        libboost-regex-dev \
        libgtk2.0-dev \
        libreadline-dev \
        libbz2-dev \
        liblzma-dev \
        libpcre++-dev \
        libpango1.0-dev \
        libmariadb-client-lgpl-dev \
        libopenblas-dev \
        liblapack-dev \
        libxt-dev \
        neovim \
        openjdk-8-jdk \
        python \
        python-pip \
        python-dev \
        python3-dev \
        python3-pip \
        python3-wheel \
        swig \
        texlive \
        texlive-fonts-extra \
        texinfo \
        vim \
        wget \
        xvfb \
        xauth \
        xfonts-base \

    export LANG=C.UTF-8 LC_ALL=C.UTF-8

# Add NVIDIA package repositories.
    apt-key adv --fetch-keys
    dpkg -i cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
    apt-get update
    apt-get -qq install -y --no-install-recommends ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
    apt-get update

# Install NVIDIA driver (optional)
    # apt-get install --no-install-recommends nvidia-driver-430

# Install development and runtime libraries.
    apt-get install -y --no-install-recommends \
        cuda-10-1 \
        libcudnn7=  \

# Install TensorRT. Requires that libcudnn7 is installed above.
    apt-get install -y --no-install-recommends libnvinfer6=6.0.1-1+cuda10.1 \
        libnvinfer-dev=6.0.1-1+cuda10.1 \

# Update python pip.
    python3 -m pip --no-cache-dir install --upgrade pip
    python3 -m pip --no-cache-dir install setuptools --upgrade
    python -m pip --no-cache-dir install setuptools --upgrade

# Install R 3.6.
    echo "deb bionic-cran35/" >> /etc/apt/sources.list
    apt-key adv --keyserver --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
    apt-get update
    apt-get install -y --no-install-recommends r-base
    apt-get install -y --no-install-recommends r-base-dev

# Install Jupyter notebook with Python and R support.
    python3 -m pip --no-cache-dir install jupyter
    R --quiet --slave -e 'install.packages(c("IRkernel"), repos="")'

# Install MPI (match the version with the cluster).
    mkdir -p /tmp/mpi
    cd /tmp/mpi
    wget -O openmpi-2.1.0.tar.bz2
    tar -xjf openmpi-2.1.0.tar.bz2
    cd openmpi-2.1.0
    ./configure --prefix=/usr/local --with-cuda
    make -j $(nproc)
    make install

# Cleanup
    apt-get -qq clean
    rm -rf /var/lib/apt/lists/*
    rm -rf /tmp/mpi

Inspect Container

To get a list of the labels defined for the container singularity inspect --labels container/

To print the container's help section singularity inspect --helpfile container/

To show container’s environment singularity inspect --environment container/

To retrieve the definition file used to build the container singularity inspect --deffile container/

Install Data Science and Chemistry Packages

Once the core writable sandbox is built we will install the additional data science and chemistry packages.

To do that execute:
sudo singularity shell --writable container/

Then execute the following lines in the shell environment.

# Install Python packages.
    python3 -m pip --no-cache-dir install numpy pandas h5py pyarrow sklearn statsmodels matplotlib seaborn plotly 

# Install Tensorflow.
    python3 -m pip --no-cache-dir install tensorflow==2.2.0 

# Install R packages.
    R --quiet --slave -e 'install.packages("tidyverse", version = "1.3.0", repos="")'
    R --quiet --slave -e 'install.packages("tidymodels", version = "0.1.0", repos="")'
    R --quiet --slave -e 'install.packages(c("lme4", "glmnet", "yaml", "jsonlite", "rlang"), repos="")'

# Install RDKit
    export RDBASE=/usr/local/share/rdkit
    mkdir -p /tmp/rdkit
    cd /tmp/rdkit
    tar zxf 2020_03_3.tar.gz
    mv rdkit-2020_03_3 $RDBASE
    mkdir $RDBASE/build
    cd $RDBASE/build
    cmake -DPYTHON_EXECUTABLE=/usr/bin/python3 ..
    make -j $(nproc)
    make install

    ln -s /usr/local/share/rdkit/rdkit /usr/local/lib/python3.6/dist-packages/

# Install OpenBabel.
    apt-get -qq -y update
    apt-get -qq install -y --no-install-recommends openbabel python-openbabel

# Install Mordred Molecular Descriptor Calculator.
    python3 -m pip --no-cache-dir install mordred

# Cleanup
    rm -rf /tmp/rdkit

Convert a Writable Sandbox to a Read Only Compressed Container

Once you are satisfied that you have installed all the required packages you can convert the writable sandbox to a read only squashfs filesystem. Squashfs is a compressed read-only file system for Linux.

sudo singularity build container.sif container/

Install Kernel Spces for Jupyter Notebook for R

Kernel specs are installed from outside the container in the host's home environment.

singularity exec container.sif R --quiet --slave -e 'IRkernel::installspec()'

NOTE: You only have to do it once per host to install kernelspec.

Test Script(s)

Tensorflow GPU

import tensorflow as tf

gpus = tf.config.list_physical_devices('GPU')

if gpus:
    with tf.device('/GPU:0'):
        a = tf.random.normal([10000,20000], 0, 1, tf.float32, seed=1)
        b = tf.random.normal([20000,10000], 0, 1, tf.float32, seed=1)
        c = tf.matmul(a, b)
    print("No GPUs found.")

print("Num GPUs:", len(gpus))

To execute the script singularity exec --nv container.sif python3

To monitor NVIDIA GPU usage nvidia-smi