Scientific Programming Best Practices

Tom Westerhout
2 February 2021

Disclaimer

Contents

  • Reproducibility;
    → keep track of your files
        → Version Control Systems (VCS)
  • Correctness;
  • Performance — some other time.

What is version control?

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.

Source: Pro Git Book

Example:


              $ ls
              main_fixed_eq5.tex  main_prb.tex  main_prl_final.tex
              main_prl.tex        main.tex      main_v2.tex
              $ # Thinking...
              $ cp main_prb.tex main_prb_final.tex
            

Local version control systems

  • How to collaborate?
  • Files are all in one place!

Centralized version control systems

  • How to collaborate?
  • Files are all in one place!

Distributed version control systems

  • How to collaborate?
  • Files are all in one place!

Git

Git

Of the professional developers who responded to the survey, almost 82% use GitHub as a collaborative tool

Source: Stack Overflow Developer Survey 2020

  • Concepts;
  • Examples & commands;

Repository (or Git project)

repository = files + history

Files:


              quantum_skyrmions/
              ├── Analysis
              │   ├── 19_site_cluster.yml
              │   ├── 7_site_cluster.yml
              │   ├── slurm_main.sh
              │   └── SpinED-x86_64.AppImage
              ├── Drafts
              │   ├── paper.tex
              │   └── references.bib
              ├── Figures
              │   ├── ground_state_energy.pdf
              │   └── topological_invariant.pdf
              ├── Proofs
              ├── Published
              ├── Raw data
              │   ├── exact_diagonalization_result_19.h5
              │   └── exact_diagonalization_result_7.h5
              └── Submitted
            

i.e. your whole project folder

History

The file history appears as snapshots in time called commits

Source: Git Handbook

  • Commits are described by hashes (SHA),
    e.g. 9d33835a8e744c5f9cc950f672885dd706c0852f
  • Hash function is any function that can be used to map data of arbitrary size to fixed-size values (Source: Wikipedia)
  • Commits are assembled into linked lists:

History

Git

  • Concepts (repository, commit, hash, branch);

    Questions?

  • Examples & commands;

Example: creating a repository

  • Choose a server: GitHub, GitLab.com, Science GitLab

    GitLab is technically superior,
    but all the cool kids hang out on GitHub.

    → we will use GitHub for examples

Example: following GitHub's instructions


              $ cd quantum_skyrmions/
              $ git init
              Initialized empty Git repository in .../quantum_skyrmions/.git/
              $ git remote add origin https://github.com/twesterhout/quantum_skyrmions.git
          

Example: cloning a repository

What your collaborators will do after you have created a repository:


              $ git clone https://github.com/twesterhout/quantum_skyrmions.git
              Cloning into 'quantum_skyrmions'...
              Username for 'https://github.com': username
              Password for 'https://username@github.com': password
              remote: Enumerating objects: 147, done.
              remote: Counting objects: 100% (147/147), done.
              remote: Compressing objects: 100% (119/119), done.
              remote: Total 147 (delta 30), reused 140 (delta 26), pack-reused 0
              Receiving objects: 100% (147/147), 2.46 MiB | 777.00 KiB/s, done.
              Resolving deltas: 100% (30/30), done.
              $ cd quantum_skyrmions/
          

Example: making changes to files


              $ vi "Analysis/SimpleTests.wl" # Making the changes...
              $ WolframKernel -script Analysis/SimpleTests.wl # Testing...
              Hello world!
              $ git add Analysis/SimpleTests.wl # Track SimpleTests.wl
              $ git commit -m "Implement hello world" # Commit to changes
              $ git push origin main # Uploading to remote server...
              Username for 'https://github.com': username
              Password for 'https://username@github.com': password
              Enumerating objects: 9, done.
              Counting objects: 100% (9/9), done.
              Delta compression using up to 8 threads
              Compressing objects: 100% (6/6), done.
              Writing objects: 100% (6/6), 496.66 KiB | 12.42 MiB/s, done.
              Total 6 (delta 2), reused 0 (delta 0), pack-reused 0
              remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
              To https://github.com/twesterhout/quantum_skyrmions
                 bd922d6..817e53d  main -> main
          

Now you are safe:

  • Code is backed up
  • Trivial to revert to a previous state

Example: updating


              $ cd quantum_skyrmions/
              $ git pull origin main # fetch & merge changes from remote
              Username for 'https://github.com': username
              Password for 'https://username@github.com': password
              remote: Enumerating objects: 5, done.
              remote: Counting objects: 100% (5/5), done.
              remote: Compressing objects: 100% (1/1), done.
              remote: Total 3 (delta 2), reused 3 (delta 2), pack-reused 0
              Unpacking objects: 100% (3/3), 1.76 KiB | 451.00 KiB/s, done.
              From https://github.com/twesterhout/quantum_skyrmions
               * branch            main       -> FETCH_HEAD
                 817e53d..88e8c26  main       -> origin/main
              Updating 817e53d..88e8c26
              Fast-forward
               SimpleTests.wl | 139 ++++++++++++++++++++++++++++++++--------
               1 file changed, 114 insertions(+), 25 deletions(-)
          

Git

  • Concepts (repository, commit, hash, branch, remote)
  • git init — create new local repository
        remote add — add new remote repository
        clone — clone an existing repository
        add — add some changes to the next commit
        commit — commit changes
        push — publish local changes on a remote
        pull — get all changes from remote to local repository
        status — view changes in working directory

Guideline: have a repository for every project you work on

What to do with data?

Git is not meant to be used with large files!

Solutions:

  • Git Large File Storage (LFS)
  • Data Version Control (DVC)

Git Large File Storage (LFS)

Git LFS handles large files by storing references to the file in the repository, but not the actual file itself.

Source: GitHub Docs

  • Git servers take care of storing the files;
  • Tell git-lfs to track specific files:
    
                        $ git lfs track "*.h5" # Track all HDF5 files
                        Adding path *.h5
                        $ git add "heisenberg_37.h5" # Work using standard git commands
                    
  • Storage limits:
    GitHub Free  2G
    GitLab.com Free 10G
    Science GitLab 10G (probably)

Data Version Control (DVC)

DVC is built to make ML models shareable and reproducible. It is designed to handle large files, data sets, machine learning models, and metrics as well as code.

ML = Machine Learning, Source: DVC Homepage

  • Git-like workflow, e.g. dvc add, dvc push etc.
  • You choose where data is stored. Suggestion:
      if you are broke (like me)
        then Surfdrive (500G per RU employee)
        else Amazon S3 (easier to collaborate)

    (In the nearest future, Ceph storage cluster managed by C&CZ might get you best of both)

Keeping track of files

  • Have a repository for every project you work on!
  • Git LFS for keeping track of production datasets
    (i.e. data to reproduce your figures).
  • (optional) DVC for keeping track of "work-in-progress" data.

Questions?

Reproducibility

  • Sharing your environment
  • Making your code portable

Sharing your environment: virtualization

Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources.

Source: Wikipedia

  • Virtual machines;
  • Hardware drivers;
  • etc.

Virtualization: levels

  • Virtual machines (e.g. VirtualBox, Windows Subsystem for Linux):
    • abstracts everything including the kernel;
    • performance overhead;
  • Containers (e.g. Docker, Singularity):
    • reuses the kernel, but abstracts everything else
    • no performance overhead
  • Virtual environments (e.g. venv, Conda):
    • abstracts only certain applications
    • no performance overhead

Conda

Package, dependency and environment management for any language — Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.

Source: Conda documentation

  • Packages — programs, libraries, etc.
  • Environments consist of packages

Pre-installed on TCM cluster.
module load Tcm; module load Anaconda-3-2020.07
and you are good to go.

Example: creating environments from files


            $ # Creating an environment...
            $ conda env create -f conda-devel.yml
            $ # Adding to git...
            $ git add conda-devel.yml
            $ git commit -m "Create environment"
            

conda-devel.yml:


                name: lattice_symmetries_devel
                channels:
                  - defaults
                  - weinbe58 # for QuSpin
                dependencies:
                  - python
                  - pip:
                    - black
                    - neovim
                    - loguru
                  - numpy
                  - scipy
                  # Stuff to compile the package locally
                  - gcc_linux-64
                  - gxx_linux-64
                  - cmake
                  - ninja
                  # For benchmarks and testing
                  - numba ==0.48 # QuSpin doesn't work with the latest version
                  - omp # Get multi-threading support for QuSpin
                  - quspin
                  ...
              

Conda

Guideline: have a Conda environment for every project you work on

If you already know what Conda is:

  • Do not use base environment!
  • Do not use conda install (except for testing)!

Singularity

Singularity containers can be built to include all of the programs, libraries, data and scripts such that an entire demonstration can be contained and either archived or distributed for others to replicate no matter what version of Linux they are presently running.

Source: Singularity User Guide

  • Container is a single file;
  • Can easily run Docker containers
    (i.e. Container Library + Docker Hub)
  • Imagine having admin rights on the cluster 😏

Pre-installed on TCM cluster.
module load Singularity
and you are good to go.

Example: creating a container


              $ # no GPU locally...
              $ nvcc hello.cu -o hello
              bash: nvcc: command not found
              $ # Singularity to the rescue!
              $ singularity build hello.sif Singularity
              INFO:    Starting build...
              INFO:    Running setup scriptlet
              + mkdir -p /workdir
              INFO:    Copying hello.cu to /workdir/
              INFO:    Running post scriptlet
              + /bin/bash /.post.script
              INFO:    Adding runscript
              INFO:    Creating SIF file...
              INFO:    Build complete: hello.sif
              

hello.cu:


              __global__ void cuda_hello() {
                  printf("Hello World from GPU!\n");
              }

              int main() {
                  cuda_hello<<<1, 1>>>();
                  return 0;
              }
              

Singularity:


                Bootstrap: docker
                From: nvidia/cuda:11.0-devel-ubuntu20.04

                %setup
                    mkdir -p ${SINGULARITY_ROOTFS}/workdir

                %files
                    hello.cu /workdir/

                %post
                    cd /workdir
                    nvcc hello.cu -o hello

                %runscript
                    /workdir/hello
              

Disclaimer: this is an advanced example

Reproducibility

  • Sharing your environment (Conda & Singularity).
    Questions?

  • Making your code portable:
    → Static executables
    → AppImages

Static executables


              $ # Compiling locally...
              $ g++-10 -std=c++20 thread.cpp -o thread -lpthread
              $ # Works locally
              $ ./thread
              Stopping...
              $ # And on lilo6
              $ scp thread lilo.science.ru.nl:
              $ ./thread
              $ ssh lilo6.science.ru.nl ./thread
              Stopping...
              $ # But breaks on lilo5
              $ ssh lilo5.science.ru.nl ./thread
              [...] version `GLIBCXX_3.4.22' not found [...]
              $ # Compile statically!
              $ g++-10 -std=c++20 thread.cpp -o thread \
                  -static -static-libgcc -static-libstdc++ \
                  -Wl,--whole-archive -lpthread -Wl,--no-whole-archive
              $ # Now works on lilo5 as well!
              $ scp thread lilo.science.ru.nl:
              $ ssh lilo5.science.ru.nl ./thread
              Stopping...
              

thread.cpp:


              #include <chrono>
              #include <cstdio>
              #include <thread>

              auto main() -> int {
                using namespace std::chrono_literals;
                // A sleepy worker thread
                auto sleepy_worker = std::jthread{
                  [](std::stop_token stoken) {
                    for (;;) {
                      std::this_thread::sleep_for(100ms);
                      if (stoken.stop_requested()) {
                        std::printf("Stopping...\n");
                        return;
                      }
                    }
                }};
                sleepy_worker.request_stop();
                sleepy_worker.join();
              }
              

More info:

AppImage

Linux apps that run anywhere

Download an application, make it executable, and run! No need to install. No system libraries or system preferences are altered.

Source: AppImage Homepage

Example:


              $ # Hmmm, our cluster does not have NeoVim installed...
              $ # No problem!
              $ wget -q https://github.com/neovim/neovim/releases/download/v0.4.4/nvim.appimage
              $ chmod +x nvim.appimage
              $ # Yay!
              $ ./nvim.appimage
            
  • Main idea: bundle all dependencies together with the executable.
  • Great for cases when static linking is not an option.

More info:

Reproducibility

  • Sharing your environment (Conda & Singularity).
  • Making your code portable (Static linking & AppImages).

    Questions?

Correctness

Source: Code Complete, by Steve McConnell

  • Code reviews.
  • Tests.
  • Proofs & formal verification methods — some other time.
  • etc. — some other time.

Code reviews

Guideline: have at least one other person read and understand your code

Tests: 0

Guideline: have tests for every project you work on

Level 0:

  • Play around in Python/Matlab/Julia interpreter
    & copy to your script/notebook when it starts working.
  • Have a small Fortran/C/C++ file with print statements
    & ...

Tests: 1

Level 1: put your tests in a file & use a testing framework, e.g.


                $ pytest
                ========================== test session starts ===========================
                platform linux -- Python 3.x.y, pytest-6.x.y, py-1.x.y, pluggy-0.x.y
                cachedir: $PYTHON_PREFIX/.pytest_cache
                rootdir: $REGENDOC_TMPDIR
                collected 1 item

                test_sample.py F                                                    [100%]

                ================================ FAILURES ================================
                ______________________________ test_answer _______________________________

                    def test_answer():
                >       assert inc(3) == 5
                E       assert 4 == 5
                E        +  where 4 = inc(3)

                test_sample.py:6: AssertionError
                ======================== short test summary info =========================
                FAILED test_sample.py::test_answer - assert 4 == 5
                =========================== 1 failed in 0.12s ============================
              

test_sample.py:


                  def inc(x):
                      return x + 1

                  def test_answer():
                      assert inc(3) == 5
                

Source: pytest documentation

Tests: 0.5

Level 0.5 (i.e. a half measure): use asserts excessively, e.g.


              # Python
              assert x > 0, "real log is undefined for negative inputs"
            

              // C and C++
              assert(x > 0 && "real log is undefined for negative inputs");
            

              # Julia
              @assert x > 0 "real log is undefined for negative inputs"
            

Note: asserts can be disabled for production runs, so no, they will not slow down your code.

Tests: 2

Level 2: Continuous Integration

CI.yml:


                name: Ubuntu
                env:
                  BUILD_TYPE: Debug
                  INSTALL_LOCATION: .local
                jobs:
                  build:
                    strategy:
                      matrix:
                        gcc-version: [7, 8, 9, 10]
                    runs-on: ubuntu-latest
                    steps:
                    - uses: actions/checkout@v2
                      with:
                        submodules: true
                    - name: configure
                      run: |
                        cmake -Bbuild \
                          -DCMAKE_CXX_COMPILER=g++-${{ matrix.gcc-version }} \
                          -DCMAKE_C_COMPILER=gcc-${{ matrix.gcc-version }} \
                          -DCMAKE_BUILD_TYPE=$BUILD_TYPE \
                          -DCMAKE_INSTALL_PREFIX=$GITHUB_WORKSPACE/$INSTALL_LOCATION
                    - name: build
                      run: cmake --build build -j4
                    - name: run tests
                      run: cd build && ctest -VV
                    - name: install project
                      run: cmake --build build --target install
              

Tests

Guideline: reach at least level 0.5 when developing code

Guideline: reach at least level 1 when publishing a paper

How to come up with test cases?

  • Sanity checks: domain-specific knowledge;
  • Work out a few examples analytically;
  • Compare with a different algorithm;
  • Reproduce results from an old paper;

Recap