Scientific Programming Guidelines

Scientific Programming Best Practices

Tom Westerhout
2 February 2021

Disclaimer

Reproducibility;
→ keep track of your files
→ Version Control Systems (VCS)
Correctness;
Performance — some other time.

What is version control?

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.

Source: Pro Git Book

Example:


              $ ls
              main_fixed_eq5.tex  main_prb.tex  main_prl_final.tex
              main_prl.tex        main.tex      main_v2.tex
              $ # Thinking...
              $ cp main_prb.tex main_prb_final.tex

Local version control systems

How to collaborate?
Files are all in one place!

Centralized version control systems

How to collaborate?
Files are all in one place!

Distributed version control systems

How to collaborate?
Files are all in one place!

Git

Of the professional developers who responded to the survey, almost 82% use GitHub as a collaborative tool

Source: Stack Overflow Developer Survey 2020

Concepts;
Examples & commands;

Repository (or Git project)


            repository = files + history

Files:


              quantum_skyrmions/
              ├── Analysis
              │   ├── 19_site_cluster.yml
              │   ├── 7_site_cluster.yml
              │   ├── slurm_main.sh
              │   └── SpinED-x86_64.AppImage
              ├── Drafts
              │   ├── paper.tex
              │   └── references.bib
              ├── Figures
              │   ├── ground_state_energy.pdf
              │   └── topological_invariant.pdf
              ├── Proofs
              ├── Published
              ├── Raw data
              │   ├── exact_diagonalization_result_19.h5
              │   └── exact_diagonalization_result_7.h5
              └── Submitted

i.e. your whole project folder

History

The file history appears as snapshots in time called commits

Source: Git Handbook

Commits are described by hashes (SHA),
e.g. 9d33835a8e744c5f9cc950f672885dd706c0852f
Hash function is any function that can be used to map data of arbitrary size to fixed-size values (Source: Wikipedia)
Commits are assembled into linked lists:

History

Git

Concepts (repository, commit, hash, branch);

Questions?
Examples & commands;

Example: creating a repository

Choose a server: GitHub, GitLab.com, Science GitLab

GitLab is technically superior,
but all the cool kids hang out on GitHub.

→ we will use GitHub for examples

Choose a server: GitHub, GitLab.com, Science GitLab
Create a repository using web interface:

Example: following GitHub's instructions


              $ cd quantum_skyrmions/
              $ git init
              Initialized empty Git repository in .../quantum_skyrmions/.git/
              $ git remote add origin https://github.com/twesterhout/quantum_skyrmions.git

Example: cloning a repository

What your collaborators will do after you have created a repository:


              $ git clone https://github.com/twesterhout/quantum_skyrmions.git
              Cloning into 'quantum_skyrmions'...
              Username for 'https://github.com': username
              Password for 'https://username@github.com': password
              remote: Enumerating objects: 147, done.
              remote: Counting objects: 100% (147/147), done.
              remote: Compressing objects: 100% (119/119), done.
              remote: Total 147 (delta 30), reused 140 (delta 26), pack-reused 0
              Receiving objects: 100% (147/147), 2.46 MiB | 777.00 KiB/s, done.
              Resolving deltas: 100% (30/30), done.
              $ cd quantum_skyrmions/

Example: making changes to files


              $ vi "Analysis/SimpleTests.wl" # Making the changes...
              $ WolframKernel -script Analysis/SimpleTests.wl # Testing...
              Hello world!
              $ git add Analysis/SimpleTests.wl # Track SimpleTests.wl
              $ git commit -m "Implement hello world" # Commit to changes
              $ git push origin main # Uploading to remote server...
              Username for 'https://github.com': username
              Password for 'https://username@github.com': password
              Enumerating objects: 9, done.
              Counting objects: 100% (9/9), done.
              Delta compression using up to 8 threads
              Compressing objects: 100% (6/6), done.
              Writing objects: 100% (6/6), 496.66 KiB | 12.42 MiB/s, done.
              Total 6 (delta 2), reused 0 (delta 0), pack-reused 0
              remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
              To https://github.com/twesterhout/quantum_skyrmions
                 bd922d6..817e53d  main -> main

Now you are safe:

Code is backed up
Trivial to revert to a previous state

Example: updating


              $ cd quantum_skyrmions/
              $ git pull origin main # fetch & merge changes from remote
              Username for 'https://github.com': username
              Password for 'https://username@github.com': password
              remote: Enumerating objects: 5, done.
              remote: Counting objects: 100% (5/5), done.
              remote: Compressing objects: 100% (1/1), done.
              remote: Total 3 (delta 2), reused 3 (delta 2), pack-reused 0
              Unpacking objects: 100% (3/3), 1.76 KiB | 451.00 KiB/s, done.
              From https://github.com/twesterhout/quantum_skyrmions
               * branch            main       -> FETCH_HEAD
                 817e53d..88e8c26  main       -> origin/main
              Updating 817e53d..88e8c26
              Fast-forward
               SimpleTests.wl | 139 ++++++++++++++++++++++++++++++++--------
               1 file changed, 114 insertions(+), 25 deletions(-)

Git

Concepts (repository, commit, hash, branch, remote)
git init — create new local repository
    remote add — add new remote repository
    clone — clone an existing repository
    add — add some changes to the next commit
    commit — commit changes
    push — publish local changes on a remote
    pull — get all changes from remote to local repository
    status — view changes in working directory

Guideline: have a repository for every project you work on

What to do with data?

Git is not meant to be used with large files!

Solutions:

Git Large File Storage (LFS)
Data Version Control (DVC)

Git Large File Storage (LFS)

Git LFS handles large files by storing references to the file in the repository, but not the actual file itself.

Source: GitHub Docs

Git servers take care of storing the files;

Tell git-lfs to track specific files:


                    $ git lfs track "*.h5" # Track all HDF5 files
                    Adding path *.h5
                    $ git add "heisenberg_37.h5" # Work using standard git commands

Storage limits:

GitHub Free	`2G`
GitLab.com Free	`10G`
Science GitLab	`10G` (probably)

Data Version Control (DVC)

DVC is built to make ML models shareable and reproducible. It is designed to handle large files, data sets, machine learning models, and metrics as well as code.

ML = Machine Learning, Source: DVC Homepage

Git-like workflow, e.g. dvc add, dvc push etc.
You choose where data is stored. Suggestion:
  if you are broke (like me)
    then Surfdrive (500G per RU employee)
    else Amazon S3 (easier to collaborate)

(In the nearest future, Ceph storage cluster managed by C&CZ might get you best of both)

Keeping track of files

Have a repository for every project you work on!
Git LFS for keeping track of production datasets
(i.e. data to reproduce your figures).
(optional) DVC for keeping track of "work-in-progress" data.

Questions?

Reproducibility

Sharing your environment
Making your code portable

Sharing your environment: virtualization

Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources.

Source: Wikipedia

Virtual machines;
Hardware drivers;
etc.

Virtualization: levels

Virtual machines (e.g. VirtualBox, Windows Subsystem for Linux):
- abstracts everything including the kernel;
- performance overhead;
Containers (e.g. Docker, Singularity):
- reuses the kernel, but abstracts everything else
- no performance overhead
Virtual environments (e.g. venv, Conda):
- abstracts only certain applications
- no performance overhead

Conda

Package, dependency and environment management for any language — Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.

Source: Conda documentation

Packages — programs, libraries, etc.
Environments consist of packages

Pre-installed on TCM cluster.
module load Tcm; module load Anaconda-3-2020.07
and you are good to go.

Example: creating environments from files


            $ # Creating an environment...
            $ conda env create -f conda-devel.yml
            $ # Adding to git...
            $ git add conda-devel.yml
            $ git commit -m "Create environment"

conda-devel.yml:


                name: lattice_symmetries_devel
                channels:
                  - defaults
                  - weinbe58 # for QuSpin
                dependencies:
                  - python
                  - pip:
                    - black
                    - neovim
                    - loguru
                  - numpy
                  - scipy
                  # Stuff to compile the package locally
                  - gcc_linux-64
                  - gxx_linux-64
                  - cmake
                  - ninja
                  # For benchmarks and testing
                  - numba ==0.48 # QuSpin doesn't work with the latest version
                  - omp # Get multi-threading support for QuSpin
                  - quspin
                  ...

Conda

Guideline: have a Conda environment for every project you work on

If you already know what Conda is:

Do not use base environment!
Do not use conda install (except for testing)!

Singularity

Singularity containers can be built to include all of the programs, libraries, data and scripts such that an entire demonstration can be contained and either archived or distributed for others to replicate no matter what version of Linux they are presently running.

Source: Singularity User Guide

Container is a single file;
Can easily run Docker containers
(i.e. Container Library + Docker Hub)
Imagine having admin rights on the cluster 😏

Pre-installed on TCM cluster.
module load Singularity
and you are good to go.

Example: creating a container


              $ # no GPU locally...
              $ nvcc hello.cu -o hello
              bash: nvcc: command not found
              $ # Singularity to the rescue!
              $ singularity build hello.sif Singularity
              INFO:    Starting build...
              INFO:    Running setup scriptlet
              + mkdir -p /workdir
              INFO:    Copying hello.cu to /workdir/
              INFO:    Running post scriptlet
              + /bin/bash /.post.script
              INFO:    Adding runscript
              INFO:    Creating SIF file...
              INFO:    Build complete: hello.sif

hello.cu:


              __global__ void cuda_hello() {
                  printf("Hello World from GPU!\n");
              }

              int main() {
                  cuda_hello<<<1, 1>>>();
                  return 0;
              }

Singularity:


                Bootstrap: docker
                From: nvidia/cuda:11.0-devel-ubuntu20.04

                %setup
                    mkdir -p ${SINGULARITY_ROOTFS}/workdir

                %files
                    hello.cu /workdir/

                %post
                    cd /workdir
                    nvcc hello.cu -o hello

                %runscript
                    /workdir/hello

Disclaimer: this is an advanced example

Reproducibility

Sharing your environment (Conda & Singularity).
Questions?
Making your code portable:
→ Static executables
→ AppImages

Static executables


              $ # Compiling locally...
              $ g++-10 -std=c++20 thread.cpp -o thread -lpthread
              $ # Works locally
              $ ./thread
              Stopping...
              $ # And on lilo6
              $ scp thread lilo.science.ru.nl:
              $ ./thread
              $ ssh lilo6.science.ru.nl ./thread
              Stopping...
              $ # But breaks on lilo5
              $ ssh lilo5.science.ru.nl ./thread
              [...] version `GLIBCXX_3.4.22' not found [...]
              $ # Compile statically!
              $ g++-10 -std=c++20 thread.cpp -o thread \
                  -static -static-libgcc -static-libstdc++ \
                  -Wl,--whole-archive -lpthread -Wl,--no-whole-archive
              $ # Now works on lilo5 as well!
              $ scp thread lilo.science.ru.nl:
              $ ssh lilo5.science.ru.nl ./thread
              Stopping...

thread.cpp:


              #include <chrono>
              #include <cstdio>
              #include <thread>

              auto main() -> int {
                using namespace std::chrono_literals;
                // A sleepy worker thread
                auto sleepy_worker = std::jthread{
                  [](std::stop_token stoken) {
                    for (;;) {
                      std::this_thread::sleep_for(100ms);
                      if (stoken.stop_requested()) {
                        std::printf("Stopping...\n");
                        return;
                      }
                    }
                }};
                sleepy_worker.request_stop();
                sleepy_worker.join();
              }

More info:

A case for static linking in scientific computing
Difference between static and shared libraries
GCC link options (e.g. -static)

AppImage

Linux apps that run anywhere

Download an application, make it executable, and run! No need to install. No system libraries or system preferences are altered.

Source: AppImage Homepage

Example:


              $ # Hmmm, our cluster does not have NeoVim installed...
              $ # No problem!
              $ wget -q https://github.com/neovim/neovim/releases/download/v0.4.4/nvim.appimage
              $ chmod +x nvim.appimage
              $ # Yay!
              $ ./nvim.appimage

Main idea: bundle all dependencies together with the executable.
Great for cases when static linking is not an option.

More info:

Reproducibility

Sharing your environment (Conda & Singularity).
Making your code portable (Static linking & AppImages).

Questions?

Correctness

Source: Code Complete, by Steve McConnell

Code reviews.
Tests.
Proofs & formal verification methods — some other time.
etc. — some other time.

Code reviews

Guideline: have at least one other person read and understand your code

Tests: 0

Guideline: have tests for every project you work on

Level 0:

Play around in Python/Matlab/Julia interpreter
& copy to your script/notebook when it starts working.
Have a small Fortran/C/C++ file with print statements
& ...

Tests: 1

Level 1: put your tests in a file & use a testing framework, e.g.


                $ pytest
                ========================== test session starts ===========================
                platform linux -- Python 3.x.y, pytest-6.x.y, py-1.x.y, pluggy-0.x.y
                cachedir: $PYTHON_PREFIX/.pytest_cache
                rootdir: $REGENDOC_TMPDIR
                collected 1 item

                test_sample.py F                                                    [100%]

                ================================ FAILURES ================================
                ______________________________ test_answer _______________________________

                    def test_answer():
                >       assert inc(3) == 5
                E       assert 4 == 5
                E        +  where 4 = inc(3)

                test_sample.py:6: AssertionError
                ======================== short test summary info =========================
                FAILED test_sample.py::test_answer - assert 4 == 5
                =========================== 1 failed in 0.12s ============================

test_sample.py:


                  def inc(x):
                      return x + 1

                  def test_answer():
                      assert inc(3) == 5

Source: pytest documentation

More info:

Tests: 0.5

Level 0.5 (i.e. a half measure): use asserts excessively, e.g.


              # Python
              assert x > 0, "real log is undefined for negative inputs"


              // C and C++
              assert(x > 0 && "real log is undefined for negative inputs");


              # Julia
              @assert x > 0 "real log is undefined for negative inputs"

Note: asserts can be disabled for production runs, so no, they will not slow down your code.

Tests: 2

Level 2: Continuous Integration

CI.yml:


                name: Ubuntu
                env:
                  BUILD_TYPE: Debug
                  INSTALL_LOCATION: .local
                jobs:
                  build:
                    strategy:
                      matrix:
                        gcc-version: [7, 8, 9, 10]
                    runs-on: ubuntu-latest
                    steps:
                    - uses: actions/checkout@v2
                      with:
                        submodules: true
                    - name: configure
                      run: |
                        cmake -Bbuild \
                          -DCMAKE_CXX_COMPILER=g++-${{ matrix.gcc-version }} \
                          -DCMAKE_C_COMPILER=gcc-${{ matrix.gcc-version }} \
                          -DCMAKE_BUILD_TYPE=$BUILD_TYPE \
                          -DCMAKE_INSTALL_PREFIX=$GITHUB_WORKSPACE/$INSTALL_LOCATION
                    - name: build
                      run: cmake --build build -j4
                    - name: run tests
                      run: cd build && ctest -VV
                    - name: install project
                      run: cmake --build build --target install

Tests

Guideline: reach at least level 0.5 when developing code

Guideline: reach at least level 1 when publishing a paper

How to come up with test cases?

Sanity checks: domain-specific knowledge;
Work out a few examples analytically;
Compare with a different algorithm;
Reproduce results from an old paper;

Recap

Reproducibility:
- Git repository for every project;
- Per-project Conda environments;
Correctness:
- At least one person understood your code;
- Asserts when writing new code;
- Unit testing before producing publishable results;
These slides:
https://twesterhout.github.io/programming-practices-talk-2021

Scientific Programming Best Practices

Disclaimer

Contents

What is version control?

Local version control systems

Centralized version control systems

Distributed version control systems

Git

Repository (or Git project)

History

History

Git

Example: creating a repository

Example: following GitHub's instructions

Example: cloning a repository

Example: making changes to files

Example: updating

Git

Guideline: have a repository for every project you work on

What to do with data?

Git Large File Storage (LFS)

Data Version Control (DVC)

Keeping track of files

Reproducibility

Sharing your environment: virtualization

Virtualization: levels

Conda

Example: creating environments from files

Conda

Guideline: have a Conda environment for every project you work on

Singularity

Example: creating a container

Reproducibility

Static executables

AppImage

Linux apps that run anywhere

Reproducibility

Correctness

Code reviews

Guideline: have at least one other person read and understand your code

Tests: 0

Guideline: have tests for every project you work on

Tests: 1

Tests: 0.5

Tests: 2

Tests

Guideline: reach at least level 0.5 when developing code

Guideline: reach at least level 1 when publishing a paper

Recap