Building and running vibe-qc on the twin cluster (Uni Bonn)

This page covers provisioning vibe-qc on the University of Bonn theoretical chemistry twin cluster (a TORQUE batch system). twin is unlike the ordinary fleet hosts (planetx / mars / maru) in two ways that shape the whole approach, so it gets its own installer pair — scripts/install_cluster.sh and scripts/update_cluster.sh — rather than the plain install.sh / update.sh.

TL;DR

From the twin login node, in a vibe-qc checkout:

./scripts/install_cluster.sh --release      # first time (or --dev, or both)
./scripts/update_cluster.sh  --release      # refresh later

The interpreter a job should invoke is then ~/vibeqc-release/.venv/bin/python (or ~/vibeqc-dev/.venv/bin/python). Heavy work runs on a compute node via qsub; the login node only fetches + submits.

Why twin needs its own installer

  1. Split tiers. The login node is openSUSE Leap 15.6 with gcc 7.5 (too old for libint/libecpint’s C++17) and only 4 cores / 15 GB — not a build host. The atokat compute nodes are openSUSE Tumbleweed with gcc 13+, 20 cores / ~97 GB. So we bring our own toolchain (Miniforge) and compile on a compute node via qsub.

  2. No compute-node internet. Only the login node has egress. So every fetch — Miniforge, the git clone, the vendored native-dep sources, and the Python wheels — happens on the login node, and the compute node builds entirely offline from the NFS-shared /home.

The result is a two-phase model:

Phase

Where

Network

Does

login

login node

online

Miniforge → toolchain env → clone → stage native-dep sources (fetch-only) → download a complete wheelhouse

compute

atokat node via qsub

offline

activate the toolchain, point pip at the wheelhouse, run the ordinary scripts/install.sh

Everything else reuses the normal build system: install.shsetup_native_deps.shbuild_{libint,libxc,spglib,fftw,libecpint}.sh source-build the vendored native deps into third_party/<dep>/install/ for byte-parity with the rest of the fleet (and because vibe-qc’s libint needs custom max_am / deriv_order settings no conda-forge libint provides). Miniforge supplies only the prerequisite layer: Python 3.14, a modern gcc/g++/gfortran, cmake/ninja, and the build-time headers/libs (boost, eigen, gmp, openblas+lapack).

Access (2-hop SSH)

twin is reachable only through the institute gateway ssh3.thch.uni-bonn.de (inbound to twin is closed). Collapse the two hops with a ProxyJump in your ~/.ssh/config:

Host ssh3 ssh3.thch.uni-bonn.de
    HostName ssh3.thch.uni-bonn.de
    User <your-user>
    IdentityFile ~/.ssh/id_ed25519

Host twin
    HostName twin
    User <your-user>
    ProxyJump ssh3
    IdentityFile ~/.ssh/twin_ed25519
    IdentitiesOnly yes

Then ssh twin, scp twin:, and rsync -e ssh twin: all route through the gateway transparently.

Cloning needs a read-only Deploy Key

The compute nodes have no internet, so the clone happens on the login node, which has no cached git credentials. install_cluster.sh generates a read-only ed25519 key at ~/.ssh/gitlab_vibeqc_deploy and prints it if it is not yet authorized. Add it once under GitLab → vibeqc → Settings → Repository → Deploy keys (leave Grant write permissions unchecked), then re-run.

On-twin layout

install_cluster.sh creates the following under your $HOME (/home/$USER, which is NFS-shared to every compute node). This is the layout the vq TORQUE dispatcher targets — keep it in sync with the dispatcher’s probed-facts section.

$HOME/miniforge3/                       Miniforge base (python 3.13)
$HOME/miniforge3/envs/vqbuild/          toolchain env: python 3.14, gcc 14,
                                        cmake/ninja, boost/eigen/gmp/openblas
$HOME/vibeqc-dev/      (branch main)    dev checkout      + .venv/
$HOME/vibeqc-release/  (branch release) release checkout  + .venv/
$HOME/.vibeqc-cluster/                  wheelhouses, generated *.pbs, build logs

The interpreter to invoke vibe-qc in a job is $HOME/vibeqc-<variant>/.venv/bin/python (console scripts: .venv/bin/vibe-qc). That venv is built on the conda Python; the toolchain’s shared libraries are reachable from it via baked RPATH, so a job does not strictly need to conda activate — but doing so (or exporting LD_LIBRARY_PATH=$HOME/miniforge3/envs/vqbuild/lib) is a harmless belt-and-suspenders.

Bootstrap (first time)

On the twin login node, in any vibe-qc checkout (e.g. clone once by hand, or let the installer clone the managed trees):

./scripts/install_cluster.sh --release        # release variant only
./scripts/install_cluster.sh --dev            # dev variant only
./scripts/install_cluster.sh                  # both (default)
./scripts/install_cluster.sh --release --wait # ...and block until built

What it does, in order:

  1. Miniforge — installs to ~/miniforge3 (rootless, single HTTPS installer) if absent.

  2. Toolchain envconda create -p ~/miniforge3/envs/vqbuild with python 3.14 + c/cxx/fortran-compiler + cmake/ninja/make/pkg-config + libboost-devel eigen gmp openblas liblapack liblapacke libcblas.

  3. Deploy key — generates + checks the read-only gitlab key (see above).

  4. Per variant: clonestage native sources (VIBEQC_FETCH_ONLY=1 setup_native_deps.sh, login node) → wheelhouse (pip download of the build + runtime + selected-extras requirements read straight from pyproject.toml) → submit the offline build to qsub.

Useful flags: --extras GROUP (default none — core runtime only; the cluster runs vibe-qc, it does not test it), --queue (default atokat), --ppn (default 20), --walltime (default 02:00:00), --no-submit (login prep only), --wait (poll to completion). See --help.

The compute build’s progress is tee’d live to ~/.vibeqc-cluster/logs/<variant>-build.<timestamp>.log (TORQUE otherwise only copies the job’s stdout back at job end).

Refresh

./scripts/update_cluster.sh --release                       # git pull + rebuild
./scripts/update_cluster.sh --dev --rebuild-native-deps     # after a vendored lib bump

update_cluster.sh fast-forwards the checkout’s branch on the login node (refusing a dirty tree — the cluster checkouts are tool-managed and must stay clean), re-stages sources, refreshes the wheelhouse, and submits the offline rebuild. --rebuild-native-deps wipes + re-fetches the vendored third_party/* trees first.

Running a vibe-qc job (and for the dispatcher)

Submit from the login node; jobs run on a compute node and read everything from the NFS /home. Job scripts must be pure ASCII — TORQUE 2.5.12’s qsub rejects anything else (qsub: file must be an ascii script), so avoid em-dashes / smart quotes in generated job files.

cat > ~/h2.pbs <<'EOF'
#!/bin/bash
#PBS -N h2scf
#PBS -q atokat
#PBS -l nodes=1:ppn=4
#PBS -l walltime=00:10:00
#PBS -j oe
cd "$PBS_O_WORKDIR"
~/vibeqc-release/.venv/bin/python my_h2.py     # the interpreter
EOF
qsub ~/h2.pbs

There is no dedicated /scratch and $TMPDIR/$SCRATCH are unset; use a working directory under /home/$USER (NFS, visible to login + compute).

How the offline build works (internals)

  • fetch-only mode. A guarded VIBEQC_FETCH_ONLY=1 path in setup_native_deps.sh and each build_*.sh clones/downloads the vendored source then exits before compiling, so the internet-connected login node can stage third_party/<dep>/src for the offline compute build. Guarded on the env var → zero change to ordinary fleet builds.

  • conda-aware preflight. build_libint.sh and setup_native_deps.sh also search $CONDA_PREFIX/include + $CONDA_PREFIX/lib (and add $CONDA_PREFIX to CMAKE_PREFIX_PATH) when a conda env is active, because boost/eigen/gmp/openblas live there rather than under /usr. Guarded on $CONDA_PREFIX, so non-conda builds are byte-for-byte unchanged.

  • offline pip. The compute job exports PIP_NO_INDEX=1 + PIP_FIND_LINKS=<wheelhouse>; PEP 517 build isolation inherits both, so pip install -e . resolves the build backend (scikit-build-core, pybind11) and all runtime deps from the wheelhouse with no network.

  • toolchain isolation. The build job sets VIBEQC_BUILD_NICED=1 (it owns the node; this also avoids install.sh’s nice/ionice re-exec) and CMAKE_PREFIX_PATH=$CONDA_PREFIX so vibe-qc’s own CMake finds conda’s Eigen.

  • BLAS pinning. The build passes -DVIBEQC_BLAS_VENDOR=OpenBLAS so vibe-qc links conda’s OpenBLAS instead of the broken system Intel MKL the atokat nodes leave on LD_LIBRARY_PATH (libmkl_avx512.so: undefined symbol). conda’s libopenblas is first in the extension’s RPATH, so this fixes both link time and run time, and jobs need no LD_LIBRARY_PATH scrubbing.

  • scipy. Installed explicitly (VQC_EXTRA_PIP, offline from the wheelhouse) because vibe-qc imports it unconditionally (python/vibeqc/density_fitting.py) but does not yet declare it in pyproject.toml – so import vibeqc fails without it on an extras-free install. Drop VQC_EXTRA_PIP once scipy is a declared dependency.

Troubleshooting

  • qsub: file must be an ascii script — a non-ASCII byte (commonly an em-dash) crept into the job file. Keep job scripts ASCII.

  • Clone Permission denied (publickey) — the deploy key is not authorized yet; add ~/.ssh/gitlab_vibeqc_deploy.pub as a read-only Deploy Key.

  • pip download cannot find a wheel — a dependency lacks a cp314 wheel for linux. Drop the offending extra (--extras none) or pin a version that ships one.

  • Build seems stuck with no log — TORQUE copies the -o file back only at job end; watch the live log under ~/.vibeqc-cluster/logs/ instead, or watch third_party/*/install appear.

  • INTEL MKL ERROR libmkl_avx512.so: undefined symbol at job run time means the extension linked the broken system MKL. Rebuild with -DVIBEQC_BLAS_VENDOR=OpenBLAS (install_cluster.sh does this by default).

  • NumPy (X86_V2) but your machine doesn't support appears only on the login node (the Xeon E5420 predates x86-64-v2, so modern numpy wheels cannot run there). Run vibe-qc on the atokat compute nodes — never on the login node, which is for fetch + submit only.

Coordination: the vq TORQUE dispatcher

The vq v1.0 TORQUE dispatcher backend (Arch 2: a vq daemon on planetx drives qsub/qstat/qdel over SSH — see vibe-queue/docs/pbs_dispatcher_backend_design.md §16) submits jobs that invoke vibe-qc on twin. It must target the layout above: the per-variant interpreter $HOME/vibeqc-<variant>/.venv/bin/python, the atokat queue, an NFS /home working directory, and ASCII-only job scripts. This page is the source of truth for those paths; update both together if the layout changes.