Building and running vibe-qc on the twin cluster (Uni Bonn)¶
This page covers provisioning vibe-qc on the University of Bonn theoretical
chemistry twin cluster (a TORQUE batch system). twin is unlike the
ordinary fleet hosts (planetx / mars / maru) in two ways that shape the whole
approach, so it gets its own installer pair —
scripts/install_cluster.sh
and
scripts/update_cluster.sh
— rather than the plain install.sh / update.sh.
TL;DR
From the twin login node, in a vibe-qc checkout:
./scripts/install_cluster.sh --release # first time (or --dev, or both)
./scripts/update_cluster.sh --release # refresh later
The interpreter a job should invoke is then ~/vibeqc-release/.venv/bin/python
(or ~/vibeqc-dev/.venv/bin/python). Heavy work runs on a compute node via
qsub; the login node only fetches + submits.
Why twin needs its own installer¶
Split tiers. The login node is openSUSE Leap 15.6 with gcc 7.5 (too old for libint/libecpint’s C++17) and only 4 cores / 15 GB — not a build host. The
atokatcompute nodes are openSUSE Tumbleweed with gcc 13+, 20 cores / ~97 GB. So we bring our own toolchain (Miniforge) and compile on a compute node viaqsub.No compute-node internet. Only the login node has egress. So every fetch — Miniforge, the git clone, the vendored native-dep sources, and the Python wheels — happens on the login node, and the compute node builds entirely offline from the NFS-shared
/home.
The result is a two-phase model:
Phase |
Where |
Network |
Does |
|---|---|---|---|
login |
login node |
online |
Miniforge → toolchain env → clone → stage native-dep sources (fetch-only) → download a complete wheelhouse |
compute |
|
offline |
activate the toolchain, point |
Everything else reuses the normal build system: install.sh →
setup_native_deps.sh → build_{libint,libxc,spglib,fftw,libecpint}.sh
source-build the vendored native deps into third_party/<dep>/install/ for
byte-parity with the rest of the fleet (and because vibe-qc’s libint needs
custom max_am / deriv_order settings no conda-forge libint provides).
Miniforge supplies only the prerequisite layer: Python 3.14, a modern
gcc/g++/gfortran, cmake/ninja, and the build-time headers/libs
(boost, eigen, gmp, openblas+lapack).
Access (2-hop SSH)¶
twin is reachable only through the institute gateway ssh3.thch.uni-bonn.de
(inbound to twin is closed). Collapse the two hops with a ProxyJump in your
~/.ssh/config:
Host ssh3 ssh3.thch.uni-bonn.de
HostName ssh3.thch.uni-bonn.de
User <your-user>
IdentityFile ~/.ssh/id_ed25519
Host twin
HostName twin
User <your-user>
ProxyJump ssh3
IdentityFile ~/.ssh/twin_ed25519
IdentitiesOnly yes
Then ssh twin, scp … twin:, and rsync -e ssh … twin: all route through
the gateway transparently.
Cloning needs a read-only Deploy Key
The compute nodes have no internet, so the clone happens on the login
node, which has no cached git credentials. install_cluster.sh generates a
read-only ed25519 key at ~/.ssh/gitlab_vibeqc_deploy and prints it if it is
not yet authorized. Add it once under GitLab → vibeqc → Settings →
Repository → Deploy keys (leave Grant write permissions unchecked), then
re-run.
On-twin layout¶
install_cluster.sh creates the following under your $HOME (/home/$USER,
which is NFS-shared to every compute node). This is the layout the vq
TORQUE dispatcher targets — keep it in sync with the dispatcher’s probed-facts
section.
$HOME/miniforge3/ Miniforge base (python 3.13)
$HOME/miniforge3/envs/vqbuild/ toolchain env: python 3.14, gcc 14,
cmake/ninja, boost/eigen/gmp/openblas
$HOME/vibeqc-dev/ (branch main) dev checkout + .venv/
$HOME/vibeqc-release/ (branch release) release checkout + .venv/
$HOME/.vibeqc-cluster/ wheelhouses, generated *.pbs, build logs
The interpreter to invoke vibe-qc in a job is
$HOME/vibeqc-<variant>/.venv/bin/python (console scripts: .venv/bin/vibe-qc).
That venv is built on the conda Python; the toolchain’s shared libraries are
reachable from it via baked RPATH, so a job does not strictly need to
conda activate — but doing so (or exporting
LD_LIBRARY_PATH=$HOME/miniforge3/envs/vqbuild/lib) is a harmless
belt-and-suspenders.
Bootstrap (first time)¶
On the twin login node, in any vibe-qc checkout (e.g. clone once by hand, or let the installer clone the managed trees):
./scripts/install_cluster.sh --release # release variant only
./scripts/install_cluster.sh --dev # dev variant only
./scripts/install_cluster.sh # both (default)
./scripts/install_cluster.sh --release --wait # ...and block until built
What it does, in order:
Miniforge — installs to
~/miniforge3(rootless, single HTTPS installer) if absent.Toolchain env —
conda create -p ~/miniforge3/envs/vqbuildwith python 3.14 +c/cxx/fortran-compiler+ cmake/ninja/make/pkg-config +libboost-devel eigen gmp openblas liblapack liblapacke libcblas.Deploy key — generates + checks the read-only gitlab key (see above).
Per variant: clone → stage native sources (
VIBEQC_FETCH_ONLY=1 setup_native_deps.sh, login node) → wheelhouse (pip downloadof the build + runtime + selected-extras requirements read straight frompyproject.toml) → submit the offline build toqsub.
Useful flags: --extras GROUP (default none — core runtime only; the
cluster runs vibe-qc, it does not test it), --queue (default atokat),
--ppn (default 20), --walltime (default 02:00:00), --no-submit (login
prep only), --wait (poll to completion). See --help.
The compute build’s progress is tee’d live to
~/.vibeqc-cluster/logs/<variant>-build.<timestamp>.log (TORQUE otherwise only
copies the job’s stdout back at job end).
Refresh¶
./scripts/update_cluster.sh --release # git pull + rebuild
./scripts/update_cluster.sh --dev --rebuild-native-deps # after a vendored lib bump
update_cluster.sh fast-forwards the checkout’s branch on the login node
(refusing a dirty tree — the cluster checkouts are tool-managed and must stay
clean), re-stages sources, refreshes the wheelhouse, and submits the offline
rebuild. --rebuild-native-deps wipes + re-fetches the vendored
third_party/* trees first.
Running a vibe-qc job (and for the dispatcher)¶
Submit from the login node; jobs run on a compute node and read everything
from the NFS /home. Job scripts must be pure ASCII — TORQUE 2.5.12’s
qsub rejects anything else (qsub: file must be an ascii script), so avoid
em-dashes / smart quotes in generated job files.
cat > ~/h2.pbs <<'EOF'
#!/bin/bash
#PBS -N h2scf
#PBS -q atokat
#PBS -l nodes=1:ppn=4
#PBS -l walltime=00:10:00
#PBS -j oe
cd "$PBS_O_WORKDIR"
~/vibeqc-release/.venv/bin/python my_h2.py # the interpreter
EOF
qsub ~/h2.pbs
There is no dedicated /scratch and $TMPDIR/$SCRATCH are unset; use a
working directory under /home/$USER (NFS, visible to login + compute).
How the offline build works (internals)¶
fetch-only mode. A guarded
VIBEQC_FETCH_ONLY=1path insetup_native_deps.shand eachbuild_*.shclones/downloads the vendored source then exits before compiling, so the internet-connected login node can stagethird_party/<dep>/srcfor the offline compute build. Guarded on the env var → zero change to ordinary fleet builds.conda-aware preflight.
build_libint.shandsetup_native_deps.shalso search$CONDA_PREFIX/include+$CONDA_PREFIX/lib(and add$CONDA_PREFIXtoCMAKE_PREFIX_PATH) when a conda env is active, because boost/eigen/gmp/openblas live there rather than under/usr. Guarded on$CONDA_PREFIX, so non-conda builds are byte-for-byte unchanged.offline pip. The compute job exports
PIP_NO_INDEX=1+PIP_FIND_LINKS=<wheelhouse>; PEP 517 build isolation inherits both, sopip install -e .resolves the build backend (scikit-build-core, pybind11) and all runtime deps from the wheelhouse with no network.toolchain isolation. The build job sets
VIBEQC_BUILD_NICED=1(it owns the node; this also avoidsinstall.sh’s nice/ionice re-exec) andCMAKE_PREFIX_PATH=$CONDA_PREFIXso vibe-qc’s own CMake finds conda’s Eigen.BLAS pinning. The build passes
-DVIBEQC_BLAS_VENDOR=OpenBLASso vibe-qc links conda’s OpenBLAS instead of the broken system Intel MKL the atokat nodes leave onLD_LIBRARY_PATH(libmkl_avx512.so: undefined symbol). conda’slibopenblasis first in the extension’s RPATH, so this fixes both link time and run time, and jobs need noLD_LIBRARY_PATHscrubbing.scipy. Installed explicitly (
VQC_EXTRA_PIP, offline from the wheelhouse) because vibe-qc imports it unconditionally (python/vibeqc/density_fitting.py) but does not yet declare it inpyproject.toml– soimport vibeqcfails without it on an extras-free install. DropVQC_EXTRA_PIPonce scipy is a declared dependency.
Troubleshooting¶
qsub: file must be an ascii script— a non-ASCII byte (commonly an em-dash) crept into the job file. Keep job scripts ASCII.Clone
Permission denied (publickey)— the deploy key is not authorized yet; add~/.ssh/gitlab_vibeqc_deploy.pubas a read-only Deploy Key.pip downloadcannot find a wheel — a dependency lacks a cp314 wheel for linux. Drop the offending extra (--extras none) or pin a version that ships one.Build seems stuck with no log — TORQUE copies the
-ofile back only at job end; watch the live log under~/.vibeqc-cluster/logs/instead, or watchthird_party/*/installappear.INTEL MKL ERROR … libmkl_avx512.so: undefined symbolat job run time means the extension linked the broken system MKL. Rebuild with-DVIBEQC_BLAS_VENDOR=OpenBLAS(install_cluster.sh does this by default).NumPy … (X86_V2) but your machine doesn't supportappears only on the login node (the Xeon E5420 predates x86-64-v2, so modern numpy wheels cannot run there). Run vibe-qc on theatokatcompute nodes — never on the login node, which is for fetch + submit only.
Coordination: the vq TORQUE dispatcher¶
The vq v1.0 TORQUE dispatcher backend (Arch 2: a vq daemon on planetx drives
qsub/qstat/qdel over SSH — see
vibe-queue/docs/pbs_dispatcher_backend_design.md §16) submits jobs that
invoke vibe-qc on twin. It must target the layout above: the per-variant
interpreter $HOME/vibeqc-<variant>/.venv/bin/python, the atokat queue, an
NFS /home working directory, and ASCII-only job scripts. This page is the
source of truth for those paths; update both together if the layout changes.