BLAS + LAPACK backend¶
vibe-qc’s C++ core uses Eigen for
dense linear algebra. When a BLAS+LAPACK library is linked, Eigen
delegates its matrix products, eigendecompositions, Cholesky
factorisations, and dense LU / QR / SVD to that library
(EIGEN_USE_BLAS
catches dense *, .noalias() products and triangular solves;
EIGEN_USE_LAPACKE catches Eigen::LLT, Eigen::LDLT,
Eigen::SelfAdjointEigenSolver, Eigen::FullPivLU, etc.). The
delegation is set at compile time; no source-code change is
required to take advantage of it.
This page describes:
which BLAS gets linked on which platform, and how to check;
how vibe-qc’s OpenMP layer interacts with BLAS-internal threading;
when the vendored OpenBLAS path is worth it (and when the system install is enough);
a candid note on what the linkage does not fix — most notably the parent perf-optimisation lever for RIJCOSX.
Backend selection per platform¶
macOS — Apple Accelerate (default)¶
Accelerate ships with the OS and is highly tuned for Apple
Silicon and recent Intel Macs. CMake picks it via
find_package(BLAS BLA_VENDOR=Apple). Nothing to install,
nothing to configure.
otool -L on the compiled _vibeqc_core*.so confirms:
$ otool -L .../vibeqc/_vibeqc_core.cpython-*-darwin.so | grep Accelerate
/System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate
EIGEN_USE_LAPACKE is not set on macOS — Accelerate’s
LAPACKE bridge has version-specific quirks vs Eigen’s
expectations, and Accelerate’s BLAS catches the SCF matrix
products that matter. To force OpenBLAS on macOS anyway (mostly
for reproducible cross-platform builds), build the vendored
OpenBLAS (see below) and pass -DVIBEQC_BLAS_VENDOR=OpenBLAS to
CMake.
Linux — system OpenBLAS / MKL / netlib auto-detect¶
CMake’s FindBLAS scans well-known vendors in order. With no
override, the first one found wins. If you have admin access,
the cleanest path is to install OpenBLAS via the package manager:
# Arch / Manjaro:
sudo pacman -S blas-openblas # replaces reference BLAS system-wide
# Debian / Ubuntu:
sudo apt install libopenblas-dev liblapacke-dev
# Fedora / RHEL:
sudo dnf install openblas-devel lapack-devel
Then pip install -e . will pick up the system OpenBLAS
automatically — no rebuild flag needed.
If only liblapack / libblas (netlib reference) is installed,
CMake will still link, but the banner reads blas netlib BLAS —
a perf trap (see “When BLAS linkage does and does not matter”
below). scripts/setup_native_deps.sh probes for this case and
prints a setup-time recommendation.
Linux — vendored OpenBLAS (no-sudo / HPC / CI)¶
For HPC users without root, locked-down workstations, or CI
boxes that want a byte-identical BLAS across builds, an opt-in
vendored OpenBLAS lives at third_party/openblas/install/:
WITH_OPENBLAS=1 ./scripts/setup_native_deps.sh
# ...or directly:
./scripts/build_openblas.sh
Requirements:
A Fortran compiler (
gfortran) — netlib LAPACK’s Fortran sources are bundled into the resultinglibopenblas.soto giveEIGEN_USE_LAPACKE-grade dense solvers. Install hints per-distro are printed bybuild_openblas.shif it can’t find one.About 5–10 minutes of build time on first run.
Once third_party/openblas/install/ exists, CMake auto-prepends
it to CMAKE_PREFIX_PATH and pins BLA_VENDOR=OpenBLAS, so the
vendored library wins deterministically over any system BLAS.
The next pip install -e . rebuilds the C extension against
libopenblas.so (with RPATH baked, so import vibeqc finds it
without LD_LIBRARY_PATH fiddling) and the banner switches to
blas OpenBLAS +LAPACKE.
Build flags used (see scripts/build_openblas.sh for the full
list and rationale):
DYNAMIC_ARCH=1— runtime CPU dispatch. The vendored binary picks the right kernel on Haswell / Skylake / Zen / Apple Silicon / … at load time, so the build host’s SIMD support doesn’t constrain where the binary can run.USE_LAPACK=1 USE_LAPACKE=1— singlelibopenblas.socarries BLAS, LAPACK, and the LAPACKE C interface (so both Eigen delegation macros activate).USE_THREAD=1 USE_OPENMP=0 NUM_THREADS=128 NO_AFFINITY=1— pthreads-internal threading, capped, no CPU pinning. Combined withOPENBLAS_NUM_THREADS=1(set by default inpython/vibeqc/__init__.py) this gives a sane BLAS-serial, OpenMP-parallel posture — see the next section.
Threading: BLAS-serial, app-parallel¶
vibe-qc parallelises its own inner loops with OpenMP, controlled
by OMP_NUM_THREADS. If the linked BLAS also threads
internally — OpenBLAS-pthreads, MKL — you get N×K threads
contending for cores: measurably worse than either layer alone.
python/vibeqc/__init__.py pins BLAS-internal threading to 1 by
default before the C extension loads:
def _pin_blas_threads() -> None:
for key in (
"OPENBLAS_NUM_THREADS",
"MKL_NUM_THREADS",
"VECLIB_MAXIMUM_THREADS",
"BLIS_NUM_THREADS",
):
os.environ.setdefault(key, "1")
os.environ.setdefault only fills a key that isn’t already set,
so an explicit OPENBLAS_NUM_THREADS=4 in your shell wins. If
you specifically want BLAS-internal threading (e.g., for a
single-threaded outer loop where BLAS parallelism is the only
parallelism available), export it before import vibeqc.
The default — BLAS serial, OpenMP parallel via OMP_NUM_THREADS
— is the standard posture for mixed OpenMP-and-BLAS codes (ORCA,
NWChem, PySCF, Psi4 all do the same).
When BLAS linkage does — and does not — matter¶
vibe-qc inherits its perf model from Eigen + the vendored numerical libraries (libint, libxc, libecpint) and its own hand-rolled kernels. EIGEN_USE_BLAS+LAPACKE matters where Eigen ops are the bottleneck — and not where they aren’t.
Where it helps: density-fitting SCF (density_fit=True) at
larger systems, periodic-SCF Cholesky factorisations, MP2
amplitude updates, hessian assembly. Anything dominated by
dense matrix products on Fock-sized matrices (≳ 500 BFs)
benefits.
Where it’s near-flat: RIJCOSX-heavy HF / hybrid DFT at small
systems. The cost there is in the COSX grid loop, AO integral
assembly, and DF 3-center kernel evaluation — paths that don’t
go through Eigen’s dense kernels, so EIGEN_USE_BLAS doesn’t
touch them. Measured on butanethiol (C₄H₁₀S) HF / def2-TZVP /
RIJCOSX, the wall time difference between no-BLAS, system
netlib BLAS, and vendored OpenBLAS+LAPACKE is within 7% (~298 s
across all three).
If your run is RIJCOSX-bound and you need it faster, the
inner-loop work is in cpp/src/cosx.cpp (Q-junction caching,
shell-pair Schwarz screening on the K build, A_g block
sparsity) rather than the BLAS backend. Track that work in the
perf chat on claude/perf-vs-orca.
Disabling BLAS (debugging)¶
For numerical regressions where you suspect Eigen / BLAS
roundoff differences (rare — should produce ulp-level
discrepancies at most, easily within SCF tolerances), pass
-DVIBEQC_USE_BLAS=OFF to the CMake configure to force
Eigen’s generic kernels:
pip install --no-build-isolation -e . \
-Ccmake.define.VIBEQC_USE_BLAS=OFF
The banner will then read blas none. Run side-by-side against
a default build to isolate which kernel the discrepancy lives
in.
Citation¶
OpenBLAS — github.com/OpenMathLib/OpenBLAS, BSD 3-Clause; the LAPACK + LAPACKE sources bundled inside the vendored
libopenblas.soare Reference-LAPACK (github.com/Reference-LAPACK/lapack), modified BSD.Apple Accelerate is system-bundled on macOS; no separate citation, the Accelerate framework reference is the entry point.
Full license inventory: docs/license.md.