Submitting a job to a remote machine with vq¶
A laptop is fine for organic chemistry up to ~30 atoms / cc-pVDZ.
Past that — supercells, dense k-meshes, transition-metal clusters,
big-basis hybrid DFT — you want a bigger box. vq is vibe-qc’s
companion job queue: a small one-binary daemon that lives on the
compute machine, accepts work from your laptop over SSH, and
streams back the outputs when each job finishes. This tutorial
walks through one complete submit-and-fetch cycle using the
dry-run pre-flight so the queue knows in advance which files
the job will write, then vq fetch to pull them home.
If you’ve used SLURM or PBS the verbs map cleanly:
sbatch → vq submit, squeue → vq queue, scancel →
vq kill. Differences worth pinning to: vq is single-host (no
job-array primitive yet — submit each input as its own job), and
it understands vibe-qc inputs specifically so it can tell which
files to fetch back rather than tarring the whole workspace.
Important
This tutorial assumes you have already installed vq on both
your laptop and a remote compute machine with default_host
pointed at the remote, per
user_guide/queue.md § Installation.
If you haven’t, do that first — it’s a one-time ~5-minute step.
The system¶
A small but representative test job: MgO (rocksalt) at PBE0 / pob-TZVP / Γ-only via the native GDF driver. Small enough that it fits in <2 GB RAM, big enough that the laptop wants ~30 s where a beefier box does it in ~5 s.
Working directory on your laptop:
~/vibeqc-runs/mgo-rocksalt/
input-mgo-pbe0.py
input-mgo-pbe0.py:
import numpy as np
import vibeqc as vq
# MgO rocksalt at experimental lattice constant a = 4.21 Å.
a = 4.21 * 1.8897259886 # → 7.957 bohr
sysp = vq.PeriodicSystem(
dim=3,
lattice=np.eye(3) * a,
unit_cell=[
vq.Atom(12, [0.0, 0.0, 0.0]), # Mg
vq.Atom(8, [a/2, a/2, a/2]), # O
],
)
vq.run_periodic_job(
sysp,
basis="pob-tzvp",
method="RHF",
functional="PBE0",
kmesh=[1, 1, 1], # Γ-only
output="output-mgo-pbe0",
)
This is just a normal vibe-qc input — the same file works whether
you run it locally with python input-mgo-pbe0.py or hand it to
vq submit. No vq-specific markup in the Python file.
Step 1 — dry-run pre-flight (locally)¶
Before queueing, sanity-check what the job will write:
cd ~/vibeqc-runs/mgo-rocksalt/
VIBEQC_DRY_RUN=1 python input-mgo-pbe0.py
This short-circuits the runner after the method resolves but
before any compute. It writes a one-shot .system manifest with
[outputs].status = "dry_run" and exits with the declared
artefacts summary on stdout:
vibe-qc dry-run pre-flight — output stem: output-mgo-pbe0
method=RHF basis=pob-tzvp functional=PBE0
Will produce:
output-mgo-pbe0.out (log, always)
output-mgo-pbe0.system (manifest, always)
output-mgo-pbe0.molden (orbitals, always)
output-mgo-pbe0.xyz (geometry, always)
output-mgo-pbe0.POSCAR (geometry, always — periodic)
output-mgo-pbe0.xsf (geometry, always — periodic)
output-mgo-pbe0.bibtex (citations, always)
output-mgo-pbe0.references (citations, always)
output-mgo-pbe0.population.{txt,json} (properties, always)
No SCF run.
The same dry-run is what vq submit --vibeqc-preflight will do
on your behalf in the next step. Running it by hand is optional;
it’s there because the inspection is useful when you want to know
the file family without paying any compute.
Step 2 — submit to the remote queue¶
vq submit --vibeqc-preflight input-mgo-pbe0.py
What vq does, in order:
Runs your script once with
VIBEQC_DRY_RUN=1on the laptop, harvesting the resulting.systemmanifest’s[plan]section intoJobSpec.expected_outputsandJobSpec.output_stem. If the preflight times out or the script doesn’t importrun_job, the submit proceeds anyway without the plan — the--vibeqc-preflightflag is best-effort.Copies
input-mgo-pbe0.pyto a fresh per-job workspace on the remote (/var/lib/vq/workspaces/<jobid>/).Enqueues the job in the daemon’s queue with default priority and resource caps (1 CPU, no memory cap, default wall-time-seconds from
vq config).Returns the jobid to stdout (a 12-character hex slug):
queued c0ff50a06462 on planetx workspace: /var/lib/vq/workspaces/c0ff50a06462 watch: vq watch c0ff50a06462 fetch when done: vq fetch c0ff50a06462 ./outputs/
The Python file is not yet running — it’s in the queue.
Whether it starts immediately depends on --max-jobs and the
queue depth.
Tip
If your job needs more than one CPU or has a known peak memory, pass them at submit time so the resource cap is right from the start:
vq submit --cpus 4 --mem-mb 8000 --wall-time-seconds 1800 \
--vibeqc-preflight input-mgo-pbe0.py
The cgroup-v2 caps are enforced by systemd-run --user; vq
won’t OOM your remote box because one job got greedy.
Step 3 — monitor¶
Three ways to watch:
# Snapshot of the queue:
vq queue
# JOBID STATE ELAPSED NAME SCRIPT
# c0ff50a06462 running 00:00:08 input-mgo-pbe0 input-mgo-pbe0.py
# Per-job detail (state machine + tails of stdout/stderr):
vq status c0ff50a06462
# Live-tail the job until it finishes (Ctrl-C exits the watcher,
# the job keeps running):
vq watch c0ff50a06462
vq queue is the equivalent of squeue; vq status is more
like scontrol show job. Both refresh on demand — there’s no
poll loop running between calls.
The state machine you’ll see, in order:
queued → starting → running → done (happy path)
queued → starting → running → failed (SCF crashed or non-zero exit)
queued → starting → running → timed_out (wall-time enforcement)
queued → cancelled (you called vq kill)
The .system manifest’s [outputs].status field tracks the
output-side state — "running" while the job is alive,
"complete" / "crashed" once it finishes. This lets vq’s
liveness detection distinguish “the SCF crashed and wrote a
.dump” from “the daemon got killed and the job is orphaned”.
Step 4 — fetch the outputs¶
Once the job state is done:
vq fetch c0ff50a06462 -o ./outputs/
This streams the workspace back via SSH + tar. With --job-name
at submit time the destination is
./outputs/<jobname>-<jobid>/; otherwise it’s
./outputs/<jobid>/:
outputs/c0ff50a06462/
input-mgo-pbe0.py # the script you submitted
output-mgo-pbe0.out # SCF log
output-mgo-pbe0.system # manifest with plan + outputs status
output-mgo-pbe0.molden # MOs
output-mgo-pbe0.xyz # geometry (extended XYZ)
output-mgo-pbe0.POSCAR # VASP-style cell
output-mgo-pbe0.xsf # XCrySDen structure
output-mgo-pbe0.bibtex # citations
output-mgo-pbe0.references
output-mgo-pbe0.population.txt
output-mgo-pbe0.population.json
stdout.log # vq-captured stdout from python
stderr.log # vq-captured stderr from python
The .bibtex / .references are auto-assembled per
user_guide/citations.md — drop the
BibTeX file into your manuscript and \cite{...} away.
Step 5 — read the result on the laptop¶
The fetched directory is everything you’d have if you’d run the job locally. Inspect the energy:
grep -E "Total energy|converged" outputs/c0ff50a06462/output-mgo-pbe0.out
# Total energy: -274.7821345 Ha
# SCF converged in 14 iterations.
Cross-check the manifest’s hardware block to know what produced the number:
python -c '
import tomllib, sys
with open(sys.argv[1], "rb") as f:
m = tomllib.load(f)
print("CPU :", m["cpu"]["model"])
print("OMP :", m["cpu"]["omp_threads_used"])
print("RAM :", m["memory"]["total_gb"], "GB")
print("vibeqc:", m["vibeqc"]["version"], m["vibeqc"]["git_sha"])
' outputs/c0ff50a06462/output-mgo-pbe0.system
Common operations¶
Re-running the same job¶
vq resubmit c0ff50a06462
# → new jobid with a fresh workspace, same inputs.
Useful when the original run hit a transient failure (a flaky mount, OOM from a co-tenant job) and you just want to try again without rebuilding the workspace from scratch.
Killing a runaway job¶
vq kill c0ff50a06462
# → systemd-run sends SIGTERM, escalates to SIGKILL after the
# grace period.
The state flips to cancelled and [outputs].status becomes
"crashed" (since the SCF didn’t finish).
Cleaning up old workspaces¶
vq cleanup --older-than 14d
# Removes workspaces of done/cancelled/failed jobs older than 14
# days from the remote.
Default workspace retention is set in vq config; manual cleanup
is for the case where you’ve fetched everything you need and want
the disk back.
Submitting an entire directory¶
For multi-file inputs (geometry file + Python script that reads it, sweep over several functionals, …):
vq submit -d ./my_sweep_dir/ --vibeqc-preflight -- python run.py
# -d <dir> = the directory to copy across
# -- = end of vq flags
# python ... = literal command to run inside the workspace
Pausing the queue¶
vq pause # daemon stops dispatching new jobs;
# running jobs keep going.
vq resume # back to normal dispatch.
vq throttle --max-jobs 1 # cap concurrency without pausing
Handy when you’re running an interactive session on the remote box and don’t want vq to fill the CPU on top of you.
Why --vibeqc-preflight matters¶
Without the preflight, vq’s JobSpec.expected_outputs is empty
— vq doesn’t know what files the job will write, so it has to
tar-stream the entire workspace at fetch time. That’s fine for a
single small job but wasteful for sweeps over many parameters
(each workspace carries the same input geometry, the same stdout
log shape, etc.).
With the preflight, the daemon knows the exact file list before
the job runs. vq fetch becomes selective; vq queue can
display “outputs 5/8 written” by reading the .system manifest
without running the SCF; the dashboard can show real progress
bars per declared artefact.
The trade-off is the ~1-2 s pre-flight cost at submit time. It
runs your Python file (with VIBEQC_DRY_RUN=1) inside a 10-s
timeout; if your input file has a slow import or sets up
non-trivial state, that runtime grows accordingly. The flag is
opt-in for that reason — set it when the gain (output-aware
fetch, progress visibility, crash detection) outweighs the cost.
What’s still local-only¶
A few things have NOT been wired through vq yet:
Remote pre-flight.
--vibeqc-preflightruns the dry-run on the submitting laptop, not on the remote. The submit fails loudly if the laptop can’t import vibe-qc (you need at least a light vibe-qc install for the preflight to work). Remote pre-flight is the next milestone.Job arrays. Submit each input as its own
vq submitcall for now; the queue dispatches them in priority order.GPU resource claims. Single-host CPU + memory caps via cgroup-v2 are honoured; no GPU claim machinery yet.
Cluster scheduling across multiple nodes. vq is single-host by design — it pairs nicely with one beefy compute box, not with a multi-node HPC cluster (use SLURM for that).
Resources¶
The MgO/PBE0/pob-TZVP/Γ job in this tutorial: ~5 s wall on a modern desktop CPU, peak RAM ~1.5 GB. The submit→fetch cycle including the preflight + tar-stream is ~10-15 s end-to-end. For larger jobs (multi-k mesh, hybrid DFT on transition-metal clusters, MP2 on def2-TZVP) the wall time of the SCF dominates; the vq overhead stays ~constant.
References¶
vq design document. user_guide/queue.md — full 802-line reference for the queue: state machine, configuration, multi-venv routing, external-program workflows, operator controls.
Phase O4 design — dry-run pre-flight.
docs/design_output_module.md§ “vq integration contract” — why the preflight pattern is shaped the way it is.
Next¶
User guide — vq job queue — every command and flag, with a long-form coverage of the config file, multi-venv routing, external-program workflows (CRYSTAL14, ORCA, PySCF parity runs), web dashboard, and admin commands.
Tutorial 40 — auto-citations — the
.bibtex/.referencessiblings vq fetches back are drop-in for your manuscript bibliography.Tutorial 26 — cross-validation — running the same input through vibe-qc + PySCF + ORCA over vq for parity work.