Submitting a job to a remote machine with `vq`¶

A laptop is fine for organic chemistry up to ~30 atoms / cc-pVDZ. Past that — supercells, dense k-meshes, transition-metal clusters, big-basis hybrid DFT — you want a bigger box. vq is vibe-qc’s companion job queue: a small one-binary daemon that lives on the compute machine, accepts work from your laptop over SSH, and streams back the outputs when each job finishes. This tutorial walks through one complete submit-and-fetch cycle using the dry-run pre-flight so the queue knows in advance which files the job will write, then vq fetch to pull them home.

If you’ve used SLURM or PBS the verbs map cleanly: sbatch → vq submit, squeue → vq queue, scancel → vq kill. Differences worth pinning to: vq is single-host (no job-array primitive yet — submit each input as its own job), and it understands vibe-qc inputs specifically so it can tell which files to fetch back rather than tarring the whole workspace.

Important

This tutorial assumes you have already installed vq on both your laptop and a remote compute machine with default_host pointed at the remote, per user_guide/queue.md § Installation. If you haven’t, do that first — it’s a one-time ~5-minute step.

The system¶

A small but representative test job: MgO (rocksalt) at PBE0 / pob-TZVP / Γ-only via the native GDF driver. Small enough that it fits in <2 GB RAM, big enough that the laptop wants ~30 s where a beefier box does it in ~5 s.

Working directory on your laptop:

~/vibeqc-runs/mgo-rocksalt/
    input-mgo-pbe0.py

input-mgo-pbe0.py:

import numpy as np
import vibeqc as vq

# MgO rocksalt at experimental lattice constant a = 4.21 Å.
a = 4.21 * 1.8897259886    # → 7.957 bohr

sysp = vq.PeriodicSystem(
    dim=3,
    lattice=np.eye(3) * a,
    unit_cell=[
        vq.Atom(12, [0.0, 0.0, 0.0]),                # Mg
        vq.Atom(8,  [a/2, a/2, a/2]),                # O
    ],
)

vq.run_periodic_job(
    sysp,
    basis="pob-tzvp",
    method="RHF",
    functional="PBE0",
    kmesh=[1, 1, 1],          # Γ-only
    output="output-mgo-pbe0",
)

This is just a normal vibe-qc input — the same file works whether you run it locally with python input-mgo-pbe0.py or hand it to vq submit. No vq-specific markup in the Python file.

Step 1 — dry-run pre-flight (locally)¶

Before queueing, sanity-check what the job will write:

cd ~/vibeqc-runs/mgo-rocksalt/
VIBEQC_DRY_RUN=1 python input-mgo-pbe0.py

This short-circuits the runner after the method resolves but before any compute. It writes a one-shot .system manifest with [outputs].status = "dry_run" and exits with the declared artefacts summary on stdout:

vibe-qc dry-run pre-flight — output stem: output-mgo-pbe0
  method=RHF basis=pob-tzvp functional=PBE0
  Will produce:
    output-mgo-pbe0.out          (log, always)
    output-mgo-pbe0.system       (manifest, always)
    output-mgo-pbe0.molden       (orbitals, always)
    output-mgo-pbe0.xyz          (geometry, always)
    output-mgo-pbe0.POSCAR       (geometry, always — periodic)
    output-mgo-pbe0.xsf          (geometry, always — periodic)
    output-mgo-pbe0.bibtex       (citations, always)
    output-mgo-pbe0.references   (citations, always)
    output-mgo-pbe0.population.{txt,json}  (properties, always)
  No SCF run.

The same dry-run is what vq submit --vibeqc-preflight will do on your behalf in the next step. Running it by hand is optional; it’s there because the inspection is useful when you want to know the file family without paying any compute.

Step 2 — submit to the remote queue¶

vq submit --vibeqc-preflight input-mgo-pbe0.py

What vq does, in order:

Runs your script once with VIBEQC_DRY_RUN=1 on the laptop, harvesting the resulting .system manifest’s [plan] section into JobSpec.expected_outputs and JobSpec.output_stem. If the preflight times out or the script doesn’t import run_job, the submit proceeds anyway without the plan — the --vibeqc-preflight flag is best-effort.
Copies input-mgo-pbe0.py to a fresh per-job workspace on the remote (/var/lib/vq/workspaces/<jobid>/).
Enqueues the job in the daemon’s queue with default priority and resource caps (1 CPU, no memory cap, default wall-time-seconds from vq config).

Returns the jobid to stdout (a 12-character hex slug):

queued c0ff50a06462 on planetx
  workspace: /var/lib/vq/workspaces/c0ff50a06462
  watch:     vq watch c0ff50a06462
  fetch when done: vq fetch c0ff50a06462 ./outputs/

The Python file is not yet running — it’s in the queue. Whether it starts immediately depends on --max-jobs and the queue depth.

Tip

If your job needs more than one CPU or has a known peak memory, pass them at submit time so the resource cap is right from the start:

vq submit --cpus 4 --mem-mb 8000 --wall-time-seconds 1800 \
    --vibeqc-preflight input-mgo-pbe0.py

The cgroup-v2 caps are enforced by systemd-run --user; vq won’t OOM your remote box because one job got greedy.

Step 3 — monitor¶

Three ways to watch:

# Snapshot of the queue:
vq queue
# JOBID         STATE     ELAPSED   NAME              SCRIPT
# c0ff50a06462  running   00:00:08  input-mgo-pbe0    input-mgo-pbe0.py

# Per-job detail (state machine + tails of stdout/stderr):
vq status c0ff50a06462

# Live-tail the job until it finishes (Ctrl-C exits the watcher,
# the job keeps running):
vq watch c0ff50a06462

vq queue is the equivalent of squeue; vq status is more like scontrol show job. Both refresh on demand — there’s no poll loop running between calls.

The state machine you’ll see, in order:

queued → starting → running → done            (happy path)
queued → starting → running → failed          (SCF crashed or non-zero exit)
queued → starting → running → timed_out       (wall-time enforcement)
queued → cancelled                            (you called vq kill)

The .system manifest’s [outputs].status field tracks the output-side state — "running" while the job is alive, "complete" / "crashed" once it finishes. This lets vq’s liveness detection distinguish “the SCF crashed and wrote a .dump” from “the daemon got killed and the job is orphaned”.

Step 4 — fetch the outputs¶

Once the job state is done:

vq fetch c0ff50a06462 -o ./outputs/

This streams the workspace back via SSH + tar. With --job-name at submit time the destination is ./outputs/<jobname>-<jobid>/; otherwise it’s ./outputs/<jobid>/:

outputs/c0ff50a06462/
    input-mgo-pbe0.py             # the script you submitted
    output-mgo-pbe0.out           # SCF log
    output-mgo-pbe0.system        # manifest with plan + outputs status
    output-mgo-pbe0.molden        # MOs
    output-mgo-pbe0.xyz           # geometry (extended XYZ)
    output-mgo-pbe0.POSCAR        # VASP-style cell
    output-mgo-pbe0.xsf           # XCrySDen structure
    output-mgo-pbe0.bibtex        # citations
    output-mgo-pbe0.references
    output-mgo-pbe0.population.txt
    output-mgo-pbe0.population.json
    stdout.log                    # vq-captured stdout from python
    stderr.log                    # vq-captured stderr from python

The .bibtex / .references are auto-assembled per user_guide/citations.md — drop the BibTeX file into your manuscript and \cite{...} away.

Step 5 — read the result on the laptop¶

The fetched directory is everything you’d have if you’d run the job locally. Inspect the energy:

grep -E "Total energy|converged" outputs/c0ff50a06462/output-mgo-pbe0.out
# Total energy:                  -274.7821345  Ha
# SCF converged in 14 iterations.

Cross-check the manifest’s hardware block to know what produced the number:

python -c '
import tomllib, sys
with open(sys.argv[1], "rb") as f:
    m = tomllib.load(f)
print("CPU :", m["cpu"]["model"])
print("OMP :", m["cpu"]["omp_threads_used"])
print("RAM :", m["memory"]["total_gb"], "GB")
print("vibeqc:", m["vibeqc"]["version"], m["vibeqc"]["git_sha"])
' outputs/c0ff50a06462/output-mgo-pbe0.system

Common operations¶

Re-running the same job¶

vq resubmit c0ff50a06462
# → new jobid with a fresh workspace, same inputs.

Useful when the original run hit a transient failure (a flaky mount, OOM from a co-tenant job) and you just want to try again without rebuilding the workspace from scratch.

Killing a runaway job¶

vq kill c0ff50a06462
# → systemd-run sends SIGTERM, escalates to SIGKILL after the
#   grace period.

The state flips to cancelled and [outputs].status becomes "crashed" (since the SCF didn’t finish).

Cleaning up old workspaces¶

vq cleanup --older-than 14d
# Removes workspaces of done/cancelled/failed jobs older than 14
# days from the remote.

Default workspace retention is set in vq config; manual cleanup is for the case where you’ve fetched everything you need and want the disk back.

Submitting an entire directory¶

For multi-file inputs (geometry file + Python script that reads it, sweep over several functionals, …):

vq submit -d ./my_sweep_dir/ --vibeqc-preflight -- python run.py
# -d <dir>   = the directory to copy across
# --         = end of vq flags
# python ... = literal command to run inside the workspace

See user_guide/queue.md § Submission forms.

Pausing the queue¶

vq pause                  # daemon stops dispatching new jobs;
                          # running jobs keep going.
vq resume                 # back to normal dispatch.
vq throttle --max-jobs 1  # cap concurrency without pausing

Handy when you’re running an interactive session on the remote box and don’t want vq to fill the CPU on top of you.

Why `--vibeqc-preflight` matters¶

Without the preflight, vq’s JobSpec.expected_outputs is empty — vq doesn’t know what files the job will write, so it has to tar-stream the entire workspace at fetch time. That’s fine for a single small job but wasteful for sweeps over many parameters (each workspace carries the same input geometry, the same stdout log shape, etc.).

With the preflight, the daemon knows the exact file list before the job runs. vq fetch becomes selective; vq queue can display “outputs 5/8 written” by reading the .system manifest without running the SCF; the dashboard can show real progress bars per declared artefact.

The trade-off is the ~1-2 s pre-flight cost at submit time. It runs your Python file (with VIBEQC_DRY_RUN=1) inside a 10-s timeout; if your input file has a slow import or sets up non-trivial state, that runtime grows accordingly. The flag is opt-in for that reason — set it when the gain (output-aware fetch, progress visibility, crash detection) outweighs the cost.

What’s still local-only¶

A few things have NOT been wired through vq yet:

Remote pre-flight. --vibeqc-preflight runs the dry-run on the submitting laptop, not on the remote. The submit fails loudly if the laptop can’t import vibe-qc (you need at least a light vibe-qc install for the preflight to work). Remote pre-flight is the next milestone.
Job arrays. Submit each input as its own vq submit call for now; the queue dispatches them in priority order.
GPU resource claims. Single-host CPU + memory caps via cgroup-v2 are honoured; no GPU claim machinery yet.
Cluster scheduling across multiple nodes. vq is single-host by design — it pairs nicely with one beefy compute box, not with a multi-node HPC cluster (use SLURM for that).

Resources¶

The MgO/PBE0/pob-TZVP/Γ job in this tutorial: ~5 s wall on a modern desktop CPU, peak RAM ~1.5 GB. The submit→fetch cycle including the preflight + tar-stream is ~10-15 s end-to-end. For larger jobs (multi-k mesh, hybrid DFT on transition-metal clusters, MP2 on def2-TZVP) the wall time of the SCF dominates; the vq overhead stays ~constant.

References¶

vq design document. user_guide/queue.md — full 802-line reference for the queue: state machine, configuration, multi-venv routing, external-program workflows, operator controls.
Phase O4 design — dry-run pre-flight. docs/design_output_module.md § “vq integration contract” — why the preflight pattern is shaped the way it is.

Next¶

User guide — vq job queue — every command and flag, with a long-form coverage of the config file, multi-venv routing, external-program workflows (CRYSTAL14, ORCA, PySCF parity runs), web dashboard, and admin commands.
Tutorial 40 — auto-citations — the .bibtex / .references siblings vq fetches back are drop-in for your manuscript bibliography.
Tutorial 26 — cross-validation — running the same input through vibe-qc + PySCF + ORCA over vq for parity work.