Handover — running ORCA & CRYSTAL test sets on the vq fleet (2026-05-20)¶

From: queue-system dev chat (vq). To: a test-set chat running ORCA + CRYSTAL jobs. Scope: how to dispatch a test set across planetx and mars via vq — what the machines can do, how to submit, how to monitor, and the gotchas.

What vq is¶

vq is the cross-machine job queue. You submit from your laptop with vq submit <host> …; a daemon on each host dispatches the work into isolated systemd-run scopes. Both hosts currently run vq 0.6.24; daemons healthy.

You do not SSH into the hosts to run jobs — you vq submit and the daemon handles dispatch, monitoring, and bookkeeping.

The two machines¶

Host	Cores	RAM	Concurrency	Best for
planetx	32	125 GB	`--max-jobs 1` — one job at a time, but it gets the whole box	big parallel CRYSTAL / ORCA runs
mars	16	62 GB	`--max-jobs 2`, `--max-cpus 12` — 2 jobs concurrently, 12 CPU slots, 4 threads reserved for the desktop	smaller jobs, 2-wide throughput

Strategy for a test set: fan small / serial jobs to mars (runs 2-wide), send the big parallel ones to planetx (1-wide, full machine). Submit everything up front — vq queues it and drains in priority-then-FIFO order.

Engines available¶

Identical engine binaries on both hosts (verify live with vq programs planetx / vq programs mars):

`vq programs` name	Engine
`crystal` / `Pcrystal`	CRYSTAL14 serial / parallel (OpenMPI)
`properties` / `Pproperties`	CRYSTAL14 properties serial / parallel
`crystal23demo` / `properties23demo`	CRYSTAL23 demo — serial only, 10-atom-per-primitive-cell limit
`orca`	ORCA 6.1.1 (shared OpenMPI 4.1.8 build)
`psi4`	planetx only (mars has no psi4)

Cells larger than 10 atoms must use crystal / Pcrystal (CRYSTAL14) — the CRYSTAL23 demo binary refuses them.

How to submit¶

# single input file as the workspace:
vq submit planetx --wall-time-seconds 7200 --cpus 8 input.inp -- orca input.inp

# whole directory as the workspace (recommended for CRYSTAL):
vq submit mars -d ./job_dir --wall-time-seconds 3600 -- bash run.sh

# a tarball, extracted on the daemon host:
vq submit planetx -c job.tar.bz2 --wall-time-seconds 7200 -- bash run.sh

Key flags:

--wall-time-seconds N — required. The watchdog SIGTERMs at the limit, then SIGKILLs after a grace period; the job lands in TIME_EXCEEDED.
--cpus N — CPU slots claimed against the host budget.
--mem-mb N — memory budget (cgroup-enforced).
--tag LABEL — repeatable; filter the test set later with vq queue --tag LABEL.
--priority N — higher dispatches first (does not preempt running jobs, only reorders PENDING).
--retry N — re-enqueue on plain non-zero exit, with exponential backoff. Watchdog kills are NOT retried.
--auto-resume — if the host reboots mid-run, the daemon resubmits (same command + workspace). Your job must restart from on-disk state: CRYSTAL fort.20 (GUESSP), ORCA .gbw.

ORCA specifics¶

Always invoke ORCA with its full absolute path:

~/bin/orca_6_1_1_linux_x86-64_shared_openmpi418_nodmrg/orca input.inp

ORCA uses argv[0] to locate its own MPI sub-executables — a bare orca breaks parallel runs. The binary also lives in a subdirectory of ~/bin/, so it is not on the daemon’s bare PATH regardless. The full path is non-negotiable for ORCA.

For parallel ORCA, set %pal nprocs N end in the .inp and match --cpus N on submit so the daemon’s budget accounting is correct.

CRYSTAL specifics¶

Bare crystal / Pcrystal / crystal23demo resolve via the daemon PATH — ~/bin/ was added to the vq-daemon environment on 2026-05-20, so no absolute path is needed for CRYSTAL.

CRYSTAL reads its deck from a file named INPUT in the working directory (or from stdin) and writes fort.* outputs into the cwd. Submit the deck as a directory (-d) so the workspace is self-contained and the fort.* files land together. If a wrapper script reads stdin in a loop, guard the CRYSTAL invocation with < /dev/null so it doesn’t consume the loop’s input.

Monitoring¶

vq summary — fleet overview: per host, running / queued / idle time / daemon health / memory pressure / versions. (vq overview is the same command.)
vq queue <host> — job list; add --tag LABEL to filter your test set.
vq status <jobid> — one job’s detail.
vq tail <jobid> — stream a job’s output live.
vq wait <jobid> — block until the job finishes; or submit with vq submit --wait for synchronous one-offs.

Job end states you’ll see: COMPLETED, FAILED (non-zero exit), TIME_EXCEEDED (hit --wall-time-seconds), OOM_KILLED, STARVED (watchdog saw no CPU activity), KILLED (vq kill).

Gotchas¶

planetx runs one job at a time. A long job blocks the rest of planetx’s queue. Size jobs accordingly or split the set across both hosts.
Don’t run vq admin update while your test set runs — it pauses the queue. If another chat updates the fleet your jobs pause and resume cleanly, but plan around it.
vibeqc-dev venv on the fleet is stale (separate handover: docs/handover_vibeqc_dev_fleet_state_2026_05_20.md). Irrelevant if you only run ORCA / CRYSTAL binaries — but do not rely on --branch main vibe-qc jobs until the dev chat rebuilds that venv.
Watchdog host-pressure auto-pause (v0.6.20): if a host crosses ~85 % memory used, the watchdog SIGSTOPs running jobs until pressure drops. Long CRYSTAL/ORCA jobs that balloon RAM can trip this — size --mem-mb realistically and prefer mars for memory-light work, planetx for the heavy ones (it has 125 GB).
Wall time is mandatory. A job with no --wall-time- seconds is rejected at submit. Estimate generously; a TIME_EXCEEDED kill loses the run.

Fleet status as of this handover (2026-05-20)¶

vq 0.6.24 on both hosts; both daemons verdict: OK.
planetx: rebooted earlier today, daemon auto-started under systemd, idle.
mars: systemd --user manager was restored after an incident earlier today; daemon healthy and supervised.
All engine binaries (crystal, Pcrystal, crystal23demo, orca, …) report OK via vq programs on both hosts.