Handover — running ORCA & CRYSTAL test sets on the vq fleet (2026-05-20)¶
From: queue-system dev chat (vq).
To: a test-set chat running ORCA + CRYSTAL jobs.
Scope: how to dispatch a test set across planetx and
mars via vq — what the machines can do, how to submit, how
to monitor, and the gotchas.
What vq is¶
vq is the cross-machine job queue. You submit from your
laptop with vq submit <host> …; a daemon on each host
dispatches the work into isolated systemd-run scopes.
Both hosts currently run vq 0.6.24; daemons healthy.
You do not SSH into the hosts to run jobs — you vq submit
and the daemon handles dispatch, monitoring, and bookkeeping.
The two machines¶
Host |
Cores |
RAM |
Concurrency |
Best for |
|---|---|---|---|---|
planetx |
32 |
125 GB |
|
big parallel CRYSTAL / ORCA runs |
mars |
16 |
62 GB |
|
smaller jobs, 2-wide throughput |
Strategy for a test set: fan small / serial jobs to mars (runs 2-wide), send the big parallel ones to planetx (1-wide, full machine). Submit everything up front — vq queues it and drains in priority-then-FIFO order.
Engines available¶
Identical engine binaries on both hosts (verify live with
vq programs planetx / vq programs mars):
|
Engine |
|---|---|
|
CRYSTAL14 serial / parallel (OpenMPI) |
|
CRYSTAL14 properties serial / parallel |
|
CRYSTAL23 demo — serial only, 10-atom-per-primitive-cell limit |
|
ORCA 6.1.1 (shared OpenMPI 4.1.8 build) |
|
planetx only (mars has no psi4) |
Cells larger than 10 atoms must use crystal / Pcrystal
(CRYSTAL14) — the CRYSTAL23 demo binary refuses them.
How to submit¶
# single input file as the workspace:
vq submit planetx --wall-time-seconds 7200 --cpus 8 input.inp -- orca input.inp
# whole directory as the workspace (recommended for CRYSTAL):
vq submit mars -d ./job_dir --wall-time-seconds 3600 -- bash run.sh
# a tarball, extracted on the daemon host:
vq submit planetx -c job.tar.bz2 --wall-time-seconds 7200 -- bash run.sh
Key flags:
--wall-time-seconds N— required. The watchdog SIGTERMs at the limit, then SIGKILLs after a grace period; the job lands inTIME_EXCEEDED.--cpus N— CPU slots claimed against the host budget.--mem-mb N— memory budget (cgroup-enforced).--tag LABEL— repeatable; filter the test set later withvq queue --tag LABEL.--priority N— higher dispatches first (does not preempt running jobs, only reorders PENDING).--retry N— re-enqueue on plain non-zero exit, with exponential backoff. Watchdog kills are NOT retried.--auto-resume— if the host reboots mid-run, the daemon resubmits (same command + workspace). Your job must restart from on-disk state: CRYSTALfort.20(GUESSP), ORCA.gbw.
ORCA specifics¶
Always invoke ORCA with its full absolute path:
~/bin/orca_6_1_1_linux_x86-64_shared_openmpi418_nodmrg/orca input.inp
ORCA uses argv[0] to locate its own MPI sub-executables — a
bare orca breaks parallel runs. The binary also lives in a
subdirectory of ~/bin/, so it is not on the daemon’s bare
PATH regardless. The full path is non-negotiable for ORCA.
For parallel ORCA, set %pal nprocs N end in the .inp and
match --cpus N on submit so the daemon’s budget accounting
is correct.
CRYSTAL specifics¶
Bare crystal / Pcrystal / crystal23demo resolve via the
daemon PATH — ~/bin/ was added to the vq-daemon environment
on 2026-05-20, so no absolute path is needed for CRYSTAL.
CRYSTAL reads its deck from a file named INPUT in the
working directory (or from stdin) and writes fort.* outputs
into the cwd. Submit the deck as a directory (-d) so the
workspace is self-contained and the fort.* files land
together. If a wrapper script reads stdin in a loop, guard
the CRYSTAL invocation with < /dev/null so it doesn’t
consume the loop’s input.
Monitoring¶
vq summary— fleet overview: per host, running / queued / idle time / daemon health / memory pressure / versions. (vq overviewis the same command.)vq queue <host>— job list; add--tag LABELto filter your test set.vq status <jobid>— one job’s detail.vq tail <jobid>— stream a job’s output live.vq wait <jobid>— block until the job finishes; or submit withvq submit --waitfor synchronous one-offs.
Job end states you’ll see: COMPLETED, FAILED (non-zero
exit), TIME_EXCEEDED (hit --wall-time-seconds),
OOM_KILLED, STARVED (watchdog saw no CPU activity),
KILLED (vq kill).
Gotchas¶
planetx runs one job at a time. A long job blocks the rest of planetx’s queue. Size jobs accordingly or split the set across both hosts.
Don’t run
vq admin updatewhile your test set runs — it pauses the queue. If another chat updates the fleet your jobs pause and resume cleanly, but plan around it.vibeqc-devvenv on the fleet is stale (separate handover:docs/handover_vibeqc_dev_fleet_state_2026_05_20.md). Irrelevant if you only run ORCA / CRYSTAL binaries — but do not rely on--branch mainvibe-qc jobs until the dev chat rebuilds that venv.Watchdog host-pressure auto-pause (v0.6.20): if a host crosses ~85 % memory used, the watchdog SIGSTOPs running jobs until pressure drops. Long CRYSTAL/ORCA jobs that balloon RAM can trip this — size
--mem-mbrealistically and prefer mars for memory-light work, planetx for the heavy ones (it has 125 GB).Wall time is mandatory. A job with no
--wall-time- secondsis rejected at submit. Estimate generously; aTIME_EXCEEDEDkill loses the run.
Fleet status as of this handover (2026-05-20)¶
vq0.6.24 on both hosts; both daemonsverdict: OK.planetx: rebooted earlier today, daemon auto-started under systemd, idle.
mars:
systemd --usermanager was restored after an incident earlier today; daemon healthy and supervised.All engine binaries (
crystal,Pcrystal,crystal23demo,orca, …) reportOKviavq programson both hosts.