Handover — running ORCA & CRYSTAL test sets on the vq fleet (2026-05-20)

From: queue-system dev chat (vq). To: a test-set chat running ORCA + CRYSTAL jobs. Scope: how to dispatch a test set across planetx and mars via vq — what the machines can do, how to submit, how to monitor, and the gotchas.


What vq is

vq is the cross-machine job queue. You submit from your laptop with vq submit <host> ; a daemon on each host dispatches the work into isolated systemd-run scopes. Both hosts currently run vq 0.6.24; daemons healthy.

You do not SSH into the hosts to run jobs — you vq submit and the daemon handles dispatch, monitoring, and bookkeeping.


The two machines

Host

Cores

RAM

Concurrency

Best for

planetx

32

125 GB

--max-jobs 1one job at a time, but it gets the whole box

big parallel CRYSTAL / ORCA runs

mars

16

62 GB

--max-jobs 2, --max-cpus 122 jobs concurrently, 12 CPU slots, 4 threads reserved for the desktop

smaller jobs, 2-wide throughput

Strategy for a test set: fan small / serial jobs to mars (runs 2-wide), send the big parallel ones to planetx (1-wide, full machine). Submit everything up front — vq queues it and drains in priority-then-FIFO order.


Engines available

Identical engine binaries on both hosts (verify live with vq programs planetx / vq programs mars):

vq programs name

Engine

crystal / Pcrystal

CRYSTAL14 serial / parallel (OpenMPI)

properties / Pproperties

CRYSTAL14 properties serial / parallel

crystal23demo / properties23demo

CRYSTAL23 demo — serial only, 10-atom-per-primitive-cell limit

orca

ORCA 6.1.1 (shared OpenMPI 4.1.8 build)

psi4

planetx only (mars has no psi4)

Cells larger than 10 atoms must use crystal / Pcrystal (CRYSTAL14) — the CRYSTAL23 demo binary refuses them.


How to submit

# single input file as the workspace:
vq submit planetx --wall-time-seconds 7200 --cpus 8 input.inp -- orca input.inp

# whole directory as the workspace (recommended for CRYSTAL):
vq submit mars -d ./job_dir --wall-time-seconds 3600 -- bash run.sh

# a tarball, extracted on the daemon host:
vq submit planetx -c job.tar.bz2 --wall-time-seconds 7200 -- bash run.sh

Key flags:

  • --wall-time-seconds Nrequired. The watchdog SIGTERMs at the limit, then SIGKILLs after a grace period; the job lands in TIME_EXCEEDED.

  • --cpus N — CPU slots claimed against the host budget.

  • --mem-mb N — memory budget (cgroup-enforced).

  • --tag LABEL — repeatable; filter the test set later with vq queue --tag LABEL.

  • --priority N — higher dispatches first (does not preempt running jobs, only reorders PENDING).

  • --retry N — re-enqueue on plain non-zero exit, with exponential backoff. Watchdog kills are NOT retried.

  • --auto-resume — if the host reboots mid-run, the daemon resubmits (same command + workspace). Your job must restart from on-disk state: CRYSTAL fort.20 (GUESSP), ORCA .gbw.

ORCA specifics

Always invoke ORCA with its full absolute path:

~/bin/orca_6_1_1_linux_x86-64_shared_openmpi418_nodmrg/orca input.inp

ORCA uses argv[0] to locate its own MPI sub-executables — a bare orca breaks parallel runs. The binary also lives in a subdirectory of ~/bin/, so it is not on the daemon’s bare PATH regardless. The full path is non-negotiable for ORCA.

For parallel ORCA, set %pal nprocs N end in the .inp and match --cpus N on submit so the daemon’s budget accounting is correct.

CRYSTAL specifics

Bare crystal / Pcrystal / crystal23demo resolve via the daemon PATH — ~/bin/ was added to the vq-daemon environment on 2026-05-20, so no absolute path is needed for CRYSTAL.

CRYSTAL reads its deck from a file named INPUT in the working directory (or from stdin) and writes fort.* outputs into the cwd. Submit the deck as a directory (-d) so the workspace is self-contained and the fort.* files land together. If a wrapper script reads stdin in a loop, guard the CRYSTAL invocation with < /dev/null so it doesn’t consume the loop’s input.


Monitoring

  • vq summary — fleet overview: per host, running / queued / idle time / daemon health / memory pressure / versions. (vq overview is the same command.)

  • vq queue <host> — job list; add --tag LABEL to filter your test set.

  • vq status <jobid> — one job’s detail.

  • vq tail <jobid> — stream a job’s output live.

  • vq wait <jobid> — block until the job finishes; or submit with vq submit --wait for synchronous one-offs.

Job end states you’ll see: COMPLETED, FAILED (non-zero exit), TIME_EXCEEDED (hit --wall-time-seconds), OOM_KILLED, STARVED (watchdog saw no CPU activity), KILLED (vq kill).


Gotchas

  • planetx runs one job at a time. A long job blocks the rest of planetx’s queue. Size jobs accordingly or split the set across both hosts.

  • Don’t run vq admin update while your test set runs — it pauses the queue. If another chat updates the fleet your jobs pause and resume cleanly, but plan around it.

  • vibeqc-dev venv on the fleet is stale (separate handover: docs/handover_vibeqc_dev_fleet_state_2026_05_20.md). Irrelevant if you only run ORCA / CRYSTAL binaries — but do not rely on --branch main vibe-qc jobs until the dev chat rebuilds that venv.

  • Watchdog host-pressure auto-pause (v0.6.20): if a host crosses ~85 % memory used, the watchdog SIGSTOPs running jobs until pressure drops. Long CRYSTAL/ORCA jobs that balloon RAM can trip this — size --mem-mb realistically and prefer mars for memory-light work, planetx for the heavy ones (it has 125 GB).

  • Wall time is mandatory. A job with no --wall-time- seconds is rejected at submit. Estimate generously; a TIME_EXCEEDED kill loses the run.


Fleet status as of this handover (2026-05-20)

  • vq 0.6.24 on both hosts; both daemons verdict: OK.

  • planetx: rebooted earlier today, daemon auto-started under systemd, idle.

  • mars: systemd --user manager was restored after an incident earlier today; daemon healthy and supervised.

  • All engine binaries (crystal, Pcrystal, crystal23demo, orca, …) report OK via vq programs on both hosts.