The vq calculation queue

vq is vibe-qc’s calculation queue — a small SSH-backed job-submission tool that lets you run vibe-qc (and CRYSTAL / ORCA) calculations on a remote compute box without writing shell glue. Configure it once, then vq submit my_calc.py from your laptop and the job is queued, dispatched, resource-capped, and watched on the remote host. Outputs come back the same way.

vq is co-shipped with vibe-qc in the vibe-queue/ subpackage but is independently versioned (at the time of writing it’s vq, version 0.5.25). It’s engine-agnostic: vibe-qc is the primary workload, but anything you can call from a shell — CRYSTAL14, ORCA 6.1, PySCF scripts — submits the same way through contrib/ wrappers.

When to use vq

  • Laptop runs out of cores or RAM. Your MacBook has 16 GB and 10 cores; the remote has 128 GB and 32 cores. Queue the big runs; keep the laptop for development.

  • You want a record of what you ran. Every submission is a JobSpec stored on the daemon, with a unique short-hash id, full command, environment, resource caps, terminal state, and outputs.

  • You’re running many jobs. vq dispatches one at a time (by default; see § Concurrency below) and records every one, so you don’t lose track when a sweep takes hours.

  • You want resource enforcement. cgroup-v2 caps mean a runaway job doesn’t bring down the box.

When NOT to use vq

  • Tiny molecules on the laptop. .venv/bin/python h2o.py runs in 3 s; the queue + ssh round-trip adds latency for zero gain.

  • Truly interactive sessions. vq is batch-shaped; use ssh

  • HPC cluster job arrays. vq targets a single single-node host. SLURM is the right tool for cluster scheduling; a SLURM backend for vq is on the v1.0 roadmap but doesn’t ship yet.

Architecture

┌────────────────────┐         SSH        ┌──────────────────────────┐
│  Your laptop       │ ─────────────────→ │  Remote compute host     │
│                    │                    │                          │
│  vq CLI            │                    │  vq-daemon.service       │
│  ~/.config/vq/     │  vq submit         │  (systemd --user)        │
│   config.toml      │                    │                          │
│                    │ ←───── stdout ──── │  Queue (SQLite-backed)   │
│  ssh-key auth      │                    │   ↓                      │
│                    │                    │  systemd-run scope       │
│                    │                    │   (cgroup-v2 caps)       │
│                    │                    │   ↓                      │
│                    │                    │  your Python / ORCA /    │
│                    │                    │  CRYSTAL14 process       │
│                    │                    │                          │
│                    │  vq-web.service    │  Web UI (FastAPI+htmx)   │
│  browser ──────────┼───── port 8765 ───→│  :8765/queue,            │
│                    │  bearer token      │  /jobs/<id>              │
└────────────────────┘                    └──────────────────────────┘

Pieces that need to be running:

  • vq-daemon.service on the remote — accepts submissions (over SSH), maintains the queue, dispatches jobs into cgroup scopes, survives reboots via loginctl enable-linger.

  • vq-web.service on the remote — read-only-plus-write REST + HTML UI, port 8765 by default, bearer-token-protected.

  • vq CLI on the laptop — wraps ssh remote vq so the laptop never deals with the queue state directly.

Installation

Two sides — local (laptop) and remote (compute box). Both run the same pip install.

Local (laptop)

# Inside your vibe-qc checkout
cd vibe-queue
python3 -m venv .venv
.venv/bin/pip install -e .

# Put vq on PATH:
ln -s ~/path/to/vibe-queue/.venv/bin/vq ~/.local/bin/vq
# or in zshrc:
#   alias vq="$HOME/path/to/vibe-queue/.venv/bin/vq"

The local install needs only the CLI dependencies (no FastAPI / systemd). Test:

vq --version            # vq, version 0.5.25

Remote (compute box)

# 1. Install vq from a vibeqc-queue clone:
git clone https://gitlab.peintinger.com/mpei/vibeqc.git ~/vibeqc-queue
cd ~/vibeqc-queue/vibe-queue
python3 -m venv .venv
.venv/bin/pip install -e '.[web]'      # [web] pulls FastAPI + uvicorn

# 2. Install the systemd-user units:
mkdir -p ~/.config/systemd/user
cp contrib/vq-daemon.service ~/.config/systemd/user/
cp contrib/vq-web.service    ~/.config/systemd/user/

# Edit ExecStart in both unit files to use the venv's
# absolute vq path (e.g. /home/user/vibeqc-queue/vibe-queue/.venv/bin/vq).

# 3. Enable the daemon to start at boot (linger keeps the
# user instance alive without an active session):
sudo loginctl enable-linger $USER

systemctl --user daemon-reload
systemctl --user enable --now vq-daemon.service
systemctl --user enable --now vq-web.service

# 4. Verify:
systemctl --user status vq-daemon vq-web
journalctl --user -u vq-daemon -f       # live log tail

The bearer token for the web UI is generated on first daemon-start and written to ~/.config/vq/web-token mode 0600 on the remote. Print it once and store it locally; you’ll need it to access the web UI from a browser. Re-generate by deleting the file and restarting the daemon.

Configuration

vq reads ~/.config/vq/config.toml on the laptop. The remote daemon doesn’t need a config file. Copy the template from the repository and edit:

cp vibe-queue/docs/config.toml.example ~/.config/vq/config.toml

A working minimal config:

# ~/.config/vq/config.toml on your laptop

# Default host when you omit it from `vq <subcommand> ...`.
# Match a [hosts.<name>] block below.
default_host = "compute"

[hosts.compute]
ssh = "compute"
# 'compute' must be an SSH alias defined in ~/.ssh/config,
# or a literal user@host.example.com. Test with:
#   ssh compute hostname

# Absolute path to vq on the remote. The remote shell's
# default PATH usually doesn't include the venv vq lives in.
remote_vq = "/home/USER/vibeqc-queue/vibe-queue/.venv/bin/vq"

# Default Python interpreter for single-file submits. Point
# at a venv where vibe-qc is installed.
remote_python = "/home/USER/vibeqc-dev/.venv/bin/python"

# Optional: multi-venv routing for --branch (v0.5.6+).
# Lets `vq submit foo.py --branch release` pick the right
# vibe-qc clone without hard-coding the path.
[hosts.compute.branches]
main    = "/home/USER/vibeqc-dev/.venv/bin/python"
release = "/home/USER/vibeqc-release/.venv/bin/python"

[hosts.compute.branch_aliases]
dev         = "main"
development = "main"
latest      = "release"

The full annotated example is at vibe-queue/docs/config.toml.example.

Multi-host

Add another [hosts.<name>] block:

[hosts.compute2]
ssh = "compute2"
remote_vq = "/home/USER/vibeqc-queue/vibe-queue/.venv/bin/vq"
remote_python = "/home/USER/vibeqc-dev/.venv/bin/python"

Then vq submit foo.py --host compute2 routes to that machine. Omit --host to use default_host.

Your first job

# A trivial vibe-qc water RHF script.
cat > water.py <<'EOF'
from vibeqc import Atom, Molecule, run_job
mol = Molecule([
    Atom(8, [0.0,  0.00,  0.00]),
    Atom(1, [0.0,  1.43, -0.98]),
    Atom(1, [0.0, -1.43, -0.98]),
])
run_job(mol, basis="sto-3g", method="rhf", output="water")
EOF

# Submit it.
vq submit water.py
# → printed to stdout: jobid (e.g. "c0ff50a06462") + a watch hint.

# Poll the queue:
vq list

# Wait for it (Ctrl-C exits the watcher; the job keeps running):
vq watch c0ff50a06462

# Once it finishes, pull the outputs back:
vq fetch c0ff50a06462 ./outputs/
# → ./outputs/water.out / .molden / .traj / stdout.log / stderr.log

That’s the entire core workflow.

Submission forms

vq accepts three submission shapes:

Single file (most common)

vq submit my_script.py
# Equivalent to:
#   ssh <host> cd <remote-workspace> && <remote_python> my_script.py

The laptop copies my_script.py into a fresh per-job workspace on the remote, runs it with the configured remote_python (or the --branch-resolved one), captures stdout / stderr, and tracks the result.

Directory submit (sweeps + multi-file inputs)

vq submit -d ./my_sweep_dir -- python run.py --basis def2-svp
# -d <path>                = the directory to copy across to the workspace
# --                       = end of vq flags
# python run.py …          = the literal command to run inside the workspace

Use this when:

  • Your script imports local modules (from helpers import ...).

  • You need multiple input files in the workspace (run.py reads geometry.xyz, basis_def.g94, etc.).

  • You want to encode the interpreter / engine in the command (e.g. running ORCA: -- orca input.inp).

Pre-packed tarball

vq submit -t my_inputs.tar.gz -- bash run.sh
# vq unpacks the tarball into the workspace before dispatching.

For reproducibility — the tarball + command + JobSpec are a complete reproducible-run unit.

Resource caps

Every job dispatched after v0.4.0 runs inside its own systemd-run user scope so cgroup-v2 memory + CPU caps apply. Wall-time enforcement is Python-watchdog-based (vq.watchdog); cgroup RuntimeMaxSec was tried in v0.4 → v0.5.7 and dropped in v0.5.8 as not runtime-mutable via systemctl --user set-property (see vibe-queue/docs/wall_time_design.md for the postmortem). The watchdog subtracts paused_seconds_total from elapsed, so wall-time is naturally pause-aware.

vq submit my_calc.py \
    --cpus 8 \
    --mem-mb 16000 \
    --wall-time-seconds 7200    # 2-hour cap (watchdog-enforced)

If the job exceeds any cap, the cgroup or the watchdog kills it cleanly and the queue records a labelled terminal state:

Terminal state

Trigger

Owner

Recovery

COMPLETED

exit code 0

nothing — outputs ready to fetch

FAILED

non-zero exit code

inspect stderr.log; re-submit

OOM_KILLED

exceeded --mem-mb cap

cgroup

bump --mem-mb or split the job

TIME_EXCEEDED

exceeded --wall-time-seconds cap

watchdog

bump the cap, or checkpoint if vibe-qc supports it for the workload

STARVED

CPU-underutilisation watchdog (5 min < 10% CPU summed over the pgid descendants)

watchdog

check stderr.log — typically a hanging worker; v0.5.12+ samples the whole pgroup so the bash wrapper no longer false-positives

ABORTED_BY_QUEUE

terminated by vq kill, queue-wide pause→kill, or daemon restart that lost the exit code

daemon

intentional; resubmit if needed

Always pass --wall-time-seconds N for non-trivial jobs — that’s the only guard against a wedged SCF eating cores indefinitely.

Orphan exit-code recovery (v0.5.9+)

If the daemon restarts mid-job (deliberately via systemctl --user restart vq-daemon or via Restart=on-failure), the dispatched bash wrapper writes the inner process’s exit code to <workspace>/_vq/exit-code on graceful exit. When the new daemon reconciles orphans, it reads the marker and classifies as COMPLETED (rc=0) or FAILED (rc≠0). Pre-v0.5.9 behaviour was to mark every restart-orphan as ABORTED_BY_QUEUE even on clean completion; v0.5.9 fixes that and is what makes vq admin update (v0.5.20+) safe to use — it deliberately pause-restart-resumes the daemon.

Multi-venv --branch routing (v0.5.6+)

The remote may host multiple vibe-qc clones — typically vibeqc-dev (tracking main) and vibeqc-release (tracking the latest tag). Pick one per submit:

vq submit my_calc.py                        # default_host's default
vq submit my_calc.py --branch main          # dev venv
vq submit my_calc.py --branch release       # release venv
vq submit my_calc.py --branch latest        # = release (alias)

--branch is mutually exclusive with --python, and only applies to single-file submits. For -d / -t submits, encode the interpreter in the explicit command.

The mapping is per-host config — [hosts.<name>.branches] + [hosts.<name>.branch_aliases]. Add new entries by editing ~/.config/vq/config.toml on the laptop; no remote restart needed.

External-program workflows (CRYSTAL / ORCA / PySCF)

vibe-qc treats other QC programs as external — see CLAUDE.md § 10 for the policy. vq dispatches them through contrib/ wrappers that handle each program’s I/O conventions:

CRYSTAL14 (Pcrystal + PROPERTIES14)

# Parallel CRYSTAL14 (default --np 14):
vq submit -d ./calc --cpus 14 \
    -- bash /home/USER/vibeqc-queue/vibe-queue/contrib/run-crystal.sh INPUT.d12

# Serial:
vq submit -d ./calc --cpus 1 \
    -- bash /home/USER/vibeqc-queue/vibe-queue/contrib/run-crystal.sh --serial INPUT.d12

# Custom MPI rank count:
vq submit -d ./calc --cpus 8 \
    -- bash /home/USER/vibeqc-queue/vibe-queue/contrib/run-crystal.sh --np 8 INPUT.d12

# PROPERTIES14 (parallel):
vq submit -d ./prop --cpus 14 \
    -- bash /home/USER/vibeqc-queue/vibe-queue/contrib/run-crystal.sh --properties prop.d3

The wrapper stages the input file as ./INPUT, runs mpirun -np N Pcrystal > out.out, restores any pre-existing INPUT on exit.

ORCA 6.1

ORCA spawns its own MPI internally — don’t wrap with mpirun:

vq submit -d ./orca_run --cpus 8 -- orca input.inp

ORCA reads --cpus-equivalent info from the ! PAL N line in the input file; declare --cpus N matching for cgroup accounting.

PySCF (as a comparison / parity reference)

vq submit my_pyscf_script.py                # PySCF is in both vibe-qc venvs

Both the dev and release vibe-qc venvs have PySCF installed (it’s in [test]), so PySCF scripts submit the same way as vibe-qc scripts.

Monitoring + management

# Snapshot the queue:
vq queue                                  # all states
vq queue --active                         # running + pending + suspended
vq queue -s running                       # only running (v0.5.27)
vq queue -s running -s pending            # explicit two-state filter
vq queue -s failed -s killed              # terminal-failure forensics

# Per-job snapshot (metadata + tail of stdout/stderr):
vq status <jobid>                         # last 50 lines
vq status <jobid> -n 200                  # last 200 lines
vq status <jobid> -n 0                    # full output

# Live tail of a workspace file (v0.5.26):
vq tail <jobid>                           # follow stdout.log
vq tail <jobid> -f                        # live-stream (Ctrl-C to stop)
vq tail <jobid> --name vibeqc.log -f      # custom logger file
vq tail <jobid> --name mgo.out -f         # CRYSTAL output
vq tail <jobid> --name h2.out -f          # ORCA / Psi4 output

# Fetch outputs back to the laptop (live job: workspace dir;
# completed: workspace dir; archived: un-tars from the .tar.bz2):
vq fetch <jobid> -o ./results

# Cancel:
vq kill <jobid>                           # SIGTERM the process group,
                                          # then SIGKILL after grace
                                          # → terminal state KILLED

# Pause / resume (v0.5.1+):
vq pause <jobid>                          # SIGSTOP the job
vq resume <jobid>                         # SIGCONT
vq pause --all                            # pause every running job
vq resume --all                           # resume every paused job

vq tail is the canonical “watch the SCF converge live” verb: it execs tail -f directly (locally) or via ssh (remotely), so SIGINT goes straight through and there’s no Python buffering layer between the job’s logger and your terminal. Use --name to target whatever file vibe-qc’s logger is writing to (e.g. logging.basicConfig(filename='vibeqc.log')vq tail JOBID --name vibeqc.log -f).

The pause / resume flow is the right tool when you need to free the box temporarily (kids gaming, an interactive workload) without losing in-flight jobs. For automated venv refresh use vq admin update instead (it pauses-pulls-builds-resumes in one verb; see Refreshing the remote vibe-qc venv below).

Web dashboard

If vq-web.service is running, open http://<remote>:8765/queue in a browser. First-time access prompts for the bearer token (stored at ~/.config/vq/web-token on the remote).

Endpoints:

Endpoint

Purpose

GET /queue

Live queue table (htmx auto-refresh)

GET /jobs/<id>

Per-job detail: spec, resource history, log tail, exit status

GET /health/{live,ready}

Kubernetes-style probes for external monitoring

POST /api/v1/jobs/<id>/{kill,pause,resume}

Per-job write actions (v0.5.1+)

POST /api/v1/queue/{pause,resume}

Queue-wide actions (v0.5.2+)

All write endpoints require the bearer token in an Authorization: Bearer <token> header. For browser use, htmx + a small form prompts once and stores it in sessionStorage.

Architecture detail (auth, request shapes, error handling) is in vibe-queue/docs/web.md.

Fetching outputs

When a job completes, the workspace on the remote contains the outputs your script wrote (water.out, water.molden, …) plus the queue-side capture files (stdout.log, stderr.log, _vq/events.jsonl, _vq/exit-code).

vq fetch <jobid> ./local_outputs/       # rsync the whole workspace
vq fetch <jobid> ./outputs/ --files stdout.log water.out
                                         # specific files only

vq fetch is archive-aware (v0.5.11+): if the workspace was archived via vq cleanup --archive (see next section), fetch streams the .tar.bz2 over SSH and reconstructs the original directory layout on the laptop. No special flag needed; the same vq fetch <jobid> <local-dir> command works for both live and archived workspaces.

Operator controls (pause / resume / throttle / drain)

When the box gets busy for non-queue reasons (kids gaming, an interactive session, an urgent job from another chat), three knobs let vq step aside without losing in-flight work:

# Hard freeze (SIGSTOP); RAM stays allocated, no CPU used.
vq pause <jobid>             # one job
vq pause --all               # every running job
vq resume <jobid>             # SIGCONT
vq resume --all

# Soft throttle (cgroup CPUWeight, renice fallback v0.5.21+).
# weight=100 is default; weight=20 = "step aside" under contention.
vq throttle <jobid> --weight 20
vq throttle --all --weight 20 --persist     # persist across new dispatches
vq throttle --all --weight 20 --persist --duration 2h   # auto-release after 2h
vq throttle --release-persist                # clear persistent state
vq throttle --status                         # what's the current state?

# Drain (don't dispatch NEW jobs; running ones continue).
vq drain                     # full drain (no new dispatches)
vq drain --max-jobs 0        # explicit full drain
vq drain --max-jobs 2        # partial drain (cap at 2 concurrent)
vq drain --release           # back to daemon's configured max
vq drain --duration 1h       # auto-release after 1h
vq drain --status

Composable: vq drain + vq pause --all + vq throttle --all cover the operator-control story. All four state files (drain.json, throttle.json, auto-cleanup.json, plus the per-job suspended-state on the spec) live under <state_root> and survive daemon restarts.

Workspace cleanup (v0.5.10+)

Long-running queues accumulate workspaces. vq cleanup is the manual housekeeping verb; it operates only on terminal-state jobs (active / pending / suspended jobs are never touched).

# List terminal-state jobs and their workspace ages:
vq cleanup
# → table: jobid, terminal_state, finished_at, workspace_size_mb

# Dry-run preview — show what would be archived:
vq cleanup --archive --older-than 30d

# Actually archive (add -x to "execute"):
vq cleanup --archive --older-than 30d -x
# → workspaces become tar.bz2 files under <state_root>/archive/

# Hard delete archived workspaces older than 90 days:
vq cleanup --delete --older-than 90d -x

# Restore an archived workspace (un-tar in place):
vq cleanup --restore <jobid> -x

The archive→restore round-trip is lossless: the directory tree after --restore is byte-identical to what was archived.

Auto-policy (v0.5.17+): instead of running the verb manually, register a daemon-side policy:

# Daemon runs the sweep once per --interval (default 24h):
vq cleanup --auto-enable --archive-after 30d --delete-after 90d

# Per-state retention (v0.5.23+) — keep failed-job forensics longer:
vq cleanup --auto-enable --archive-after 30d \
           --archive-after-state failed:90d --delete-after 180d

# Read-only status:
vq cleanup --auto-status

# Disable:
vq cleanup --auto-disable

Configurable archive location (v0.5.22+): default <state_root>/archive/ may live on a small partition. Override via:

  • $VQ_ARCHIVE_DIR env var on the daemon host (applies to all archive paths globally)

  • --archive-dir DIR flag on the verb (per-policy with --auto-enable, per-invocation with one-shot --archive)

Why this matters: when the queue gets busy, workspaces add up fast (~10s of MB per typical SCF, ~hundreds of MB for big periodic + Molden + cube + .traj). Without cleanup, the <state_root> filesystem fills. With cleanup, you get a straightforward archive → delete pipeline that preserves the artefact history (every spec + final outputs) at small storage cost (~5× compression for typical output mixes).

Daemon admin

What happens at host reboot

The daemon survives if loginctl enable-linger is set:

  • Daemon restart only — running jobs become orphans with their pgids preserved; the new daemon re-attaches at startup. Job completes normally; exit code is read from the dispatched job’s _vq/exit-code file (so re-attach works even after a restart that wiped the Popen handle). This is v0.5.9’s _vq/exit-code marker — pre-v0.5.9 restart-orphans got marked ABORTED_BY_QUEUE even on clean completion.

  • Full host reboot — kernel kills everything; all RUNNING jobs are marked ABORTED_BY_QUEUE on next daemon start. Resubmit using the JobSpecs in the queue history.

Note

Wall-time enforcement gap when the daemon is down. Because v0.5.8 dropped cgroup RuntimeMaxSec (it wasn’t runtime-mutable on pause; see vibe-queue/docs/wall_time_design.md), wall-time enforcement is now the Python watchdog only. If the daemon crashes and stays down beyond the watchdog’s poll interval, a job that should have hit its --wall-time-seconds cap during the outage isn’t killed by the kernel — it keeps running until the daemon comes back and the watchdog catches up. In practice, Restart=on-failure on the systemd unit keeps the gap to a few seconds. The trade-off is documented in vibe-queue/docs/wall_time_design.md.

Refreshing the remote vibe-qc venv after a release (v0.5.20+)

As of vq v0.5.20, this is one verb:

vq admin update vibeqc-release

Which does pause-all → git -C <git_dir> pullbash <update_script> → resume-all (always, even on Ctrl-C or pull failure — resume is in a finally block so the queue always comes back up). Reads git_dir / branch / update_script from the host’s [programs.X] registry (see vq programs below).

Verifying a tagged release (v0.5.24+):

git push --tags
vq admin update vibeqc-release --tag v0.8.0
vq submit smoke_test.py --branch release

--tag v0.X.Y runs git describe --exact-match --tags HEAD after the pull and fails the update (skips the build, exits non-zero) if HEAD isn’t exactly at the expected tag — the libint-vanishing class of “pull succeeded but landed on the wrong commit” failures.

Checking remote state (v0.5.25+):

vq admin status
# NAME            BRANCH   SHA           DESCRIBE             DIRTY  LAST_UPDATED_AT             LAST OK
# vibeqc-dev      main     abc12345defg  v0.7.3-12-gabc1234   no     2026-05-13T14:30:00+00:00   True
# vibeqc-release  release  fedcba987654  v0.8.0                no     2026-05-13T14:35:12+00:00   True

Compare SHA to your laptop’s git rev-parse --short=12 HEAD to answer “is planetx at the commit I just pushed?” without ssh.

Chat workflow for testing a just-pushed feature:

git push                                    # laptop
vq admin update vibeqc-dev                  # refresh planetx
vq submit my_feature_test.py --branch main  # exercise the new code

This is the canonical pattern — always vq admin update between push and submit if you need planetx at your latest commit.

vq programs (v0.5.18+) — list registered programs:

vq programs              # human-readable table
vq programs --json       # machine-readable; for scripts

The registry lives at ~/.config/vq/config.toml on the remote host under [programs.X]. Three kinds:

  • binary — CRYSTAL, ORCA, Psi4 (an executable on disk)

  • venv — vibeqc-dev, vibeqc-release (a Python venv + git checkout that vq admin update knows how to refresh)

  • import — pyscf (a module that should be importable from a specific Python)

Watching the daemon

journalctl --user -u vq-daemon -f       # live tail
systemctl --user status vq-daemon       # service health

Concurrency

The default daemon configuration is single-job dispatch (--max-jobs 1 in the systemd unit). This is the test-phase default — change to --max-jobs N in the unit file’s ExecStart and restart the daemon to parallel-dispatch.

Set --max-jobs honestly against the CPU budget: if jobs declare --cpus 8 and the box has 32 cores, --max-jobs 4 is the safe ceiling. The daemon does not currently enforce this; it accepts whatever you set.

Troubleshooting

Symptom

Likely cause

Fix

vq: command not found on laptop

venv not on PATH

symlink to ~/.local/bin/vq or add the venv bin to PATH

ssh: command not found on submit

local SSH client not installed

install OpenSSH client

Permission denied (publickey) on submit

SSH key not authorised on remote

add laptop’s ~/.ssh/id_*.pub to remote’s ~/.ssh/authorized_keys

Job hangs in QUEUED indefinitely

daemon not running or --max-jobs 0

systemctl --user status vq-daemon; restart if dead

Job terminates STARVED at the 5-min mark

pre-v0.5.12, the watchdog read CPU from the wrapper PID only (bash sleeping in wait() shows 0% CPU even when the child process is using 16 cores)

upgrade to vq v0.5.12+; the watchdog now sums CPU across the whole pgid descendant set. As a workaround on older versions: exec command inside the wrapper so the worker becomes the dispatched PID

import vibeqc fails on remote

wrong remote_python (system Python instead of vibe-qc venv)

check ~/.config/vq/config.toml [hosts.X].remote_python

Web UI says “401 unauthorised”

bearer token expired or wrong

re-read ~/.config/vq/web-token on remote; for browsers, clear sessionStorage

Comprehensive troubleshooting table in vibe-queue/docs/handover.md § Troubleshooting.

Version history (recent)

vq version

Headline

v0.5.27

vq queue --state STATE + --active shortcut — filter the listing

v0.5.26

vq tail [HOST] JOBID --name FILENAME -f — live-stream workspace files (default stdout.log; --name for vibe-qc’s logger output / engine-native files)

v0.5.25

vq admin status — live SHA / DESCRIBE / DIRTY + last-update record per venv env

v0.5.24

vq admin update --tag v0.X.Y — verify HEAD is at the expected tag post-pull

v0.5.23

per-state retention overrides: --archive-after-state failed:7d

v0.5.22

configurable archive_dir ($VQ_ARCHIVE_DIR env + --archive-dir flag + AutoCleanupPolicy.archive_dir)

v0.5.21

renice fallback for vq throttle on non-cgroup hosts

v0.5.20

vq admin update <env> minimal — one verb for pause / git pull / build / resume

v0.5.19

smoke test consumes absolute paths from vq programs --json

v0.5.18

vq programs verb + [programs.X] registry (binary / venv / import kinds)

v0.5.17

auto-cleanup policy (daemon main-loop hook reads auto-cleanup.json)

v0.5.13–.16

vq throttle / vq drain + --persist + --duration auto-release

v0.5.12

watchdog samples pgid descendants (fixes STARVED false-positive when bash-wrapped jobs sleep in wait())

v0.5.11

archive-aware remote vq fetch from the laptop (v0.5.10’s cleanup verb produced archives the SSH-side fetch couldn’t see)

v0.5.10

vq cleanup verb (archive / delete / restore terminal-state workspaces)

v0.5.9

orphan exit-code recovery via _vq/exit-code marker — restart-orphans that complete normally are now COMPLETED/FAILED, not ABORTED_BY_QUEUE

v0.5.8

drop broken cgroup RuntimeMaxSec; Python watchdog is the single owner of wall-time

v0.5.7

run-crystal.sh cleans per-rank scratch on success; --keep-scratch opt-out

v0.5.6

--branch multi-venv routing (vibeqc-dev vs vibeqc-release)

v0.5.0–.5

web dashboard, pause/resume, bearer-token auth, CRYSTAL14 parallel dispatch

v0.4

cgroup-v2 enforcement, pgid recovery, event log

v0.3

resource watchdog (mem cap, wall-time, terminal-state machine)

Full per-version detail at vibe-queue/docs/handover.md § “What’s NEW in …” (the handover is the deeper reference; this page is the user-facing entry).

Roadmap (vq’s own)

vq has its own roadmap independent of vibe-qc — see vibe-queue/docs/roadmap.md. Near-term:

  • v0.6.0 full scope (some items already shipped in v0.5.20–v0.5.25; remaining: vq admin update --all multi-env, vq admin update vq self-update with daemon restart, admin-update-in-progress marker file for crash recovery, multi-user / per-uid).

  • v0.7 — job priority (--priority N), per-user quotas, retry on failure (--retry N with backoff), webhook notifications, opt-in --auto-resume after host reboot.

  • v1.0 — SLURM / PBS backend so the same vq submit shape works against HPC clusters.

See also