External structures (vqfetch)

vqfetch is the v0.8.0 console-script that pulls crystal structures from open databases and emits two artefacts:

  1. A regression PeriodicSpec module (under examples/regression/systems/periodic/) so the structure becomes part of the regression matrix.

  2. An executable vibe-qc input script (under examples/) so you can run an SCF on the fetched cell with one command.

The pull preserves full per-record provenance: source DB, ID, permalink URL, original DOI (where available), license string, and fetched-at timestamp. This means a vqfetch-pulled structure is publication-ready out of the box — no “where did this come from” auditability gap.

Sources

Source

Default per-record license

Use it for

CLI subcommand

OPTIMADE federation

per-provider (varies)

formula-based federated query across providers

vqfetch optimade --formula MgO

Materials Project

CC-BY 4.0

computed structures + properties; primary for many slugs

vqfetch mp --id mp-1265

COD (Crystallography Open Database)

CC0 / public domain

experimentally determined CIFs

vqfetch cod --id 1011027

NOMAD

CC-BY 4.0 (data); CC0 (metadata)

computed materials data

(use optimade --provider nomad)

Canonical set

(whichever the slug’s primary provider applies)

the five round-trip-verified structures used as smoke tests

vqfetch canonical mgo_rocksalt

The full license inventory is in docs/license.md.

Install

vqfetch is part of the optional [fetch] extra:

pip install -e '.[fetch]'   # development install
# OR
pip install 'vibe-qc[fetch]' # once published

This pulls in optimade>=1.0,<2, ase>=3.22, beautifulsoup4>=4.12,<5, and lxml>=4.9. Without the extra, vqfetch will not be on $PATH.

Quick start: round-trip MgO from Materials Project

# 1. Fetch the structure → emit SPEC + input script.
vqfetch mp --id mp-1265 --basis sto-3g --method rks-lda

# Output (one path per line; both are written to disk):
# examples/regression/systems/periodic/mp_mp-1265.py
# examples/scf-mp_mp-1265.py

# 2. Run the SCF (uses the venv's Python).
.venv/bin/python examples/scf-mp_mp-1265.py

The SCF run produces:

  • mp_mp-1265.out — banner, SCF trace, energy breakdown, orbital table, the source DB / ID / DOI / license recorded in the run header, plus wall-clock timings.

  • mp_mp-1265.molden — molecular orbitals.

  • mp_mp-1265.traj — ASE trajectory (single frame for static SCF; multi-frame for optimize=True).

Reference: live planetx round-trip on 2026-05-09 produced E = −950.4204308512 Ha (13 SCF iters, ~2h 20m on 16 cores).

Five canonical structures (round-trip-verified)

The canonical subcommand walks a hand-curated five-structure table that the v1 acceptance harness round-trips end-to-end on every commit:

vqfetch canonical mgo_rocksalt    # MgO via Materials Project
vqfetch canonical nacl_rocksalt   # NaCl via Materials Project
vqfetch canonical lih_rocksalt    # LiH via Materials Project
vqfetch canonical si_diamond      # Si via Materials Project
vqfetch canonical c_diamond       # C  via Materials Project

Use --quick to drop the recommended basis to sto-3g for a fast smoke test (default behaviour is the heuristic-recommended basis per § Recommended basis below).

Per-record provenance

Every fetched record carries the following fields (visible in the SPEC module’s Provenance dataclass and surfaced in the SCF log + per-run .system manifest):

Provenance(
    source_db="materials_project",
    source_id="mp-1265",
    source_url="https://next-gen.materialsproject.org/materials/mp-1265",
    original_doi="10.17188/1199994",          # if known
    license="CC-BY 4.0",
    fetched_at="2026-05-09T21:30:00+00:00",
    provider="mp",                             # OPTIMADE provider key
    notes="...",                               # free-form, e.g. "via canonical_set"
)

When you re-run a calculation later, the provenance bundle travels with the SPEC. For published work, cite the source DB per its terms of use (Materials Project: cite per their terms; COD: CC0 — citation appreciated, not legally required; NOMAD: cite the contributing author + NOMAD).

Cache + offline mode

vqfetch caches every successful fetch on disk per XDG (default: ~/.cache/vqfetch/). Repeated vqfetch mp --id mp-1265 does not re-hit the API. TTL defaults:

  • 30 days for OPTIMADE / MP / NOMAD.

  • Infinite for COD (CIFs are immutable post-publication).

Two relevant flags:

  • --no-cache — bypass cache reads but still write through after a live fetch. Useful when you suspect the upstream record has been updated.

  • --cache-only — refuse live HTTP entirely; fail fast if the record isn’t in the cache. Useful for offline / reproducible runs (e.g. on cluster compute nodes without network).

Common flags

All structure subcommands accept the same emission flags:

Flag

Default

Purpose

--basis

(heuristic; see below)

Override the recommended basis (e.g. --basis pob-tzvp).

--method

rks-lda for periodic; rhf for molecules

SCF method baked into the emitted input script. Choices: rhf, rks-lda, rks-pbe, rks-blyp, rks-b3lyp.

--quick

off

Force --basis sto-3g. Smoke-test mode.

--out

examples/regression/systems/{periodic,molecules}/

Output directory for the emitted SPEC module.

--input-script

examples/

Output directory for the emitted input script.

--no-cache

off

Bypass cache reads (still writes through).

--cache-only

off

Refuse live HTTP.

--slug

(auto-generated)

Override the SPEC id slug.

What the emitted files look like

After vqfetch mp --id mp-1265:

examples/regression/systems/periodic/mp_mp-1265.py — import-as-module SPEC for the regression suite:

"""MgO rocksalt — fetched from Materials Project mp-1265."""
from examples.regression.core.spec import (
    PeriodicSpec, Provenance, ReferenceKind,
)

mp_mp_1265 = PeriodicSpec(
    id="mp_mp-1265",
    formula="MgO",
    lattice_vectors=[[0.0, 2.106, 2.106],
                     [2.106, 0.0, 2.106],
                     [2.106, 2.106, 0.0]],
    atoms=[("Mg", [0.0, 0.0, 0.0]),
           ("O",  [2.106, 2.106, 2.106])],
    recommended_basis="pob-tzvp",
    provenance=Provenance(
        source_db="materials_project",
        source_id="mp-1265",
        source_url="https://next-gen.materialsproject.org/materials/mp-1265",
        license="CC-BY 4.0",
        fetched_at="2026-05-09T21:30:00+00:00",
    ),
)

examples/scf-mp_mp-1265.py — runnable SCF script:

"""SCF on mp_mp-1265 — generated by vqfetch on 2026-05-09."""
from vibeqc import Atom, PeriodicSystem, run_periodic_job

cell = PeriodicSystem(
    lattice_vectors=[[0.0, 2.106, 2.106],
                     [2.106, 0.0, 2.106],
                     [2.106, 2.106, 0.0]],
    atoms=[Atom(12, [0.0, 0.0, 0.0]),
           Atom(8,  [2.106, 2.106, 2.106])],
)

run_periodic_job(
    cell,
    basis="pob-tzvp",
    method="RKS",
    functional="LDA",
    output="mp_mp-1265",
)

# Provenance bundle (preserved in the SCF log header):
#   Source: materials_project mp-1265
#   URL:    https://next-gen.materialsproject.org/materials/mp-1265
#   License: CC-BY 4.0
#   Fetched: 2026-05-09T21:30:00+00:00

Combining with the regression suite

Once a SPEC lands in examples/regression/systems/periodic/, the regression suite picks it up automatically. Run the full matrix:

python -m examples.regression.run_suite \
    --include examples/regression/systems/periodic/mp_mp-1265.py \
    --output-md

…and the new system shows up alongside the hand-curated test set, with the per-source provenance footer included in the generated summary.md.

When NOT to use vqfetch

  • You already have a CIF on disk. Use ASE’s CIF reader directly: ase.io.read("my.cif") then build a PeriodicSystem from the result. vqfetch’s value is the fetch + provenance + emission round-trip, not the CIF parsing itself.

  • You want a structure NOT in any open database. vqfetch’s sources cover ~99% of well-studied solids; for exotic / proprietary structures, hand-build the PeriodicSystem and document provenance manually.

  • You need cell relaxation. vqfetch emits a static-SCF script; add optimize=True to the emitted call yourself if you want geometry optimisation. The fetched cell is the input geometry; the optimised cell is yours to record.

Phase 3 (deferred to v0.9.0+)

These items are designed but not shipped in v0.8.0:

  • WebBook fallback for structures missing from MP / COD.

  • Bulk-sweep tooling (vqfetch list-candidates) for multi-candidate workflows.

  • OQMD as a primary provider (currently reachable via OPTIMADE federation, not as a top-level subcommand).

  • Cross-provider consistency check (alert when MP / COD / NOMAD disagree on the same canonical structure).

The vqfetch chat owns these; tracked in docs/roadmap.md § vqfetch Phase 3.

See also

  • reference_data.md — vqfetch’s CCCBDB integration for experimental reference data (atomization energies, ΔHf°, vibrational frequencies, IE).

  • docs/license.md — full per-source licensing + bundled-data inventory.