Γ-CCM (aiccm2026dev-a) dense four-center scalability — analysis & strategy

Companion to handovers/HANDOVER_AICCM_A_SCALABILITY.md. This is the Phase 1 (analysis) + Phase 2 (strategy) deliverable for the aiccm2026dev-a (Γ-CCM) memory workstream: why the dense symmetric four-center OOMs on real 3-D crystals, and the surveyed phased plan to make it fit in RAM.

Status: analysis + strategy, for review. No code-behaviour change yet. The recommended plan must be reviewed before any implementation (Phase 3) — see the handover mandate (CLAUDE.md §11/§14).

All numbers below are analytical (cheap: N_pad and the implied tensor bytes need no ERI build) or from tiny-cell (1-D H-chain) memory profiles run under a hard RSS cap. The genuine c-diamond (2,2,2) reproduction is a declared-cap vq job (id recorded in the handover), not a local run — a parallel run already spiked this shared box to 137 GB.


1. Phase 1 — analysis: where the memory goes

1.1 The failure path (confirmed by reading the source)

run_ccm_rhf (periodic/ccm/scf.py:117)
  -> _ccm_eri_for_method (scf.py:45)
    -> ccm_eri_symmetric / ccm_eri (periodic/ccm/padded.py:361 / :289)
      -> eri_pad = np.asarray(compute_eri(pad.basis))    <-- the killer
         # pad.basis is the PADDED cluster: home cell + every ±2t image cell.

There are two distinct large allocations in the pipeline, in order of severity:

#

Allocation

Site

Size

Severity

1

Padded ERI compute_eri(pad.basis)

padded.py:321 / :401 (C++ core)

N_pad⁴ × 8 B

the wall

2

Folded effective tensor eff

padded.py:333/:411, returned to scf.py

N_ref⁴ × 8 B

secondary wall (pob 3-D)

3

J/K build einsum("mnrs,rs->mn", eff, D)

scf.py:157–158

works on eff, O(N_ref²) extra

not a wall

  • N_pad = N_ref_ao × N_eri_cells, where N_eri_cells = len(eri_cells(ccm)) is the number of ±2t image cells the four-center fold materialises (padded.eri_cells).

  • The padded ERI (#1) is allocated inside the C++ libint core (compute_eri), so a Python tracemalloc does not see it — process RSS does. It is dense and screening-free (a validation builder).

  • The folded eff (#2) is the object scf.py actually contracts; it is N_ref⁴, independent of the image count, and is what survives once the padded ERI is freed.

1.2 Analytical scaling table (the wall)

N_ref_ao (= ccm.nbf), N_eri_cells, N_pad, dense-padded-ERI (N_pad⁴·8), folded-effective-ERI (N_ref⁴·8) for the canonical test cells. STO-3G unless noted. Computed from geometry only — no ERI built.

system

nrep

basis

n_atoms

N_ref

wssc cells

eri cells

N_pad

padded ERI

folded ERI

h-chain

(1,1,1)

sto-3g

2

2

1

1

2

0.1 µiB

0.1 µiB

h-chain

(2,1,1)

sto-3g

4

4

3

5

20

1.2 MiB

2 KiB

h-chain

(4,1,1)

sto-3g

8

8

3

5

40

19.5 MiB

31 KiB

h-chain

(8,1,1)

sto-3g

16

16

3

5

80

0.31 GiB

0.5 MiB

c-diamond

(1,1,1)

sto-3g

2

10

7

25

250

29.1 GiB

76 KiB

c-diamond

(2,1,1)

sto-3g

4

20

11

45

900

4.77 TiB

1.2 MiB

c-diamond

(2,2,1)

sto-3g

8

40

25

117

4680

3490 TiB

19.5 MiB

c-diamond

(2,2,2)

sto-3g

16

80

25

117

9360

55 800 TiB

0.31 GiB

si-diamond

(1,1,1)

sto-3g

2

18

7

25

450

306 GiB

0.8 MiB

si-diamond

(2,1,1)

sto-3g

4

36

11

45

1620

50.1 TiB

12.5 MiB

si-diamond

(2,2,2)

sto-3g

16

144

25

117

16848

586 000 TiB

3.2 GiB

mgo

(1,1,1)

sto-3g

2

14

13

57

798

2.95 TiB

0.3 MiB

mgo

(2,1,1)

sto-3g

4

28

17

79

2212

174 TiB

4.6 MiB

mgo

(2,2,2)

sto-3g

16

112

27

125

14000

280 000 TiB

1.17 GiB

cscl

(2,2,2)

pob-tzvp-rev2

16

176

125

22000

1.7e6 TiB

7.2 GiB

c-diamond

(2,2,2)

pob-tzvp-rev2

16

288

117

33696

9.4e6 TiB

51.3 GiB

mgo

(2,2,2)

pob-tzvp-rev2

16

296

125

37000

1.4e7 TiB

57.2 GiB

AO density: STO-3G is 1 (H) / 5 (C) / 7 (MgO avg) / 9 (Si) AO/atom; pob-tzvp-rev2 is ~11–18 AO/atom.

Where the wall is.

  • Dense padded ERI (#1) dies at the first real 3-D cell. c-diamond (1,1,1) STO-3G already needs 29 GiB; (2,1,1) needs 4.8 TiB; (2,2,2) needs 55 800 TiB. Every dense-four-center route (aiccm-hf, -ks, -viz, -localize, -mp2, -ccsd) inherits this and OOMs identically — this is the reproduced blocker.

  • Folded eff (#2) is fine at STO-3G but is the second wall at pob-tzvp-rev2 3-D. c-diamond/MgO (2,2,2) pob-tzvp-rev2 fold to a 51–57 GiB eff tensor even if #1 were solved — so a fix that only removes the padded ERI still leaves a near-64-GB ceiling at production basis. An integral-direct J/K (never form eff) removes both walls; anything that still materialises eff only removes #1.

1.3 Tiny-cell memory profile (confirms the dominant allocation + exponent)

1-D H-chain STO-3G, the only family safe to actually build locally (largest dense ERI here is 80⁴·8 = 312 MiB), under a 4 GB in-process RSS watchdog:

nrep

N_ref

N_pad

compute_eri tensor (exact)

process RSS high-water

(2,1,1)

4

20

1.221 MiB

115.6 MiB

(4,1,1)

8

40

19.531 MiB

135.6 MiB

(8,1,1)

16

80

312.500 MiB

448.3 MiB

  • The dominant allocation is unambiguously compute_eri’s N_pad⁴ padded tensor: its exact size scales N_pad 2·N_pad tensor ×16, a fitted exponent of 4.000 (tensor N_pad⁴). Process RSS tracks it (the ~313 MiB tensor adds ~313 MiB to the high-water at the largest case).

  • tracemalloc (Python allocator) reports ~0 for compute_eri because the tensor is a C++-core allocation — RSS / .nbytes are the correct instruments, and both agree with the N_pad⁴ law.

1.4 The lean counter-examples (quantified)

The fix is not hypothetical — two routes already avoid the N_pad⁴ tensor:

  1. run_ccm_rhf_gdf (RI / multi-k GDF, ri.py). Because CCM ≡ SCM-Γ, this runs the validated native multi-k GDF on the unit cell with the nrep k-mesh — so its working AO dimension is the unit-cell nbf (10 for c-diamond, not the supercell’s 80), and its 3-index RI tensor is N_aux × n_unit² × n_k — a few MiB k-resolved cderi, never N_pad⁴:

    system

    unit nbf

    GDF 3-index RI (order)

    dense padded ERI

    c-diamond (2,2,2)

    10

    ~3 MiB

    55 800 TiB

    mgo (2,2,2)

    14

    ~8 MiB

    280 000 TiB

    The handover records it ran the same c-diamond (2,2,2) in ~6 GB / ~23 min, exact under CCM ≡ SCM-Γ (E/atom −37.4155 Ha). This is the existence proof that lean and exact is achievable. (The submitted vq job re-confirms the peak RSS on a fresh box.)

  2. Symmetry-unique atom pairs (symmetry.py ccm_symmetry_unique_atom_pairs). The cluster-invariant space group reduces the ordered home atom-pair count that the ERI/Fock build must touch — measured on the (2,2,2) cells:

    system

    pairs

    unique

    reduction

    cluster group order

    c-diamond

    256

    19

    13.47×

    48

    si-diamond

    256

    19

    13.47×

    48

    mgo

    256

    32

    8.0×

    48

    An unused lever: only one representative pair per orbit needs its block built, the rest follow by the AO rotation P. It is a constant-factor (not complexity-class) win, so it stacks on top of an integral-direct fix rather than replacing it.

Aside (not in scope here): build_padded_cluster raises KeyError(0) on a pob-tzvp-rev2 cell (cscl) — an ECP/ghost-atom basis-mapping bug in the padded assembly, independent of memory. It belongs to the method/basis chain (HANDOVER_AICCM_FOLLOWON.md), not this workstream; flagged for them. The pob N_pad rows above use the analytical N_pad = nbf × eri_cells, which is unaffected.



3. Phase 3 — implementation

3b — integral-direct J/K in the C++ kernel (DONE, 2026-06-25)

Maintainer-approved ordering: 3b first (the review chose the integral-direct C++ kernel over the Phase-3a block-batched Python fold). After FR-2 routed aiccm-viz/-localize/-pao to run_ccm_rhf_scalable, that driver is the production HF path and already ran STO-3G 3-D — so the sharp remaining blocker is wall #2 inside its C++ kernel (build_jk_ccm_weighted): both production methods (bra_home_full for union12, aiccm2026dev-a for the method of record) allocated a thread-local effective tensor Vj_tls[n_threads] of nbf²×nbf², then reduced to V_full, symmetrised to V_sym, and contracted — a peak of (n_threads+2)·N_ref⁴·8 bytes (~300 GiB on a pob-tzvp-rev2 c-diamond (2,2,2) cell at nbf=288).

What landed. Two new C++ kernels in cpp/src/periodic_fock.cppaiccm2026dev-a-direct and bra_home_full-direct — reuse the identical verified quartet loop and WSSC weight, but fold each weighted block straight into thread-local J/K instead of Vj. The fold is exact: with V_sym = ½(V + Vᵀ) and the full-branch contractions J[μν]=Σ P[λσ]V_sym[μν,λσ], K[μν]=Σ P[λσ]V_sym[μσ,λν], a single block t = (μν|λσ)·w contributes

J[μ,ν] += ½ t P[λ,σ]      J[λ,σ] += ½ t P[μ,ν]      (the V and Vᵀ halves)
K[μ,σ] += ½ t P[λ,ν]      K[λ,ν] += ½ t P[μ,σ]

followed by the same final Hermitisation. Peak JK-build memory drops from (n_threads+2)·N_ref⁴·8 to n_threads·2·N_ref²·8. The builder is already rebuilt every SCF iteration (no V cache — CCMWeightedGammaJKBuilder::build_g_rhf calls build_jk_ccm_weighted(…D…) each call), so this is zero recompute penalty — same integrals, no tensor, and it drops the separate nbf⁴ contraction pass too.

run_ccm_rhf_scalable gains a four_center= keyword: scf.py maps "direct" (default → the -direct kernels) or "full"/"dense" (→ the preserved full-tensor kernels, the small-cluster comparison reference). The dense Python run_ccm_rhf and the full C++ branches are untouched — the full four-center path stays runnable for comparison (maintainer constraint, 2026-06-25).

Verification (gated, byte-for-byte).

cell

basis

direct vs full

note

He square (2,2,1) 2-D

sto-3g

`

ΔE

H₄ chain (4,1,1) 1-D

sto-3g

`

ΔE

c-diamond (2,2,2) 3-D

sto-3g

`

ΔE

Tests in tests/test_ccm_scalable.py: test_scalable_direct_matches_full (parametrised, the byte-for-byte gate) and test_scalable_direct_memory_regression (slow; the c-diamond (2,2,2) RSS contrast — direct stays O(nbf²) while full allocates the O(nbf⁴) tensor, same energy). Full CCM suite green (185 tests). The authoritative production-basis (pob-tzvp-rev2, ~300 GiB full vs lean direct) reproduction is a vq job (testing chat, §0).

Consumer adoption. properties.py (viz figures) and convergence.py route through run_ccm_rhf_scalable → they pick up four_center="direct" automatically. dft.py (run_ccm_rks/run_ccm_uks) builds the JK builder directly; it now takes the same four_center="direct" default via the shared scf._ccm_scalable_cxx_method helper, so KS-CCM scales too (landed with maintainer authorization to cross the §0 method-layer boundary for this one-liner). The post-HF stack (mp2/ccsd) still rides the dense Python run_ccm_rhf — out of scope here (small-cluster + the byte-for-byte reference); RI-MP2 via the neutral cderi is the 3d route for those.

3a / 3c / 3d — status

  • 3a (block-batched Python fold): demoted. With scalable as the production path, the dense Python run_ccm_rhf now serves only mp2/ccsd small-clusters + the reference; 3b solved the production blocker directly. 3a remains available if the dense Python path itself needs STO-3G 3-D.

  • 3c (symmetry-unique pairs, 13.5×): unchanged plan — a constant-factor throughput win that stacks on the 3b quartet loop; deferred.

  • 3d (RI/GDF for consumers): cross-chat, opt-in; the GDF route is the 3-D accuracy path (the bare four-center is a 3-D model number). Unchanged.