WW-PGD — WeightWatcher Projected Gradient Descent

WW-PGD: WeightWatcher Projected Gradient Descent

Status: Experimental | GitHub | WeightWatcher

Abstract

WW-PGD is a lightweight spectral projection add-on for PyTorch optimizers (SGD, Adam, AdamW, Muon, etc.). It wraps your existing optimizer and periodically projects each layer’s weight spectrum toward a critical heavy-tailed manifold motivated by HTSR/SETOL: the tail exponent α is driven toward α ≈ 2 and the SETOL ERG condition trace-log(λ) in the tail = 0 (equivalently detX = 1) is enforced on the tail subspace. In practice, WW-PGD uses WeightWatcher diagnostics (detX_num and num_pl_spikes) to select the tail via a midpoint rule at each projection step (epoch or batch boundary).

1 · What WW-PGD Does

1.1 “Add-on” optimizer design

WW-PGD does not replace your optimizer. Instead:

You run your base optimizer normally (e.g., AdamW).
At an epoch boundary (or every N steps), WW-PGD runs a spectral projection on each layer weight matrix.
The goal is to steer weights toward a SETOL/HTSR critical regime without disrupting standard training dynamics.

1.2 Tail selection via WeightWatcher midpoint rule

At each projection step, WeightWatcher provides (per layer):

detX_num: a trace-log based estimate of effective tail size (ERG / detX tail).
num_pl_spikes: number of power-law “spike” outliers in the tail.

WW-PGD selects the working tail size using a midpoint rule: k_mid = floor((detX_num + num_pl_spikes)/2). As the model approaches the critical regime (α → 2), SETOL predicts these two quantities converge, so the midpoint becomes effectively exact.

1.3 Projection target: HTSR + SETOL critical conditions

HTSR critical condition: α ≈ 2 in the heavy-tailed spectral tail.
SETOL ERG condition: trace-log(λ) in the tail = 0 (equivalently detX = 1 on the tail).

1.4 How the projection is applied (high level)

For each layer weight matrix, WW-PGD:

Computes an SVD / eigen-spectrum (tail only is modified).
Constructs a rank-ordered power-law template for the tail (target α schedule, never targeting α < 2).
Applies a stable Cayley-style update in log-eigenvalue space.
Retracts to satisfy the ERG trace-log condition on the tail.
Reconstructs the weight matrix and blends it back into the model.

2 · Results & Figures

2.1 MNIST: Plain vs Augmented Test

The following plots show mean ± std across runs for: plain test accuracy, augmented test accuracy, and layer-wise α trajectories from WeightWatcher. The augmented evaluation uses mild, in-distribution perturbations (small rotation/translation + light blur), intended as a robustness proxy.

**Figure 1.** MNIST plain test accuracy (mean ± std): Baseline vs WW-PGD.

**Figure 2.** MNIST augmented test accuracy (mean ± std): Baseline vs WW-PGD.

Layer-wise alpha (mean ± std) — **Figure 3.** Layer-wise HTSR exponent α (mean ± std): WW-PGD tends to stabilize α trajectories toward the critical regime.

2.2 FashionMNIST summary (QuickStart notebook)

The FashionMNIST experiments are documented in the QuickStart notebook: WW_PGD_QuickStart.ipynb .

Final results (epoch 35, mean ± std)

Baseline: plain = 98.05% ± 0.13, augmented = 96.24% ± 0.17
WW-PGD: plain = 97.99% ± 0.17, augmented = 96.23% ± 0.20

Interpretation (early read):

On FashionMNIST, WW-PGD is roughly accuracy-neutral at this scale (differences are within error bars).
The more interesting signal is often in the spectral diagnostics: whether α stabilizes and whether detX/spike diagnostics converge in the expected way.
WW-PGD is best viewed as a spectral control primitive that can be tuned for stability/robustness rather than a guaranteed accuracy boost in every small benchmark.

3 · How to Use WW-PGD

3.1 Install

pip install git+https://github.com/CalculatedContent/WW_PGD.git

3.2 Minimal usage (wrap any optimizer)

import torch
import torch.nn as nn
import torch.nn.functional as F
import ww_pgd

model = nn.Linear(10, 10)

base_opt = torch.optim.AdamW(model.parameters(), lr=1e-3)

cfg = ww_pgd.WWTailConfig(
    warmup_epochs=0,
    ramp_epochs=5,
    min_tail=5,
    blend_eta=0.5,
    cayley_eta=0.25,
)

opt = ww_pgd.WWPGDWrapper(model, base_opt, cfg)

for epoch in range(num_epochs):
    for xb, yb in loader:
        loss = F.cross_entropy(model(xb), yb)
        opt.zero_grad(set_to_none=True)
        loss.backward()
        opt.step()

    opt.apply_tail_projection(epoch=epoch, num_epochs=num_epochs)

3.3 Practical knobs

warmup_epochs: skip projections early to avoid disrupting feature formation.
ramp_epochs: gradually increase projection strength.
min_tail: avoid unstable tiny-tail fits (important for small layers).
apply_every_epochs: reduce projection frequency if needed.

4 · Performance Notes & Feedback

WW-PGD can be slower than plain training because it performs spectral analysis and reconstruction. This overhead is the main reason we use: warmup, ramping, and (often) epoch-boundary projections rather than per-step projections.

We are actively working on performance improvements (faster decompositions, better batching of diagnostics, and more selective projection policies). Feedback is valuable—especially cases where:

WW-PGD improves stability/robustness without harming accuracy,
WW-PGD hurts convergence (when and why),
layer-wise α or spike metrics behave unexpectedly,
Muon/SGD behave differently from AdamW under the same projection policy.