SETOL: A Semi-Empirical Theory of (Deep) Learning

Over the past several years we’ve been studying deep neural networks using tools from complex systems, inspired by Per Bak’s self-organized criticality and the econophysics work of Didier Sornette (RG, critical cascades) and Jean-Philippe Bouchaud (heavy-tailed RMT).

Using WeightWatcher, we’ve measured hundreds of real models and found a striking pattern: their empirical spectral densities are heavy-tailed with robust power-law behavior, remarkably similar across architectures and datasets. The exponents fall in narrow, universal ranges—highly suggestive of systems sitting near a critical point.

SETOL builds on this and provides something more unexpected: a derivation showing that trained networks at convergence behave as if they undergo a single step of the Wilson Exact Renormalization Group (ERG). This RG signature appears directly in the measured spectra.

What may interest complex-systems researchers:

Power-law ESDs in real neural nets (no synthetic data or toy models).
Universality: similar exponents across layers, models, and scales.
Empirical RG evidence in trained networks (via detX / ERG).
100% reproducible: anyone can run WeightWatcher on any model and verify the spectra.
Strong conceptual links to SOC, econophysics, avalanches, and heavy-tailed matrix ensembles.

SETOL overview: RG, detX and HTSR alignment

For full details, see the SETOL paper: Semi-Empirical Theory of (Deep) Learning (SETOL) .

RG View of a Layer: detX and the Wilson ERG Step

In SETOL, each trained layer is treated as a matrix model with a free energy that is constrained by a Wilson-style Exact Renormalization Group (ERG) transformation. The key idea is that the scale-invariant transformation on the eigenvalues of WᵀW imposes a constraint on the tail of the spectrum: this is the detX condition.

In practice, WeightWatcher exposes this through the option detX=True. When enabled, WeightWatcher:

Fits a truncated power law to the ESD and estimates the HTSR exponent α.
Computes the ERG / detX trace-log constraint on the tail eigenvalues.
Reports a per-layer quantity detX_val in the details dataframe.

Empirically we observe that by adjusting the layer learning rate, we can often achieve a configuration where:

α ≈ 2 for that layer, and
detX_val ≈ 0, meaning the ERG constraint is satisfied.

Alignment of alpha and detX across learning rates

In the SETOL paper, this appears as the alignment of the red vertical lines (start of the WeightWatcher power-law fit) and the purple vertical lines (start of the detX region) in the log-linear ESD plots. When the red and purple lines overlap and α ≈ 2, the ERG (detX) condition is effectively satisfied.


import weightwatcher as ww

watcher = ww.WeightWatcher(model=your_model)

# Enable detX to evaluate the ERG constraint
details_df = watcher.analyze(plot=True, detX=True)

# Layers near the ERG fixed point: alpha ≈ 2 and detX_val ≈ 0
erg_layers = details_df[details_df["detX_val"].abs() < 1e-2]

Effective Correlation Space (ECS) and Truncated SVD

SETOL also motivates the Effective Correlation Space (ECS) hypothesis. The idea is that when a layer has α ≈ 2 and satisfies the detX condition, its meaningful correlations live in a low-dimensional subspace that can be captured by a truncated SVD with a rank determined by the power-law tail.

Experimentally, we have been using WeightWatcher to test this: when a layer reaches α = 2 (within numerical precision), we:

Run TruncatedSVD on the layer weight matrix.
Choose the rank directly from the fitted power-law (size of the correlation space).
Replace the original layer with this low-rank “trunk” layer.

The surprising result: for such layers, the test accuracy of the full model and the test accuracy of the model with the truncated layer are essentially identical. The differences in train error, test error, and generalization gap all go to zero. In other words, we can compress the layer without retraining.

Effective Correlation Space (ECS) schematic

This is strong evidence that, at α ≈ 2 with detX satisfied, the layer is sitting in an Effective Correlation Space where the RG/ERG picture is not just a metaphor, but quantitatively predictive. This can be implemented using the (experimental) SVDSmoothing() method:


import weightwatcher as ww

watcher = ww.WeightWatcher(model=your_model)
new_model = watcher.SVDSmoothing()

In our experiments on a simple 3-layer MLP, this ECS-based truncation reproduces the original test accuracy almost exactly when α ≈ 2 and detX_val ≈ 0 for the target layer, confirming the SETOL prediction for that RG fixed-point.