Yongzhong Xu

Studying the geometry of how neural networks learn.

AI / ML Research

I started with toy models — modular addition, modular multiplication — using the commutator defect to look for what predicted grokking. The signal was real, but local. Scaling up to TinyStories pretraining, I found the backbone story: training trajectories concentrate along a single dominant direction. In GPT2-124M, the same structure appears with two dominant directions.

The natural question was the gap between the backbone and the rest — when does the dominant direction separate cleanly from the bulk? Going back to the toy models with that question, σ₂/σ₃ on the rolling parameter-update Gram fell out as the right object.

From there, two further applications: the σ₂/σ₃ eigengap as a Stage III lock-in detector in Tian’s Li₂ framework, and per-task gradient SVD revealing 100–330× Linear-Centroid coupling. Others are in progress.

ModAdd+ModMul phase portrait showing grokking transition in spectral gap / non-commutativity phase space

Phase Portrait of Grokking

Left: Training trajectory in the (spectral gap, non-commutativity) plane, colored by training step. The system wanders in the competition zone (σ₁ ≈ σ₂) with near-zero accuracy, then the gap opens, non-commutativity peaks at step 15200, and the trajectory crosses into the alignment quadrant — at which point both grokking events occur (ModMul at step 16300, ModAdd at step 16400). Right: The same trajectory colored by test accuracy. Accuracy is essentially zero until the phase portrait dictates otherwise.

Phase space comparison: grokking (wd=1) vs memorizing (wd=0)

Grokking vs. Memorizing in Phase Space

Two runs, same architecture, same seed. With weight decay (left), the spectral gap opens and the system generalizes. Without weight decay (right), the gap never opens — the trajectory expands in the vertical direction but stays compressed along the spectral axis. The distinction between generalization and memorization is visible in the geometry of the training trajectory.

Notes & writeups

Writeup · mechanistic interpretability

Spectral fingerprints of an attention circuit during pretraining

An unsupervised spectral signal pre-identifies causally-relevant attention heads across an ~20× parameter scale range (51M synthetic to OLMoE 1B-7B), two architectures (dense + 64-expert MoE), and three pretraining datasets (Pile, FineWeb, DCLM). Universal cross-architecture finding: L0/L1 have zero BOS-classified heads in every model tested; a 3–4 head induction circuit is identified in every 1B-class model via a capability-specific screen at selectivity ≥ 50×.

Read Code

Research Overview

My research applies geometric and dynamical-systems methods — geometric flows, Morse theory, critical-point theory, variational methods — to questions about training dynamics, capability emergence, and representation structure.

Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

Develops the analytical machinery for the spectral edge phenomena observed across earlier work: gap dynamics equations, a spectral loss decomposition, and an adiabatic parameter for training stability. Verified across modular arithmetic, Dyck-1, SCAN, and GPT2-class transformer training.

Note on the scope. The empirical predictions — phase transition timing, grokking onset — are robust across architectures and scales. The theoretical definition of the privileged subspace boundary k* remains an open problem. For simple datasets the definition is operationally clean; for complex datasets like natural language, the edge may be a more diffuse object without a single privileged direction. This is the next theoretical question the framework raises, not a limitation of the empirical results.

arXiv

Spectral-Edge Notes & Papers

Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

Per-task gradient SVD bridges Linear-Centroid features and spectral-edge optimizer dynamics with 100–330× coupling — 1–2 orders of magnitude stronger than the AdamW-update SVD used in prior work.

arXiv Note Code

Technical note · arXiv in preparation

Spectral Marker of Feature Interaction in Tian’s Li₂ Grokking Framework

Bridges to Yuandong Tian’s Li₂ grokking framework (arXiv:2509.21519): the σ₂/σ₃ eigengap on the rolling parameter-update Gram pinpoints Tian’s Stage III lock-in (fires at epoch 174 ± 1 in 15/15 grok seeds, 0/15 controls), with Theorem 6 sign-match saturating at the same epoch — mechanistic verification of when feature repulsion becomes dominant.

PDF Code

The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression

At grokking, the spectral edge transitions from a gradient-driven functional axis to a weight-decay compression axis — flat under perturbation yet >4000× more ablation-critical than random directions.

arXiv Code

Spectral Edge Dynamics Reveal Functional Modes of Learning

Each spectral-edge direction induces a structured function over the input domain — a single Fourier mode for modular addition, a discrete-log mode for multiplication, a multi-mode family for subtraction — invisible to standard interpretability tools.

arXiv Code

Empirical Evidence

Spectral Edge Dynamics of Training Trajectories: Signal–Noise Geometry Across Scales

Scales spectral-edge analysis to GPT2-class transformer training (51M–124M parameters): the coherent/noise boundary follows a universal three-phase evolution and effective signal rank scales with task complexity.

arXiv Code

Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training

Transformer trajectories under AdamW concentrate along a “backbone” direction capturing 60–80% of displacement from initialization; the effect vanishes under SGD, identifying the structure as optimizer-induced rather than gradient-induced.

arXiv Code

Global Low-Rank, Local Full-Rank: The Holographic Encoding of Learned Algorithms

Multi-task modular-arithmetic trajectories live in a 2–6 dimensional global subspace while individual weights remain full-rank: learned algorithms are globally low-rank in learning directions but locally full-rank in parameter space.

arXiv Code

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

Weight decay drives a phase diagram of multi-task grokking dynamics — staggered task ordering, universal trajectory integrability, holographic incompressibility, and transverse fragility — consistent with compression onto a superposition subspace.

arXiv Code

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

Training evolves within a single dominant component (68–83% variance) while orthogonal curvature spikes ahead of grokking with a power-law lead time; causal intervention confirms motion along the subspace is necessary for generalization.

arXiv Code

Early-Warning Signals of Grokking via Loss-Landscape Geometry

The commutator defect rises before grokking on SCAN and Dyck-1, accelerates grokking by 32–50% when amplified, and prevents it when suppressed — a causally implicated, architecture-agnostic early-warning signal.

arXiv Code

Low-Dimensional Execution Manifolds in Transformer Learning Dynamics

Transformer training on modular arithmetic collapses onto 3–4 dimensional execution manifolds, with most parameters absorbing optimization interference while computation occurs in a dramatically reduced subspace.

arXiv Code

Collaboration

I’m interested in how neural networks learn, and how they could learn better. I want the work to be useful. Everything I do comes out of that.

Collaborations are welcome — theoretical or empirical, on current questions or new ones. Visiting positions, joint research projects, or other engagements with researchers in mechanistic interpretability, alignment, or large-scale training all welcome.

Background

I received my Ph.D. in Mathematics from Rutgers University, where I worked with Abbas Bahri in nonlinear PDE and contact geometry. I was subsequently a Courant Instructor at the Courant Institute of Mathematical Sciences (NYU) and a member of the Institute for Advanced Study. I later served as a Distinguished Research Fellow and Associate Professor at Shanghai Jiao Tong University.

Research Monograph

Recent Progress in Conformal Geometry

Abbas Bahri and Yongzhong Xu

Imperial College Press · ICP Advanced Texts in Mathematics, Vol. 1 · 2007 · 524 pp.

Personal

Outside of work, I have been practicing meditation in the Nyingma tradition since sixteen. I enjoy hiking and cooking. I am a vegetarian.