|
Publications
* denotes equal contribution
|
UNDER REVIEW
|
Duality of Attention Sinks: Two Algorithms, Two Solutions
Lukas Fesser*,
Mozes Jacobs*,
Thomas Fel*,
Andy Keller,
Sham Kakade
Attention sinks share a visual signature but hide two distinct algorithms: nop, where a head suppresses its update by routing to a null token, and broadcast, where a sink aggregates and redistributes global information. Each mechanism leaves distinct traces — nop sinks have negligible value norms; broadcast sinks induce low-rank outputs — which we use to derive practical diagnostics. Applied to pretrained vision transformers, we find both mechanisms coexist at scale. Gating and registers, the two dominant interventions, each implicitly target only one mechanism; combining them yields complementary gains. Training LeJepa with both gating and registers improves downstream semantic segmentation performance beyond either alone.
|
ICLR 2026
|
Block-Recurrent Dynamics in ViTs
Mozes Jacobs*,
Thomas Fel*,
Richard Hakim*,
Alessandra Brondetta,
Demba Ba,
T. Andy Keller
We introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent
depth structure. To validate this, we train recurrent surrogates called Raptor. We demonstrate that
a Raptor model can recover 96% of DINOv2 ImageNet-1k linear probe accuracy in only 2
blocks while maintaining equivalent runtime. We leverage our hypothesis to
perform dynamical interpretability, revealing directional convergence into class-dependent basins,
token-specific trajectory dynamics, and low-rank attractor structure in late layers.
|
CCN 2025 (Oral)
|
Traveling Waves Integrate Spatial Information Through Time
Mozes Jacobs,
Robert C. Budzinski,
Lyle Muller,
Demba Ba,
T. Andy Keller
blog
/
talk
We investigate how traveling waves of neural activity enable spatial information integration in
convolutional recurrent networks. Our models learn to generate traveling waves in response to visual
stimuli, effectively expanding receptive fields of locally connected neurons. This mechanism
significantly outperforms local feed-forward networks on semantic segmentation tasks requiring
global spatial context, achieving comparable performance to non-local U-Nets while using
significantly fewer parameters.
|
PREPRINT
|
HyperSINDy: Deep Generative Modeling of Nonlinear Stochastic Governing Equations
Mozes Jacobs,
Bingni W. Brunton,
Steven L. Brunton,
J. Nathan Kutz,
Ryan V. Raut
arXiv, 2023
HyperSINDy is a deep generative framework for discovering stochastic governing equations from data. A variational encoder and hypernetwork produce sparse differential equations — learned via a trainable binary mask — whose coefficients are driven by Gaussian white noise. HyperSINDy accurately recovers ground-truth stochastic dynamics and provides uncertainty quantification that scales to high-dimensional systems.
|
|