This post was produced as part of the Iliad Fellowship under the mentorship of Dmitry Vaintrob. Tl;dr: Power-law ("heavy-tailed") distributions have universality theorems similar to those which make Gaussians common. We observe many things in ML are power-law distributed, most robustly and interestingly, the spectra of weight matrices. I explain how we can think of power-laws as being a natural generalization of the idea of 'sparsity', interpolating between true sparsity and Gaussianity according to the 'tail-index' of the distribution. I share some hypotheses about how this might relate to the 'sparse'/'discrete'/'factored' representations that neural networks seem to learn. I promise this is not a Santa-Fe-Institute encomium for power laws or "black swans"; different genre. Contents 1. The generalized central limit theorem proves power-law distributions are universality classes 2. Power laws observed in NNs might help us understand representation learning 2.A. HTSR: phase changes in weight-matrix spectra and data-free prediction of generalization 2.B. BBP transition as a quantum of learning 2.C. HTSR as an extended BBP transition 2.D. Training evidence for heavy tails is mixed, and I'm not sure if they're important 3. The tail exponent α is a smooth proxy for sparsity and compressibility 3.A. α captures compressibility across heavy tails 3.B. α-stable noise can make discrete codebooks optimal 3.C. Heavy-tailed noise can convert analog inputs into discrete codebooks 4. Summ…

Full article content could not be extracted automatically. Read the original below.