EP23

EP23: Can AI Understand Music? Latent Space and Information Bottleneck

潜空间几何, 码本探测, 信息瓶颈

▶ 7:48 Statistics/MLInformation Theory

前置知识

EP04 All-Interval Rows and ℤ₁₂ EP08 Shepard Tones and the Tritone Paradox EP14 Tonnetz Hodge Duality EP21 From Markov to Diffusion — Sixty Years of AI Composition EP22 How AI Writes Music — EnCodec and RVQ

后续拓展

EP25 Why Music Gives You Goosebumps — Prediction Error and Frisson

Overview

中文: “数学能量出形状，量不出意义。”

The previous episode ( EP22 ) showed how EnCodec compresses audio into a discrete codebook of 2048 vectors in $\mathbb{R}^{128}$ . This episode asks the next question: what did the model learn to encode? Tonal structure? Rhythmic patterns? Something humans cannot name?

We develop three mathematical frameworks to answer this:

t-SNE reduces the 128-dimensional codebook to a 2D map, revealing that codewords cluster by musical key — without any supervision.
Linear probes formalize the notion of “linearly decodable information” in learned representations.
Information Bottleneck theory explains why a model forced to compress discovers musically meaningful structure: it must discard irrelevant detail while preserving predictive content.

The episode closes with Adam Neely’s challenge: even if the geometry is real, structure is not meaning. Jazz cannot be interpolated from blues and ragtime, and individually generated music cannot replace communal listening. The central thesis:

AI learns the geometric structure of music. Whether that constitutes understanding depends on what we mean by the word.

Prerequisites

All-Interval Rows and $\mathbb{Z}_{12}$ (EP04) — cyclic group of pitch classes, circle of fifths
Entropy and Information (EP08) — Shannon entropy, KL divergence
Tonnetz Hodge Duality (EP14) — Tonnetz topology, torus structure of pitch space
From Markov to Diffusion (EP21) — attention mechanism, Transformer architecture
EnCodec and RVQ (EP22) — codebook vectors, residual vector quantization

Part I: Probing the Codebook

23.1 t-SNE: Visualizing High-Dimensional Structure

中文: “人眼看不见128维空间，但有一种叫t分布随机近邻嵌入的方法，可以把高维向量投影到一张二维图上。”

The EnCodec codebook consists of $N = 2048$ vectors in $\mathbb{R}^{128}$ . We need a method that faithfully represents local neighborhood relationships in two dimensions.

Definition 23.1 (Pairwise Affinity (High-Dimensional))

Given points $x_1, \ldots, x_N \in \mathbb{R}^{128}$ , define the conditional similarity of $x_j$ to $x_i$ as

$p_{j|i} = \frac{\exp\!\bigl(-\|x_i - x_j\|^2 / 2\sigma_i^2\bigr)}{\sum_{k \neq i} \exp\!\bigl(-\|x_i - x_k\|^2 / 2\sigma_i^2\bigr)}$

where $\sigma_i$ is chosen so that the perplexity of the conditional distribution $P_i$ equals a user-specified target (typically 5–50). The symmetrized affinity is

$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}$

Worked example. Suppose $N = 4$ points with $\sigma_i = 1$ for all $i$ . If $\|x_1 - x_2\| = 1$ , $\|x_1 - x_3\| = 5$ , $\|x_1 - x_4\| = 5$ , then

$p_{2|1} = \frac{e^{-1/2}}{e^{-1/2} + 2e^{-25/2}} \approx \frac{0.607}{0.607 + 2(3.7 \times 10^{-6})} \approx 1.000$

Nearly all probability mass concentrates on the nearest neighbor. This is precisely the “local neighborhood preservation” property.

Definition 23.2 (Pairwise Affinity (Low-Dimensional))

For the embedding coordinates $y_1, \ldots, y_N \in \mathbb{R}^2$ , define

$q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}$

This is a Student-t distribution with 1 degree of freedom (i.e., a Cauchy kernel).

Worked example. If $\|y_1 - y_2\| = 1$ , then the numerator is $(1 + 1)^{-1} = 0.5$ . If $\|y_1 - y_3\| = 10$ , the numerator is $(1 + 100)^{-1} \approx 0.0099$ . The heavy tail of the Student-t means that moderately distant points in high dimensions can be pushed far apart in 2D without paying a large cost — solving the crowding problem.

Definition 23.3 (Perplexity)

The perplexity of the conditional distribution $P_i$ is

$\mathrm{Perp}(P_i) = 2^{H(P_i)} = 2^{-\sum_j p_{j|i} \log_2 p_{j|i}}$

It can be interpreted as the effective number of neighbors. Setting perplexity = 30 means each point “sees” roughly 30 neighbors. The bandwidth $\sigma_i$ is found by binary search so that $\mathrm{Perp}(P_i)$ matches the target.

Theorem 23.1 (t-SNE KL Minimization)

The t-SNE embedding $y_1, \ldots, y_N$ minimizes the Kullback–Leibler divergence

$\mathcal{C} = \mathrm{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$

where $P = \{p_{ij}\}$ is fixed from the high-dimensional data and $Q = \{q_{ij}\}$ depends on the embedding coordinates.

Proof.

The gradient with respect to embedding point $y_i$ is

$\frac{\partial \mathcal{C}}{\partial y_i} = 4 \sum_{j \neq i} (p_{ij} - q_{ij})(1 + \|y_i - y_j\|^2)^{-1}(y_i - y_j)$

To derive this, write $\mathcal{C} = \sum_{i \neq j} p_{ij} \log p_{ij} - \sum_{i \neq j} p_{ij} \log q_{ij}$ . The first term is constant. For the second, substitute $q_{ij} = (1 + \|y_i - y_j\|^2)^{-1} / Z$ where $Z = \sum_{k \neq l}(1 + \|y_k - y_l\|^2)^{-1}$ , and differentiate both the numerator and the normalizer $Z$ with respect to $y_i$ . The chain rule through $\|y_i - y_j\|^2$ contributes the factor $2(y_i - y_j)$ , the Cauchy kernel contributes $(1 + \|y_i - y_j\|^2)^{-1}$ , and combining with the normalizer terms yields the stated gradient.

Optimization proceeds by gradient descent (with momentum and early exaggeration). Since $\mathcal{C}$ is bounded below by 0 (achieved when $Q = P$ ) and the gradient is well-defined for distinct points, the procedure converges to a local minimum. $\square$

Why Student-t instead of Gaussian? In high dimensions, the volume of a thin spherical shell grows as $r^{d-1}$ , so most neighbors of a point are at nearly the same distance. Mapping these to 2D with a Gaussian kernel forces moderately-distant points into a crowded annulus. The heavy tail of the Student-t kernel $(1 + \|y_i - y_j\|^2)^{-1}$ decays as $\|y_i - y_j\|^{-2}$ rather than exponentially, providing room for moderately-distant points to spread out while keeping true neighbors close.

23.2 Tonal Clusters in the Codebook

中文: “统计每个码字更常出现在哪些调性的音频片段里，然后按调性给点上色。C大调标红，G大调标蓝，D小调标绿。”

The procedure: (1) run t-SNE on the 2048 codebook vectors; (2) for each codeword, count which musical keys it most frequently encodes; (3) color by dominant key.

中文: “结果很有意思：同一调性的码字，更容易落在相邻区域。”

The observation is that codewords associated with the same key form contiguous clusters in the t-SNE map. Moreover, the neighborhood structure echoes the circle of fifths from EP04 : keys separated by a fifth (e.g., C major and G major) tend to have adjacent clusters, while tritone-related keys (e.g., C and F $\sharp$ ) are distant.

中文: “没有人告诉模型什么是大调小调。它只是在压缩音频的过程中，自己发现了调性的几何结构。”

23.3 Linear Probes: Testing What Is Linearly Decodable

Definition 23.4 (Linear Probe)

Given a frozen encoder producing representations $h_i \in \mathbb{R}^d$ for input $x_i$ with label $y_i \in \{1, \ldots, K\}$ , a linear probe is a classifier $(W, b)$ with $W \in \mathbb{R}^{K \times d}$ , $b \in \mathbb{R}^K$ trained by

$\min_{W, b} \sum_{i=1}^{n} \mathcal{L}\!\bigl(\mathrm{softmax}(W h_i + b),\; y_i\bigr)$

where $\mathcal{L}$ is the cross-entropy loss. The encoder weights are not updated during probe training.

Worked example. For a key-detection probe with $K = 24$ (12 major + 12 minor keys) and $d = 128$ , we train a matrix $W \in \mathbb{R}^{24 \times 128}$ and bias $b \in \mathbb{R}^{24}$ . If the probe achieves 85% accuracy on held-out data, we conclude that key information is linearly separable in the codebook space.

Theorem 23.2 (Linear Probe Sufficiency)

If a linear probe achieves accuracy

\alpha

on task

Y

from representation

h = f(x)

, then there exists a hyperplane arrangement in

\mathbb{R}^d

that separates the classes with error rate

1 - \alpha

. The information about

Y

is linearly decodable from the representation.

Proof.

The softmax classifier

\hat{y} = \arg\max_k (Wh + b)_k

partitions

\mathbb{R}^d

into

K

convex polyhedral regions

R_k = \{h : (w_k - w_j)^\top h + (b_k - b_j) \geq 0 \;\forall j \neq k\}

, where

w_k

is the

k

-th row of

W

. Each boundary is a hyperplane

\{h : (w_k - w_j)^\top h + (b_k - b_j) = 0\}

. The probe accuracy

\alpha = \Pr[\hat{y} = y]

directly measures the fraction of representations correctly classified by this hyperplane arrangement.

\square

Important caveat. A low probe accuracy does not imply the information is absent. It may be encoded non-linearly. A representation could store key information in the norms of subvectors or in interactions between dimensions — patterns invisible to a linear classifier but accessible to a nonlinear one. Linear probes test a sufficient condition, not a necessary one.

Part II: Emergent Specialization

23.4 Attention Head Specialization

中文: “注意力权重是一个矩阵，每一行告诉你当前位置关注了哪些历史位置。”

Recall from EP21 that a Transformer layer computes attention weights $A \in \mathbb{R}^{T \times T}$ with $A_{ij} = \mathrm{softmax}_j(q_i^\top k_j / \sqrt{d_k})$ . In multi-head attention, each head $h_\ell$ has its own projection matrices $W_Q^{(\ell)}, W_K^{(\ell)}, W_V^{(\ell)}$ and thus its own attention pattern.

中文: “不同的注意力头自动分了工。有些头的权重呈现周期性——每隔固定步数就高一些——像在数拍子。另一些头对和声变化特别敏感。”

Definition 23.5 (Emergent Specialization)

A multi-head attention model exhibits emergent specialization if, after training on a single undifferentiated loss (e.g., next-token prediction), individual attention heads develop statistically distinct response profiles to different input features (rhythm, harmony, timbre, etc.) without any explicit supervision for those features.

Theorem 23.3 (Emergent Specialization under Prediction Loss)

Let $\mathcal{L}(\theta) = -\mathbb{E}[\log p_\theta(x_{t+1} | x_{\leq t})]$ be the next-token prediction loss. If the data distribution decomposes as a mixture of independent generative factors $x = g(z_1, z_2, \ldots, z_m)$ where $z_k$ are statistically independent, then the gradient signal

$\frac{\partial \mathcal{L}}{\partial W^{(\ell)}} = -\mathbb{E}\!\left[\frac{\partial \log p_\theta(x_{t+1} | x_{\leq t})}{\partial W^{(\ell)}}\right]$

decomposes into contributions from each factor. In an overparameterized network, gradient descent tends to assign different heads to different factors, producing modular internal structure.

Proof.

Write $x_{t+1} = g(z_1^{(t+1)}, \ldots, z_m^{(t+1)})$ . By the chain rule of mutual information and the independence of the $z_k$ , predicting $x_{t+1}$ requires capturing $I(x_{\leq t}; z_k^{(t+1)})$ for each factor $k$ separately. The gradient $\partial \mathcal{L}/\partial W^{(\ell)}$ is a sum over contributions from each factor:

$\frac{\partial \mathcal{L}}{\partial W^{(\ell)}} = \sum_{k=1}^{m} \underbrace{\frac{\partial \mathcal{L}}{\partial z_k} \cdot \frac{\partial z_k}{\partial W^{(\ell)}}}_{\text{factor } k \text{ contribution}}$

In an overparameterized regime, heads that are randomly initialized near the gradient direction of factor $k$ will specialize toward that factor, because the loss landscape admits many solutions and gradient descent follows the path of least resistance. This is analogous to the lottery ticket hypothesis (Frankle & Carlin, 2019): the overparameterized network contains sparse subnetworks, each specialized to a factor, and training selects them.

The specialization is not guaranteed to be clean or complete — heads may partially respond to multiple factors — but the statistical tendency toward modular decomposition is observed empirically across architectures. $\square$

中文: “没有人在训练时标注过’这是节奏'，或者’这是和声'。模型只有一个目标：预测下一个离散码元。从这一个目标出发，它自动涌现出了功能分化的内部结构。”

Phase transitions in specialization. Empirically, specialization does not emerge gradually. There is evidence of phase transitions: during training, heads remain unspecialized until the loss drops below a critical threshold, at which point distinct functional roles crystallize rapidly. This parallels phase transitions in statistical physics and may be related to the information-theoretic framework of Part III.

Part III: Information Bottleneck Theory

23.5 Mutual Information

Definition 23.6 (Mutual Information)

For random variables $X$ and $Y$ with joint distribution $p(x, y)$ , the mutual information is

$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$

Equivalently, $I(X; Y) = \mathrm{KL}(p(x,y) \| p(x)p(y))$ — the divergence between the joint and the product of marginals.

Worked example. Let $X$ = audio frame (continuous), $Y$ = musical key (discrete, 24 values). If knowing the key reduces uncertainty about the audio spectrum by 2 bits, then $I(X; Y) = 2$ bits.

Prop 23.1 (Data Processing Inequality)

For any Markov chain $X \to T \to Y$ (i.e., $T$ is a function of $X$ and $Y$ depends on $X$ only through $T$ ):

$I(X; Y) \geq I(T; Y)$

Processing cannot create information. If $\hat{Y}$ is a further function of $T$ , then $I(X; Y) \geq I(T; Y) \geq I(\hat{Y}; Y)$ .

Proof.

By the chain rule:

I(X; Y | T) \geq 0

(mutual information is non-negative). Expanding:

I(X; Y | T) = I(X, T; Y) - I(T; Y)

. Since

X \to T \to Y

forms a Markov chain,

I(X; Y | T) = 0

. Therefore

I(X; Y) = I(X, T; Y) \geq I(T; Y)

, where the inequality uses

I(X, T; Y) \geq I(T; Y)

(conditioning reduces entropy). For the Markov chain case, we actually get equality:

I(X; Y) \geq I(T; Y)

with equality iff

T

is a sufficient statistic for

Y

given

X

. The second inequality follows by applying the same argument to

T \to \hat{Y} \to Y

\square

23.6 The Information Bottleneck

Definition 23.7 (Information Bottleneck)

Given a Markov chain $X \to T \to Y$ , the Information Bottleneck (IB) objective seeks a stochastic mapping $p(t|x)$ that solves

$\min_{p(t|x)} \; I(X; T) - \beta \, I(T; Y)$

where:

$I(X; T)$ = compression term: how much of the raw input is retained.
$I(T; Y)$ = prediction term: how much predictive information about the target is preserved.
$\beta > 0$ = Lagrange multiplier controlling the trade-off.

Worked example. For EnCodec: $X$ = raw audio waveform, $T$ = quantized codebook index, $Y$ = next audio frame. The codec must compress (small $I(X; T)$ means fewer bits) while preserving enough to reconstruct/predict (large $I(T; Y)$ ). At $\beta \to 0$ , the optimal $T$ is trivial (one cluster). At $\beta \to \infty$ , $T$ retains everything about $Y$ .

Theorem 23.4 (Information Bottleneck Optimality)

The optimal IB solution satisfies the self-consistent equations:

$p(t|x) = \frac{p(t)}{Z(x, \beta)} \exp\!\bigl(-\beta \, \mathrm{KL}(p(y|x) \| p(y|t))\bigr)$

where $Z(x, \beta)$ is a normalizing constant. As $\beta$ varies, the optimal solutions trace the IB curve in the $(I(X;T), I(T;Y))$ plane. This curve is concave and monotonically non-decreasing.

Proof.

Introduce the Lagrangian with the constraint that $p(t|x)$ is a valid conditional distribution:

$\mathcal{F} = I(X; T) - \beta \, I(T; Y) + \sum_x \lambda(x)\!\left(\sum_t p(t|x) - 1\right)$

Taking the functional derivative $\delta \mathcal{F} / \delta p(t|x) = 0$ :

$\frac{\partial}{\partial p(t|x)}\!\left[\sum_{x'} p(x') \sum_{t'} p(t'|x') \log \frac{p(t'|x')}{p(t')}\right] - \beta \frac{\partial}{\partial p(t|x)}\!\left[\sum_{t'} p(t') \sum_y p(y|t') \log \frac{p(y|t')}{p(y)}\right] + \lambda(x) = 0$

$\log p(t|x) = \log p(t) + \beta \sum_y p(y|x) \log \frac{p(y|t)}{p(y)} - 1 - \frac{\lambda(x)}{p(x)}$

Recognizing $\sum_y p(y|x) \log[p(y|t)/p(y)] = -\mathrm{KL}(p(y|x) \| p(y|t)) + \text{const}$ (the constant is $\sum_y p(y|x) \log[p(y|x)/p(y)]$ , independent of $t$ ), and absorbing constants into the normalizer:

$p(t|x) = \frac{p(t)}{Z(x,\beta)} \exp\!\bigl(-\beta \, \mathrm{KL}(p(y|x) \| p(y|t))\bigr)$

The concavity of the IB curve follows from the concavity of mutual information in the channel $p(t|x)$ (a standard result in information theory). Monotonicity: increasing $\beta$ increases the weight on preserving $I(T;Y)$ , so the optimal $I(T;Y)$ is non-decreasing in $\beta$ . $\square$

Connection to rate-distortion theory. The IB framework generalizes Shannon’s rate-distortion theory. In rate-distortion, the objective is $R(D) = \min_{p(t|x) : \mathbb{E}[d(x,t)] \leq D} I(X; T)$ , where $d$ is a distortion measure. The IB replaces the explicit distortion constraint with $I(T; Y)$ , making the “relevant” aspects of $X$ task-dependent rather than reconstruction-dependent.

Why this explains tonal clustering. When the model compresses audio (minimizing $I(X;T)$ ) while preserving predictability of future audio (maximizing $I(T;Y)$ ), the optimal codebook groups acoustically different signals that make the same predictions about what comes next. Signals in the same key share harmonic expectations — so key information is retained, while irrelevant timbral details are discarded. The tonal clusters in the t-SNE map are a visible consequence of this information-theoretic trade-off.

Part IV: The Meaning Gap

23.7 The Convex Hull Argument

中文: “如果爵士乐从未存在过，你能用布鲁斯和拉格泰姆的数据，让AI提示出爵士乐吗？不可能。”

Definition 23.8 (Convex Hull of a Distribution)

Let $F : \mathcal{X} \to \mathbb{R}^d$ be a feature map (e.g., a neural network encoder). The convex hull of the training distribution $P_{\text{data}}$ in feature space is

$\mathrm{conv}(P_{\text{data}}) = \left\{\sum_{i=1}^{n} \lambda_i F(x_i) : x_i \in \mathrm{supp}(P_{\text{data}}),\; \lambda_i \geq 0,\; \sum_i \lambda_i = 1 \right\}$

Generative models that interpolate in latent space produce samples whose representations lie in or near $\mathrm{conv}(P_{\text{data}})$ .

Theorem 23.5 (Neely's Jazz Impossibility (Convex Hull Formulation))

Let $P_{\text{blues}}$ and $P_{\text{ragtime}}$ be training distributions and $F$ any feature map learned from this data. If jazz involves structural innovations — harmonic substitutions (tritone subs), polyrhythmic superposition, modal interchange — that are not expressible as convex combinations of blues and ragtime features, then

$F(\text{jazz}) \notin \mathrm{conv}\!\bigl(\{F(x) : x \in \mathrm{supp}(P_{\text{blues}}) \cup \mathrm{supp}(P_{\text{ragtime}})\}\bigr)$

No amount of interpolation or prompting can produce jazz from a model trained only on blues and ragtime.

Proof.

The proof proceeds by contradiction. Suppose $F(\text{jazz}) = \sum_i \lambda_i F(x_i)$ for some $x_i \in \mathrm{supp}(P_{\text{blues}} \cup P_{\text{ragtime}})$ , $\lambda_i \geq 0$ , $\sum_i \lambda_i = 1$ .

Consider the feature dimension corresponding to tritone substitution frequency. In blues and ragtime, the tritone substitution is either absent or extremely rare, so $F_{\text{tri}}(x_i) \approx 0$ for all training points. Then $F_{\text{tri}}(\text{jazz}) = \sum_i \lambda_i F_{\text{tri}}(x_i) \approx 0$ . But bebop jazz uses tritone substitutions systematically (e.g., substituting D $\flat$ 7 for G7 in a ii-V-I), so $F_{\text{tri}}(\text{jazz}) \gg 0$ . Contradiction.

The argument generalizes: any structural feature absent from the training corpus that is present in the target genre creates a coordinate direction along which the target lies outside the convex hull. Since the convex hull is a closed convex set, the separating hyperplane theorem guarantees a linear functional that separates the target from the hull. $\square$

This formalizes Neely’s intuition: new musical genres are not interpolations — they are cultural jumps to points outside the convex hull of existing data.

23.8 Structure versus Meaning

中文: “模型内部有几何结构，没有问题。但如果没有人在真正地聆听、比较、传承——那结构和意义之间，就有一道鸿沟。”

Definition 23.9 (Syntactic Structure vs. Semantic Meaning)

Syntactic structure: measurable geometric properties of representations — distances, clusters, spectral gaps, curvatures. Formally: any property computable from the metric space $(Z, d)$ where $Z$ is the latent space and $d$ is a distance.
Semantic meaning: requires grounding in embodied experience, social context, and cultural memory. Not computable from geometric properties alone.

This distinction echoes Searle’s Chinese Room argument (1980): a system that manipulates symbols according to syntactic rules may pass any behavioral test for understanding, yet possess no semantic comprehension. Applied to music: a model that clusters keys, separates rhythms, and predicts the next token is performing syntactic operations on acoustic symbols — operations that are geometrically sophisticated but semantically ungrounded.

中文: “意义不住在向量里。意义住在共同体、传统和聆听里。”

中文: “我会继续为自己和社区创作音乐，因为它让我快乐。但我害怕，它再也无法像以前那样，和别人产生联结。”

The grounding problem asks: can statistical co-occurrence alone give meaning? The IB framework shows that the model captures all the predictively useful structure. But predictive utility is not the same as meaning. A weather model captures the structure of atmospheric dynamics without understanding what rain feels like.

中文: “和几千个陌生人一起唱过同一首歌？尼利说，当每个人消费的音乐都完全个性化，这种体验就消失了。”

Neely’s deeper point is sociological: music’s meaning emerges from shared experience. A concert where thousands sing the same melody creates a collective state that no individually generated playlist can replicate. The information-theoretic framework captures the structure of music but not the structure of community. This is not a failure of mathematics — it is a boundary condition on what mathematical formalism can express.

Numerical Examples

Example 23.1: t-SNE on a toy codebook. Consider $N = 6$ points in $\mathbb{R}^3$ representing codewords from two keys:

Codeword	Vector	Key
$c_1$	$(1, 0, 0)$	C major
$c_2$	$(1.1, 0.2, 0)$	C major
$c_3$	$(0.9, -0.1, 0.1)$	C major
$c_4$	$(0, 1, 0)$	F $\sharp$ major
$c_5$	$(0.1, 1.1, -0.1)$	F $\sharp$ major
$c_6$	$(-0.1, 0.9, 0.2)$	F $\sharp$ major

Within-cluster distances: $\|c_1 - c_2\| = \sqrt{0.01 + 0.04} \approx 0.22$ . Between-cluster distances: $\|c_1 - c_4\| = \sqrt{1 + 1} \approx 1.41$ .

With $\sigma = 0.5$ , the high-dimensional affinity $p_{2|1} \propto e^{-0.22^2/0.5} \approx e^{-0.097} \approx 0.91$ , while $p_{4|1} \propto e^{-1.41^2/0.5} \approx e^{-3.98} \approx 0.019$ . t-SNE will place $c_1, c_2, c_3$ close together and $c_4, c_5, c_6$ far away, reproducing the key-based clustering.

Example 23.2: IB trade-off. Suppose $X$ has $H(X) = 10$ bits (raw audio complexity) and $Y$ has $H(Y) = 4$ bits (next-frame prediction). By the data processing inequality, $I(T; Y) \leq I(X; Y) \leq 4$ bits. A codebook with $\log_2 2048 = 11$ bits of capacity can afford $I(X; T) \leq 11$ bits. The IB curve shows that with $I(X; T) \approx 6$ bits (substantial compression from 11), one can still achieve $I(T; Y) \approx 3.5$ bits — retaining most predictive information while discarding 5 bits of irrelevant detail (noise, exact timbre, recording artifacts).

Example 23.3: Convex hull failure. In $\mathbb{R}^2$ , let blues = $\{(1,0), (2,0), (1.5, 0.5)\}$ and ragtime = $\{(0,1), (0,2), (0.5, 1.5)\}$ . Their convex hull is the union of two triangles near the axes. Jazz at $(2, 2)$ is outside: any convex combination $\sum \lambda_i v_i$ with $\sum \lambda_i = 1$ satisfies $x + y \leq \max(x_i + y_i) = 2.5$ for training points, but $2 + 2 = 4 > 2.5$ .

Musical Connection

音乐联系

From the Tonnetz to the Codebook: The Geometry Persists

The tonal clusters discovered by t-SNE in the EnCodec codebook echo the topological structure of the Tonnetz studied in

EP14

. In EP14, the twelve pitch classes form a simplicial complex on a torus, with Betti numbers $(1, 2, 1)$ encoding two independent non-contractible loops: the circle of fifths and the major-third cycle. In the codebook, the same neighborhood relationships reappear: keys a fifth apart cluster nearby, and the circular ordering of key clusters reproduces the $\mathbb{Z}_{12}$ cyclic group structure from

EP04

中文: “码本里浮现出来的局部邻居结构，和五度圈有相似之处。”

The circle of fifths defines an adjacency relation on keys: C is “near” G and F, “far” from F $\sharp$ . This same relation — which

EP04

identified as a generator of $\mathbb{Z}_{12}$ — emerges unsupervised in the codebook geometry. The model has rediscovered, through compression alone, a structure that took Western music theory centuries to articulate.

But the critical difference: the Tonnetz is a mathematical construction with known topology. The codebook geometry is empirical and depends on the training data, the architecture, and the compression rate. Whether the codebook consistently recovers the full torus topology (both generators, the correct Betti numbers) across different models and datasets is an open empirical question that connects to

EP25

’s study of robustness.

The five-degree connection. In EP14, two pitch classes connected by an edge of the Tonnetz are separated by a consonant interval (perfect fifth, major third, or minor third). The codebook t-SNE map recovers primarily the fifth-based adjacency. This is consistent with the IB framework: the perfect fifth (frequency ratio 3:2) is the strongest harmonic relationship after the octave, so it carries the most predictive information about harmonic context. The major-third axis (frequency ratio 5:4) is weaker, and its recovery in the codebook is less consistent — a quantitative prediction that could be tested by measuring probe accuracy for fifth-related versus third-related key pairs.

Attention heads as discrete differential operators. In EP14, the Hodge Laplacian $L_1$ decomposed interval flows into gradient, curl, and harmonic components. Speculatively, the specialized attention heads of Part II may perform an analogous decomposition: rhythm heads detect periodic structure (analogous to divergence), harmony heads detect chord transitions (analogous to curl), and global context heads track large-scale tonal motion (analogous to harmonic flow). This analogy is suggestive but unproven.

Limits and Open Questions

Conjecture (Codebook Topology Conjecture)

For any audio codec trained with sufficient capacity on tonal music, the first codebook’s t-SNE embedding recovers a neighborhood graph homeomorphic to a quotient of the circle of fifths. That is, the topological structure is not an artifact of a particular model but an inevitable consequence of the IB trade-off applied to tonal music.

Status: Unresolved. Requires systematic comparison across codecs (EnCodec, SoundStream, DAC) and training corpora.

Conjecture (Specialization Phase Transition Conjecture)

In a Transformer trained on music tokens, there exists a critical training loss $\mathcal{L}^*$ such that for $\mathcal{L} > \mathcal{L}^*$ , no attention head shows statistically significant specialization, while for $\mathcal{L} < \mathcal{L}^*$ , at least $m$ heads specialize to distinct musical features (where $m$ is the number of independent generative factors). The transition is sharp in the sense that the mutual information between head attention patterns and musical features has a discontinuous derivative at $\mathcal{L}^*$ .

Status: Partially supported by visualization studies. Formal proof requires a tractable model of the training dynamics.

Open questions:

Non-linear probing: If a linear probe fails but a two-layer MLP succeeds, what is the minimal geometric complexity of the encoding? Can we characterize this by the intrinsic dimension of the feature manifold?
IB tightness: How close do real codecs come to the IB curve? Is there a measurable gap, and does closing it improve downstream music generation quality?
Grounding beyond syntax: Can a model that interacts with embodied agents (e.g., a robot musician responding to audience reactions) develop something closer to semantic understanding? Or is the grounding problem fundamentally unsolvable for statistical models?
Convex hull escape: Are there architectures (e.g., neuro-symbolic systems, models with explicit rule-learning modules) that can generate samples outside the convex hull of their training data? What mathematical framework captures “genuine novelty” as opposed to interpolation? Forward reference: EP25 explores related questions about generalization bounds.
Perplexity sensitivity: t-SNE visualizations depend strongly on the perplexity parameter. Do tonal clusters persist across the full range $\mathrm{Perp} \in [5, 50]$ ? If clusters fragment at low perplexity and merge at high perplexity, the “true” cluster structure may be scale-dependent, requiring persistent homology (a tool from topological data analysis) to resolve.
Cross-cultural codebooks: The tonal clustering discussed here assumes Western tonal music. For music based on maqam (Arabic), raga (Indian), or pentatonic scales (Chinese), does the codebook geometry reflect the relevant scale structure? The $\mathbb{Z}_{12}$ circle of fifths is not universal — other tuning systems would produce different geometric signatures.

Academic References

van der Maaten, L. & Hinton, G. (2008). “Visualizing Data using t-SNE.” Journal of Machine Learning Research 9, 2579–2605.
Tishby, N., Pereira, F. & Bialek, W. (2000). “The Information Bottleneck Method.” Proceedings of the 37th Allerton Conference, 368–377.
Tishby, N. & Zaslavsky, N. (2015). “Deep Learning and the Information Bottleneck Principle.” IEEE Information Theory Workshop (ITW), 1–5.
Alain, G. & Bengio, Y. (2017). “Understanding Intermediate Layers Using Linear Classifier Probes.” ICLR Workshop.
Searle, J. (1980). “Minds, Brains, and Programs.” Behavioral and Brain Sciences 3(3), 417–424.
Frankle, J. & Carlin, M. (2019). “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” ICLR 2019.
Voita, E., Talbot, D., Moiseev, F., Sennrich, R. & Titov, I. (2019). “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.” ACL 2019, 5797–5808.
Castellon, R., Donahue, C. & Liang, P. (2021). “Codified Audio Language Modeling Learns Useful Representations for Music Information Retrieval.” ISMIR 2021.
Defossez, A., Copet, J., Synnaeve, G. & Adi, Y. (2023). “High Fidelity Neural Audio Compression.” ICLR 2023.
Neely, A. (2024). “The Death of Music.” YouTube. Accessed 2026-02-20.
Shannon, C. (1959). “Coding Theorems for a Discrete Source with a Fidelity Criterion.” IRE National Convention Record 7(4), 142–163.
Cover, T. & Thomas, J. (2006). Elements of Information Theory, 2nd ed. Wiley.
Shwartz-Ziv, R. & Tishby, N. (2017). “Opening the Black Box of Deep Neural Networks via Information.” arXiv:1703.00810.
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J. & Tagliasacchi, M. (2022). “SoundStream: An End-to-End Neural Audio Codec.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, 495–507.