EP21

EP21: From Markov to Diffusion — Sixty Years of AI Composition

Markov链, Transformer注意力, 分数匹配/扩散

▶ 8:35 Statistics/MLInformation Theory

前置知识

EP04 All-Interval Rows and ℤ₁₂ EP07 Information Entropy and All-Interval Rows

后续拓展

EP22 How AI Writes Music — EnCodec and RVQ EP23 Can AI Understand Music? Latent Space and Information Bottleneck EP24 The Information Code of Pop Music

Overview

Three mathematical frameworks have dominated algorithmic music composition over sixty years: Markov chains (1957), Transformer attention (2017), and diffusion models (2020). Each replaces its predecessor not by discarding the core idea — sampling from a learned probability distribution over musical events — but by changing which probability distribution and how it is parameterized.

中文: “它们在解同一个数学问题。这个问题的答案，七十年里只换了三次。从骰子，到注意力，到噪声。”

This companion page formalizes the mathematics the video sketches: transition matrices on $\mathbb{Z}_{12}$ , the softmax attention mechanism as a content-dependent stochastic matrix, and the DDPM forward/reverse processes with score matching.

Prerequisites

Cyclic groups and ℤ₁₂ (EP04) — pitch classes as elements of the cyclic group
Probability and entropy basics (EP07) — probability distributions, expectation
Linear algebra: matrix multiplication, eigenvalues, positive semi-definiteness
Basic multivariate calculus: gradients, Gaussian densities

Part I: Markov Chains on ℤ₁₂

中文: “1957年，希勒和艾萨克森在伊利诺伊大学的Illiac计算机上做了一个实验。他们把十二个音高排成一圈。第四期讲过，这就是Z 12，十二元循环群。然后给每对音之间分配一个转移概率。”

1.1 Definitions

Definition 21.1 (Time-Homogeneous Markov Chain on a Finite State Space)

Let $S = \{s_1, \ldots, s_n\}$ be a finite set of states. A time-homogeneous Markov chain on $S$ is a sequence of random variables $(X_0, X_1, X_2, \ldots)$ taking values in $S$ such that for all $t \geq 0$ and all states $i, j \in S$ : $P(X_{t+1} = j \mid X_t = i, X_{t-1}, \ldots, X_0) = P(X_{t+1} = j \mid X_t = i) =: p_{ij}$

The constant $p_{ij}$ is called the transition probability from state $i$ to state $j$ .

Example (Illiac Suite, 1957): Set $S = \mathbb{Z}_{12} = \{C, C\sharp, D, \ldots, B\}$ , the twelve pitch classes (see EP04 ). Hiller and Isaacson assigned transition probabilities such as $p_{C,E} = 0.3$ and $p_{C,G} = 0.4$ . To generate a melody: start at some pitch $X_0$ , sample $X_1$ from row $X_0$ of the transition matrix, then sample $X_2$ from row $X_1$ , and so on.

中文: “生成旋律的方法：站在当前音，按概率掷骰子，跳到下一个音。重复。这就是马尔可夫链。本质上，1957年的AI作曲，就是在Z 12上掷加权骰子。”

Definition 21.2 (Stochastic (Transition) Matrix)

The transition matrix $P \in \mathbb{R}^{n \times n}$ of a Markov chain on $n$ states has entries $P_{ij} = p_{ij}$ satisfying:

Non-negativity: $P_{ij} \geq 0$ for all $i,j$ .
Row-stochastic: $\sum_{j=1}^{n} P_{ij} = 1$ for every row $i$ .

A matrix satisfying both conditions is called a (row-)stochastic matrix.

Worked example: A toy $3 \times 3$ transition matrix on states $\{C, E, G\}$ : $P = \begin{pmatrix} 0.2 & 0.3 & 0.5 \\ 0.1 & 0.4 & 0.5 \\ 0.3 & 0.3 & 0.4 \end{pmatrix}$

Row 1: from C, probability 0.2 to stay on C, 0.3 to jump to E, 0.5 to jump to G. Each row sums to 1.

1.2 Multi-Step Transitions

Theorem 21.1 (Chapman-Kolmogorov Equation)

Let

P

be the transition matrix of a time-homogeneous Markov chain. The

n

-step transition probabilities satisfy

P^{(n)} = P^n

i.e., the probability of going from state

i

to state

j

in exactly

n

steps is the

(i,j)

-entry of the matrix

P^n

Proof.

Base case ( $n=1$ ): $P^{(1)} = P^1 = P$ by definition.

Inductive step: Assume $P^{(n)} = P^n$ . For the $(n+1)$ -step probability: $P^{(n+1)}_{ij} = P(X_{n+1} = j \mid X_0 = i) = \sum_{k \in S} P(X_{n+1} = j \mid X_n = k)\, P(X_n = k \mid X_0 = i)$ $= \sum_{k \in S} P_{kj} \cdot P^{(n)}_{ik} = \sum_{k \in S} P^n_{ik} \cdot P_{kj} = (P^n \cdot P)_{ij} = P^{n+1}_{ij}$

The first equality uses the law of total probability and the Markov property. The inductive hypothesis gives $P^{(n)}_{ik} = P^n_{ik}$ . Therefore $P^{(n+1)} = P^{n+1}$ . $\square$

Musical meaning: $P^n_{C,G}$ is the probability that a melody starting on C will be on G after exactly $n$ notes.

1.3 Stationary Distribution

Definition 21.3 (Stationary Distribution)

A probability vector

\pi \in \mathbb{R}^n

(with

\pi_i \geq 0

and

\sum_i \pi_i = 1

) is a stationary distribution of the Markov chain with transition matrix

P

\pi P = \pi

That is,

\pi

is a left eigenvector of

P

with eigenvalue 1.

Theorem 21.2 (Existence and Uniqueness of Stationary Distribution (Perron-Frobenius))

Let $P$ be the transition matrix of an irreducible, aperiodic Markov chain on a finite state space. Then:

Existence: There exists a unique stationary distribution $\pi$ with $\pi P = \pi$ and $\pi_i > 0$ for all $i$ .
Convergence: For any initial distribution $\mu_0$ , $\lim_{n \to \infty} \mu_0 P^n = \pi$ .
Long-run frequency: With probability 1, the fraction of time spent in state $i$ converges to $\pi_i$ .

Proof.

(Sketch via Perron-Frobenius.) An irreducible chain has

P^m > 0

(entry-wise) for some

m

(aperiodicity + irreducibility ensure this). The Perron-Frobenius theorem for positive matrices states that such a matrix has a unique largest eigenvalue

\lambda_1 = 1

(since

P

is stochastic, the all-ones vector is a right eigenvector with eigenvalue 1), and the corresponding left eigenvector

\pi

can be chosen with all entries positive and summing to 1. All other eigenvalues satisfy

|\lambda_i| < 1

, so

P^n \to \mathbf{1}\pi

n \to \infty

, which gives convergence and the frequency interpretation.

\square

Musical meaning: The stationary distribution $\pi$ gives the long-run frequency of each pitch class. If the Markov chain on $\mathbb{Z}_{12}$ has $\pi_C = 0.12$ and $\pi_G = 0.10$ , then in a long enough generated melody, roughly 12% of notes will be C and 10% will be G — regardless of which note started the chain. This is a stylistic fingerprint of the transition matrix.

Worked example: For the toy matrix above, solving $\pi P = \pi$ with $\pi_1 + \pi_2 + \pi_3 = 1$ gives the system $0.2\pi_1 + 0.1\pi_2 + 0.3\pi_3 = \pi_1$ , etc. The solution is $\pi \approx (0.194, 0.333, 0.472)$ : G dominates in the long run.

1.4 Why Markov Chains Fail at Music

中文: “结果听起来怎么样？局部像音乐，整体没记忆。因为马尔可夫链只看上一步。…副歌回来、主题发展——它全抓不住。”

Definition 21.4 (Mixing Time)

The mixing time of an irreducible aperiodic Markov chain is the smallest

t

such that the total variation distance between

\mu_0 P^t

and

\pi

is at most

1/4

for any initial distribution

\mu_0

t_{\mathrm{mix}} = \min\bigl\{t : \max_{\mu_0} \|\mu_0 P^t - \pi\|_{\mathrm{TV}} \leq 1/4\bigr\}

The mixing time measures how quickly the chain “forgets” its starting state. For a first-order chain on $\mathbb{Z}_{12}$ , this is typically very small (a few steps), meaning the chain loses all memory of its beginning after a handful of notes.

One might try a $k$ -th order Markov chain: condition on the previous $k$ notes instead of just one. But this requires a transition matrix of size $12^k \times 12^k$ : for $k = 8$ (a single musical phrase), the state space has $12^8 \approx 4.3 \times 10^8$ states. The number of parameters grows exponentially, making estimation from finite training data intractable. This exponential blowup is the fundamental limitation.

Prop 21.1 (Exponential State Space of k-th Order Chains)

k

-th order Markov chain on an alphabet of size

|S|

requires a transition matrix with

|S|^k \cdot |S| = |S|^{k+1}

parameters. For

|S| = 12

and

k = 16

(two bars of eighth notes), this exceeds

10^{17}

Proof.

The state space of a

k

-th order chain is

S^k

(all sequences of length

k

), with

|S^k| = |S|^k

states. Each state has

|S|

possible next-state probabilities (one per element of

S

), so the transition matrix has

|S|^k

rows and

|S|

free parameters per row (the last is determined by the row-sum constraint, but the matrix still has

|S|^k \times |S|

entries). For

|S| = 12

k = 16

12^{17} \approx 2.2 \times 10^{18}

\square

Part II: Attention and Transformers

中文: “真正的突破要等到2017年。Transformer的核心是注意力机制。关键直觉是这样的：马尔可夫链有一张固定的转移概率表，每行加起来等于一。注意力机制也有这样一张表——但它不是固定的，而是根据当前内容实时算出来的。”

2.1 Scaled Dot-Product Attention

Definition 21.5 (Query, Key, Value Projections)

Let

X \in \mathbb{R}^{n \times d}

be an input matrix whose

n

rows are token embeddings of dimension

d

. Define three learned weight matrices

W_Q, W_K \in \mathbb{R}^{d \times d_k}

and

W_V \in \mathbb{R}^{d \times d_v}

. The query, key, and value matrices are:

Q = XW_Q, \quad K = XW_K, \quad V = XW_V

Definition 21.6 (Scaled Dot-Product Attention)

Given

Q \in \mathbb{R}^{n \times d_k}

K \in \mathbb{R}^{n \times d_k}

V \in \mathbb{R}^{n \times d_v}

, the scaled dot-product attention is:

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

where softmax is applied row-wise: for row

i

\mathrm{softmax}(z)_j = e^{z_j}/\sum_{\ell} e^{z_\ell}

Why the $\sqrt{d_k}$ scaling: If the entries of $Q$ and $K$ are independent with mean 0 and variance 1, then the dot product $q_i \cdot k_j = \sum_{m=1}^{d_k} q_{im} k_{jm}$ has mean 0 and variance $d_k$ . Without scaling, for large $d_k$ the dot products grow large in magnitude, pushing softmax into saturation (one entry near 1, all others near 0). Dividing by $\sqrt{d_k}$ restores unit variance, keeping softmax in its informative regime.

Worked example: Let $n = 3$ (three tokens: C, E, G), $d_k = 2$ . Suppose after projection: $Q = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix}, \quad K = \begin{pmatrix} 1 & 1 \\ 0 & 1 \\ 1 & 0 \end{pmatrix}$

Then $QK^T = \begin{pmatrix} 1 & 0 & 1 \\ 1 & 1 & 0 \\ 2 & 1 & 1 \end{pmatrix}$ . Dividing by $\sqrt{2} \approx 1.41$ and applying row-wise softmax gives a $3 \times 3$ matrix where each row sums to 1 — an attention weight matrix.

2.2 Multi-Head Attention

Definition 21.7 (Multi-Head Attention)

Given

h

attention heads, each with its own projections

W_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{d \times d_k}

and

W_V^{(i)} \in \mathbb{R}^{d \times d_v}

, define:

\mathrm{head}_i = \mathrm{Attention}(XW_Q^{(i)},\; XW_K^{(i)},\; XW_V^{(i)})

\mathrm{MultiHead}(X) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h)\, W_O

where

W_O \in \mathbb{R}^{hd_v \times d}

is a learned output projection.

Why multiple heads: Each head can learn a different “type” of relationship. In music: one head might attend to rhythmic patterns, another to harmonic intervals, a third to melodic contour. Concatenating and projecting combines these views.

2.3 Attention as a Dynamic Transition Matrix

中文: “换句话说，Transformer把固定概率表变成了动态概率表。”

Theorem 21.3 (Softmax Attention Yields a Row-Stochastic Matrix)

Let $A = \mathrm{softmax}(QK^T/\sqrt{d_k}) \in \mathbb{R}^{n \times n}$ . Then:

$A_{ij} > 0$ for all $i, j$ .
$\sum_{j=1}^{n} A_{ij} = 1$ for every row $i$ .

That is, $A$ is a (strictly) positive row-stochastic matrix — it satisfies the same algebraic properties as a Markov chain transition matrix. However, unlike a Markov chain, $A$ depends on the input $X$ and changes at every position.

Proof.

(1) For any real vector $z \in \mathbb{R}^n$ , the softmax function outputs $\mathrm{softmax}(z)_j = e^{z_j}/\sum_{\ell=1}^n e^{z_\ell}$ . Since $e^{z_j} > 0$ for all $z_j \in \mathbb{R}$ , we have $\mathrm{softmax}(z)_j > 0$ .

(2) By definition, $\sum_{j=1}^n \mathrm{softmax}(z)_j = \sum_{j=1}^n \frac{e^{z_j}}{\sum_\ell e^{z_\ell}} = \frac{\sum_j e^{z_j}}{\sum_\ell e^{z_\ell}} = 1$ .

Applying this to each row $z_i = (QK^T)_{i,:}/\sqrt{d_k}$ establishes both properties for the matrix $A$ . $\square$

The key conceptual shift: a Markov chain uses a fixed stochastic matrix $P$ , determined once from training data. Attention computes a different stochastic matrix $A(X)$ for each input sequence $X$ . This allows the model to attend to distant positions when the content warrants it — capturing long-range dependencies like returning choruses and thematic development that Markov chains cannot.

Property	Markov Chain	Attention
Stochastic matrix	Fixed $P$	Dynamic $A(X)$
Entries	$\geq 0$ , rows sum to 1	$> 0$ , rows sum to 1
Context window	1 step (or $k$ for order- $k$ )	Entire sequence
Parameters	$O(\|S\|^2)$	$O(d^2)$ , independent of sequence length

Part III: Diffusion Models (DDPM)

中文: “第三次跳跃更反直觉。…扩散模型…从一整片纯噪声开始，一步一步去掉噪声，逐步还原出音乐的结构。”

3.1 The Forward Noising Process

Definition 21.8 (DDPM Forward Process)

Let

x_0 \sim q(x_0)

be a data sample (e.g., a mel-spectrogram of music). The forward (noising) process is a Markov chain

x_0 \to x_1 \to \cdots \to x_T

defined by:

q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\; \sqrt{1 - \beta_t}\, x_{t-1},\; \beta_t I\right)

where

\{\beta_t\}_{t=1}^T

is a fixed noise schedule with

0 < \beta_t < 1

中文: “最反直觉的是：扩散的前向加噪过程，数学上就是一个马尔可夫链。…只不过状态不再是12个离散的音高，而是一整片连续的噪声。”

At each step, the signal is scaled down by $\sqrt{1-\beta_t}$ and Gaussian noise of variance $\beta_t$ is added. For large $T$ , $x_T$ is approximately pure Gaussian noise.

Worked example: For a one-dimensional signal $x_0 = 5$ with $\beta_1 = 0.01$ : $x_1 \sim \mathcal{N}(\sqrt{0.99} \cdot 5,\; 0.01) = \mathcal{N}(4.975,\; 0.01)$ The signal is barely perturbed. After many steps, the signal is destroyed.

3.2 The Reparameterization Trick

Definition 21.9 (Cumulative Noise Schedule)

Define

\alpha_t = 1 - \beta_t

and the cumulative product:

\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s = \prod_{s=1}^{t}(1 - \beta_s)

Note

\bar{\alpha}_t

is decreasing in

t

, approaching 0 as

t \to T

Theorem 21.4 (DDPM Forward Reparameterization)

For any

t \geq 1

, the marginal

q(x_t \mid x_0)

can be written in closed form:

q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\; \sqrt{\bar{\alpha}_t}\, x_0,\; (1 - \bar{\alpha}_t)\, I\right)

Equivalently, we can sample

x_t

directly from

x_0

without iterating through intermediate steps:

x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I)

Proof.

Base case ( $t = 1$ ): By definition, $q(x_1 \mid x_0) = \mathcal{N}(\sqrt{\alpha_1}\, x_0, \beta_1 I) = \mathcal{N}(\sqrt{\bar{\alpha}_1}\, x_0, (1 - \bar{\alpha}_1) I)$ since $\bar{\alpha}_1 = \alpha_1$ and $\beta_1 = 1 - \alpha_1$ .

Inductive step: Assume $x_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\, x_0 + \sqrt{1 - \bar{\alpha}_{t-1}}\, \varepsilon_1$ with $\varepsilon_1 \sim \mathcal{N}(0, I)$ . Then: $x_t = \sqrt{\alpha_t}\, x_{t-1} + \sqrt{\beta_t}\, \varepsilon_2, \qquad \varepsilon_2 \sim \mathcal{N}(0, I),\; \varepsilon_2 \perp \varepsilon_1$ Substituting the inductive hypothesis: $x_t = \sqrt{\alpha_t}\bigl(\sqrt{\bar{\alpha}_{t-1}}\, x_0 + \sqrt{1 - \bar{\alpha}_{t-1}}\, \varepsilon_1\bigr) + \sqrt{\beta_t}\, \varepsilon_2$ $= \sqrt{\alpha_t \bar{\alpha}_{t-1}}\, x_0 + \sqrt{\alpha_t(1 - \bar{\alpha}_{t-1})}\, \varepsilon_1 + \sqrt{\beta_t}\, \varepsilon_2$

Since $\varepsilon_1$ and $\varepsilon_2$ are independent standard Gaussians, the sum $\sqrt{\alpha_t(1 - \bar{\alpha}_{t-1})}\, \varepsilon_1 + \sqrt{\beta_t}\, \varepsilon_2$ is Gaussian with mean 0 and variance: $\alpha_t(1 - \bar{\alpha}_{t-1}) + \beta_t = \alpha_t - \alpha_t \bar{\alpha}_{t-1} + 1 - \alpha_t = 1 - \bar{\alpha}_t$ using $\alpha_t \bar{\alpha}_{t-1} = \bar{\alpha}_t$ and $\beta_t = 1 - \alpha_t$ . Therefore: $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I) \;\;\square$

Musical significance: This result means we can jump directly from a clean spectrogram $x_0$ to any noise level $t$ in one step — essential for efficient training where $t$ is sampled uniformly at random.

3.3 The Reverse (Denoising) Process

Definition 21.10 (DDPM Reverse Process)

The reverse process is a learned Markov chain running backward from

x_T \sim \mathcal{N}(0, I)

x_0

p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\; \mu_\theta(x_t, t),\; \sigma_t^2 I\right)

where

\mu_\theta

is a neural network predicting the denoised mean, and

\sigma_t^2

is a fixed or learned variance schedule.

Theorem 21.5 (ELBO for Diffusion Models)

The log-likelihood of the data is bounded below by: $\log p_\theta(x_0) \geq \mathbb{E}_q\!\left[-\log \frac{p(x_T)}{q(x_T \mid x_0)} - \sum_{t=2}^{T} D_{\mathrm{KL}}\!\left(q(x_{t-1} \mid x_t, x_0) \;\|\; p_\theta(x_{t-1} \mid x_t)\right) + \log p_\theta(x_0 \mid x_1)\right]$

The key insight: the posterior $q(x_{t-1} \mid x_t, x_0)$ is tractable (it is Gaussian), so each KL term can be computed in closed form. Training reduces to making each reverse step $p_\theta(x_{t-1} \mid x_t)$ match the true posterior $q(x_{t-1} \mid x_t, x_0)$ .

Proof.

(Sketch.) Start with

\log p_\theta(x_0) = \log \int p_\theta(x_{0:T})\, dx_{1:T}

. Introduce the forward process as a variational distribution:

\log p_\theta(x_0) = \log \int \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)} q(x_{1:T} \mid x_0)\, dx_{1:T} \geq \mathbb{E}_{q(x_{1:T} \mid x_0)}\!\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right]

by Jensen’s inequality. Factorizing both

p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)

and

q(x_{1:T} \mid x_0) = \prod_{t=1}^T q(x_t \mid x_{t-1})

, then rewriting the product using Bayes' rule

q(x_t \mid x_{t-1}) = q(x_{t-1} \mid x_t, x_0)\, q(x_t \mid x_0) / q(x_{t-1} \mid x_0)

, telescoping yields the stated decomposition into KL divergence terms. Each

q(x_{t-1} \mid x_t, x_0)

is Gaussian (as the product of two Gaussians), making the KL terms tractable.

\square

In practice, Ho et al. (2020) showed that the simplified loss $L_{\mathrm{simple}} = \mathbb{E}_{t, x_0, \varepsilon}\!\left[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right]$ (where $\varepsilon_\theta$ is a neural network predicting the noise $\varepsilon$ added at step $t$ ) works well and corresponds to a reweighted version of the ELBO.

3.4 Score Function and Score Matching

Definition 21.11 (Score Function)

The score function of a distribution

p_t(x)

at noise level

t

is the gradient of the log-density:

s(x, t) = \nabla_x \log p_t(x)

A neural network

s_\theta(x, t)

trained to approximate the score is called a score network.

Theorem 21.6 (Score-Noise Equivalence)

The score function and the noise prediction network are related by:

s_\theta(x_t, t) = -\frac{\varepsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}}

That is, predicting the score is equivalent to predicting the noise, up to a known scaling factor.

Proof.

From the reparameterization $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \varepsilon$ , the conditional density is: $q(x_t \mid x_0) = \frac{1}{(2\pi(1 - \bar{\alpha}_t))^{d/2}} \exp\!\left(-\frac{\|x_t - \sqrt{\bar{\alpha}_t}\, x_0\|^2}{2(1 - \bar{\alpha}_t)}\right)$

Taking the gradient with respect to $x_t$ : $\nabla_{x_t} \log q(x_t \mid x_0) = -\frac{x_t - \sqrt{\bar{\alpha}_t}\, x_0}{1 - \bar{\alpha}_t} = -\frac{\sqrt{1 - \bar{\alpha}_t}\, \varepsilon}{1 - \bar{\alpha}_t} = -\frac{\varepsilon}{\sqrt{1 - \bar{\alpha}_t}}$

Since the optimal noise predictor satisfies $\varepsilon_\theta(x_t, t) \approx \varepsilon$ , we have $s_\theta(x_t, t) \approx \nabla_{x_t} \log q(x_t \mid x_0) = -\varepsilon/\sqrt{1 - \bar{\alpha}_t}$ . Marginalizing over $x_0$ extends this to the unconditional score $\nabla_{x_t} \log p_t(x_t)$ . $\square$

Musical interpretation: The score $\nabla_x \log p_t(x)$ points toward regions of higher probability density — i.e., toward more “music-like” spectrograms. Denoising by following the score is equivalent to gradually sculpting noise into music by climbing the probability landscape.

3.5 The Landscape: Token vs Continuous

中文: “横轴是开放还是闭源。纵轴是用离散token做自回归，还是用连续空间做扩散。”

Definition 21.12 (Autoregressive (Token-Based) Generative Model)

Given a vocabulary

V

(e.g., quantized audio tokens), an autoregressive model factorizes the joint distribution of a sequence

(x_1, \ldots, x_n)

as:

p(x_1, \ldots, x_n) = \prod_{i=1}^{n} p_\theta(x_i \mid x_1, \ldots, x_{i-1})

Each conditional is parameterized by a Transformer that attends to all previous tokens. Sampling proceeds left-to-right.

Definition 21.13 (Continuous Diffusion Generative Model)

A diffusion model defines

p_\theta(x)

implicitly via the reverse process (Definition 21.10). The data

x \in \mathbb{R}^d

lives in a continuous space (e.g., a mel-spectrogram or a latent embedding). Sampling starts from

x_T \sim \mathcal{N}(0,I)

and iteratively denoises.

The fundamental mathematical distinction: autoregressive models operate on discrete sequences with an explicit likelihood via the chain rule of probability. Diffusion models operate on continuous vectors with an implicit likelihood accessible only through the ELBO. This is not merely an implementation choice — it determines the inductive bias: autoregressive models excel at capturing sequential dependencies but struggle with global coherence; diffusion models naturally capture global structure (the entire spectrogram is generated jointly) but must be coaxed into temporal coherence.

Numerical Examples

Example 21.1: Markov Chain Melody Generation

Consider a simple Markov chain on $\{C, D, E, F, G\}$ (a pentatonic subset) with transition matrix:

P = \begin{pmatrix} 0.1 & 0.3 & 0.2 & 0.1 & 0.3 \\ 0.2 & 0.1 & 0.3 & 0.2 & 0.2 \\ 0.1 & 0.2 & 0.1 & 0.3 & 0.3 \\ 0.3 & 0.1 & 0.2 & 0.1 & 0.3 \\ 0.2 & 0.2 & 0.2 & 0.2 & 0.2 \end{pmatrix}

Starting at $X_0 = C$ : row 1 says we go to D with probability 0.3, to G with probability 0.3. Suppose we sample D. From D (row 2), we might sample E (probability 0.3). From E (row 3), we might sample G (probability 0.3). Generated melody fragment: C-D-E-G-…

The two-step transition $P^2$ entry $(C,G)$ is: $\sum_k P_{Ck} P_{kG} = 0.1(0.3) + 0.3(0.2) + 0.2(0.3) + 0.1(0.3) + 0.3(0.2) = 0.24$ .

Example 21.2: Forward Diffusion on a Spectrogram Pixel

Take one pixel of a mel-spectrogram: $x_0 = 3.0$ (log-amplitude). With a linear schedule $\beta_t = 0.0001 + (0.02 - 0.0001) \cdot t/T$ for $T = 1000$ :

At $t = 1$ : $\bar{\alpha}_1 \approx 0.9999$ , so $x_1 \approx 0.99995 \cdot 3.0 + 0.01 \cdot \varepsilon \approx 3.0$ (nearly unchanged).
At $t = 500$ : $\bar{\alpha}_{500} \approx 0.05$ , so $x_{500} \approx 0.22 \cdot 3.0 + 0.97 \cdot \varepsilon$ (mostly noise).
At $t = 1000$ : $\bar{\alpha}_{1000} \approx 0$ , so $x_{1000} \approx \varepsilon$ (pure noise).

Musical Connection

音乐联系

中文: “三集，同一个问题：怎么把音乐变成概率模型能操作的对象？”

The Representation War: The narration frames the current landscape as a battle between two representations:

中文: “今天的模型大战，本质是表示大战——音乐到底该写成token，还是雕成连续潜空间？”

Approach	Representation	Framework	Examples
Token-based	Discrete tokens (MIDI, audio codecs)	Autoregressive	MusicLM, MusicGen
Continuous	Spectrograms, latent vectors	Diffusion	Riffusion, Stable Audio, ACE-Step

The token-based approach factors the joint distribution as a product of conditionals:

$p(x_1, \ldots, x_n) = \prod_{i=1}^{n} p(x_i \mid x_1, \ldots, x_{i-1})$

and predicts one token at a time. The continuous approach treats the entire musical signal as a point in a high-dimensional space and sculpts it from noise by reversing a Markov chain. Both are valid factorizations of the same underlying probability $p(x)$ .

中文: “Riffusion：先把声音变成频谱图——一张二维的图片。然后用图片扩散模型去生成新的频谱图…”

The arc from EP21 to EP23: This episode (EP21) asks how the math changed.

EP22

examines the specific architectures (EnCodec, RVQ, DiT).

EP23

addresses evaluation — how do we measure whether generated music is “good”?

Limits and Open Questions

Markov chains are not dead: Higher-order Markov models with clever state representations (e.g., learned embeddings rather than raw $k$ -grams) remain useful as baselines and components of hybrid systems.
Attention complexity: Standard attention is $O(n^2)$ in sequence length. For long musical pieces (minutes of audio tokenized at high resolution), this is a bottleneck. Linear attention, sparse attention, and state-space models ( EP24 ) are active research directions.
Diffusion speed: DDPM requires hundreds of denoising steps. Accelerated samplers (DDIM, consistency models, flow matching) reduce this to tens or single-digit steps — but the quality-speed tradeoff is not fully understood.
The representation question is open: Whether music is better represented as discrete tokens or continuous signals is an empirical question with no theoretical resolution. Hybrid models (e.g., diffusion in a discrete-token latent space) blur the boundary.

中文: “七个模型，表面上各有各的招。但底层数学只有两条路：要么把音乐切成token一个一个预测，要么把音乐当作连续空间整体雕刻。”

Controllability vs quality: Markov chains offer full control (just edit the transition matrix) but poor quality. Diffusion models offer high quality but limited fine-grained control. Bridging this gap is the central engineering challenge of AI music generation.

Historical Timeline

Year	Development	Mathematical core
1906	Markov formalizes dependent random variables	Transition matrix $P$
1957	Hiller & Isaacson: Illiac Suite	Markov chain on $\mathbb{Z}_{12}$
1986	Elman / Jordan: recurrent neural networks	Hidden state, backpropagation through time
1997	Hochreiter & Schmidhuber: LSTM	Gated memory cells
2017	Vaswani et al.: Transformer	Softmax attention = dynamic stochastic matrix
2019	Music Transformer	Long-range attention for symbolic music
2020	Ho et al.: DDPM	Score matching, forward/reverse diffusion
2022	Riffusion	Image diffusion on spectrograms
2023	MusicGen, MusicLM	Token-based autoregressive audio generation
2024	ACE-Step, Stable Audio	Full-song diffusion in latent space

Academic References

Hiller, L. & Isaacson, L. (1957). Musical Composition with a High-Speed Digital Computer. Experimental Music, McGraw-Hill.
Norris, J.R. (1997). Markov Chains. Cambridge University Press.
Seneta, E. (2006). Non-negative Matrices and Markov Chains, 3rd ed. Springer.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. & Polosukhin, I. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS).
Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A.M., Hoffman, M.D., Dinculescu, M. & Eck, D. (2019). “Music Transformer: Generating Music with Long-Term Structure.” ICLR 2019.
Ho, J., Jain, A. & Abbeel, P. (2020). “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems 33 (NeurIPS).
Song, Y. & Ermon, S. (2019). “Generative Modeling by Estimating Gradients of the Data Distribution.” NeurIPS 2019.
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S. & Poole, B. (2021). “Score-Based Generative Modeling through Stochastic Differential Equations.” ICLR 2021.
Forsyth, S. (2022). Riffusion — Stable Diffusion for Real-Time Music Generation. https://www.riffusion.com
Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y. & Defossez, A. (2023). “Simple and Controllable Music Generation.” NeurIPS 2023. (MusicGen)
Agostinelli, A., Denk, T.I., Borsos, Z., Engel, J., Verzetti, M., Tagliasacchi, A., Marafioti, A., Ye, Z., Le Roux, J. & Frank, J. (2023). “MusicLM: Generating Music From Text.” arXiv:2301.11325.
Levin, D.A., Peres, Y. & Wilmer, E.L. (2009). Markov Chains and Mixing Times. AMS.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. (2015). “Deep Unsupervised Learning using Nonequilibrium Thermodynamics.” ICML 2015.
Hochreiter, S. & Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation 9(8), 1735-1780.
Briot, J.-P., Hadjeres, G. & Pachet, F. (2020). Deep Learning Techniques for Music Generation. Springer.