EP21

EP21: From Markov to Diffusion — Sixty Years of AI Composition

Markov链, Transformer注意力, 分数匹配/扩散
8:35 Statistics/MLInformation Theory

Overview

Three mathematical frameworks have dominated algorithmic music composition over sixty years: Markov chains (1957), Transformer attention (2017), and diffusion models (2020). Each replaces its predecessor not by discarding the core idea — sampling from a learned probability distribution over musical events — but by changing which probability distribution and how it is parameterized.

中文: “它们在解同一个数学问题。这个问题的答案,七十年里只换了三次。从骰子,到注意力,到噪声。”

This companion page formalizes the mathematics the video sketches: transition matrices on , the softmax attention mechanism as a content-dependent stochastic matrix, and the DDPM forward/reverse processes with score matching.


Prerequisites


Part I: Markov Chains on ℤ₁₂

中文: “1957年,希勒和艾萨克森在伊利诺伊大学的Illiac计算机上做了一个实验。他们把十二个音高排成一圈。第四期讲过,这就是Z 12,十二元循环群。然后给每对音之间分配一个转移概率。”

1.1 Definitions

Definition 21.1 (Time-Homogeneous Markov Chain on a Finite State Space)

Let be a finite set of states. A time-homogeneous Markov chain on is a sequence of random variables taking values in such that for all and all states :

The constant is called the transition probability from state to state .

Example (Illiac Suite, 1957): Set , the twelve pitch classes (see EP04 ). Hiller and Isaacson assigned transition probabilities such as and . To generate a melody: start at some pitch , sample from row of the transition matrix, then sample from row , and so on.

中文: “生成旋律的方法:站在当前音,按概率掷骰子,跳到下一个音。重复。这就是马尔可夫链。本质上,1957年的AI作曲,就是在Z 12上掷加权骰子。”

Definition 21.2 (Stochastic (Transition) Matrix)

The transition matrix of a Markov chain on states has entries satisfying:

  1. Non-negativity: for all .
  2. Row-stochastic: for every row .

A matrix satisfying both conditions is called a (row-)stochastic matrix.

Worked example: A toy transition matrix on states :

Row 1: from C, probability 0.2 to stay on C, 0.3 to jump to E, 0.5 to jump to G. Each row sums to 1.

1.2 Multi-Step Transitions

Theorem 21.1 (Chapman-Kolmogorov Equation)
Let be the transition matrix of a time-homogeneous Markov chain. The -step transition probabilities satisfy i.e., the probability of going from state to state in exactly steps is the -entry of the matrix .
Proof.

Base case (): by definition.

Inductive step: Assume . For the -step probability:

The first equality uses the law of total probability and the Markov property. The inductive hypothesis gives . Therefore .

Musical meaning: is the probability that a melody starting on C will be on G after exactly notes.

1.3 Stationary Distribution

Definition 21.3 (Stationary Distribution)
A probability vector (with and ) is a stationary distribution of the Markov chain with transition matrix if That is, is a left eigenvector of with eigenvalue 1.
Theorem 21.2 (Existence and Uniqueness of Stationary Distribution (Perron-Frobenius))

Let be the transition matrix of an irreducible, aperiodic Markov chain on a finite state space. Then:

  1. Existence: There exists a unique stationary distribution with and for all .
  2. Convergence: For any initial distribution , .
  3. Long-run frequency: With probability 1, the fraction of time spent in state converges to .
Proof.
(Sketch via Perron-Frobenius.) An irreducible chain has (entry-wise) for some (aperiodicity + irreducibility ensure this). The Perron-Frobenius theorem for positive matrices states that such a matrix has a unique largest eigenvalue (since is stochastic, the all-ones vector is a right eigenvector with eigenvalue 1), and the corresponding left eigenvector can be chosen with all entries positive and summing to 1. All other eigenvalues satisfy , so as , which gives convergence and the frequency interpretation.

Musical meaning: The stationary distribution gives the long-run frequency of each pitch class. If the Markov chain on has and , then in a long enough generated melody, roughly 12% of notes will be C and 10% will be G — regardless of which note started the chain. This is a stylistic fingerprint of the transition matrix.

Worked example: For the toy matrix above, solving with gives the system , etc. The solution is : G dominates in the long run.

1.4 Why Markov Chains Fail at Music

中文: “结果听起来怎么样?局部像音乐,整体没记忆。因为马尔可夫链只看上一步。…副歌回来、主题发展——它全抓不住。”

Definition 21.4 (Mixing Time)
The mixing time of an irreducible aperiodic Markov chain is the smallest such that the total variation distance between and is at most for any initial distribution :

The mixing time measures how quickly the chain “forgets” its starting state. For a first-order chain on , this is typically very small (a few steps), meaning the chain loses all memory of its beginning after a handful of notes.

One might try a -th order Markov chain: condition on the previous notes instead of just one. But this requires a transition matrix of size : for (a single musical phrase), the state space has states. The number of parameters grows exponentially, making estimation from finite training data intractable. This exponential blowup is the fundamental limitation.

Prop 21.1 (Exponential State Space of k-th Order Chains)
A -th order Markov chain on an alphabet of size requires a transition matrix with parameters. For and (two bars of eighth notes), this exceeds .
Proof.
The state space of a -th order chain is (all sequences of length ), with states. Each state has possible next-state probabilities (one per element of ), so the transition matrix has rows and free parameters per row (the last is determined by the row-sum constraint, but the matrix still has entries). For , : .

Part II: Attention and Transformers

中文: “真正的突破要等到2017年。Transformer的核心是注意力机制。关键直觉是这样的:马尔可夫链有一张固定的转移概率表,每行加起来等于一。注意力机制也有这样一张表——但它不是固定的,而是根据当前内容实时算出来的。”

2.1 Scaled Dot-Product Attention

Definition 21.5 (Query, Key, Value Projections)
Let be an input matrix whose rows are token embeddings of dimension . Define three learned weight matrices and . The query, key, and value matrices are:
Definition 21.6 (Scaled Dot-Product Attention)
Given , , , the scaled dot-product attention is: where softmax is applied row-wise: for row , .

Why the scaling: If the entries of and are independent with mean 0 and variance 1, then the dot product has mean 0 and variance . Without scaling, for large the dot products grow large in magnitude, pushing softmax into saturation (one entry near 1, all others near 0). Dividing by restores unit variance, keeping softmax in its informative regime.

Worked example: Let (three tokens: C, E, G), . Suppose after projection:

Then . Dividing by and applying row-wise softmax gives a matrix where each row sums to 1 — an attention weight matrix.

2.2 Multi-Head Attention

Definition 21.7 (Multi-Head Attention)
Given attention heads, each with its own projections and , define: where is a learned output projection.

Why multiple heads: Each head can learn a different “type” of relationship. In music: one head might attend to rhythmic patterns, another to harmonic intervals, a third to melodic contour. Concatenating and projecting combines these views.

2.3 Attention as a Dynamic Transition Matrix

中文: “换句话说,Transformer把固定概率表变成了动态概率表。”

Theorem 21.3 (Softmax Attention Yields a Row-Stochastic Matrix)

Let . Then:

  1. for all .
  2. for every row .

That is, is a (strictly) positive row-stochastic matrix — it satisfies the same algebraic properties as a Markov chain transition matrix. However, unlike a Markov chain, depends on the input and changes at every position.

Proof.

(1) For any real vector , the softmax function outputs . Since for all , we have .

(2) By definition, .

Applying this to each row establishes both properties for the matrix .

The key conceptual shift: a Markov chain uses a fixed stochastic matrix , determined once from training data. Attention computes a different stochastic matrix for each input sequence . This allows the model to attend to distant positions when the content warrants it — capturing long-range dependencies like returning choruses and thematic development that Markov chains cannot.

Property Markov Chain Attention
Stochastic matrix Fixed Dynamic
Entries , rows sum to 1 , rows sum to 1
Context window 1 step (or for order-) Entire sequence
Parameters , independent of sequence length

Part III: Diffusion Models (DDPM)

中文: “第三次跳跃更反直觉。…扩散模型…从一整片纯噪声开始,一步一步去掉噪声,逐步还原出音乐的结构。”

3.1 The Forward Noising Process

Definition 21.8 (DDPM Forward Process)
Let be a data sample (e.g., a mel-spectrogram of music). The forward (noising) process is a Markov chain defined by: where is a fixed noise schedule with .

中文: “最反直觉的是:扩散的前向加噪过程,数学上就是一个马尔可夫链。…只不过状态不再是12个离散的音高,而是一整片连续的噪声。”

At each step, the signal is scaled down by and Gaussian noise of variance is added. For large , is approximately pure Gaussian noise.

Worked example: For a one-dimensional signal with : The signal is barely perturbed. After many steps, the signal is destroyed.

3.2 The Reparameterization Trick

Definition 21.9 (Cumulative Noise Schedule)
Define and the cumulative product: Note is decreasing in , approaching 0 as .
Theorem 21.4 (DDPM Forward Reparameterization)
For any , the marginal can be written in closed form: Equivalently, we can sample directly from without iterating through intermediate steps:
Proof.

Base case (): By definition, since and .

Inductive step: Assume with . Then: Substituting the inductive hypothesis:

Since and are independent standard Gaussians, the sum is Gaussian with mean 0 and variance: using and . Therefore:

Musical significance: This result means we can jump directly from a clean spectrogram to any noise level in one step — essential for efficient training where is sampled uniformly at random.

3.3 The Reverse (Denoising) Process

Definition 21.10 (DDPM Reverse Process)
The reverse process is a learned Markov chain running backward from to : where is a neural network predicting the denoised mean, and is a fixed or learned variance schedule.
Theorem 21.5 (ELBO for Diffusion Models)

The log-likelihood of the data is bounded below by:

The key insight: the posterior is tractable (it is Gaussian), so each KL term can be computed in closed form. Training reduces to making each reverse step match the true posterior .

Proof.
(Sketch.) Start with . Introduce the forward process as a variational distribution: by Jensen’s inequality. Factorizing both and , then rewriting the product using Bayes' rule , telescoping yields the stated decomposition into KL divergence terms. Each is Gaussian (as the product of two Gaussians), making the KL terms tractable.

In practice, Ho et al. (2020) showed that the simplified loss (where is a neural network predicting the noise added at step ) works well and corresponds to a reweighted version of the ELBO.

3.4 Score Function and Score Matching

Definition 21.11 (Score Function)
The score function of a distribution at noise level is the gradient of the log-density: A neural network trained to approximate the score is called a score network.
Theorem 21.6 (Score-Noise Equivalence)
The score function and the noise prediction network are related by: That is, predicting the score is equivalent to predicting the noise, up to a known scaling factor.
Proof.

From the reparameterization , the conditional density is:

Taking the gradient with respect to :

Since the optimal noise predictor satisfies , we have . Marginalizing over extends this to the unconditional score .

Musical interpretation: The score points toward regions of higher probability density — i.e., toward more “music-like” spectrograms. Denoising by following the score is equivalent to gradually sculpting noise into music by climbing the probability landscape.

3.5 The Landscape: Token vs Continuous

中文: “横轴是开放还是闭源。纵轴是用离散token做自回归,还是用连续空间做扩散。”

Definition 21.12 (Autoregressive (Token-Based) Generative Model)
Given a vocabulary (e.g., quantized audio tokens), an autoregressive model factorizes the joint distribution of a sequence as: Each conditional is parameterized by a Transformer that attends to all previous tokens. Sampling proceeds left-to-right.
Definition 21.13 (Continuous Diffusion Generative Model)
A diffusion model defines implicitly via the reverse process (Definition 21.10). The data lives in a continuous space (e.g., a mel-spectrogram or a latent embedding). Sampling starts from and iteratively denoises.

The fundamental mathematical distinction: autoregressive models operate on discrete sequences with an explicit likelihood via the chain rule of probability. Diffusion models operate on continuous vectors with an implicit likelihood accessible only through the ELBO. This is not merely an implementation choice — it determines the inductive bias: autoregressive models excel at capturing sequential dependencies but struggle with global coherence; diffusion models naturally capture global structure (the entire spectrogram is generated jointly) but must be coaxed into temporal coherence.


Numerical Examples

Example 21.1: Markov Chain Melody Generation

Consider a simple Markov chain on (a pentatonic subset) with transition matrix:

Starting at : row 1 says we go to D with probability 0.3, to G with probability 0.3. Suppose we sample D. From D (row 2), we might sample E (probability 0.3). From E (row 3), we might sample G (probability 0.3). Generated melody fragment: C-D-E-G-…

The two-step transition entry is: .

Example 21.2: Forward Diffusion on a Spectrogram Pixel

Take one pixel of a mel-spectrogram: (log-amplitude). With a linear schedule for :

  • At : , so (nearly unchanged).
  • At : , so (mostly noise).
  • At : , so (pure noise).

Musical Connection

音乐联系

中文: “三集,同一个问题:怎么把音乐变成概率模型能操作的对象?”

The Representation War: The narration frames the current landscape as a battle between two representations:

中文: “今天的模型大战,本质是表示大战——音乐到底该写成token,还是雕成连续潜空间?”

Approach Representation Framework Examples
Token-based Discrete tokens (MIDI, audio codecs) Autoregressive MusicLM, MusicGen
Continuous Spectrograms, latent vectors Diffusion Riffusion, Stable Audio, ACE-Step

The token-based approach factors the joint distribution as a product of conditionals:

and predicts one token at a time. The continuous approach treats the entire musical signal as a point in a high-dimensional space and sculpts it from noise by reversing a Markov chain. Both are valid factorizations of the same underlying probability .

中文: “Riffusion:先把声音变成频谱图——一张二维的图片。然后用图片扩散模型去生成新的频谱图…”

The arc from EP21 to EP23: This episode (EP21) asks how the math changed.

EP22

examines the specific architectures (EnCodec, RVQ, DiT).

EP23

addresses evaluation — how do we measure whether generated music is “good”?


Limits and Open Questions

  1. Markov chains are not dead: Higher-order Markov models with clever state representations (e.g., learned embeddings rather than raw -grams) remain useful as baselines and components of hybrid systems.

  2. Attention complexity: Standard attention is in sequence length. For long musical pieces (minutes of audio tokenized at high resolution), this is a bottleneck. Linear attention, sparse attention, and state-space models ( EP24 ) are active research directions.

  3. Diffusion speed: DDPM requires hundreds of denoising steps. Accelerated samplers (DDIM, consistency models, flow matching) reduce this to tens or single-digit steps — but the quality-speed tradeoff is not fully understood.

  4. The representation question is open: Whether music is better represented as discrete tokens or continuous signals is an empirical question with no theoretical resolution. Hybrid models (e.g., diffusion in a discrete-token latent space) blur the boundary.

中文: “七个模型,表面上各有各的招。但底层数学只有两条路:要么把音乐切成token一个一个预测,要么把音乐当作连续空间整体雕刻。”

  1. Controllability vs quality: Markov chains offer full control (just edit the transition matrix) but poor quality. Diffusion models offer high quality but limited fine-grained control. Bridging this gap is the central engineering challenge of AI music generation.

Historical Timeline

Year Development Mathematical core
1906 Markov formalizes dependent random variables Transition matrix
1957 Hiller & Isaacson: Illiac Suite Markov chain on
1986 Elman / Jordan: recurrent neural networks Hidden state, backpropagation through time
1997 Hochreiter & Schmidhuber: LSTM Gated memory cells
2017 Vaswani et al.: Transformer Softmax attention = dynamic stochastic matrix
2019 Music Transformer Long-range attention for symbolic music
2020 Ho et al.: DDPM Score matching, forward/reverse diffusion
2022 Riffusion Image diffusion on spectrograms
2023 MusicGen, MusicLM Token-based autoregressive audio generation
2024 ACE-Step, Stable Audio Full-song diffusion in latent space

Academic References

  1. Hiller, L. & Isaacson, L. (1957). Musical Composition with a High-Speed Digital Computer. Experimental Music, McGraw-Hill.
  2. Norris, J.R. (1997). Markov Chains. Cambridge University Press.
  3. Seneta, E. (2006). Non-negative Matrices and Markov Chains, 3rd ed. Springer.
  4. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. & Polosukhin, I. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS).
  5. Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A.M., Hoffman, M.D., Dinculescu, M. & Eck, D. (2019). “Music Transformer: Generating Music with Long-Term Structure.” ICLR 2019.
  6. Ho, J., Jain, A. & Abbeel, P. (2020). “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems 33 (NeurIPS).
  7. Song, Y. & Ermon, S. (2019). “Generative Modeling by Estimating Gradients of the Data Distribution.” NeurIPS 2019.
  8. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S. & Poole, B. (2021). “Score-Based Generative Modeling through Stochastic Differential Equations.” ICLR 2021.
  9. Forsyth, S. (2022). Riffusion — Stable Diffusion for Real-Time Music Generation. https://www.riffusion.com
  10. Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y. & Defossez, A. (2023). “Simple and Controllable Music Generation.” NeurIPS 2023. (MusicGen)
  11. Agostinelli, A., Denk, T.I., Borsos, Z., Engel, J., Verzetti, M., Tagliasacchi, A., Marafioti, A., Ye, Z., Le Roux, J. & Frank, J. (2023). “MusicLM: Generating Music From Text.” arXiv:2301.11325.
  12. Levin, D.A., Peres, Y. & Wilmer, E.L. (2009). Markov Chains and Mixing Times. AMS.
  13. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. (2015). “Deep Unsupervised Learning using Nonequilibrium Thermodynamics.” ICML 2015.
  14. Hochreiter, S. & Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation 9(8), 1735-1780.
  15. Briot, J.-P., Hadjeres, G. & Pachet, F. (2020). Deep Learning Techniques for Music Generation. Springer.