EP21: From Markov to Diffusion — Sixty Years of AI Composition
后续拓展
Overview
Three mathematical frameworks have dominated algorithmic music composition over sixty years: Markov chains (1957), Transformer attention (2017), and diffusion models (2020). Each replaces its predecessor not by discarding the core idea — sampling from a learned probability distribution over musical events — but by changing which probability distribution and how it is parameterized.
中文: “它们在解同一个数学问题。这个问题的答案,七十年里只换了三次。从骰子,到注意力,到噪声。”
This companion page formalizes the mathematics the video sketches: transition matrices on , the softmax attention mechanism as a content-dependent stochastic matrix, and the DDPM forward/reverse processes with score matching.
Prerequisites
- Cyclic groups and ℤ₁₂ (EP04) — pitch classes as elements of the cyclic group
- Probability and entropy basics (EP07) — probability distributions, expectation
- Linear algebra: matrix multiplication, eigenvalues, positive semi-definiteness
- Basic multivariate calculus: gradients, Gaussian densities
Part I: Markov Chains on ℤ₁₂
中文: “1957年,希勒和艾萨克森在伊利诺伊大学的Illiac计算机上做了一个实验。他们把十二个音高排成一圈。第四期讲过,这就是Z 12,十二元循环群。然后给每对音之间分配一个转移概率。”
1.1 Definitions
Let be a finite set of states. A time-homogeneous Markov chain on is a sequence of random variables taking values in such that for all and all states :
The constant is called the transition probability from state to state .
Example (Illiac Suite, 1957): Set , the twelve pitch classes (see EP04 ). Hiller and Isaacson assigned transition probabilities such as and . To generate a melody: start at some pitch , sample from row of the transition matrix, then sample from row , and so on.
中文: “生成旋律的方法:站在当前音,按概率掷骰子,跳到下一个音。重复。这就是马尔可夫链。本质上,1957年的AI作曲,就是在Z 12上掷加权骰子。”
The transition matrix of a Markov chain on states has entries satisfying:
- Non-negativity: for all .
- Row-stochastic: for every row .
A matrix satisfying both conditions is called a (row-)stochastic matrix.
Worked example: A toy transition matrix on states :
Row 1: from C, probability 0.2 to stay on C, 0.3 to jump to E, 0.5 to jump to G. Each row sums to 1.
1.2 Multi-Step Transitions
Base case (): by definition.
Inductive step: Assume . For the -step probability:
The first equality uses the law of total probability and the Markov property. The inductive hypothesis gives . Therefore .
Musical meaning: is the probability that a melody starting on C will be on G after exactly notes.
1.3 Stationary Distribution
Let be the transition matrix of an irreducible, aperiodic Markov chain on a finite state space. Then:
- Existence: There exists a unique stationary distribution with and for all .
- Convergence: For any initial distribution , .
- Long-run frequency: With probability 1, the fraction of time spent in state converges to .
Musical meaning: The stationary distribution gives the long-run frequency of each pitch class. If the Markov chain on has and , then in a long enough generated melody, roughly 12% of notes will be C and 10% will be G — regardless of which note started the chain. This is a stylistic fingerprint of the transition matrix.
Worked example: For the toy matrix above, solving with gives the system , etc. The solution is : G dominates in the long run.
1.4 Why Markov Chains Fail at Music
中文: “结果听起来怎么样?局部像音乐,整体没记忆。因为马尔可夫链只看上一步。…副歌回来、主题发展——它全抓不住。”
The mixing time measures how quickly the chain “forgets” its starting state. For a first-order chain on , this is typically very small (a few steps), meaning the chain loses all memory of its beginning after a handful of notes.
One might try a -th order Markov chain: condition on the previous notes instead of just one. But this requires a transition matrix of size : for (a single musical phrase), the state space has states. The number of parameters grows exponentially, making estimation from finite training data intractable. This exponential blowup is the fundamental limitation.
Part II: Attention and Transformers
中文: “真正的突破要等到2017年。Transformer的核心是注意力机制。关键直觉是这样的:马尔可夫链有一张固定的转移概率表,每行加起来等于一。注意力机制也有这样一张表——但它不是固定的,而是根据当前内容实时算出来的。”
2.1 Scaled Dot-Product Attention
Why the scaling: If the entries of and are independent with mean 0 and variance 1, then the dot product has mean 0 and variance . Without scaling, for large the dot products grow large in magnitude, pushing softmax into saturation (one entry near 1, all others near 0). Dividing by restores unit variance, keeping softmax in its informative regime.
Worked example: Let (three tokens: C, E, G), . Suppose after projection:
Then . Dividing by and applying row-wise softmax gives a matrix where each row sums to 1 — an attention weight matrix.
2.2 Multi-Head Attention
Why multiple heads: Each head can learn a different “type” of relationship. In music: one head might attend to rhythmic patterns, another to harmonic intervals, a third to melodic contour. Concatenating and projecting combines these views.
2.3 Attention as a Dynamic Transition Matrix
中文: “换句话说,Transformer把固定概率表变成了动态概率表。”
Let . Then:
- for all .
- for every row .
That is, is a (strictly) positive row-stochastic matrix — it satisfies the same algebraic properties as a Markov chain transition matrix. However, unlike a Markov chain, depends on the input and changes at every position.
(1) For any real vector , the softmax function outputs . Since for all , we have .
(2) By definition, .
Applying this to each row establishes both properties for the matrix .
The key conceptual shift: a Markov chain uses a fixed stochastic matrix , determined once from training data. Attention computes a different stochastic matrix for each input sequence . This allows the model to attend to distant positions when the content warrants it — capturing long-range dependencies like returning choruses and thematic development that Markov chains cannot.
| Property | Markov Chain | Attention |
|---|---|---|
| Stochastic matrix | Fixed | Dynamic |
| Entries | , rows sum to 1 | , rows sum to 1 |
| Context window | 1 step (or for order-) | Entire sequence |
| Parameters | , independent of sequence length |
Part III: Diffusion Models (DDPM)
中文: “第三次跳跃更反直觉。…扩散模型…从一整片纯噪声开始,一步一步去掉噪声,逐步还原出音乐的结构。”
3.1 The Forward Noising Process
中文: “最反直觉的是:扩散的前向加噪过程,数学上就是一个马尔可夫链。…只不过状态不再是12个离散的音高,而是一整片连续的噪声。”
At each step, the signal is scaled down by and Gaussian noise of variance is added. For large , is approximately pure Gaussian noise.
Worked example: For a one-dimensional signal with : The signal is barely perturbed. After many steps, the signal is destroyed.
3.2 The Reparameterization Trick
Base case (): By definition, since and .
Inductive step: Assume with . Then: Substituting the inductive hypothesis:
Since and are independent standard Gaussians, the sum is Gaussian with mean 0 and variance: using and . Therefore:
Musical significance: This result means we can jump directly from a clean spectrogram to any noise level in one step — essential for efficient training where is sampled uniformly at random.
3.3 The Reverse (Denoising) Process
The log-likelihood of the data is bounded below by:
The key insight: the posterior is tractable (it is Gaussian), so each KL term can be computed in closed form. Training reduces to making each reverse step match the true posterior .
In practice, Ho et al. (2020) showed that the simplified loss (where is a neural network predicting the noise added at step ) works well and corresponds to a reweighted version of the ELBO.
3.4 Score Function and Score Matching
From the reparameterization , the conditional density is:
Taking the gradient with respect to :
Since the optimal noise predictor satisfies , we have . Marginalizing over extends this to the unconditional score .
Musical interpretation: The score points toward regions of higher probability density — i.e., toward more “music-like” spectrograms. Denoising by following the score is equivalent to gradually sculpting noise into music by climbing the probability landscape.
3.5 The Landscape: Token vs Continuous
中文: “横轴是开放还是闭源。纵轴是用离散token做自回归,还是用连续空间做扩散。”
The fundamental mathematical distinction: autoregressive models operate on discrete sequences with an explicit likelihood via the chain rule of probability. Diffusion models operate on continuous vectors with an implicit likelihood accessible only through the ELBO. This is not merely an implementation choice — it determines the inductive bias: autoregressive models excel at capturing sequential dependencies but struggle with global coherence; diffusion models naturally capture global structure (the entire spectrogram is generated jointly) but must be coaxed into temporal coherence.
Numerical Examples
Example 21.1: Markov Chain Melody Generation
Consider a simple Markov chain on (a pentatonic subset) with transition matrix:
Starting at : row 1 says we go to D with probability 0.3, to G with probability 0.3. Suppose we sample D. From D (row 2), we might sample E (probability 0.3). From E (row 3), we might sample G (probability 0.3). Generated melody fragment: C-D-E-G-…
The two-step transition entry is: .
Example 21.2: Forward Diffusion on a Spectrogram Pixel
Take one pixel of a mel-spectrogram: (log-amplitude). With a linear schedule for :
- At : , so (nearly unchanged).
- At : , so (mostly noise).
- At : , so (pure noise).
Musical Connection
中文: “三集,同一个问题:怎么把音乐变成概率模型能操作的对象?”
The Representation War: The narration frames the current landscape as a battle between two representations:
中文: “今天的模型大战,本质是表示大战——音乐到底该写成token,还是雕成连续潜空间?”
| Approach | Representation | Framework | Examples |
|---|---|---|---|
| Token-based | Discrete tokens (MIDI, audio codecs) | Autoregressive | MusicLM, MusicGen |
| Continuous | Spectrograms, latent vectors | Diffusion | Riffusion, Stable Audio, ACE-Step |
The token-based approach factors the joint distribution as a product of conditionals:
and predicts one token at a time. The continuous approach treats the entire musical signal as a point in a high-dimensional space and sculpts it from noise by reversing a Markov chain. Both are valid factorizations of the same underlying probability .
中文: “Riffusion:先把声音变成频谱图——一张二维的图片。然后用图片扩散模型去生成新的频谱图…”
The arc from EP21 to EP23: This episode (EP21) asks how the math changed.
examines the specific architectures (EnCodec, RVQ, DiT).
addresses evaluation — how do we measure whether generated music is “good”?
Limits and Open Questions
-
Markov chains are not dead: Higher-order Markov models with clever state representations (e.g., learned embeddings rather than raw -grams) remain useful as baselines and components of hybrid systems.
-
Attention complexity: Standard attention is in sequence length. For long musical pieces (minutes of audio tokenized at high resolution), this is a bottleneck. Linear attention, sparse attention, and state-space models ( EP24 ) are active research directions.
-
Diffusion speed: DDPM requires hundreds of denoising steps. Accelerated samplers (DDIM, consistency models, flow matching) reduce this to tens or single-digit steps — but the quality-speed tradeoff is not fully understood.
-
The representation question is open: Whether music is better represented as discrete tokens or continuous signals is an empirical question with no theoretical resolution. Hybrid models (e.g., diffusion in a discrete-token latent space) blur the boundary.
中文: “七个模型,表面上各有各的招。但底层数学只有两条路:要么把音乐切成token一个一个预测,要么把音乐当作连续空间整体雕刻。”
- Controllability vs quality: Markov chains offer full control (just edit the transition matrix) but poor quality. Diffusion models offer high quality but limited fine-grained control. Bridging this gap is the central engineering challenge of AI music generation.
Historical Timeline
| Year | Development | Mathematical core |
|---|---|---|
| 1906 | Markov formalizes dependent random variables | Transition matrix |
| 1957 | Hiller & Isaacson: Illiac Suite | Markov chain on |
| 1986 | Elman / Jordan: recurrent neural networks | Hidden state, backpropagation through time |
| 1997 | Hochreiter & Schmidhuber: LSTM | Gated memory cells |
| 2017 | Vaswani et al.: Transformer | Softmax attention = dynamic stochastic matrix |
| 2019 | Music Transformer | Long-range attention for symbolic music |
| 2020 | Ho et al.: DDPM | Score matching, forward/reverse diffusion |
| 2022 | Riffusion | Image diffusion on spectrograms |
| 2023 | MusicGen, MusicLM | Token-based autoregressive audio generation |
| 2024 | ACE-Step, Stable Audio | Full-song diffusion in latent space |
Academic References
- Hiller, L. & Isaacson, L. (1957). Musical Composition with a High-Speed Digital Computer. Experimental Music, McGraw-Hill.
- Norris, J.R. (1997). Markov Chains. Cambridge University Press.
- Seneta, E. (2006). Non-negative Matrices and Markov Chains, 3rd ed. Springer.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. & Polosukhin, I. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS).
- Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A.M., Hoffman, M.D., Dinculescu, M. & Eck, D. (2019). “Music Transformer: Generating Music with Long-Term Structure.” ICLR 2019.
- Ho, J., Jain, A. & Abbeel, P. (2020). “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems 33 (NeurIPS).
- Song, Y. & Ermon, S. (2019). “Generative Modeling by Estimating Gradients of the Data Distribution.” NeurIPS 2019.
- Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S. & Poole, B. (2021). “Score-Based Generative Modeling through Stochastic Differential Equations.” ICLR 2021.
- Forsyth, S. (2022). Riffusion — Stable Diffusion for Real-Time Music Generation. https://www.riffusion.com
- Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y. & Defossez, A. (2023). “Simple and Controllable Music Generation.” NeurIPS 2023. (MusicGen)
- Agostinelli, A., Denk, T.I., Borsos, Z., Engel, J., Verzetti, M., Tagliasacchi, A., Marafioti, A., Ye, Z., Le Roux, J. & Frank, J. (2023). “MusicLM: Generating Music From Text.” arXiv:2301.11325.
- Levin, D.A., Peres, Y. & Wilmer, E.L. (2009). Markov Chains and Mixing Times. AMS.
- Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. (2015). “Deep Unsupervised Learning using Nonequilibrium Thermodynamics.” ICML 2015.
- Hochreiter, S. & Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation 9(8), 1735-1780.
- Briot, J.-P., Hadjeres, G. & Pachet, F. (2020). Deep Learning Techniques for Music Generation. Springer.