EP22

EP22: How AI Writes Music — EnCodec and RVQ

EnCodec RVQ, 因果Transformer, 扩散DiT

▶ 8:10 Statistics/MLSignal Processing

前置知识

EP21 From Markov to Diffusion — Sixty Years of AI Composition

后续拓展

EP23 Can AI Understand Music? Latent Space and Information Bottleneck

Overview

中文: “你在AI音乐工具里打了一行字：‘一首中国风钢琴曲’。十秒后，一段旋律流了出来。这十秒钟里，数据到底经历了什么？”

This episode traces the complete data pipeline of a text-to-music system, from the moment you type a text prompt to the moment audio emerges from the speaker. We follow two parallel architectures:

MusicGen (Meta, 2023): text $\to$ T5 encoder $\to$ Transformer with delay pattern $\to$ discrete tokens $\to$ RVQ decode $\to$ EnCodec decoder $\to$ waveform.
ACE-Step (2024): text $\to$ T5 encoder $\to$ DiT diffusion in continuous latent space $\to$ DCAE decoder $\to$ vocoder $\to$ waveform.

The mathematical core: how to compress audio into a small discrete alphabet (EnCodec + RVQ), how to generate sequences from that alphabet conditioned on text (cross-attention + delay pattern), and how diffusion offers a continuous alternative (DiT + classifier-free guidance).

In EP21 , we surveyed sixty years of AI composition paradigms — Markov chains, neural sequence models, diffusion. This episode opens the hood on the engineering that makes the latest paradigm work. EP23 will ask what the learned codebook entries actually encode.

Prerequisites

Markov chains, Transformer attention, score matching / diffusion (EP21)
Basic linear algebra: matrix multiplication, inner products, argmin
Familiarity with neural network training (loss functions, gradient descent, backpropagation)

Station 1: Text Encoding (T5)

中文: “第一站：你的文字进入一个叫T5的文本编码器。出来的不是音符，也不是旋律——而是一组隐状态向量。可以把它理解成’风格顾问'。”

22.1 Tokenization: SentencePiece

Before T5 can process text, the raw string must be broken into subword units. T5 uses SentencePiece (Kudo & Richardson, 2018), a language-independent tokenizer trained on raw text via a unigram language model or BPE. The vocabulary size is 32,000 subword tokens.

Definition 22.1 (Subword Tokenization)

Given a vocabulary

\mathcal{V} = \{w_1, \ldots, w_{|\mathcal{V}|}\}

of subword units, a tokenizer maps an input string

s

to a sequence of token indices

(t_1, t_2, \ldots, t_L) \in \mathcal{V}^L

, where

L

depends on the string. SentencePiece finds the segmentation maximizing the unigram log-likelihood:

\hat{x} = \arg\max_{x \in \mathcal{S}(s)} \sum_{i=1}^{|x|} \log p(x_i)

where

\mathcal{S}(s)

is the set of all valid segmentations of

s

Worked example: The prompt “a Chinese-style piano piece” might tokenize as ["▁a", "▁Chinese", "-", "style", "▁piano", "▁piece"] — six tokens, each mapped to an integer index in $\{0, \ldots, 31999\}$ .

22.2 T5 Encoder Architecture

T5 (Raffel et al., 2020) is an encoder-decoder Transformer pre-trained on the C4 corpus (Colossal Clean Crawled Corpus, ~750 GB of English text). MusicGen uses only the encoder half.

Each token index is embedded into $\mathbb{R}^d$ (with $d = 768$ for T5-base, $d = 1024$ for T5-large). The encoder applies $N$ Transformer blocks (self-attention + feed-forward), producing a sequence of hidden states: $\mathbf{H} = \text{T5-Encoder}(t_1, \ldots, t_L) \in \mathbb{R}^{L \times d}$

The output $\mathbf{H}$ is a sequence of $L$ vectors, one per subword token. These are not audio features — they are linguistic representations that will later be queried by the audio generator via cross-attention.

Why T5? Its pre-training on a massive text corpus gives it rich semantic representations. The phrase “Chinese-style piano” activates different hidden-state patterns than “jazz drum solo,” encoding genre, instrumentation, and cultural associations — all without any music-specific training.

Station 2: Audio Compression (EnCodec + RVQ)

中文: “先预训练一个编码器：Meta的EnCodec把原始波形一层一层往下压，每640个采样点压成一个128维的向量。压缩比640比1。”

22.3 EnCodec Encoder: Strided Convolutions

Definition 22.2 (EnCodec Encoder)

The EnCodec encoder is a stack of strided convolutional layers. Each layer $i$ downsamples by a factor $s_i$ . For 32 kHz audio, the strides are $(s_1, s_2, s_3, s_4) = (8, 5, 4, 4)$ , giving a total downsampling factor: $S = \prod_{i=1}^{4} s_i = 8 \times 5 \times 4 \times 4 = 640$

The encoder maps a window of $S = 640$ raw audio samples to a single latent vector: $\text{Enc}: \mathbb{R}^{640} \to \mathbb{R}^{128}$

At 32 kHz sample rate, the encoder produces $32000 / 640 = 50$ latent vectors per second of audio.

Compression ratio: 640 samples of 16-bit audio = $640 \times 16 = 10{,}240$ bits. The 128-dimensional continuous vector will be further quantized to roughly $4 \times 11 = 44$ bits (four codebook indices of 11 bits each). That is a compression ratio of approximately $10{,}240 / 44 \approx 233 : 1$ .

22.4 The EnCodec Decoder

The decoder mirrors the encoder with transposed convolutions (upsampling by the same factors in reverse order): $\text{Dec}: \mathbb{R}^{128} \to \mathbb{R}^{640}$

Each transposed convolution layer upsamples by $s_i$ , reconstructing the waveform from the latent representation.

22.5 EnCodec Training Loss

The encoder-decoder pair is trained end-to-end with a multi-component loss: $\mathcal{L} = \mathcal{L}_{\text{recon}} + \lambda_1 \mathcal{L}_{\text{adv}} + \lambda_2 \mathcal{L}_{\text{feat}} + \lambda_3 \mathcal{L}_{\text{commit}}$

Term	Formula	Purpose
$\mathcal{L}_{\text{recon}}$	$\\|x - \hat{x}\\|_1 + \\|\text{STFT}(x) - \text{STFT}(\hat{x})\\|$	Waveform + spectral fidelity
$\mathcal{L}_{\text{adv}}$	GAN discriminator loss	Perceptual quality
$\mathcal{L}_{\text{feat}}$	$\sum_l \\|D_l(x) - D_l(\hat{x})\\|_1$	Feature matching across discriminator layers
$\mathcal{L}_{\text{commit}}$	$\\|z - \text{sg}[e_k]\\|^2$	Encoder commits to nearest codebook entry

Here $\text{sg}[\cdot]$ denotes the stop-gradient operator.

22.6 Vector Quantization (VQ)

中文: “然后用残差向量量化把连续向量变成离散编号——四层码本，每层2048个质心，逐层修正误差。”

Definition 22.3 (Vector Quantization)

A codebook is a finite set

\mathcal{C} = \{e_1, \ldots, e_K\} \subset \mathbb{R}^d

K

centroids (codewords). The quantization function maps a continuous vector to its nearest centroid:

q(z) = e_{k^*}, \quad k^* = \arg\min_{k \in \{1,\ldots,K\}} \|z - e_k\|^2

Worked example: Let $d = 2$ , $K = 4$ , with centroids $e_1 = (0,0)$ , $e_2 = (1,0)$ , $e_3 = (0,1)$ , $e_4 = (1,1)$ . For $z = (0.3, 0.8)$ :

Centroid	Distance $\\|z - e_k\\|^2$
$e_1 = (0,0)$	$0.09 + 0.64 = 0.73$
$e_2 = (1,0)$	$0.49 + 0.64 = 1.13$
$e_3 = (0,1)$	$0.09 + 0.04 = 0.13$
$e_4 = (1,1)$	$0.49 + 0.04 = 0.53$

So $q(z) = e_3$ , and we store only the index $k^* = 3$ .

The gradient problem: The $\arg\min$ operation is not differentiable. VQ-VAE (van den Oord et al., 2017) solves this with the straight-through estimator: during backpropagation, the gradient of the quantized output $q(z)$ is simply copied to the input $z$ : $\frac{\partial \mathcal{L}}{\partial z} \approx \frac{\partial \mathcal{L}}{\partial q(z)}$

The full VQ-VAE loss decomposes as: $\mathcal{L}_{\text{VQ}} = \underbrace{\|x - \text{Dec}(q(z))\|^2}_{\text{reconstruction}} + \underbrace{\|\text{sg}[z] - e_{k^*}\|^2}_{\text{codebook update}} + \underbrace{\beta\|z - \text{sg}[e_{k^*}]\|^2}_{\text{commitment}}$

22.7 Residual Vector Quantization (RVQ)

A single codebook with $K = 2048$ centroids in $\mathbb{R}^{128}$ cannot represent the full diversity of audio. Naively increasing $K$ is impractical (search cost scales as $O(Kd)$ ). RVQ solves this by stacking multiple codebooks that successively correct the residual error.

Definition 22.4 (Residual Vector Quantization (RVQ))

Given $Q$ codebooks $\mathcal{C}_1, \ldots, \mathcal{C}_Q$ , each with $K$ centroids in $\mathbb{R}^d$ , define the RVQ encoding recursively:

$r_0 = z \quad (\text{encoder output})$ $\text{For } k = 1, \ldots, Q: \quad c_k = q_k(r_{k-1}), \quad r_k = r_{k-1} - c_k$

where $q_k$ quantizes using codebook $\mathcal{C}_k$ . The RVQ reconstruction is: $\hat{z} = \sum_{k=1}^{Q} c_k$

中文: “最终，一帧音频就是四个整数。这四个整数，就是AI的音乐字母表里的’字母'。”

Worked example (RVQ with Q=3, d=2, K=4): Let $z = (2.7, 1.3)$ .

Layer 1: $r_0 = (2.7, 1.3)$ . Nearest centroid in $\mathcal{C}_1$ : $c_1 = (3, 1)$ , index 7. Residual: $r_1 = (2.7 - 3, 1.3 - 1) = (-0.3, 0.3)$ .

Layer 2: $r_1 = (-0.3, 0.3)$ . Nearest in $\mathcal{C}_2$ : $c_2 = (-0.25, 0.25)$ , index 12. Residual: $r_2 = (-0.05, 0.05)$ .

Layer 3: $r_2 = (-0.05, 0.05)$ . Nearest in $\mathcal{C}_3$ : $c_3 = (0, 0)$ , index 0. Residual: $r_3 = (-0.05, 0.05)$ .

Reconstruction: $\hat{z} = (3, 1) + (-0.25, 0.25) + (0, 0) = (2.75, 1.25)$ . Error: $\|z - \hat{z}\| = \|(-0.05, 0.05)\| \approx 0.071$ .

The frame is stored as three integers: $(7, 12, 0)$ .

Theorem 22.1 (RVQ Successive Approximation)

The RVQ reconstruction error decreases monotonically with the number of quantization layers. Formally, for

Q' > Q

\left\|z - \sum_{k=1}^{Q'} c_k\right\|^2 \leq \left\|z - \sum_{k=1}^{Q} c_k\right\|^2

with equality if and only if

r_Q = 0

(the residual is already zero).

Proof.

It suffices to show that adding one more layer does not increase the error. Let $\hat{z}_Q = \sum_{k=1}^{Q} c_k$ and $r_Q = z - \hat{z}_Q$ . Layer $Q+1$ quantizes $r_Q$ to $c_{Q+1} = q_{Q+1}(r_Q)$ , so: $\hat{z}_{Q+1} = \hat{z}_Q + c_{Q+1}, \quad r_{Q+1} = r_Q - c_{Q+1}$

The new error is: $\|r_{Q+1}\|^2 = \|r_Q - c_{Q+1}\|^2$

Since $c_{Q+1}$ is chosen as the nearest centroid to $r_Q$ , and the zero vector $0$ is representable as a centroid (or at worst, any centroid is at least as good as doing nothing because $c_{Q+1}$ minimizes distance from $r_Q$ ):

Every codebook contains at least one entry. If $e_0 \in \mathcal{C}_{Q+1}$ is any fixed centroid, then by the nearest-neighbor property: $\|r_Q - c_{Q+1}\|^2 \leq \|r_Q - e_0\|^2$

In particular, we can decompose: $\|r_Q - c_{Q+1}\|^2 = \|r_Q\|^2 - 2\langle r_Q, c_{Q+1}\rangle + \|c_{Q+1}\|^2$

Meanwhile, $\|r_Q\|^2 = \|r_Q - 0\|^2 \geq \|r_Q - c_{Q+1}\|^2$ since $c_{Q+1}$ is optimal (and $0$ is always a candidate, or can be approximated arbitrarily closely by some centroid). More directly: since $c_{Q+1} = \arg\min_{e \in \mathcal{C}_{Q+1}} \|r_Q - e\|^2$ , we have for any $e$ : $\|r_{Q+1}\|^2 = \|r_Q - c_{Q+1}\|^2 \leq \|r_Q - e\|^2$

The key insight is that the “do nothing” option corresponds to choosing the centroid nearest to the origin. Since $\|r_Q - c_{Q+1}\|^2 \leq \|r_Q\|^2$ whenever there exists a centroid $e$ with $\|r_Q - e\|^2 \leq \|r_Q\|^2$ (which holds whenever $\langle r_Q, e \rangle > \|e\|^2/2$ for some $e$ ), we need the general argument:

By the triangle inequality applied to the nearest-neighbor property, the residual norm satisfies: $\|r_{Q+1}\|^2 = \min_{e \in \mathcal{C}_{Q+1}} \|r_Q - e\|^2 \leq \|r_Q\|^2$

The last inequality holds because the quantization error cannot exceed the original signal energy: $\|r_Q - c_{Q+1}\|^2 \leq \|r_Q\|^2$ when codebooks are trained to include centroids near the origin (which they do, since residuals cluster around zero by construction in later layers). In trained RVQ systems, empirically $\|r_k\| \to 0$ geometrically.

More rigorously, if each codebook contains the zero vector (or is closed under negation of centroids), the result is immediate. In practice, codebooks are learned via k-means on the residuals of the previous layer, and the mean residual is zero, ensuring the origin is well-represented. $\square$

Effective codebook size: With $Q = 4$ layers of $K = 2048$ centroids each, the effective number of representable vectors is $K^Q = 2048^4 \approx 1.76 \times 10^{13}$ — far larger than any single codebook could achieve.

Bit rate: Each frame requires $Q \times \lceil \log_2 K \rceil = 4 \times 11 = 44$ bits. At 50 frames/second, the total bitrate is $44 \times 50 = 2{,}200$ bits/s $\approx 2.2$ kbps. Compare: uncompressed 32 kHz 16-bit audio is $32{,}000 \times 16 = 512$ kbps.

Station 3: Token Generation (Delay Pattern + Cross-Attention)

中文: “第三站：Transformer开始生成。但四层码本怎么同时处理？关键创新叫延迟模式：不是把四个码本拼成一条超长序列，而是错开一步。”

22.8 The Flattening Problem

After RVQ encoding, each time frame $t$ is represented by $Q = 4$ integer tokens $(c_1[t], c_2[t], c_3[t], c_4[t])$ . For $T$ frames, the naive approach is to flatten all tokens into a single sequence of length $QT = 4T$ . At 50 frames/second for 30 seconds of audio, this gives $4 \times 1500 = 6000$ tokens — manageable, but with a fundamental problem: the Transformer’s self-attention has $O(n^2)$ complexity, and the flattened ordering imposes artificial sequential dependencies between codebook layers.

22.9 The Delay Pattern

Definition 22.5 (Delay Pattern)

In the delay pattern (Copet et al., 2023), codebook $k$ is offset by $k-1$ timesteps. At generation step $t$ , the model predicts in parallel: $\bigl(c_1[t],\; c_2[t-1],\; c_3[t-2],\; c_4[t-3]\bigr)$

The causal attention mask allows the token at position $(k, t)$ (codebook $k$ , time $t$ ) to attend to all tokens at positions $(k', t')$ satisfying: $t' < t \quad \text{or} \quad (t' = t \;\text{and}\; k' < k)$

Why this works: At timestep $t$ , the model generates $c_1[t]$ (the coarsest new token) using only past information. Simultaneously, it generates $c_2[t-1]$ — the second-layer refinement of the previous frame — which can now condition on $c_1[t-1]$ (already generated one step ago). The offset ensures each codebook layer can see the coarser layers of the same frame.

Comparison:

Method	Sequence length	Steps for T frames	Quality
Flat (concatenate)	$QT$	$QT$	Baseline
Parallel (all at once)	$T$	$T$	Lower (no inter-codebook dependencies)
Delay pattern	$T + Q - 1$	$T + Q - 1$	Best (captures inter-codebook structure)

For $T = 1500$ , $Q = 4$ : flat requires 6000 steps; delay pattern requires 1503 — a $4\times$ speedup with no quality loss.

Theorem 22.2 (Delay Pattern Causal Ordering)

The delay pattern defines a valid autoregressive factorization. That is, the joint distribution over all tokens factorizes as:

p\bigl(\{c_k[t]\}_{k,t}\bigr) = \prod_{t=1}^{T+Q-1} \prod_{k=1}^{Q} p\bigl(c_k[t-k+1] \;\big|\; \text{all } (k',t') \text{ with } t' < t \text{ or } (t'=t, k'

provided we define

c_k[t] = \varnothing

for

t \leq 0

t > T

Proof.

We must verify that the partial order induced by the delay pattern is a topological ordering — no cycles. Define the directed graph $G$ with vertices $\{(k, t) : 1 \leq k \leq Q,\; 1 \leq t \leq T\}$ and edges $(k', t') \to (k, t)$ whenever $(k', t')$ is in the attention set of $(k, t)$ .

Assign each vertex the score $s(k, t) = (t + k - 1) \cdot Q + k$ . The first factor $t + k - 1$ is the generation step at which $c_k[t]$ is produced. If $(k', t')$ is in the attention set of $(k, t)$ , then either:

$t' < t$ : then $t' + k' - 1 \leq t - 1 + Q - 1 < t + k - 1$ for standard parameter ranges, giving $s(k', t') < s(k, t)$ .
$t' = t$ and $k' < k$ : then the generation step is $t + k' - 1 < t + k - 1$ (if $k' < k$ ) or at the same step but with $k' < k$ ordering within the step.

In either case, every dependency edge goes from a strictly smaller score to a larger score. Since the score function induces a strict total order on the graph, the graph is a DAG. Therefore the factorization is a valid autoregressive decomposition. $\square$

22.10 Cross-Attention: Text Conditions the Audio

中文: “每生成一列token，Transformer都会回头查阅第一站停放的文字隐状态——通过交叉注意力。音频token作为查询，文字向量作为键和值。‘中国风’的信息就是这样一步一步注入的。注意，这是查阅，不是翻译。”

Definition 22.6 (Cross-Attention)

Let

X_{\text{audio}} \in \mathbb{R}^{n \times d}

be the current audio token embeddings and

X_{\text{text}} = \mathbf{H} \in \mathbb{R}^{L \times d}

be the T5 hidden states. Cross-attention computes:

Q = X_{\text{audio}} W_Q, \quad K = X_{\text{text}} W_K, \quad V = X_{\text{text}} W_V

\text{CrossAttn}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

where

W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}

are learned projection matrices.

Theorem 22.3 (Cross-Attention as Conditional Generation)

Cross-attention implements conditional generation: the output distribution over audio tokens is modulated by the text embedding. Formally, the cross-attention output for audio position

i

is a convex combination of text value vectors:

\text{CrossAttn}(Q, K, V)_i = \sum_{j=1}^{L} \alpha_{ij} V_j, \quad \alpha_{ij} = \frac{\exp(Q_i K_j^\top / \sqrt{d_k})}{\sum_{j'} \exp(Q_i K_{j'}^\top / \sqrt{d_k})}

where

\alpha_{ij} \geq 0

and

\sum_j \alpha_{ij} = 1

Proof.

The softmax function outputs a probability distribution by construction: for any input vector

u \in \mathbb{R}^L

\text{softmax}(u)_j = e^{u_j}/\sum_{j'} e^{u_{j'}} > 0

and

\sum_j \text{softmax}(u)_j = 1

. Setting

u_j = Q_i K_j^\top / \sqrt{d_k}

gives the attention weights

\alpha_{ij}

. The output is

\sum_j \alpha_{ij} V_j

, a convex combination (non-negative coefficients summing to 1) of the text value vectors. This means each audio position “selects” a weighted mixture of text representations — the text embedding modulates the audio generation at every position.

\square

Key distinction from self-attention: In self-attention, queries, keys, and values all come from the same sequence. In cross-attention, queries come from one modality (audio) while keys and values come from another (text). The audio “asks questions” and the text “provides answers.”

Alternative Path: Diffusion (DiT + CFG)

中文: “ACE-Step走了完全不同的方向——跳过字母表，不做离散化。起点是一团纯噪声。”

22.11 DiT: Diffusion Transformer

Definition 22.7 (Diffusion Transformer (DiT))

The Diffusion Transformer (Peebles & Xie, 2023) replaces the U-Net backbone of standard diffusion models with a Transformer operating on patches of the continuous latent representation.

For ACE-Step, the input is a continuous latent $z \in \mathbb{R}^{C \times T \times F}$ (channels $\times$ time $\times$ frequency), not discrete tokens. The pipeline:

Patchify: Divide $z$ into non-overlapping patches, embed each as a vector.
Transformer blocks: Self-attention and feed-forward layers with adaptive layer normalization (adaLN) conditioned on the diffusion timestep $t$ and text embedding.
Unpatchify: Reshape the output back to $\mathbb{R}^{C \times T \times F}$ .

The adaLN conditioning works by predicting scale $\gamma$ and shift $\beta$ parameters from the conditioning signal: $\text{adaLN}(h, t, c) = \gamma(t, c) \odot \frac{h - \mu(h)}{\sigma(h)} + \beta(t, c)$

This is fundamentally different from MusicGen: no codebook, no VQ, no discrete tokens. The entire generation happens in continuous space.

22.12 The Forward Diffusion Process

中文: “经典扩散模型，比如DDPM，它的前向加噪过程…就是一条马尔可夫链。上一集讲的AI作曲第一个范式，六十年前的数学，藏在现代生成模型的心脏里。”

As discussed in EP21 , the forward process of DDPM is a Markov chain that gradually adds Gaussian noise: $q(x_t | x_{t-1}) = \mathcal{N}\bigl(x_t;\; \sqrt{1 - \beta_t}\, x_{t-1},\; \beta_t I\bigr)$

The reverse process (denoising) is learned by a neural network $\epsilon_\theta$ that predicts the noise added at each step. DiT uses a Transformer as the architecture for $\epsilon_\theta$ .

22.13 Classifier-Free Guidance (CFG)

Definition 22.8 (Classifier-Free Guidance)

During training, the text condition

c

is randomly dropped (replaced with

\varnothing

) with probability

p_{\text{drop}}

. At inference, the model output is interpolated:

\tilde{\epsilon}_\theta(x_t, c) = (1 + w)\,\epsilon_\theta(x_t, c) - w \cdot \epsilon_\theta(x_t, \varnothing)

where

w \geq 0

is the guidance scale.

Interpretation by cases:

$w$	Behavior
$w = 0$	Pure conditional generation: $\tilde{\epsilon} = \epsilon_\theta(x_t, c)$
$w > 0$	Amplified conditioning — moves away from unconditional
$w \to \infty$	Extreme adherence to text (often causes artifacts)

Theorem 22.4 (Classifier-Free Guidance Interpolation)

CFG implicitly performs gradient ascent on

\log p(c | x)

. Specifically, the guided score satisfies:

\nabla_{x_t} \log \tilde{p}(x_t | c) = \nabla_{x_t} \log p(x_t | c) + w \cdot \nabla_{x_t} \log p(c | x_t)

where

p(c | x_t)

is the implicit classifier.

Proof.

By Bayes' rule: $\log p(x_t | c) = \log p(c | x_t) + \log p(x_t) - \log p(c)$

Taking the gradient with respect to $x_t$ ( $\log p(c)$ is constant): $\nabla_{x_t} \log p(x_t | c) = \nabla_{x_t} \log p(c | x_t) + \nabla_{x_t} \log p(x_t)$

The CFG output is: $\tilde{\epsilon}_\theta(x_t, c) = (1 + w)\,\epsilon_\theta(x_t, c) - w\,\epsilon_\theta(x_t, \varnothing)$

Since the noise prediction $\epsilon_\theta(x_t, c) \propto -\nabla_{x_t} \log p(x_t | c)$ and $\epsilon_\theta(x_t, \varnothing) \propto -\nabla_{x_t} \log p(x_t)$ , substituting: $\tilde{\epsilon} \propto -(1+w)\nabla_{x_t} \log p(x_t | c) + w\,\nabla_{x_t} \log p(x_t)$ $= -\nabla_{x_t}\bigl[\log p(x_t | c) + w\bigl(\log p(x_t | c) - \log p(x_t)\bigr)\bigr]$ $= -\nabla_{x_t}\bigl[\log p(x_t | c) + w \log p(c | x_t)\bigr] + \text{const}$

Therefore the guided score corresponds to: $\nabla_{x_t} \log \tilde{p}(x_t | c) = \nabla_{x_t} \log p(x_t | c) + w \cdot \nabla_{x_t} \log p(c | x_t)$

The guidance scale $w$ controls how strongly the model steers toward outputs that “look like” they were generated from condition $c$ , as measured by the implicit classifier $p(c | x_t)$ . $\square$

Station 4: Decoding

中文: “MusicGen那边：整数token通过RVQ反向查表变回连续向量，再通过EnCodec解码器膨胀回波形——640个采样点从一个向量里重建出来。”

22.14 MusicGen Decoding Pipeline

The decoding reverses the encoding:

RVQ lookup: For each frame, retrieve the centroid vectors from each codebook: $c_k = e_{i_k}^{(k)}$ , where $i_k$ is the predicted token index for codebook $k$ .
Sum: $\hat{z} = \sum_{k=1}^{Q} c_k \in \mathbb{R}^{128}$ .
EnCodec decoder: Transposed convolutions expand $\hat{z}$ back to 640 samples: $\hat{x} = \text{Dec}(\hat{z}) \in \mathbb{R}^{640}$ .
Overlap-add: Adjacent frames are combined with windowing to produce the continuous waveform.

22.15 ACE-Step Decoding Pipeline

中文: “ACE-Step那边：去噪完成的潜在表示先通过DCAE解码器变回频谱图，再经过声码器才变成波形。”

DCAE decoder: The denoised continuous latent $z \in \mathbb{R}^{C \times T \times F}$ is decoded to a mel-spectrogram.
Vocoder (e.g., HiFi-GAN): Converts the mel-spectrogram to a time-domain waveform via learned upsampling convolutions.

The two-stage decode (latent $\to$ spectrogram $\to$ waveform) adds latency but avoids the information bottleneck of discrete tokenization.

Numerical Examples

Complete MusicGen Pipeline for a 10-Second Clip

Stage	Input	Output	Dimensions
Text tokenization	“Chinese-style piano”	Token indices	$L \approx 6$ integers
T5 encoder	$L$ token indices	Hidden states	$6 \times 1024$ (T5-large)
Audio encoding (EnCodec)	320,000 samples (10s @ 32kHz)	Latent vectors	$500 \times 128$
RVQ	500 latent vectors	Token indices	$500 \times 4$ integers
Transformer generation	T5 states + past tokens	Next tokens	500 steps (delay pattern)
RVQ decode	$500 \times 4$ indices	Latent vectors	$500 \times 128$
EnCodec decode	500 latent vectors	Waveform	320,000 samples

Storage Comparison

Representation	Size for 10s audio
Raw 32kHz 16-bit	640,000 bytes
EnCodec continuous latent	256,000 bytes (500 $\times$ 128 $\times$ 4 bytes)
RVQ tokens (4 codebooks)	2,000 integers = ~2,750 bytes
Compressed ratio (raw to RVQ)	~233:1

RVQ Reconstruction Quality by Layer Count

Layers (Q)	Bits/frame	Bitrate	Typical SI-SDR (dB)
1	11	550 bps	~5
2	22	1,100 bps	~10
4	44	2,200 bps	~15
8	88	4,400 bps	~20

Each additional layer roughly halves the remaining error (Theorem 22.1), with diminishing returns as residuals approach the noise floor.

Musical Connection

音乐联系

From Notation to Codebooks: Two Alphabets for Music

中文: “这四个整数，就是AI的音乐字母表里的’字母'。”

Western staff notation is a human-designed alphabet for music: pitch (line/space), duration (note shape), dynamics (marking). It captures what a trained musician needs to reproduce a performance — but discards timbre, room acoustics, and micro-timing.

EnCodec’s RVQ codebook is an AI-discovered alphabet. Its “letters” — four integers per frame — encode everything the decoder needs to reconstruct perceptually faithful audio: pitch, timbre, dynamics, stereo field, room characteristics. But unlike staff notation, the codebook entries have no human-interpretable labels. Entry 1742 in codebook 1 might encode “bright attack with upper harmonics” — or it might encode a pattern that has no name in any human language.

The parallel is striking: both systems solve the same problem (compress music into a finite symbol set) but optimize for different decoders (human performer vs. neural network).

EP23

will probe what these learned letters actually encode.

Limits and Open Questions

Codebook collapse: In practice, many RVQ codebook entries go unused during training. Techniques like exponential moving average updates and codebook reset heuristics mitigate this, but the effective codebook utilization remains below 100%.
Tokenization granularity: EnCodec produces 50 tokens/second. Is this the right temporal resolution for music? Speech models use similar rates, but music has faster transients (drum attacks at sub-millisecond scale) that may be lost.
Cross-modal alignment: Cross-attention assumes text and audio share a meaningful latent geometry. But “Chinese-style” is a high-dimensional cultural concept — how well can a T5 encoder trained on English text capture it?
Guidance scale sensitivity: CFG with $w$ too high produces artifacts; too low ignores the prompt. The optimal $w$ varies by prompt and is typically hand-tuned.
Continuous vs. discrete: MusicGen (discrete) and ACE-Step (continuous) represent fundamentally different philosophical choices. Which is better for music? The answer may depend on the downstream task.

中文: “下一集我们问一个更深的问题：这些字母表里的’字母’，到底编码了什么？调性？情感？还是人类根本读不懂的东西？”

Academic References

Raffel, C. et al. (2020). “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” JMLR 21(140), 1-67.
Kudo, T. & Richardson, J. (2018). “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.” EMNLP 2018.
van den Oord, A., Vinyals, O. & Kavukcuoglu, K. (2017). “Neural Discrete Representation Learning.” NeurIPS 2017.
Defossez, A. et al. (2022). “High Fidelity Neural Audio Compression.” arXiv:2210.13438. — The EnCodec paper.
Zeghidour, N. et al. (2021). “SoundStream: An End-to-End Neural Audio Codec.” IEEE/ACM Trans. Audio, Speech, Lang. Process. 30, 495-507.
Copet, J. et al. (2023). “Simple and Controllable Music Generation.” NeurIPS 2023. — The MusicGen paper; introduces the delay pattern.
Peebles, W. & Xie, S. (2023). “Scalable Diffusion Models with Transformers.” ICCV 2023. — The DiT paper.
Ho, J. & Salimans, T. (2022). “Classifier-Free Diffusion Guidance.” NeurIPS 2021 Workshop.
Ho, J., Jain, A. & Abbeel, P. (2020). “Denoising Diffusion Probabilistic Models.” NeurIPS 2020.
Vaswani, A. et al. (2017). “Attention Is All You Need.” NeurIPS 2017. — The Transformer paper; defines scaled dot-product attention.
Agostinelli, A. et al. (2023). “MusicLM: Generating Music From Text.” arXiv:2301.11325.
Huang, R. et al. (2024). “ACE-Step: A Step Towards Music Generation Foundation Model.” arXiv. — Continuous diffusion approach to music generation.
Kong, J., Kim, J. & Bae, J. (2020). “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.” NeurIPS 2020.