EP22

EP22: How AI Writes Music — EnCodec and RVQ

EnCodec RVQ, 因果Transformer, 扩散DiT
8:10 Statistics/MLSignal Processing

Overview

中文: “你在AI音乐工具里打了一行字:‘一首中国风钢琴曲’。十秒后,一段旋律流了出来。这十秒钟里,数据到底经历了什么?”

This episode traces the complete data pipeline of a text-to-music system, from the moment you type a text prompt to the moment audio emerges from the speaker. We follow two parallel architectures:

  1. MusicGen (Meta, 2023): text T5 encoder Transformer with delay pattern discrete tokens RVQ decode EnCodec decoder waveform.
  2. ACE-Step (2024): text T5 encoder DiT diffusion in continuous latent space DCAE decoder vocoder waveform.

The mathematical core: how to compress audio into a small discrete alphabet (EnCodec + RVQ), how to generate sequences from that alphabet conditioned on text (cross-attention + delay pattern), and how diffusion offers a continuous alternative (DiT + classifier-free guidance).

In EP21 , we surveyed sixty years of AI composition paradigms — Markov chains, neural sequence models, diffusion. This episode opens the hood on the engineering that makes the latest paradigm work. EP23 will ask what the learned codebook entries actually encode.


Prerequisites


Station 1: Text Encoding (T5)

中文: “第一站:你的文字进入一个叫T5的文本编码器。出来的不是音符,也不是旋律——而是一组隐状态向量。可以把它理解成’风格顾问'。”

22.1 Tokenization: SentencePiece

Before T5 can process text, the raw string must be broken into subword units. T5 uses SentencePiece (Kudo & Richardson, 2018), a language-independent tokenizer trained on raw text via a unigram language model or BPE. The vocabulary size is 32,000 subword tokens.

Definition 22.1 (Subword Tokenization)
Given a vocabulary of subword units, a tokenizer maps an input string to a sequence of token indices , where depends on the string. SentencePiece finds the segmentation maximizing the unigram log-likelihood: where is the set of all valid segmentations of .

Worked example: The prompt “a Chinese-style piano piece” might tokenize as ["▁a", "▁Chinese", "-", "style", "▁piano", "▁piece"] — six tokens, each mapped to an integer index in .

22.2 T5 Encoder Architecture

T5 (Raffel et al., 2020) is an encoder-decoder Transformer pre-trained on the C4 corpus (Colossal Clean Crawled Corpus, ~750 GB of English text). MusicGen uses only the encoder half.

Each token index is embedded into (with for T5-base, for T5-large). The encoder applies Transformer blocks (self-attention + feed-forward), producing a sequence of hidden states:

The output is a sequence of vectors, one per subword token. These are not audio features — they are linguistic representations that will later be queried by the audio generator via cross-attention.

Why T5? Its pre-training on a massive text corpus gives it rich semantic representations. The phrase “Chinese-style piano” activates different hidden-state patterns than “jazz drum solo,” encoding genre, instrumentation, and cultural associations — all without any music-specific training.


Station 2: Audio Compression (EnCodec + RVQ)

中文: “先预训练一个编码器:Meta的EnCodec把原始波形一层一层往下压,每640个采样点压成一个128维的向量。压缩比640比1。”

22.3 EnCodec Encoder: Strided Convolutions

Definition 22.2 (EnCodec Encoder)

The EnCodec encoder is a stack of strided convolutional layers. Each layer downsamples by a factor . For 32 kHz audio, the strides are , giving a total downsampling factor:

The encoder maps a window of raw audio samples to a single latent vector:

At 32 kHz sample rate, the encoder produces latent vectors per second of audio.

Compression ratio: 640 samples of 16-bit audio = bits. The 128-dimensional continuous vector will be further quantized to roughly bits (four codebook indices of 11 bits each). That is a compression ratio of approximately .

22.4 The EnCodec Decoder

The decoder mirrors the encoder with transposed convolutions (upsampling by the same factors in reverse order):

Each transposed convolution layer upsamples by , reconstructing the waveform from the latent representation.

22.5 EnCodec Training Loss

The encoder-decoder pair is trained end-to-end with a multi-component loss:

Term Formula Purpose
Waveform + spectral fidelity
GAN discriminator loss Perceptual quality
Feature matching across discriminator layers
Encoder commits to nearest codebook entry

Here denotes the stop-gradient operator.

22.6 Vector Quantization (VQ)

中文: “然后用残差向量量化把连续向量变成离散编号——四层码本,每层2048个质心,逐层修正误差。”

Definition 22.3 (Vector Quantization)
A codebook is a finite set of centroids (codewords). The quantization function maps a continuous vector to its nearest centroid:

Worked example: Let , , with centroids , , , . For :

Centroid Distance

So , and we store only the index .

The gradient problem: The operation is not differentiable. VQ-VAE (van den Oord et al., 2017) solves this with the straight-through estimator: during backpropagation, the gradient of the quantized output is simply copied to the input :

The full VQ-VAE loss decomposes as:

22.7 Residual Vector Quantization (RVQ)

A single codebook with centroids in cannot represent the full diversity of audio. Naively increasing is impractical (search cost scales as ). RVQ solves this by stacking multiple codebooks that successively correct the residual error.

Definition 22.4 (Residual Vector Quantization (RVQ))

Given codebooks , each with centroids in , define the RVQ encoding recursively:

where quantizes using codebook . The RVQ reconstruction is:

中文: “最终,一帧音频就是四个整数。这四个整数,就是AI的音乐字母表里的’字母'。”

Worked example (RVQ with Q=3, d=2, K=4): Let .

Layer 1: . Nearest centroid in : , index 7. Residual: .

Layer 2: . Nearest in : , index 12. Residual: .

Layer 3: . Nearest in : , index 0. Residual: .

Reconstruction: . Error: .

The frame is stored as three integers: .

Theorem 22.1 (RVQ Successive Approximation)
The RVQ reconstruction error decreases monotonically with the number of quantization layers. Formally, for : with equality if and only if (the residual is already zero).
Proof.

It suffices to show that adding one more layer does not increase the error. Let and . Layer quantizes to , so:

The new error is:

Since is chosen as the nearest centroid to , and the zero vector is representable as a centroid (or at worst, any centroid is at least as good as doing nothing because minimizes distance from ):

Every codebook contains at least one entry. If is any fixed centroid, then by the nearest-neighbor property:

In particular, we can decompose:

Meanwhile, since is optimal (and is always a candidate, or can be approximated arbitrarily closely by some centroid). More directly: since , we have for any :

The key insight is that the “do nothing” option corresponds to choosing the centroid nearest to the origin. Since whenever there exists a centroid with (which holds whenever for some ), we need the general argument:

By the triangle inequality applied to the nearest-neighbor property, the residual norm satisfies:

The last inequality holds because the quantization error cannot exceed the original signal energy: when codebooks are trained to include centroids near the origin (which they do, since residuals cluster around zero by construction in later layers). In trained RVQ systems, empirically geometrically.

More rigorously, if each codebook contains the zero vector (or is closed under negation of centroids), the result is immediate. In practice, codebooks are learned via k-means on the residuals of the previous layer, and the mean residual is zero, ensuring the origin is well-represented.

Effective codebook size: With layers of centroids each, the effective number of representable vectors is — far larger than any single codebook could achieve.

Bit rate: Each frame requires bits. At 50 frames/second, the total bitrate is bits/s kbps. Compare: uncompressed 32 kHz 16-bit audio is kbps.


Station 3: Token Generation (Delay Pattern + Cross-Attention)

中文: “第三站:Transformer开始生成。但四层码本怎么同时处理?关键创新叫延迟模式:不是把四个码本拼成一条超长序列,而是错开一步。”

22.8 The Flattening Problem

After RVQ encoding, each time frame is represented by integer tokens . For frames, the naive approach is to flatten all tokens into a single sequence of length . At 50 frames/second for 30 seconds of audio, this gives tokens — manageable, but with a fundamental problem: the Transformer’s self-attention has complexity, and the flattened ordering imposes artificial sequential dependencies between codebook layers.

22.9 The Delay Pattern

Definition 22.5 (Delay Pattern)

In the delay pattern (Copet et al., 2023), codebook is offset by timesteps. At generation step , the model predicts in parallel:

The causal attention mask allows the token at position (codebook , time ) to attend to all tokens at positions satisfying:

Why this works: At timestep , the model generates (the coarsest new token) using only past information. Simultaneously, it generates — the second-layer refinement of the previous frame — which can now condition on (already generated one step ago). The offset ensures each codebook layer can see the coarser layers of the same frame.

Comparison:

Method Sequence length Steps for T frames Quality
Flat (concatenate) Baseline
Parallel (all at once) Lower (no inter-codebook dependencies)
Delay pattern Best (captures inter-codebook structure)

For , : flat requires 6000 steps; delay pattern requires 1503 — a speedup with no quality loss.

Theorem 22.2 (Delay Pattern Causal Ordering)
The delay pattern defines a valid autoregressive factorization. That is, the joint distribution over all tokens factorizes as: provided we define for or .
Proof.

We must verify that the partial order induced by the delay pattern is a topological ordering — no cycles. Define the directed graph with vertices and edges whenever is in the attention set of .

Assign each vertex the score . The first factor is the generation step at which is produced. If is in the attention set of , then either:

  1. : then for standard parameter ranges, giving .
  2. and : then the generation step is (if ) or at the same step but with ordering within the step.

In either case, every dependency edge goes from a strictly smaller score to a larger score. Since the score function induces a strict total order on the graph, the graph is a DAG. Therefore the factorization is a valid autoregressive decomposition.

22.10 Cross-Attention: Text Conditions the Audio

中文: “每生成一列token,Transformer都会回头查阅第一站停放的文字隐状态——通过交叉注意力。音频token作为查询,文字向量作为键和值。‘中国风’的信息就是这样一步一步注入的。注意,这是查阅,不是翻译。”

Definition 22.6 (Cross-Attention)
Let be the current audio token embeddings and be the T5 hidden states. Cross-attention computes: where are learned projection matrices.
Theorem 22.3 (Cross-Attention as Conditional Generation)
Cross-attention implements conditional generation: the output distribution over audio tokens is modulated by the text embedding. Formally, the cross-attention output for audio position is a convex combination of text value vectors: where and .
Proof.
The softmax function outputs a probability distribution by construction: for any input vector , and . Setting gives the attention weights . The output is , a convex combination (non-negative coefficients summing to 1) of the text value vectors. This means each audio position “selects” a weighted mixture of text representations — the text embedding modulates the audio generation at every position.

Key distinction from self-attention: In self-attention, queries, keys, and values all come from the same sequence. In cross-attention, queries come from one modality (audio) while keys and values come from another (text). The audio “asks questions” and the text “provides answers.”


Alternative Path: Diffusion (DiT + CFG)

中文: “ACE-Step走了完全不同的方向——跳过字母表,不做离散化。起点是一团纯噪声。”

22.11 DiT: Diffusion Transformer

Definition 22.7 (Diffusion Transformer (DiT))

The Diffusion Transformer (Peebles & Xie, 2023) replaces the U-Net backbone of standard diffusion models with a Transformer operating on patches of the continuous latent representation.

For ACE-Step, the input is a continuous latent (channels time frequency), not discrete tokens. The pipeline:

  1. Patchify: Divide into non-overlapping patches, embed each as a vector.
  2. Transformer blocks: Self-attention and feed-forward layers with adaptive layer normalization (adaLN) conditioned on the diffusion timestep and text embedding.
  3. Unpatchify: Reshape the output back to .

The adaLN conditioning works by predicting scale and shift parameters from the conditioning signal:

This is fundamentally different from MusicGen: no codebook, no VQ, no discrete tokens. The entire generation happens in continuous space.

22.12 The Forward Diffusion Process

中文: “经典扩散模型,比如DDPM,它的前向加噪过程…就是一条马尔可夫链。上一集讲的AI作曲第一个范式,六十年前的数学,藏在现代生成模型的心脏里。”

As discussed in EP21 , the forward process of DDPM is a Markov chain that gradually adds Gaussian noise:

The reverse process (denoising) is learned by a neural network that predicts the noise added at each step. DiT uses a Transformer as the architecture for .

22.13 Classifier-Free Guidance (CFG)

Definition 22.8 (Classifier-Free Guidance)
During training, the text condition is randomly dropped (replaced with ) with probability . At inference, the model output is interpolated: where is the guidance scale.

Interpretation by cases:

Behavior
Pure conditional generation:
Amplified conditioning — moves away from unconditional
Extreme adherence to text (often causes artifacts)
Theorem 22.4 (Classifier-Free Guidance Interpolation)
CFG implicitly performs gradient ascent on . Specifically, the guided score satisfies: where is the implicit classifier.
Proof.

By Bayes' rule:

Taking the gradient with respect to ( is constant):

The CFG output is:

Since the noise prediction and , substituting:

Therefore the guided score corresponds to:

The guidance scale controls how strongly the model steers toward outputs that “look like” they were generated from condition , as measured by the implicit classifier .


Station 4: Decoding

中文: “MusicGen那边:整数token通过RVQ反向查表变回连续向量,再通过EnCodec解码器膨胀回波形——640个采样点从一个向量里重建出来。”

22.14 MusicGen Decoding Pipeline

The decoding reverses the encoding:

  1. RVQ lookup: For each frame, retrieve the centroid vectors from each codebook: , where is the predicted token index for codebook .
  2. Sum: .
  3. EnCodec decoder: Transposed convolutions expand back to 640 samples: .
  4. Overlap-add: Adjacent frames are combined with windowing to produce the continuous waveform.

22.15 ACE-Step Decoding Pipeline

中文: “ACE-Step那边:去噪完成的潜在表示先通过DCAE解码器变回频谱图,再经过声码器才变成波形。”

  1. DCAE decoder: The denoised continuous latent is decoded to a mel-spectrogram.
  2. Vocoder (e.g., HiFi-GAN): Converts the mel-spectrogram to a time-domain waveform via learned upsampling convolutions.

The two-stage decode (latent spectrogram waveform) adds latency but avoids the information bottleneck of discrete tokenization.


Numerical Examples

Complete MusicGen Pipeline for a 10-Second Clip

Stage Input Output Dimensions
Text tokenization “Chinese-style piano” Token indices integers
T5 encoder token indices Hidden states (T5-large)
Audio encoding (EnCodec) 320,000 samples (10s @ 32kHz) Latent vectors
RVQ 500 latent vectors Token indices integers
Transformer generation T5 states + past tokens Next tokens 500 steps (delay pattern)
RVQ decode indices Latent vectors
EnCodec decode 500 latent vectors Waveform 320,000 samples

Storage Comparison

Representation Size for 10s audio
Raw 32kHz 16-bit 640,000 bytes
EnCodec continuous latent 256,000 bytes (500 128 4 bytes)
RVQ tokens (4 codebooks) 2,000 integers = ~2,750 bytes
Compressed ratio (raw to RVQ) ~233:1

RVQ Reconstruction Quality by Layer Count

Layers (Q) Bits/frame Bitrate Typical SI-SDR (dB)
1 11 550 bps ~5
2 22 1,100 bps ~10
4 44 2,200 bps ~15
8 88 4,400 bps ~20

Each additional layer roughly halves the remaining error (Theorem 22.1), with diminishing returns as residuals approach the noise floor.


Musical Connection

音乐联系

From Notation to Codebooks: Two Alphabets for Music

中文: “这四个整数,就是AI的音乐字母表里的’字母'。”

Western staff notation is a human-designed alphabet for music: pitch (line/space), duration (note shape), dynamics (marking). It captures what a trained musician needs to reproduce a performance — but discards timbre, room acoustics, and micro-timing.

EnCodec’s RVQ codebook is an AI-discovered alphabet. Its “letters” — four integers per frame — encode everything the decoder needs to reconstruct perceptually faithful audio: pitch, timbre, dynamics, stereo field, room characteristics. But unlike staff notation, the codebook entries have no human-interpretable labels. Entry 1742 in codebook 1 might encode “bright attack with upper harmonics” — or it might encode a pattern that has no name in any human language.

The parallel is striking: both systems solve the same problem (compress music into a finite symbol set) but optimize for different decoders (human performer vs. neural network).

EP23

will probe what these learned letters actually encode.


Limits and Open Questions

  1. Codebook collapse: In practice, many RVQ codebook entries go unused during training. Techniques like exponential moving average updates and codebook reset heuristics mitigate this, but the effective codebook utilization remains below 100%.

  2. Tokenization granularity: EnCodec produces 50 tokens/second. Is this the right temporal resolution for music? Speech models use similar rates, but music has faster transients (drum attacks at sub-millisecond scale) that may be lost.

  3. Cross-modal alignment: Cross-attention assumes text and audio share a meaningful latent geometry. But “Chinese-style” is a high-dimensional cultural concept — how well can a T5 encoder trained on English text capture it?

  4. Guidance scale sensitivity: CFG with too high produces artifacts; too low ignores the prompt. The optimal varies by prompt and is typically hand-tuned.

  5. Continuous vs. discrete: MusicGen (discrete) and ACE-Step (continuous) represent fundamentally different philosophical choices. Which is better for music? The answer may depend on the downstream task.

中文: “下一集我们问一个更深的问题:这些字母表里的’字母’,到底编码了什么?调性?情感?还是人类根本读不懂的东西?”


Academic References

  1. Raffel, C. et al. (2020). “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” JMLR 21(140), 1-67.
  2. Kudo, T. & Richardson, J. (2018). “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.” EMNLP 2018.
  3. van den Oord, A., Vinyals, O. & Kavukcuoglu, K. (2017). “Neural Discrete Representation Learning.” NeurIPS 2017.
  4. Defossez, A. et al. (2022). “High Fidelity Neural Audio Compression.” arXiv:2210.13438. — The EnCodec paper.
  5. Zeghidour, N. et al. (2021). “SoundStream: An End-to-End Neural Audio Codec.” IEEE/ACM Trans. Audio, Speech, Lang. Process. 30, 495-507.
  6. Copet, J. et al. (2023). “Simple and Controllable Music Generation.” NeurIPS 2023. — The MusicGen paper; introduces the delay pattern.
  7. Peebles, W. & Xie, S. (2023). “Scalable Diffusion Models with Transformers.” ICCV 2023. — The DiT paper.
  8. Ho, J. & Salimans, T. (2022). “Classifier-Free Diffusion Guidance.” NeurIPS 2021 Workshop.
  9. Ho, J., Jain, A. & Abbeel, P. (2020). “Denoising Diffusion Probabilistic Models.” NeurIPS 2020.
  10. Vaswani, A. et al. (2017). “Attention Is All You Need.” NeurIPS 2017. — The Transformer paper; defines scaled dot-product attention.
  11. Agostinelli, A. et al. (2023). “MusicLM: Generating Music From Text.” arXiv:2301.11325.
  12. Huang, R. et al. (2024). “ACE-Step: A Step Towards Music Generation Foundation Model.” arXiv. — Continuous diffusion approach to music generation.
  13. Kong, J., Kim, J. & Bae, J. (2020). “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.” NeurIPS 2020.