EP22: How AI Writes Music — EnCodec and RVQ
Overview
中文: “你在AI音乐工具里打了一行字:‘一首中国风钢琴曲’。十秒后,一段旋律流了出来。这十秒钟里,数据到底经历了什么?”
This episode traces the complete data pipeline of a text-to-music system, from the moment you type a text prompt to the moment audio emerges from the speaker. We follow two parallel architectures:
- MusicGen (Meta, 2023): text T5 encoder Transformer with delay pattern discrete tokens RVQ decode EnCodec decoder waveform.
- ACE-Step (2024): text T5 encoder DiT diffusion in continuous latent space DCAE decoder vocoder waveform.
The mathematical core: how to compress audio into a small discrete alphabet (EnCodec + RVQ), how to generate sequences from that alphabet conditioned on text (cross-attention + delay pattern), and how diffusion offers a continuous alternative (DiT + classifier-free guidance).
In EP21 , we surveyed sixty years of AI composition paradigms — Markov chains, neural sequence models, diffusion. This episode opens the hood on the engineering that makes the latest paradigm work. EP23 will ask what the learned codebook entries actually encode.
Prerequisites
- Markov chains, Transformer attention, score matching / diffusion (EP21)
- Basic linear algebra: matrix multiplication, inner products, argmin
- Familiarity with neural network training (loss functions, gradient descent, backpropagation)
Station 1: Text Encoding (T5)
中文: “第一站:你的文字进入一个叫T5的文本编码器。出来的不是音符,也不是旋律——而是一组隐状态向量。可以把它理解成’风格顾问'。”
22.1 Tokenization: SentencePiece
Before T5 can process text, the raw string must be broken into subword units. T5 uses SentencePiece (Kudo & Richardson, 2018), a language-independent tokenizer trained on raw text via a unigram language model or BPE. The vocabulary size is 32,000 subword tokens.
Worked example: The prompt “a Chinese-style piano piece” might tokenize as ["▁a", "▁Chinese", "-", "style", "▁piano", "▁piece"] — six tokens, each mapped to an integer index in .
22.2 T5 Encoder Architecture
T5 (Raffel et al., 2020) is an encoder-decoder Transformer pre-trained on the C4 corpus (Colossal Clean Crawled Corpus, ~750 GB of English text). MusicGen uses only the encoder half.
Each token index is embedded into (with for T5-base, for T5-large). The encoder applies Transformer blocks (self-attention + feed-forward), producing a sequence of hidden states:
The output is a sequence of vectors, one per subword token. These are not audio features — they are linguistic representations that will later be queried by the audio generator via cross-attention.
Why T5? Its pre-training on a massive text corpus gives it rich semantic representations. The phrase “Chinese-style piano” activates different hidden-state patterns than “jazz drum solo,” encoding genre, instrumentation, and cultural associations — all without any music-specific training.
Station 2: Audio Compression (EnCodec + RVQ)
中文: “先预训练一个编码器:Meta的EnCodec把原始波形一层一层往下压,每640个采样点压成一个128维的向量。压缩比640比1。”
22.3 EnCodec Encoder: Strided Convolutions
The EnCodec encoder is a stack of strided convolutional layers. Each layer downsamples by a factor . For 32 kHz audio, the strides are , giving a total downsampling factor:
The encoder maps a window of raw audio samples to a single latent vector:
At 32 kHz sample rate, the encoder produces latent vectors per second of audio.
Compression ratio: 640 samples of 16-bit audio = bits. The 128-dimensional continuous vector will be further quantized to roughly bits (four codebook indices of 11 bits each). That is a compression ratio of approximately .
22.4 The EnCodec Decoder
The decoder mirrors the encoder with transposed convolutions (upsampling by the same factors in reverse order):
Each transposed convolution layer upsamples by , reconstructing the waveform from the latent representation.
22.5 EnCodec Training Loss
The encoder-decoder pair is trained end-to-end with a multi-component loss:
| Term | Formula | Purpose |
|---|---|---|
| Waveform + spectral fidelity | ||
| GAN discriminator loss | Perceptual quality | |
| Feature matching across discriminator layers | ||
| Encoder commits to nearest codebook entry |
Here denotes the stop-gradient operator.
22.6 Vector Quantization (VQ)
中文: “然后用残差向量量化把连续向量变成离散编号——四层码本,每层2048个质心,逐层修正误差。”
Worked example: Let , , with centroids , , , . For :
| Centroid | Distance |
|---|---|
So , and we store only the index .
The gradient problem: The operation is not differentiable. VQ-VAE (van den Oord et al., 2017) solves this with the straight-through estimator: during backpropagation, the gradient of the quantized output is simply copied to the input :
The full VQ-VAE loss decomposes as:
22.7 Residual Vector Quantization (RVQ)
A single codebook with centroids in cannot represent the full diversity of audio. Naively increasing is impractical (search cost scales as ). RVQ solves this by stacking multiple codebooks that successively correct the residual error.
Given codebooks , each with centroids in , define the RVQ encoding recursively:
where quantizes using codebook . The RVQ reconstruction is:
中文: “最终,一帧音频就是四个整数。这四个整数,就是AI的音乐字母表里的’字母'。”
Worked example (RVQ with Q=3, d=2, K=4): Let .
Layer 1: . Nearest centroid in : , index 7. Residual: .
Layer 2: . Nearest in : , index 12. Residual: .
Layer 3: . Nearest in : , index 0. Residual: .
Reconstruction: . Error: .
The frame is stored as three integers: .
It suffices to show that adding one more layer does not increase the error. Let and . Layer quantizes to , so:
The new error is:
Since is chosen as the nearest centroid to , and the zero vector is representable as a centroid (or at worst, any centroid is at least as good as doing nothing because minimizes distance from ):
Every codebook contains at least one entry. If is any fixed centroid, then by the nearest-neighbor property:
In particular, we can decompose:
Meanwhile, since is optimal (and is always a candidate, or can be approximated arbitrarily closely by some centroid). More directly: since , we have for any :
The key insight is that the “do nothing” option corresponds to choosing the centroid nearest to the origin. Since whenever there exists a centroid with (which holds whenever for some ), we need the general argument:
By the triangle inequality applied to the nearest-neighbor property, the residual norm satisfies:
The last inequality holds because the quantization error cannot exceed the original signal energy: when codebooks are trained to include centroids near the origin (which they do, since residuals cluster around zero by construction in later layers). In trained RVQ systems, empirically geometrically.
More rigorously, if each codebook contains the zero vector (or is closed under negation of centroids), the result is immediate. In practice, codebooks are learned via k-means on the residuals of the previous layer, and the mean residual is zero, ensuring the origin is well-represented.
Effective codebook size: With layers of centroids each, the effective number of representable vectors is — far larger than any single codebook could achieve.
Bit rate: Each frame requires bits. At 50 frames/second, the total bitrate is bits/s kbps. Compare: uncompressed 32 kHz 16-bit audio is kbps.
Station 3: Token Generation (Delay Pattern + Cross-Attention)
中文: “第三站:Transformer开始生成。但四层码本怎么同时处理?关键创新叫延迟模式:不是把四个码本拼成一条超长序列,而是错开一步。”
22.8 The Flattening Problem
After RVQ encoding, each time frame is represented by integer tokens . For frames, the naive approach is to flatten all tokens into a single sequence of length . At 50 frames/second for 30 seconds of audio, this gives tokens — manageable, but with a fundamental problem: the Transformer’s self-attention has complexity, and the flattened ordering imposes artificial sequential dependencies between codebook layers.
22.9 The Delay Pattern
In the delay pattern (Copet et al., 2023), codebook is offset by timesteps. At generation step , the model predicts in parallel:
The causal attention mask allows the token at position (codebook , time ) to attend to all tokens at positions satisfying:
Why this works: At timestep , the model generates (the coarsest new token) using only past information. Simultaneously, it generates — the second-layer refinement of the previous frame — which can now condition on (already generated one step ago). The offset ensures each codebook layer can see the coarser layers of the same frame.
Comparison:
| Method | Sequence length | Steps for T frames | Quality |
|---|---|---|---|
| Flat (concatenate) | Baseline | ||
| Parallel (all at once) | Lower (no inter-codebook dependencies) | ||
| Delay pattern | Best (captures inter-codebook structure) |
For , : flat requires 6000 steps; delay pattern requires 1503 — a speedup with no quality loss.
We must verify that the partial order induced by the delay pattern is a topological ordering — no cycles. Define the directed graph with vertices and edges whenever is in the attention set of .
Assign each vertex the score . The first factor is the generation step at which is produced. If is in the attention set of , then either:
- : then for standard parameter ranges, giving .
- and : then the generation step is (if ) or at the same step but with ordering within the step.
In either case, every dependency edge goes from a strictly smaller score to a larger score. Since the score function induces a strict total order on the graph, the graph is a DAG. Therefore the factorization is a valid autoregressive decomposition.
22.10 Cross-Attention: Text Conditions the Audio
中文: “每生成一列token,Transformer都会回头查阅第一站停放的文字隐状态——通过交叉注意力。音频token作为查询,文字向量作为键和值。‘中国风’的信息就是这样一步一步注入的。注意,这是查阅,不是翻译。”
Key distinction from self-attention: In self-attention, queries, keys, and values all come from the same sequence. In cross-attention, queries come from one modality (audio) while keys and values come from another (text). The audio “asks questions” and the text “provides answers.”
Alternative Path: Diffusion (DiT + CFG)
中文: “ACE-Step走了完全不同的方向——跳过字母表,不做离散化。起点是一团纯噪声。”
22.11 DiT: Diffusion Transformer
The Diffusion Transformer (Peebles & Xie, 2023) replaces the U-Net backbone of standard diffusion models with a Transformer operating on patches of the continuous latent representation.
For ACE-Step, the input is a continuous latent (channels time frequency), not discrete tokens. The pipeline:
- Patchify: Divide into non-overlapping patches, embed each as a vector.
- Transformer blocks: Self-attention and feed-forward layers with adaptive layer normalization (adaLN) conditioned on the diffusion timestep and text embedding.
- Unpatchify: Reshape the output back to .
The adaLN conditioning works by predicting scale and shift parameters from the conditioning signal:
This is fundamentally different from MusicGen: no codebook, no VQ, no discrete tokens. The entire generation happens in continuous space.
22.12 The Forward Diffusion Process
中文: “经典扩散模型,比如DDPM,它的前向加噪过程…就是一条马尔可夫链。上一集讲的AI作曲第一个范式,六十年前的数学,藏在现代生成模型的心脏里。”
As discussed in EP21 , the forward process of DDPM is a Markov chain that gradually adds Gaussian noise:
The reverse process (denoising) is learned by a neural network that predicts the noise added at each step. DiT uses a Transformer as the architecture for .
22.13 Classifier-Free Guidance (CFG)
Interpretation by cases:
| Behavior | |
|---|---|
| Pure conditional generation: | |
| Amplified conditioning — moves away from unconditional | |
| Extreme adherence to text (often causes artifacts) |
By Bayes' rule:
Taking the gradient with respect to ( is constant):
The CFG output is:
Since the noise prediction and , substituting:
Therefore the guided score corresponds to:
The guidance scale controls how strongly the model steers toward outputs that “look like” they were generated from condition , as measured by the implicit classifier .
Station 4: Decoding
中文: “MusicGen那边:整数token通过RVQ反向查表变回连续向量,再通过EnCodec解码器膨胀回波形——640个采样点从一个向量里重建出来。”
22.14 MusicGen Decoding Pipeline
The decoding reverses the encoding:
- RVQ lookup: For each frame, retrieve the centroid vectors from each codebook: , where is the predicted token index for codebook .
- Sum: .
- EnCodec decoder: Transposed convolutions expand back to 640 samples: .
- Overlap-add: Adjacent frames are combined with windowing to produce the continuous waveform.
22.15 ACE-Step Decoding Pipeline
中文: “ACE-Step那边:去噪完成的潜在表示先通过DCAE解码器变回频谱图,再经过声码器才变成波形。”
- DCAE decoder: The denoised continuous latent is decoded to a mel-spectrogram.
- Vocoder (e.g., HiFi-GAN): Converts the mel-spectrogram to a time-domain waveform via learned upsampling convolutions.
The two-stage decode (latent spectrogram waveform) adds latency but avoids the information bottleneck of discrete tokenization.
Numerical Examples
Complete MusicGen Pipeline for a 10-Second Clip
| Stage | Input | Output | Dimensions |
|---|---|---|---|
| Text tokenization | “Chinese-style piano” | Token indices | integers |
| T5 encoder | token indices | Hidden states | (T5-large) |
| Audio encoding (EnCodec) | 320,000 samples (10s @ 32kHz) | Latent vectors | |
| RVQ | 500 latent vectors | Token indices | integers |
| Transformer generation | T5 states + past tokens | Next tokens | 500 steps (delay pattern) |
| RVQ decode | indices | Latent vectors | |
| EnCodec decode | 500 latent vectors | Waveform | 320,000 samples |
Storage Comparison
| Representation | Size for 10s audio |
|---|---|
| Raw 32kHz 16-bit | 640,000 bytes |
| EnCodec continuous latent | 256,000 bytes (500 128 4 bytes) |
| RVQ tokens (4 codebooks) | 2,000 integers = ~2,750 bytes |
| Compressed ratio (raw to RVQ) | ~233:1 |
RVQ Reconstruction Quality by Layer Count
| Layers (Q) | Bits/frame | Bitrate | Typical SI-SDR (dB) |
|---|---|---|---|
| 1 | 11 | 550 bps | ~5 |
| 2 | 22 | 1,100 bps | ~10 |
| 4 | 44 | 2,200 bps | ~15 |
| 8 | 88 | 4,400 bps | ~20 |
Each additional layer roughly halves the remaining error (Theorem 22.1), with diminishing returns as residuals approach the noise floor.
Musical Connection
From Notation to Codebooks: Two Alphabets for Music
中文: “这四个整数,就是AI的音乐字母表里的’字母'。”
Western staff notation is a human-designed alphabet for music: pitch (line/space), duration (note shape), dynamics (marking). It captures what a trained musician needs to reproduce a performance — but discards timbre, room acoustics, and micro-timing.
EnCodec’s RVQ codebook is an AI-discovered alphabet. Its “letters” — four integers per frame — encode everything the decoder needs to reconstruct perceptually faithful audio: pitch, timbre, dynamics, stereo field, room characteristics. But unlike staff notation, the codebook entries have no human-interpretable labels. Entry 1742 in codebook 1 might encode “bright attack with upper harmonics” — or it might encode a pattern that has no name in any human language.
The parallel is striking: both systems solve the same problem (compress music into a finite symbol set) but optimize for different decoders (human performer vs. neural network).
will probe what these learned letters actually encode.
Limits and Open Questions
-
Codebook collapse: In practice, many RVQ codebook entries go unused during training. Techniques like exponential moving average updates and codebook reset heuristics mitigate this, but the effective codebook utilization remains below 100%.
-
Tokenization granularity: EnCodec produces 50 tokens/second. Is this the right temporal resolution for music? Speech models use similar rates, but music has faster transients (drum attacks at sub-millisecond scale) that may be lost.
-
Cross-modal alignment: Cross-attention assumes text and audio share a meaningful latent geometry. But “Chinese-style” is a high-dimensional cultural concept — how well can a T5 encoder trained on English text capture it?
-
Guidance scale sensitivity: CFG with too high produces artifacts; too low ignores the prompt. The optimal varies by prompt and is typically hand-tuned.
-
Continuous vs. discrete: MusicGen (discrete) and ACE-Step (continuous) represent fundamentally different philosophical choices. Which is better for music? The answer may depend on the downstream task.
中文: “下一集我们问一个更深的问题:这些字母表里的’字母’,到底编码了什么?调性?情感?还是人类根本读不懂的东西?”
Academic References
- Raffel, C. et al. (2020). “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” JMLR 21(140), 1-67.
- Kudo, T. & Richardson, J. (2018). “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.” EMNLP 2018.
- van den Oord, A., Vinyals, O. & Kavukcuoglu, K. (2017). “Neural Discrete Representation Learning.” NeurIPS 2017.
- Defossez, A. et al. (2022). “High Fidelity Neural Audio Compression.” arXiv:2210.13438. — The EnCodec paper.
- Zeghidour, N. et al. (2021). “SoundStream: An End-to-End Neural Audio Codec.” IEEE/ACM Trans. Audio, Speech, Lang. Process. 30, 495-507.
- Copet, J. et al. (2023). “Simple and Controllable Music Generation.” NeurIPS 2023. — The MusicGen paper; introduces the delay pattern.
- Peebles, W. & Xie, S. (2023). “Scalable Diffusion Models with Transformers.” ICCV 2023. — The DiT paper.
- Ho, J. & Salimans, T. (2022). “Classifier-Free Diffusion Guidance.” NeurIPS 2021 Workshop.
- Ho, J., Jain, A. & Abbeel, P. (2020). “Denoising Diffusion Probabilistic Models.” NeurIPS 2020.
- Vaswani, A. et al. (2017). “Attention Is All You Need.” NeurIPS 2017. — The Transformer paper; defines scaled dot-product attention.
- Agostinelli, A. et al. (2023). “MusicLM: Generating Music From Text.” arXiv:2301.11325.
- Huang, R. et al. (2024). “ACE-Step: A Step Towards Music Generation Foundation Model.” arXiv. — Continuous diffusion approach to music generation.
- Kong, J., Kim, J. & Bae, J. (2020). “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.” NeurIPS 2020.