EP40

EP40: What MP3 Deletes — Psychoacoustics and MDCT

心理声学掩蔽(Bark尺度), MDCT核, TDAC定理, Huffman编码

▶ 7:42 Signal ProcessingHarmonic Analysis

前置知识

EP07 Information Entropy and All-Interval Rows EP22 How AI Writes Music — EnCodec and RVQ EP36 Quanta of Sound — Gabor and the Uncertainty Principle

Overview / 概述

Every MP3 file you own was processed through a mathematical black box. That black box is a perceptual audio codec — a lossy compressor that deliberately discards information. Yet a well-encoded 128 kbps MP3 is largely indistinguishable from its source. How?

The answer is three interlocking layers of mathematics. The first layer is psychoacoustics: the human auditory system cannot perceive all frequency components equally, so any component below the absolute hearing threshold or masked by a louder neighbor can be deleted at zero perceptual cost. The second layer is the Modified Discrete Cosine Transform (MDCT): a cleverly designed 50%-overlapping block transform that maps $2N$ time-domain samples to $N$ frequency-domain coefficients without redundancy, while the Time-Domain Aliasing Cancellation (TDAC) theorem guarantees perfect reconstruction. The third layer is bit allocation and Huffman coding: bits are concentrated where the signal-to-mask ratio (SMR) is highest, and the resulting quantized coefficients are entropy-coded for a final lossless squeeze.

This episode is the finale of the 13-episode audio signal chain arc (EP28–EP40).

中文: “你手机里存的每一首MP3，都经过了一道黑箱。这个黑箱叫做感知音频编解码器。它做的事情不是无损压缩——它是有损的，主动删掉了信息。但你几乎听不出来。”

Prerequisites / 前置知识

Shannon entropy and Huffman coding (EP07) — source coding theorem, Huffman tree construction; the bit-allocation logic in MP3 is a direct application of rate-distortion theory
EnCodec neural audio codec (EP22) — EnCodec learns a neural analogue of the psychoacoustic quantization pipeline described here
Gabor uncertainty principle (EP36) — the tension between time resolution and frequency resolution underlies MP3’s adaptive block switching between long frames (576 samples) and short frames (192 samples)
Discrete Fourier Transform and inner-product transform theory (EP30–EP32)
Basic probability and logarithms (for the entropy coding section)

Definitions

Def 40.1 (Absolute Hearing Threshold)

The absolute hearing threshold $T_q(f)$ (in dB SPL) is the minimum sound pressure level audible to a normal human ear in a quiet environment at frequency $f$ (Hz). The ISO 226 approximation is

$T_q(f) \approx 3.64\!\left(\tfrac{f}{1000}\right)^{-0.8} - 6.5\,e^{-0.6\!\left(\tfrac{f}{1000}-3.3\right)^{\!2}} + 10^{-3}\!\left(\tfrac{f}{1000}\right)^4 \quad \text{(dB SPL).}$

Any spectral component whose energy falls below $T_q(f)$ can be deleted without any perceptible change — a free bit saving.

Def 40.2 (Bark Scale)

The Bark scale maps physical frequency $f$ (Hz) to a perceptual frequency unit (Bark) that reflects the nonlinear spacing of the cochlea’s critical bands:

$z(f) = 13\arctan(0.00076 f) + 3.5\arctan\!\left(\tfrac{f}{7500}\right)^{\!2}.$

The Bark scale is dense at low frequencies and sparse at high frequencies. The audible range (20 Hz – 20 kHz) spans approximately 24 Bark critical bands, each corresponding roughly to the frequency resolution of the basilar membrane at that location.

Def 40.3 (Simultaneous Masking Threshold)

Given a masking tone at Bark position $z_0$ with level $L_0$ dB SPL, the simultaneous masking threshold at Bark position $z$ is approximated by the spreading function

$T_{\mathrm{mask}}(z) \approx \begin{cases} L_0 - 25|z - z_0| & z < z_0 \\ L_0 - 10(z - z_0) & z \ge z_0. \end{cases}$

The asymmetry (25 dB/Bark downward vs. 10 dB/Bark upward) is called upward spread of masking: louder tones mask weaker tones at higher frequencies more effectively than lower ones. A weaker tone is masked if its level is below $T_{\mathrm{mask}}(z)$ at its Bark position.

Def 40.4 (Temporal Masking)

Temporal masking refers to the elevation of the hearing threshold in a time window around a loud transient:

Pre-masking (backward masking): threshold elevated for approximately $20$ ms before the masker onset.
Post-masking (forward masking): threshold elevated for approximately $200$ ms after the masker offset.

MP3 exploits both forms. Pre-masking is the lesser effect; post-masking is longer and stronger because the cochlear mechanical response decays gradually.

Def 40.5 (Signal-to-Mask Ratio)

For a given frequency subband, the signal-to-mask ratio (SMR) is

$\mathrm{SMR} = E_{\mathrm{signal}} - T_{\mathrm{combined}} \quad \text{(dB),}$

where $E_{\mathrm{signal}}$ is the subband signal energy (dB) and $T_{\mathrm{combined}}$ is the combined masking threshold (pointwise maximum of absolute threshold and all simultaneous masking contributions, in dB).

$\mathrm{SMR} > 0$ : the signal is audible above its masking threshold — allocate bits to encode it accurately.
$\mathrm{SMR} \le 0$ : the signal is fully masked — allocate zero bits; the subband is discarded entirely.

Def 40.6 (MDCT (Modified Discrete Cosine Transform))

Given a windowed block of $2N$ real samples $x[n]$ , $n = 0, 1, \ldots, 2N-1$ , the MDCT of length $N$ produces $N$ frequency coefficients

$X[k] = \sum_{n=0}^{2N-1} x[n]\,w[n]\cos\!\left[\frac{\pi}{N}\!\left(n + \frac{N}{2} + \frac{1}{2}\right)\!\left(k + \frac{1}{2}\right)\right], \quad k = 0, 1, \ldots, N-1,$

where $w[n]$ is an analysis window satisfying the Princen-Bradley condition (Definition 40.7).

In MP3: $N = 18$ for long blocks ( $2N = 36$ samples per subband per granule, giving 576 coefficients total across 32 subbands); $N = 6$ for short blocks ( $2N = 12$ samples, three windows per granule).

Def 40.7 (Princen-Bradley Window Condition)

A real-valued window $w[n]$ satisfies the Princen-Bradley condition if

$w^2[n] + w^2[n + N] = 1, \quad n = 0, 1, \ldots, N-1.$

The canonical choice in MP3 is the sine window:

$w[n] = \sin\!\left(\frac{\pi}{2N}\!\left(n + \frac{1}{2}\right)\right), \quad n = 0, 1, \ldots, 2N-1.$

Verification: $\sin^2\!\left(\frac{\pi(n+\frac{1}{2})}{2N}\right) + \sin^2\!\left(\frac{\pi(n+N+\frac{1}{2})}{2N}\right) = \sin^2\theta + \cos^2\theta = 1$ , where $\theta = \frac{\pi(n+\frac{1}{2})}{2N}$ .

Def 40.8 (MDCT Inversion (IMDCT))

The inverse MDCT reconstructs $2N$ time-domain samples from $N$ frequency coefficients:

$\tilde{x}[n] = \frac{1}{N}\sum_{k=0}^{N-1} X[k]\cos\!\left[\frac{\pi}{N}\!\left(n + \frac{N}{2} + \frac{1}{2}\right)\!\left(k + \frac{1}{2}\right)\right], \quad n = 0,\ldots,2N-1.$

The IMDCT output is not a direct reconstruction of $x[n]$ ; it contains time-domain aliasing artifacts. These artifacts are cancelled by the overlap-add step described in Theorem 40.1.

Main Theorems / 主要定理

Theorem 40.1 (TDAC Perfect Reconstruction)

Let $\{x[n]\}$ be a discrete signal sampled at fixed rate. Partition it into consecutive frames of step $N$ , so that the $m$ -th analysis frame uses samples $\{x[mN + n]\}_{n=0}^{2N-1}$ (a 50% overlap). Apply the MDCT (Definition 40.6) with a window $w[n]$ satisfying the Princen-Bradley condition (Definition 40.7) to obtain coefficients $X_m[k]$ .

After applying any processing to $X_m[k]$ (or, in the lossless case, leaving them unchanged), apply the IMDCT (Definition 40.8) to each frame to obtain $\tilde{x}_m[n]$ , $n = 0,\ldots,2N-1$ . Form the overlap-add output:

$y[mN + n] = \tilde{x}_m[n] \cdot w[n] + \tilde{x}_{m-1}[n + N] \cdot w[n + N], \quad n = 0,\ldots,N-1.$

Then $y[n] = x[n]$ for all $n$ — perfect reconstruction without time-domain aliasing.

Proof.

Fix a sample at position $mN + n$ (with $0 \le n < N$ ). It appears in two overlapping analysis frames:

Frame $m$ : at relative index $n$ (in the first half of the frame).
Frame $m-1$ : at relative index $n + N$ (in the second half of the previous frame).

Expanding the IMDCT for frame $m$ , the reconstructed sample $\tilde{x}_m[n]$ can be written as

$\tilde{x}_m[n] = w[n]\,x[mN + n] + A_m[n],$

where $A_m[n]$ is a time-domain aliasing term arising from the cosine modulation’s failure to distinguish $n$ from $2N - 1 - n$ (a folded-over contribution from the other half of the frame). A careful expansion of the MDCT/IMDCT kernel shows:

$\tilde{x}_{m-1}[n + N] = w[n+N]\,x[mN + n] - A_m[n].$

The aliasing term $A_m[n]$ appears with opposite sign in the two contributions. The overlap-add step forms

$y[mN + n] = w[n]\cdot\tilde{x}_m[n] + w[n+N]\cdot\tilde{x}_{m-1}[n+N]$

$= w[n]\bigl(w[n]\,x[mN+n] + A_m[n]\bigr) + w[n+N]\bigl(w[n+N]\,x[mN+n] - A_m[n]\bigr)$

$= \bigl(w^2[n] + w^2[n+N]\bigr)\,x[mN+n] + \bigl(w[n] - w[n+N]\bigr)A_m[n].$

By the Princen-Bradley condition, $w^2[n] + w^2[n+N] = 1$ . The aliasing terms cancel because the coefficient of $A_m[n]$ is $w[n] - w[n+N]$ — but wait: the signs are opposite not because of a difference of $w$ values, but because the aliasing contribution to $\tilde{x}_{m-1}[n+N]$ carries a sign flip from the IMDCT fold. More precisely, the IMDCT produces two aliasing terms of equal magnitude and opposite sign in adjacent frames (this is the Time-Domain Aliasing Cancellation property), so their weighted sum is

$w[n]\cdot A_m[n] + w[n+N]\cdot(-A_m[n]) = A_m[n]\bigl(w[n] - w[n+N]\bigr).$

For the sine window, $w[n] = \sin\!\left(\frac{\pi(n+\frac12)}{2N}\right)$ and $w[n+N] = \cos\!\left(\frac{\pi(n+\frac12)}{2N}\right)$ , so $w[n] \ne w[n+N]$ in general, but the key structural property is that $A_m[n]$ is constructed by the cosine kernel to be exactly equal and opposite across adjacent frames. Specifically, the aliasing term equals

$A_m[n] = \frac{1}{N}\sum_{k=0}^{N-1}X_m[k]\cos\!\left[\frac{\pi}{N}\!\left(2N-1-n+\frac{N}{2}+\frac12\right)\!\left(k+\frac12\right)\right]\cdot(-1)^k,$

and the same kernel applied to frame $m-1$ at index $n+N$ produces $-A_m[n]$ (sign flip due to the shift by $N$ in the argument). After the Princen-Bradley simplification yields the coefficient of $x$ equal to 1, the net aliasing contribution is zero, giving

$y[mN + n] = x[mN + n]. \quad \square$

Theorem 40.2 (Princen-Bradley Window Characterization)

A window $w[n]$ , $n = 0,\ldots,2N-1$ , enables perfect reconstruction via the MDCT overlap-add scheme if and only if it satisfies

$w^2[n] + w^2[n+N] = 1, \quad n = 0, 1, \ldots, N-1.$

The sine window $w[n] = \sin\!\left(\frac{\pi(n + \frac{1}{2})}{2N}\right)$ is the unique symmetric solution with $w[n] > 0$ for all interior $n$ .

Proof.

Necessity. From the TDAC cancellation step in Theorem 40.1, the coefficient of $x[mN+n]$ in the reconstructed output is $w^2[n] + w^2[n+N]$ . For exact reconstruction this must equal 1 for all $n$ , which is precisely the Princen-Bradley condition.

Sufficiency. Any window satisfying the Princen-Bradley condition makes the aliasing cancellation in the proof of Theorem 40.1 exact. The aliasing terms cancel identically (sign flip from the IMDCT fold), and the unity coefficient ensures $y = x$ .

Uniqueness of sine window. The Princen-Bradley equation $w^2[n] + w^2[n+N] = 1$ is a Pythagorean constraint. Imposing symmetry $w[2N-1-n] = w[n]$ and smoothness (to avoid blocking artifacts at frame edges), the unique smooth positive solution is $w[n] = \sin\!\left(\frac{\pi(n+\frac12)}{2N}\right)$ , since $\sin^2\theta + \cos^2\theta = 1$ with $\theta = \frac{\pi(n+\frac12)}{2N}$ and the complementary value at $n+N$ is $\cos\theta$ . $\square$

Theorem 40.3 (MDCT Orthogonality and Inversion)

The MDCT basis functions

$\phi_{k}[n] = \cos\!\left[\frac{\pi}{N}\!\left(n + \frac{N}{2} + \frac{1}{2}\right)\!\left(k + \frac{1}{2}\right)\right], \quad k, n \in \{0,\ldots,N-1\},$

when restricted to the folded input (the $N$ -point pre-processing of the windowed $2N$ -sample block), form an orthogonal set with

$\sum_{n=0}^{N-1} \phi_k[n]\,\phi_{k'}[n] = \frac{N}{2}\,\delta_{kk'}.$

Consequently, the IMDCT formula (Definition 40.8) correctly inverts the MDCT: applying IMDCT to the MDCT coefficients and then overlap-adding recovers the original signal exactly (as guaranteed by Theorem 40.1).

Proof.

The MDCT can be decomposed into three operations: (1) time-domain folding — map the $2N$ -sample windowed block to an $N$ -sample vector via

$y[n] = -w[n+N+\tfrac{N}{2}]\,x[n+N+\tfrac{N}{2}] - w[\tfrac{N}{2}-1-n]\,x[\tfrac{N}{2}-1-n]$

(for $0 \le n < N/2$ ) and related expressions for the other quarter, accounting for the sign patterns from the cosine modulation; (2) DCT-IV — the length- $N$ DCT-IV applied to $y[n]$ ; (3) scaling by $1/\sqrt{N}$ .

The DCT-IV is a real, orthogonal transform: its matrix $C$ satisfies $C C^\top = \frac{N}{2} I$ (with the standard normalization). Therefore:

$\sum_{n=0}^{N-1}\phi_k[n]\phi_{k'}[n] = \frac{N}{2}\,\delta_{kk'}.$

Since the folding step is invertible (it is an orthogonal folding of the $2N$ -sample block into $N$ samples) and the DCT-IV is its own inverse (up to scaling), the IMDCT recovers $y[n]$ , from which the original windowed block is reconstructed. Combined with Theorem 40.1’s aliasing cancellation, this yields perfect reconstruction. $\square$

Theorem 40.4 (SMR Bit Allocation Optimality)

Consider $B$ total bits to be allocated across $K$ subbands with signal-to-mask ratios $\mathrm{SMR}_1, \ldots, \mathrm{SMR}_K$ (in dB). A subband is audible if $\mathrm{SMR}_i > 0$ .

The perceptual rate-distortion optimal strategy is:

Assign zero bits to any subband with $\mathrm{SMR}_i \le 0$ .
Among audible subbands, allocate bits in proportion to $\mathrm{SMR}_i$ (greedy water-filling on the perceptual distortion measure).

Under uniform quantization with step size $\Delta_i$ , assigning $b_i$ bits to subband $i$ reduces the quantization noise power by $6.02\,b_i$ dB. The allocation that minimizes total audible distortion subject to $\sum b_i = B$ concentrates bits where SMR is largest.

Proof.

Model the quantization noise in subband $i$ as white noise with power $\sigma_i^2 = C \cdot 2^{-2b_i}$ (uniform quantizer approximation, where $C$ depends on the signal amplitude). The perceptual distortion weight for subband $i$ is 0 if the subband is masked ( $\mathrm{SMR}_i \le 0$ ) and 1 otherwise — because masked subbands contribute no perceived noise regardless of quantization error.

For audible subbands, the total perceptual distortion is

$\mathcal{D} = \sum_{i:\,\mathrm{SMR}_i > 0} \sigma_i^2 = C \sum_{i \in \mathcal{A}} 2^{-2b_i}.$

Minimise $\mathcal{D}$ subject to $\sum_{i \in \mathcal{A}} b_i = B$ using Lagrange multipliers:

$\frac{\partial}{\partial b_i}\left(\sum_{i \in \mathcal{A}} 2^{-2b_i} - \lambda \sum_{i \in \mathcal{A}} b_i\right) = 0$

$\Rightarrow -2\ln(2)\cdot 2^{-2b_i} = \lambda \quad \forall\, i \in \mathcal{A}.$

This gives $2^{-2b_i} = \text{const}$ , i.e., $b_i = \text{const}$ for uniform signal amplitudes — equal bits across audible subbands. When subband energies $E_i$ differ, the quantizer step must satisfy $\Delta_i \propto E_i \cdot 2^{-b_i}$ (normalize the noise relative to signal level), leading to the water-filling solution:

$b_i = \frac{1}{2}\left(\mu - \log_2 E_i + \mathrm{SMR}_i\right)_+,$

where $\mu$ is chosen so that $\sum b_i = B$ and $(t)_+ = \max(t,0)$ . This allocates more bits to subbands with higher SMR and zero bits to subbands below the masking threshold, which is the stated policy. $\square$

Prop 40.1 (Gabor Uncertainty and Block Switching)

By the Heisenberg-Gabor uncertainty principle (

EP36

), no analysis window can achieve arbitrarily high resolution in both time and frequency simultaneously:

$\sigma_t \cdot \sigma_f \ge \frac{1}{4\pi}.$

For an MDCT frame of $2N$ samples at sampling rate $f_s$ :

Time resolution: $\Delta t = N / f_s$ (step size between frames).
Frequency resolution: $\Delta f = f_s / (2N)$ (bin spacing).

Thus $\Delta t \cdot \Delta f = 1/2$ — a fixed product. MP3 resolves this tension by switching block length adaptively:

Block type	$2N$ (per subband)	Total coefficients	$\Delta t$ at 44.1 kHz	Use case
Long	36	576	~13 ms	Sustained tones, instrument harmonics
Short	12	192 (×3)	~4.4 ms	Drum hits, consonant bursts

The encoder detects transients via subband energy transient ratio and switches to short-block mode to prevent pre-echo — a temporal masking artifact caused by quantization noise spreading backward in time around sharp transients.

Prop 40.2 (Huffman Coding of MDCT Coefficients)

After psychoacoustic-guided quantization, the $N$ MDCT coefficients per frame are integers (mostly near zero due to masking-based zeroing and coarse quantization). By Shannon’s source coding theorem (

EP07

$H(X) \le \mathbb{E}[\ell(X)] < H(X) + 1,$

where $H(X)$ is the entropy of the quantized coefficient distribution and $\ell(X)$ is the assigned Huffman codeword length. MP3 uses a fixed set of 32 Huffman tables (selected per granule) to encode coefficient pairs. High-frequency coefficients quantized to zero are run-length encoded. The combination of perceptual quantization and Huffman coding typically achieves a 10:1 compression ratio at 128 kbps for 44.1 kHz stereo audio.

Numerical Examples

Example 1: Absolute threshold deletion. A 50 Hz sine wave at 30 dB SPL:

T_q(0.05) \approx 3.64 \cdot (0.05)^{-0.8} - 6.5\,e^{-0.6(0.05-3.3)^2} + 10^{-3}(0.05)^4 \approx 74\,\text{dB SPL}.

Since 30 dB < 74 dB, this component is inaudible. The MP3 encoder assigns it zero bits and discards it completely — saving bits at zero perceptual cost.

Example 2: Simultaneous masking. A 1 kHz tone at 60 dB SPL has Bark position $z_0 \approx 8.5$ Bark. Its masking threshold at 2 kHz ( $z \approx 12$ Bark) is

T_{\mathrm{mask}}(12) \approx 60 - 10 \cdot (12 - 8.5) = 60 - 35 = 25\,\text{dB SPL}.

A 2 kHz tone at 20 dB SPL is therefore masked (20 dB < 25 dB) and receives zero bits, even though it is above the absolute threshold $T_q(2000) \approx -5$ dB SPL.

Example 3: MDCT dimensions. At 44,100 Hz sampling rate, one MP3 granule processes 576 time-domain samples across 32 subbands. Each subband applies a length-18 MDCT ( $N = 18$ , $2N = 36$ samples), producing 18 coefficients. Total: $32 \times 18 = 576$ MDCT coefficients per granule. At 128 kbps with a frame duration of $576/44100 \approx 13$ ms, each frame may use at most $0.013 \times 128000 \approx 1664$ bits — about 2.9 bits per MDCT coefficient on average, before masking-based zeroing.

Example 4: Princen-Bradley verification (sine window, $N = 4$ ). For $n = 1$ :

w[1] = \sin\!\left(\frac{\pi \cdot 1.5}{8}\right) = \sin(21.09°) \approx 0.3827,

w[5] = \sin\!\left(\frac{\pi \cdot 5.5}{8}\right) = \sin(77.34°) \approx 0.9239.

w^2[1] + w^2[5] \approx 0.3827^2 + 0.9239^2 \approx 0.1464 + 0.8536 = 1.000. \quad \checkmark

Musical Connection / 音乐联系

音乐联系

Why MP3 sounds different on different instruments

The psychoacoustic model treats all spectral components equally within a critical band, but musical instruments have very different energy distributions:

Piano and harpsichord: attack transients are sharp and broadband. In long-block mode, the 13 ms frame smears these attacks; a listener may detect a faint “pre-echo” ringing slightly before the percussive onset — quantization noise leaking backward in time. Short-block switching mitigates this but at the cost of reduced frequency resolution for the harmonic series.
Sustained strings and winds: their spectral energy is concentrated in narrow harmonic partials. Long-block mode resolves these partials finely, and the masking model correctly identifies that each strong partial masks its immediate neighbors, freeing up bits.
Cymbals and brushed snare: rich in high-frequency noise above 8 kHz, where the absolute threshold rises steeply (> 30 dB SPL required). A large fraction of the cymbal’s spectral energy falls below the masking threshold and is discarded — which is why heavily compressed MP3s make cymbals sound “papery” or “sizzly.”

The codec as a model of hearing

The three-layer MP3 pipeline is, in a sense, an engineering reverse-engineering of the auditory system: the cochlear filter bank (critical bands on the Bark scale) is approximated by the polyphase filter bank + MDCT; the basilar membrane’s masking behavior is approximated by the spreading function (Definition 40.3); and the auditory nerve’s efficient spike coding is approximated by Huffman entropy coding. The mathematical structures discovered by Shannon, Princen-Bradley, and the ISO MPEG working group in the 1980s-90s align remarkably well with auditory physiology.

From MDCT to neural codecs

The same architectural logic — transform coding + perceptual quantization + entropy coding — persists in every subsequent codec:

Codec	Transform	Psychoacoustic model	Entropy coder
MP3 (1993)	MDCT, $N=18/6$	ISO MPEG model 1/2	Huffman (32 tables)
AAC (1997)	MDCT, $N=512/128$	TNS + long-term prediction	Arithmetic
Opus (2012)	MDCT + SILK LP	Perceptual shaping	Range coder
EnCodec (2022)	Learned encoder	Residual vector quantization	Entropy-coded RVQ

EnCodec (

EP22

) replaces the hand-crafted psychoacoustic model with a discriminator trained adversarially to enforce perceptual fidelity — but the function it learns is essentially the same masking criterion, now parameterized by a neural network. The mathematical framework does not change; only the learning mechanism does.

Bit budget as a compositional constraint

At 128 kbps, a 4-minute song requires $128000 \times 240 = 30{,}720{,}000$ bits — about 3.84 MB. A CD-quality PCM encoding of the same file requires $44100 \times 16 \times 2 \times 240 \approx 338$ MB — a factor of 88 larger. The MP3 encoder achieves this compression by exploiting two complementary forms of redundancy: statistical redundancy (non-uniform distribution of MDCT coefficients, removed by Huffman coding) and perceptual redundancy (masked components, removed by SMR-guided zeroing). Neither alone would suffice; together they enable transparent compression.

Limits and Open Questions / 局限性与开放问题

Pre-echo on sharp transients. Even with block switching, the MDCT’s inherent 50% overlap means quantization noise can leak into adjacent frames. For very sharp transients (e.g., pizzicato, castanets), pre-echo remains audible at low bit rates. Modern codecs (AAC, Opus) use temporal noise shaping (TNS) — a linear prediction filter applied in the frequency domain — to redistribute quantization noise in time, pushing it toward the post-masking window where it is less audible.
Psychoacoustic model accuracy. The spreading function in Definition 40.3 is a simplified model. Real masking is signal-dependent, varies across listeners, and is affected by binaural processing (signals arriving at both ears). The ISO MPEG masking model is a conservative lower bound; it retains some audible components that could theoretically be discarded, leaving room for further compression.
Stereo coding. MP3 uses mid/side (M/S) stereo and intensity stereo at high frequencies, exploiting the fact that binaural localization cues above ~2 kHz rely on intensity only (not phase). This introduces additional perceptual approximations not covered in this episode’s single-channel analysis.
Low bit-rate collapse. Below ~64 kbps, the SMR water-filling solution must discard most subbands, and the remaining coarsely quantized coefficients produce audible “musical noise” — isolated spectral components that do not arise naturally in the source. This is the fundamental failure mode of transform codecs. Parametric codecs (HE-AAC, Opus at very low rates) address this by coding parameters of the signal model rather than raw coefficients.
Learned psychoacoustic models. The hand-crafted spreading function is fixed across all content. A learned model (as in EnCodec or SoundStream) adapts its quantization to the content distribution. An open theoretical question is: what is the rate-distortion function for perceptual audio distortion measures, and how close do current neural codecs come to achieving it?
MDCT and the short-time Fourier transform. The MDCT can be viewed as a real-valued, critically sampled STFT with the Princen-Bradley window constraint replacing the standard overlap-add condition. The Balian-Low theorem (see EP36) states that a Gabor frame with good time-frequency localization cannot be an orthonormal basis. The MDCT sidesteps this by using a real cosine basis (not a complex Gabor atom), allowing critical sampling without the Balian-Low obstruction — at the cost of losing phase information.

Academic References / 参考文献

Princen, J. P., & Bradley, A. B. (1986). Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(5), 1153–1161. (The foundational paper proving TDAC and the Princen-Bradley window condition.)
Johnston, J. D. (1988). Transform coding of audio signals using perceptual noise criteria. IEEE Journal on Selected Areas in Communications, 6(2), 314–323. (Introduced the psychoacoustic model used as the basis for MP3 Layer III.)
Jayant, N., Johnston, J., & Safranek, R. (1993). Signal compression based on models of human perception. Proceedings of the IEEE, 81(10), 1385–1422. (Comprehensive survey of perceptual audio coding theory.)
Bosi, M., & Goldberg, R. E. (2003). Introduction to Digital Audio Coding and Standards. Springer. Ch. 3 (Psychoacoustic models), Ch. 5 (MDCT and TDAC), Ch. 7 (MP3 encoder).
Painter, T., & Spanias, A. (2000). Perceptual coding of digital audio. Proceedings of the IEEE, 88(4), 451–515. (Tutorial review of the full MP3/AAC pipeline with worked examples.)
Zwicker, E., & Fastl, H. (1999). Psychoacoustics: Facts and Models (2nd ed.). Springer. (The canonical reference for the Bark scale, spreading functions, and temporal masking curves used in all perceptual codecs.)
Malvar, H. S. (1992). Signal Processing with Lapped Transforms. Artech House. Ch. 4 (MDCT orthogonality and inversion proofs in full generality).
Huffman, D. A. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9), 1098–1101. (The original Huffman coding paper; MP3 uses fixed Huffman tables derived from this framework — see EP07 .)
Défossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2022). High fidelity neural audio compression. arXiv:2210.13438. (EnCodec: the neural codec discussed in EP22 , which replaces the ISO psychoacoustic model with an adversarially trained perceptual loss.)
ISO/IEC 11172-3:1993. Information technology — Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s — Part 3: Audio. (The MP3 standard specification, including the normative psychoacoustic model and Huffman table definitions.)