EP35: The Phase Vocoder — Mathematics of Pitch Shifting
后续拓展
Overview
A single recorded note in a sampler like Kontakt can produce 88 different pitches — not by storing 88 separate recordings, but by mathematically stretching or compressing one recording. The tool that makes this possible is the phase vocoder, a signal processing algorithm that separates a sound into magnitude and phase components via the Short-Time Fourier Transform, then manipulates those components to achieve pitch shifting and time stretching independently.
The central mathematical challenge is phase continuity. Naive pitch shifting — simply resampling a signal at a different rate — changes duration alongside pitch. The phase vocoder solves this by working frame-by-frame in the frequency domain: it estimates the true instantaneous frequency in each spectral bin from the phase difference between adjacent frames, scales those frequencies by a desired ratio, and then reconstructs a phase-continuous output signal. The reconstruction is only valid if the analysis and synthesis windows satisfy the Constant Overlap-Add (COLA) condition, which constrains the joint choice of window function and hop size.
This episode builds the full mathematical picture: STFT definition and phase advance, phase unwrapping and instantaneous frequency estimation, pitch-shift by frequency scaling, time-stretch by separating analysis and synthesis hop sizes, and the COLA condition for perfect reconstruction.
中文: “Kontakt里一个样本可以发出88个不同音高——它不是录了88段音频,而是把一段录音在数学上拉伸或压缩。背后的工具叫相位声码器。今天我们拆开它,看看采样音源的数学核心。”
Prerequisites / 前置知识
This episode builds directly on:
-
EP30: The Sampling Theorem — sinc reconstruction, bandlimited signals, perfect reconstruction from discrete samples . The wavetable synthesis section of this episode uses windowed-sinc interpolation, which is the same ideal reconstruction kernel derived in EP30.
-
Fourier analysis (EP02): the STFT is the Fourier transform applied locally through a sliding window. All notation carries over: for angular frequency, for DFT bins.
-
Complex exponentials: phase is the argument of a complex number ; the key operations are extracting and tracking how evolves between frames.
Definitions
Let be a discrete-time signal and a window function of length . The Short-Time Fourier Transform of at frame and frequency bin is
where is the analysis hop size (the number of samples the window advances between consecutive frames), and . The STFT output is a complex-valued time-frequency matrix with magnitude and phase .
Main Theorems / 主要定理
To shift all spectral components of upward by frequency ratio :
- Compute the STFT to obtain .
- Estimate via Theorem 35.2.
- Replace each true frequency by .
- Propagate a phase-continuous output phase:
- Reconstruct using followed by overlap-add synthesis.
The output contains the same spectral magnitudes as but all instantaneous frequencies scaled by , with no change in duration.
The proof has two parts: correctness of frequency scaling, and phase continuity.
Frequency scaling. The synthesis frame is constructed with magnitude (unchanged) and a new phase whose increment per frame is . By the inverse of Theorem 35.1, a STFT frame with that phase increment, when inverse-transformed, yields a local sinusoid at instantaneous frequency . Therefore every spectral component is scaled by .
Phase continuity. The output phase is defined by the recurrence . By construction this is a cumulative sum — there are no discontinuities between frames. If the synthesis used the raw (wrapped) analysis phase instead, adjacent frames would have inconsistent phase relationships, producing comb-filter artifacts (metallic distortion) when overlap-added. The recurrence guarantees a smooth phase trajectory.
Duration unchanged. Both analysis and synthesis use the same hop size , so the same number of frames covers the same total duration. The output signal has the same number of samples as the input.
Duration scaling. The analysis processes frames from , covering a total of input samples. Synthesis places those frames with spacing , covering output samples. The duration ratio is .
Pitch preservation. The pitch of the output is determined by the instantaneous frequencies of the synthesis frames. The phase increment per synthesis frame at bin is . The corresponding angular frequency of the synthesized sinusoid is the increment divided by the synthesis hop: The instantaneous frequency is , identical to the analysis frequency. Since no frequency scaling is applied, the pitch is preserved.
Why naive repositioning fails. If the raw analysis phases were used directly in the repositioned synthesis frames without recurrence, consecutive frames would carry phase values appropriate for positions but placed at positions . The mismatch produces phase discontinuities at frame boundaries, causing comb-filter (metallic) distortion in the overlap-add sum. The recurrence eliminates this.
Numerical Examples
Example 1: Phase advance for a pure tone
Let , , sample rate Hz. Consider a pure tone at Hz (A4).
The angular frequency is rad/sample. The nearest bin is The center frequency of bin 10 is rad/sample.
Expected phase advance per hop: rad.
True phase advance per hop: rad.
Phase deviation: rad (no unwrapping needed since ).
Instantaneous frequency: rad/sample, recovering Hz. The estimate is accurate to within measurement resolution.
Example 2: Pitch shifting by a perfect fifth
A perfect fifth corresponds to frequency ratio .
For the 440 Hz tone above, shifted frequency: rad/sample, corresponding to Hz (E5).
Output phase after frame (starting from ): Each frame receives a consistent phase increment of 24.06 rad, ensuring a smooth sinusoid at 660 Hz in the output.
Example 3: Time stretching by 1.5x
With , , the synthesis hop is samples.
Phase recurrence for the 440 Hz bin: The output instantaneous frequency is rad/sample — unchanged at 440 Hz. A 5-second input (220,500 samples at 44.1 kHz) produces a 7.5-second output (330,750 samples), stretched by exactly .
Example 4: COLA check — Hann at 50% overlap
With and , consecutive Hann windows centered at and are: At : and wraps to . Sum: . This confirms the OLA-COLA condition at 50% overlap (Theorem 35.4).
Musical Connection / 音乐联系
The phase vocoder is the mathematical engine behind some of the most familiar sounds in modern music production.
Sampler instruments. A sampler records a note at one pitch (the “root note”) and uses the phase vocoder — or its lightweight cousin, wavetable synthesis — to render every other pitch on the keyboard from that single recording. Kontakt, Omnisphere, and similar instruments rely on this: one carefully recorded cello note becomes an entire playable cello section.
The “chipmunk effect” and autotune. Setting in pitch shifting pushes all frequencies upward, compressing the spectral content into a high-frequency range. The voice sounds thin and cartoon-like because harmonic spacing and formant positions all shift together. Professional pitch correction (Melodyne, Auto-Tune) avoids this by formant preservation: the spectral envelope (formant curve) is extracted, separated from the fine harmonic structure, and shifted independently so that only the fundamental pitch moves while the vocal timbre (vowel quality) remains natural.
Time-stretching in DAWs. Every digital audio workstation — Logic, Ableton, Reaper — uses a variant of the phase vocoder time-stretch algorithm when the user warps audio to a different tempo. The analysis/synthesis hop ratio directly maps to the tempo change factor. Phase vocoder artifacts (the “phasiness” or metallic quality on transients) motivate further refinements such as transient detection, which switches to a waveform-copy method around percussive events.
Wavetable synthesis. The lighter-weight alternative described in this episode stores a single waveform period and varies playback speed to control pitch. The interpolation kernel used between samples is mathematically identical to the sinc reconstruction studied in
- EP30: The Sampling Theorem
-
linear interpolation is a piecewise-linear approximation to sinc, while windowed-sinc interpolation achieves ideal reconstruction within the Nyquist bandwidth.
Looking ahead. The Short-Time Fourier Transform has a deeper foundation: in EP36, we will see that the time-frequency resolution of the STFT is constrained by the Gabor uncertainty principle, established in 1946 — a musical analogue of the Heisenberg uncertainty principle that fundamentally limits how precisely a sound can be localized in both time and frequency simultaneously. EP42 will revisit these ideas in the context of wavelet analysis, which adapts the window width to frequency and can overcome some of STFT’s resolution tradeoffs.
Limits and Open Questions / 局限性与开放问题
Transient smearing. The phase vocoder assumes local stationarity — each frame is well modeled by a superposition of sinusoids. Percussive transients (drum hits, plucked strings) violate this assumption severely. Time-stretching a snare drum with a standard phase vocoder produces a characteristic “pre-echo” or “flamming” artifact as the transient energy is spread across adjacent frames. State-of-the-art solutions (transient detection, phase locking, object-based coding) partially address this, but perfect transient preservation under time-stretching remains an active research area.
Polyphony and phase locking. When multiple harmonic partials of the same note are present, the relative phases between bins at harmonically related frequencies encode important perceptual cues (the “shape” of the attack and the coherence of the instrument body). Naively scaling all phases independently can destroy this coherence. Phase locking algorithms propagate phase corrections from the analysis bin at the fundamental frequency up through its harmonics, preserving perceptually important phase relationships. The optimal phase locking strategy for complex polyphonic signals is still an open problem.
Pitch shifting limits. Very large pitch ratios (e.g., , two octaves up) produce audible artifacts even with phase-continuous processing, because the assumed one-sinusoid-per-bin model breaks down as the scaled frequencies of different bins overlap. This motivates resampling-based approaches for large pitch changes, or source-separation preprocessing.
The uncertainty principle. The STFT window length controls a fundamental tradeoff: long windows give fine frequency resolution () but poor time localization (); short windows give the reverse. There is no window for which both are simultaneously arbitrarily small — this is the Gabor uncertainty principle, previewed in the narration and to be treated rigorously in EP36.
Formant preservation. As noted in the narration, professional pitch shifting requires separating the spectral fine structure (harmonic positions, encoding pitch) from the spectral envelope (formant positions, encoding timbre). The standard engineering approach uses cepstral liftering or LPC envelope estimation, but both are approximations. Perceptually optimal formant preservation in the presence of vibrato, breathiness, and consonant transitions is an unsolved signal processing problem.
Academic References / 参考文献
-
Flanagan, J. L., & Golden, R. M. (1966). Phase vocoder. Bell System Technical Journal, 45(9), 1493–1509. The original phase vocoder paper.
-
Dolson, M. (1986). The phase vocoder: A tutorial. Computer Music Journal, 10(4), 14–27. Accessible tutorial covering instantaneous frequency estimation and overlap-add reconstruction.
-
Laroche, J., & Dolson, M. (1999). Improved phase vocoder time-scale modification of audio. IEEE Transactions on Speech and Audio Processing, 7(3), 323–332. Introduces phase locking for improved polyphonic time-stretching.
-
Portnoff, M. R. (1980). Time-frequency representation of digital signals and systems based on short-time Fourier analysis. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(1), 55–69. Rigorous derivation of COLA and perfect reconstruction conditions.
-
Zölzer, U. (Ed.). (2011). DAFX: Digital Audio Effects (2nd ed.). Wiley. Chapter 7 covers the phase vocoder algorithm in engineering detail, including pitch shifting, time stretching, and transient handling.
-
Gabor, D. (1946). Theory of communication. Journal of the Institution of Electrical Engineers, 93(26), 429–441. Foundational paper establishing the time-frequency uncertainty principle; the mathematical basis for EP36.
-
Smith, J. O. (2011). Spectral Audio Signal Processing. W3K Publishing. Available online at https://ccrma.stanford.edu/~jos/sasp/. Comprehensive treatment of STFT, phase vocoder, and wavetable synthesis.