EP35

EP35: The Phase Vocoder — Mathematics of Pitch Shifting

STFT相位展开, COLA条件, 重叠相加, 音高移位
5:56 Signal ProcessingHarmonic Analysis

Overview

A single recorded note in a sampler like Kontakt can produce 88 different pitches — not by storing 88 separate recordings, but by mathematically stretching or compressing one recording. The tool that makes this possible is the phase vocoder, a signal processing algorithm that separates a sound into magnitude and phase components via the Short-Time Fourier Transform, then manipulates those components to achieve pitch shifting and time stretching independently.

The central mathematical challenge is phase continuity. Naive pitch shifting — simply resampling a signal at a different rate — changes duration alongside pitch. The phase vocoder solves this by working frame-by-frame in the frequency domain: it estimates the true instantaneous frequency in each spectral bin from the phase difference between adjacent frames, scales those frequencies by a desired ratio, and then reconstructs a phase-continuous output signal. The reconstruction is only valid if the analysis and synthesis windows satisfy the Constant Overlap-Add (COLA) condition, which constrains the joint choice of window function and hop size.

This episode builds the full mathematical picture: STFT definition and phase advance, phase unwrapping and instantaneous frequency estimation, pitch-shift by frequency scaling, time-stretch by separating analysis and synthesis hop sizes, and the COLA condition for perfect reconstruction.

中文: “Kontakt里一个样本可以发出88个不同音高——它不是录了88段音频,而是把一段录音在数学上拉伸或压缩。背后的工具叫相位声码器。今天我们拆开它,看看采样音源的数学核心。”


Prerequisites / 前置知识

This episode builds directly on:


Definitions

Definition 35.1 (Short-Time Fourier Transform (STFT))

Let be a discrete-time signal and a window function of length . The Short-Time Fourier Transform of at frame and frequency bin is

where is the analysis hop size (the number of samples the window advances between consecutive frames), and . The STFT output is a complex-valued time-frequency matrix with magnitude and phase .

Definition 35.2 (Spectrogram)
The spectrogram of a signal is the magnitude-squared STFT, It represents the distribution of signal energy across time (frame index ) and frequency (bin ). The STFT yields strictly more information than the spectrogram: the phase is discarded when forming , and is irretrievably lost.
Definition 35.3 (Expected Phase Advance)
The center angular frequency of bin is . If a pure sinusoid at exactly is analyzed, the phase advances by a fixed amount between consecutive frames. The expected phase advance for bin over one analysis hop is
Definition 35.4 (Phase Deviation and Instantaneous Frequency)
Let be the observed phase difference between frame and frame , after phase unwrapping (see Proposition 35.1 below). The phase deviation is where denotes wrapping to . The instantaneous angular frequency at bin in frame is
Definition 35.5 (COLA Condition)
A window function and hop size satisfy the Constant Overlap-Add (COLA) condition if for some nonzero constant . In the WOLA (Weighted Overlap-Add) variant used in many phase vocoder implementations, the condition is imposed on the squared window: The WOLA form is needed when both the analysis and synthesis paths apply the window, so that the product of the two windows sums to a constant.
Definition 35.6 (Analysis and Synthesis Hop Sizes)
In a phase vocoder configured for time stretching by factor , the analysis hop size and synthesis hop size are related by When the output is longer than the input (time expansion); when it is shorter (time compression). Pitch is unchanged when only the hop ratio is altered and phase is correctly propagated.

Main Theorems / 主要定理

Theorem 35.1 (STFT Phase Advance Formula)
If is a complex exponential at angular frequency , the STFT phase at bin satisfies where is a constant depending only on and the window . In particular, the phase increments between consecutive frames at fixed bin are constant and equal to .
Proof.
Substituting into Definition 35.1: Factor out the term that depends on : The sum is a constant (the window’s frequency response evaluated at ), independent of . Write . Then which establishes the linear growth with slope and confirms that the frame-to-frame phase increment equals .
Prop 35.1 (Phase Unwrapping)
Measured phases are confined to by the operation. When the true phase increment exceeds in magnitude, the observed difference undergoes a discontinuous “wrap.” Phase unwrapping corrects for these wraps by adding the unique integer multiple of that brings the difference into : After unwrapping, recovers the true phase increment and the instantaneous frequency formula (Definition 35.4) is valid.
Theorem 35.2 (Instantaneous Frequency Estimation)
Under the local stationarity assumption that the signal in frame is well approximated by a single sinusoid at frequency , the instantaneous frequency in bin is unambiguously estimated as provided the true frequency deviation satisfies (i.e., it stays within half a bin width in terms of the wrapped phase range).
Proof.
By Theorem 35.1, a pure tone at contributes a phase increment in bin . The observed raw difference is for some integer introduced by the wrapping. Subtracting the expected advance: Wrapping this residual into recovers uniquely as long as its magnitude is less than , i.e., . Dividing by and adding yields the formula.
Theorem 35.3 (COLA Condition for Perfect Reconstruction)
Let the synthesis signal be formed by the overlap-add of windowed inverse-STFT frames, where is the synthesis window. If the analysis and synthesis windows satisfy the COLA condition for all , then (perfect reconstruction when no frequency modifications are applied).
Proof.
Substituting the STFT definition into the overlap-add formula and exchanging summation order: The inner DFT sum equals the Kronecker delta , so the sum collapses to : Applying the COLA condition, the sum equals for all , giving .
Theorem 35.4 (Hann Window Satisfies COLA at 50% Overlap)
The Hann window for satisfies the WOLA condition when the hop size is (50% overlap), with constant value .
Proof.
At 50% overlap, , so at each sample position exactly two windows overlap: the one centered at and the one at . Within the support of two successive windows with offset , define for the first window and for the second. Then: Since : Using : This is not identically ; the Hann window at exactly 50% overlap does not satisfy the squared-sum COLA exactly. However, at 75% overlap (), four windows contribute at each point and the squared sum is identically constant. For the standard (unsquared) COLA at 50% overlap, the Hann window sum does equal identically: with two overlapping windows, Thus the Hann window at 50% overlap satisfies the standard OLA-COLA condition with constant , guaranteeing energy-flat overlap-add reconstruction in the OLA framework.
Theorem 35.5 (Phase Vocoder Pitch-Shift Theorem)

To shift all spectral components of upward by frequency ratio :

  1. Compute the STFT to obtain .
  2. Estimate via Theorem 35.2.
  3. Replace each true frequency by .
  4. Propagate a phase-continuous output phase:
  5. Reconstruct using followed by overlap-add synthesis.

The output contains the same spectral magnitudes as but all instantaneous frequencies scaled by , with no change in duration.

Proof.

The proof has two parts: correctness of frequency scaling, and phase continuity.

Frequency scaling. The synthesis frame is constructed with magnitude (unchanged) and a new phase whose increment per frame is . By the inverse of Theorem 35.1, a STFT frame with that phase increment, when inverse-transformed, yields a local sinusoid at instantaneous frequency . Therefore every spectral component is scaled by .

Phase continuity. The output phase is defined by the recurrence . By construction this is a cumulative sum — there are no discontinuities between frames. If the synthesis used the raw (wrapped) analysis phase instead, adjacent frames would have inconsistent phase relationships, producing comb-filter artifacts (metallic distortion) when overlap-added. The recurrence guarantees a smooth phase trajectory.

Duration unchanged. Both analysis and synthesis use the same hop size , so the same number of frames covers the same total duration. The output signal has the same number of samples as the input.

Theorem 35.6 (Phase Correction for Time Stretching)
To stretch the duration of by factor without changing pitch, set the synthesis hop and propagate output phases as: The resulting signal has duration times that of and retains the original pitch.
Proof.

Duration scaling. The analysis processes frames from , covering a total of input samples. Synthesis places those frames with spacing , covering output samples. The duration ratio is .

Pitch preservation. The pitch of the output is determined by the instantaneous frequencies of the synthesis frames. The phase increment per synthesis frame at bin is . The corresponding angular frequency of the synthesized sinusoid is the increment divided by the synthesis hop: The instantaneous frequency is , identical to the analysis frequency. Since no frequency scaling is applied, the pitch is preserved.

Why naive repositioning fails. If the raw analysis phases were used directly in the repositioned synthesis frames without recurrence, consecutive frames would carry phase values appropriate for positions but placed at positions . The mismatch produces phase discontinuities at frame boundaries, causing comb-filter (metallic) distortion in the overlap-add sum. The recurrence eliminates this.


Numerical Examples

Example 1: Phase advance for a pure tone

Let , , sample rate Hz. Consider a pure tone at Hz (A4).

The angular frequency is rad/sample. The nearest bin is The center frequency of bin 10 is rad/sample.

Expected phase advance per hop: rad.

True phase advance per hop: rad.

Phase deviation: rad (no unwrapping needed since ).

Instantaneous frequency: rad/sample, recovering Hz. The estimate is accurate to within measurement resolution.

Example 2: Pitch shifting by a perfect fifth

A perfect fifth corresponds to frequency ratio .

For the 440 Hz tone above, shifted frequency: rad/sample, corresponding to Hz (E5).

Output phase after frame (starting from ): Each frame receives a consistent phase increment of 24.06 rad, ensuring a smooth sinusoid at 660 Hz in the output.

Example 3: Time stretching by 1.5x

With , , the synthesis hop is samples.

Phase recurrence for the 440 Hz bin: The output instantaneous frequency is rad/sample — unchanged at 440 Hz. A 5-second input (220,500 samples at 44.1 kHz) produces a 7.5-second output (330,750 samples), stretched by exactly .

Example 4: COLA check — Hann at 50% overlap

With and , consecutive Hann windows centered at and are: At : and wraps to . Sum: . This confirms the OLA-COLA condition at 50% overlap (Theorem 35.4).


Musical Connection / 音乐联系

音乐联系

The phase vocoder is the mathematical engine behind some of the most familiar sounds in modern music production.

Sampler instruments. A sampler records a note at one pitch (the “root note”) and uses the phase vocoder — or its lightweight cousin, wavetable synthesis — to render every other pitch on the keyboard from that single recording. Kontakt, Omnisphere, and similar instruments rely on this: one carefully recorded cello note becomes an entire playable cello section.

The “chipmunk effect” and autotune. Setting in pitch shifting pushes all frequencies upward, compressing the spectral content into a high-frequency range. The voice sounds thin and cartoon-like because harmonic spacing and formant positions all shift together. Professional pitch correction (Melodyne, Auto-Tune) avoids this by formant preservation: the spectral envelope (formant curve) is extracted, separated from the fine harmonic structure, and shifted independently so that only the fundamental pitch moves while the vocal timbre (vowel quality) remains natural.

Time-stretching in DAWs. Every digital audio workstation — Logic, Ableton, Reaper — uses a variant of the phase vocoder time-stretch algorithm when the user warps audio to a different tempo. The analysis/synthesis hop ratio directly maps to the tempo change factor. Phase vocoder artifacts (the “phasiness” or metallic quality on transients) motivate further refinements such as transient detection, which switches to a waveform-copy method around percussive events.

Wavetable synthesis. The lighter-weight alternative described in this episode stores a single waveform period and varies playback speed to control pitch. The interpolation kernel used between samples is mathematically identical to the sinc reconstruction studied in

EP30: The Sampling Theorem

linear interpolation is a piecewise-linear approximation to sinc, while windowed-sinc interpolation achieves ideal reconstruction within the Nyquist bandwidth.

Looking ahead. The Short-Time Fourier Transform has a deeper foundation: in EP36, we will see that the time-frequency resolution of the STFT is constrained by the Gabor uncertainty principle, established in 1946 — a musical analogue of the Heisenberg uncertainty principle that fundamentally limits how precisely a sound can be localized in both time and frequency simultaneously. EP42 will revisit these ideas in the context of wavelet analysis, which adapts the window width to frequency and can overcome some of STFT’s resolution tradeoffs.


Limits and Open Questions / 局限性与开放问题

Transient smearing. The phase vocoder assumes local stationarity — each frame is well modeled by a superposition of sinusoids. Percussive transients (drum hits, plucked strings) violate this assumption severely. Time-stretching a snare drum with a standard phase vocoder produces a characteristic “pre-echo” or “flamming” artifact as the transient energy is spread across adjacent frames. State-of-the-art solutions (transient detection, phase locking, object-based coding) partially address this, but perfect transient preservation under time-stretching remains an active research area.

Polyphony and phase locking. When multiple harmonic partials of the same note are present, the relative phases between bins at harmonically related frequencies encode important perceptual cues (the “shape” of the attack and the coherence of the instrument body). Naively scaling all phases independently can destroy this coherence. Phase locking algorithms propagate phase corrections from the analysis bin at the fundamental frequency up through its harmonics, preserving perceptually important phase relationships. The optimal phase locking strategy for complex polyphonic signals is still an open problem.

Pitch shifting limits. Very large pitch ratios (e.g., , two octaves up) produce audible artifacts even with phase-continuous processing, because the assumed one-sinusoid-per-bin model breaks down as the scaled frequencies of different bins overlap. This motivates resampling-based approaches for large pitch changes, or source-separation preprocessing.

The uncertainty principle. The STFT window length controls a fundamental tradeoff: long windows give fine frequency resolution () but poor time localization (); short windows give the reverse. There is no window for which both are simultaneously arbitrarily small — this is the Gabor uncertainty principle, previewed in the narration and to be treated rigorously in EP36.

Formant preservation. As noted in the narration, professional pitch shifting requires separating the spectral fine structure (harmonic positions, encoding pitch) from the spectral envelope (formant positions, encoding timbre). The standard engineering approach uses cepstral liftering or LPC envelope estimation, but both are approximations. Perceptually optimal formant preservation in the presence of vibrato, breathiness, and consonant transitions is an unsolved signal processing problem.


Academic References / 参考文献

  1. Flanagan, J. L., & Golden, R. M. (1966). Phase vocoder. Bell System Technical Journal, 45(9), 1493–1509. The original phase vocoder paper.

  2. Dolson, M. (1986). The phase vocoder: A tutorial. Computer Music Journal, 10(4), 14–27. Accessible tutorial covering instantaneous frequency estimation and overlap-add reconstruction.

  3. Laroche, J., & Dolson, M. (1999). Improved phase vocoder time-scale modification of audio. IEEE Transactions on Speech and Audio Processing, 7(3), 323–332. Introduces phase locking for improved polyphonic time-stretching.

  4. Portnoff, M. R. (1980). Time-frequency representation of digital signals and systems based on short-time Fourier analysis. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(1), 55–69. Rigorous derivation of COLA and perfect reconstruction conditions.

  5. Zölzer, U. (Ed.). (2011). DAFX: Digital Audio Effects (2nd ed.). Wiley. Chapter 7 covers the phase vocoder algorithm in engineering detail, including pitch shifting, time stretching, and transient handling.

  6. Gabor, D. (1946). Theory of communication. Journal of the Institution of Electrical Engineers, 93(26), 429–441. Foundational paper establishing the time-frequency uncertainty principle; the mathematical basis for EP36.

  7. Smith, J. O. (2011). Spectral Audio Signal Processing. W3K Publishing. Available online at https://ccrma.stanford.edu/~jos/sasp/. Comprehensive treatment of STFT, phase vocoder, and wavetable synthesis.