EP35

EP35: The Phase Vocoder — Mathematics of Pitch Shifting

STFT相位展开, COLA条件, 重叠相加, 音高移位

▶ 5:56 Signal ProcessingHarmonic Analysis

前置知识

EP30 The Sampling Theorem Is Not What You Think

后续拓展

EP36 Quanta of Sound — Gabor and the Uncertainty Principle EP42 Chord Recognition — From Chroma Vectors to Viterbi Decoding

Overview

A single recorded note in a sampler like Kontakt can produce 88 different pitches — not by storing 88 separate recordings, but by mathematically stretching or compressing one recording. The tool that makes this possible is the phase vocoder, a signal processing algorithm that separates a sound into magnitude and phase components via the Short-Time Fourier Transform, then manipulates those components to achieve pitch shifting and time stretching independently.

The central mathematical challenge is phase continuity. Naive pitch shifting — simply resampling a signal at a different rate — changes duration alongside pitch. The phase vocoder solves this by working frame-by-frame in the frequency domain: it estimates the true instantaneous frequency in each spectral bin from the phase difference between adjacent frames, scales those frequencies by a desired ratio, and then reconstructs a phase-continuous output signal. The reconstruction is only valid if the analysis and synthesis windows satisfy the Constant Overlap-Add (COLA) condition, which constrains the joint choice of window function and hop size.

This episode builds the full mathematical picture: STFT definition and phase advance, phase unwrapping and instantaneous frequency estimation, pitch-shift by frequency scaling, time-stretch by separating analysis and synthesis hop sizes, and the COLA condition for perfect reconstruction.

中文: “Kontakt里一个样本可以发出88个不同音高——它不是录了88段音频，而是把一段录音在数学上拉伸或压缩。背后的工具叫相位声码器。今天我们拆开它，看看采样音源的数学核心。”

Prerequisites / 前置知识

This episode builds directly on:

EP30: The Sampling Theorem — sinc reconstruction, bandlimited signals, perfect reconstruction from discrete samples . The wavetable synthesis section of this episode uses windowed-sinc interpolation, which is the same ideal reconstruction kernel derived in EP30.
Fourier analysis (EP02): the STFT is the Fourier transform applied locally through a sliding window. All notation carries over: $\omega$ for angular frequency, $X[k]$ for DFT bins.
Complex exponentials: phase is the argument of a complex number $X = |X| e^{i\phi}$ ; the key operations are extracting $\phi = \arg(X)$ and tracking how $\phi$ evolves between frames.

Definitions

Definition 35.1 (Short-Time Fourier Transform (STFT))

Let $x[n]$ be a discrete-time signal and $w[n]$ a window function of length $L$ . The Short-Time Fourier Transform of $x$ at frame $m$ and frequency bin $k$ is

$X[k,m] = \sum_{n=0}^{L-1} x[n + m \cdot H_a]\, w[n]\, e^{-i\, 2\pi k n / L},$

where $H_a$ is the analysis hop size (the number of samples the window advances between consecutive frames), and $k = 0, 1, \ldots, L-1$ . The STFT output is a complex-valued time-frequency matrix with magnitude $|X[k,m]|$ and phase $\phi[k,m] = \arg(X[k,m])$ .

Definition 35.2 (Spectrogram)

The spectrogram of a signal is the magnitude-squared STFT,

\mathcal{S}[k,m] = |X[k,m]|^2.

It represents the distribution of signal energy across time (frame index

m

) and frequency (bin

k

). The STFT yields strictly more information than the spectrogram: the phase

\phi[k,m]

is discarded when forming

\mathcal{S}

, and is irretrievably lost.

Definition 35.3 (Expected Phase Advance)

The center angular frequency of bin

k

\omega_k = 2\pi k / L

. If a pure sinusoid at exactly

\omega_k

is analyzed, the phase advances by a fixed amount between consecutive frames. The expected phase advance for bin

k

over one analysis hop

H_a

\Delta\phi_{\text{expected}}[k] = \omega_k \cdot H_a = \frac{2\pi k H_a}{L}.

Definition 35.4 (Phase Deviation and Instantaneous Frequency)

Let

\Delta\phi_{\text{actual}}[k,m] = \phi[k,m] - \phi[k,m-1]

be the observed phase difference between frame

m

and frame

m-1

, after phase unwrapping (see Proposition 35.1 below). The phase deviation is

\delta\phi[k,m] = \mathcal{W}\bigl(\Delta\phi_{\text{actual}}[k,m] - \Delta\phi_{\text{expected}}[k]\bigr),

where

\mathcal{W}(\cdot)

denotes wrapping to

(-\pi, \pi]

. The instantaneous angular frequency at bin

k

in frame

m

\omega_{\text{true}}[k,m] = \omega_k + \frac{\delta\phi[k,m]}{H_a}.

Definition 35.5 (COLA Condition)

A window function

w[n]

and hop size

H

satisfy the Constant Overlap-Add (COLA) condition if

\sum_{m=-\infty}^{\infty} w[n - m \cdot H] = C \quad \forall\, n,

for some nonzero constant

C

. In the WOLA (Weighted Overlap-Add) variant used in many phase vocoder implementations, the condition is imposed on the squared window:

\sum_{m=-\infty}^{\infty} w^2[n - m \cdot H] = C \quad \forall\, n.

The WOLA form is needed when both the analysis and synthesis paths apply the window, so that the product of the two windows sums to a constant.

Definition 35.6 (Analysis and Synthesis Hop Sizes)

In a phase vocoder configured for time stretching by factor

\alpha > 0

, the analysis hop size

H_a

and synthesis hop size

H_s

are related by

H_s = \alpha \cdot H_a.

When

\alpha > 1

the output is longer than the input (time expansion); when

\alpha < 1

it is shorter (time compression). Pitch is unchanged when only the hop ratio is altered and phase is correctly propagated.

Main Theorems / 主要定理

Theorem 35.1 (STFT Phase Advance Formula)

x[n] = A e^{i\omega_0 n}

is a complex exponential at angular frequency

\omega_0

, the STFT phase at bin

k

satisfies

\phi[k, m] = \phi[k, 0] + m \cdot \omega_0 \cdot H_a + \psi_k,

where

\psi_k

is a constant depending only on

k

and the window

w

. In particular, the phase increments between consecutive frames at fixed bin

k

are constant and equal to

\omega_0 H_a

Proof.

Substituting

x[n + mH_a] = A e^{i\omega_0(n + mH_a)}

into Definition 35.1:

X[k,m] = \sum_{n=0}^{L-1} A e^{i\omega_0(n+mH_a)}\, w[n]\, e^{-i2\pi kn/L}.

Factor out the term that depends on

m

X[k,m] = A e^{i\omega_0 m H_a} \sum_{n=0}^{L-1} w[n]\, e^{i(\omega_0 - \omega_k)n}.

The sum is a constant

G_k \in \mathbb{C}

(the window’s frequency response evaluated at

\omega_0 - \omega_k

), independent of

m

. Write

G_k = |G_k| e^{i\psi_k}

. Then

\phi[k,m] = \arg(X[k,m]) = \omega_0 m H_a + \arg(A) + \psi_k,

which establishes the linear growth with slope

\omega_0 H_a

and confirms that the frame-to-frame phase increment equals

\omega_0 H_a

\square

Prop 35.1 (Phase Unwrapping)

Measured phases

\phi[k,m]

are confined to

(-\pi, \pi]

by the

\arg

operation. When the true phase increment

\omega_0 H_a

exceeds

\pi

in magnitude, the observed difference

\phi[k,m] - \phi[k,m-1]

undergoes a discontinuous “wrap.” Phase unwrapping corrects for these wraps by adding the unique integer multiple of

2\pi

that brings the difference into

(-\pi, \pi]

\Delta\phi_{\text{unwrapped}}[k,m] = \Delta\phi_{\text{raw}}[k,m] + 2\pi \cdot \operatorname{round}\!\left(\frac{-\Delta\phi_{\text{raw}}[k,m]}{2\pi}\right).

After unwrapping,

\Delta\phi_{\text{unwrapped}}

recovers the true phase increment and the instantaneous frequency formula (Definition 35.4) is valid.

Theorem 35.2 (Instantaneous Frequency Estimation)

Under the local stationarity assumption that the signal in frame

m

is well approximated by a single sinusoid at frequency

\omega_{\text{true}}[k,m]

, the instantaneous frequency in bin

k

is unambiguously estimated as

\omega_{\text{true}}[k,m] = \omega_k + \frac{\mathcal{W}\!\left(\phi[k,m] - \phi[k,m-1] - \omega_k H_a\right)}{H_a},

provided the true frequency deviation

|\omega_{\text{true}} - \omega_k|

satisfies

|\omega_{\text{true}} - \omega_k| < \pi / H_a

(i.e., it stays within half a bin width in terms of the wrapped phase range).

Proof.

By Theorem 35.1, a pure tone at

\omega_{\text{true}}

contributes a phase increment

\omega_{\text{true}} H_a

in bin

k

. The observed raw difference is

\Delta\phi_{\text{raw}} = \omega_{\text{true}} H_a + 2\pi p

for some integer

p

introduced by the

\arg

wrapping. Subtracting the expected advance:

\Delta\phi_{\text{raw}} - \omega_k H_a = (\omega_{\text{true}} - \omega_k) H_a + 2\pi p.

Wrapping this residual into

(-\pi, \pi]

recovers

(\omega_{\text{true}} - \omega_k) H_a

uniquely as long as its magnitude is less than

\pi

, i.e.,

|\omega_{\text{true}} - \omega_k| < \pi/H_a

. Dividing by

H_a

and adding

\omega_k

yields the formula.

\square

Theorem 35.3 (COLA Condition for Perfect Reconstruction)

Let the synthesis signal be formed by the overlap-add of windowed inverse-STFT frames,

y[n] = \frac{1}{C}\sum_{m} \left(\sum_{k=0}^{L-1} X[k,m]\, e^{i2\pi kn/L}\right) w_s[n - mH],

where

w_s

is the synthesis window. If the analysis and synthesis windows satisfy the COLA condition

\sum_m w_a[n-mH]\, w_s[n-mH] = C

for all

n

, then

y = x

(perfect reconstruction when no frequency modifications are applied).

Proof.

Substituting the STFT definition into the overlap-add formula and exchanging summation order:

y[n] = \frac{1}{C}\sum_{m} w_s[n-mH] \sum_{j} x[j]\, w_a[j-mH] \underbrace{\frac{1}{L}\sum_k e^{i2\pi k(n-j)/L}}_{\delta[n-j]}.

The inner DFT sum equals the Kronecker delta

\delta[n-j]

, so the

j

sum collapses to

x[n]\, w_a[n-mH]

y[n] = \frac{x[n]}{C} \sum_{m} w_a[n-mH]\, w_s[n-mH].

Applying the COLA condition, the sum equals

C

for all

n

, giving

y[n] = x[n]

\square

Theorem 35.4 (Hann Window Satisfies COLA at 50% Overlap)

The Hann window

w[n] = \tfrac{1}{2}\!\left(1 - \cos\!\tfrac{2\pi n}{L}\right)

for

n = 0, \ldots, L-1

satisfies the WOLA condition

\sum_m w^2[n - mH] = \text{const}

when the hop size is

H = L/2

(50% overlap), with constant value

3/4

Proof.

At 50% overlap,

H = L/2

, so at each sample position

n

exactly two windows overlap: the one centered at

mH

and the one at

(m+1)H

. Within the support of two successive windows with offset

H

, define

u = 2\pi n/L \in [0, \pi)

for the first window and

u - \pi

for the second. Then:

w^2(u) + w^2(u+\pi) = \frac{1}{4}(1-\cos u)^2 + \frac{1}{4}(1-\cos(u+\pi))^2.

Since

\cos(u+\pi) = -\cos u

= \frac{1}{4}(1-\cos u)^2 + \frac{1}{4}(1+\cos u)^2 = \frac{1}{4}\bigl(2 + 2\cos^2 u\bigr) = \frac{1}{2}(1 + \cos^2 u).

Using

\cos^2 u = (1+\cos 2u)/2

= \frac{1}{2}\!\left(1 + \frac{1+\cos 2u}{2}\right) = \frac{3}{4} + \frac{\cos 2u}{4}.

This is not identically

3/4

; the Hann window at exactly 50% overlap does not satisfy the squared-sum COLA exactly. However, at 75% overlap (

H = L/4

), four windows contribute at each point and the squared sum is identically constant. For the standard (unsquared) COLA at 50% overlap, the Hann window sum

\sum_m w[n-mH]

does equal

1

identically: with two overlapping windows,

w(u) + w(u+\pi) = \tfrac{1}{2}(1-\cos u) + \tfrac{1}{2}(1+\cos u) = 1.

Thus the Hann window at 50% overlap satisfies the standard OLA-COLA condition with constant

C = 1

, guaranteeing energy-flat overlap-add reconstruction in the OLA framework.

\square

Theorem 35.5 (Phase Vocoder Pitch-Shift Theorem)

To shift all spectral components of $x[n]$ upward by frequency ratio $r > 0$ :

Compute the STFT to obtain $X[k,m]$ .
Estimate $\omega_{\text{true}}[k,m]$ via Theorem 35.2.
Replace each true frequency by $\omega_{\text{shifted}}[k,m] = r \cdot \omega_{\text{true}}[k,m]$ .
Propagate a phase-continuous output phase: $\phi_{\text{out}}[k,m] = \phi_{\text{out}}[k,m-1] + r \cdot \omega_{\text{true}}[k,m] \cdot H_a.$
Reconstruct using $Y[k,m] = |X[k,m]|\, e^{i\phi_{\text{out}}[k,m]}$ followed by overlap-add synthesis.

The output $y[n]$ contains the same spectral magnitudes as $x[n]$ but all instantaneous frequencies scaled by $r$ , with no change in duration.

Proof.

The proof has two parts: correctness of frequency scaling, and phase continuity.

Frequency scaling. The synthesis frame is constructed with magnitude $|X[k,m]|$ (unchanged) and a new phase whose increment per frame is $r \cdot \omega_{\text{true}}[k,m] \cdot H_a$ . By the inverse of Theorem 35.1, a STFT frame with that phase increment, when inverse-transformed, yields a local sinusoid at instantaneous frequency $r \cdot \omega_{\text{true}}[k,m]$ . Therefore every spectral component is scaled by $r$ .

Phase continuity. The output phase is defined by the recurrence $\phi_{\text{out}}[k,m] = \phi_{\text{out}}[k,m-1] + r \cdot \omega_{\text{true}}[k,m] \cdot H_a$ . By construction this is a cumulative sum — there are no discontinuities between frames. If the synthesis used the raw (wrapped) analysis phase instead, adjacent frames would have inconsistent phase relationships, producing comb-filter artifacts (metallic distortion) when overlap-added. The recurrence guarantees a smooth phase trajectory.

Duration unchanged. Both analysis and synthesis use the same hop size $H_a$ , so the same number of frames covers the same total duration. The output signal has the same number of samples as the input. $\square$

Theorem 35.6 (Phase Correction for Time Stretching)

To stretch the duration of

x[n]

by factor

\alpha

without changing pitch, set the synthesis hop

H_s = \alpha H_a

and propagate output phases as:

\phi_{\text{out}}[k,m] = \phi_{\text{out}}[k,m-1] + \omega_{\text{true}}[k,m] \cdot H_s.

The resulting signal

y[n]

has duration

\alpha

times that of

x[n]

and retains the original pitch.

Proof.

Duration scaling. The analysis processes $M$ frames from $x$ , covering a total of $M \cdot H_a$ input samples. Synthesis places those $M$ frames with spacing $H_s = \alpha H_a$ , covering $M \cdot H_s = \alpha M H_a$ output samples. The duration ratio is $\alpha$ .

Pitch preservation. The pitch of the output is determined by the instantaneous frequencies of the synthesis frames. The phase increment per synthesis frame at bin $k$ is $\omega_{\text{true}}[k,m] \cdot H_s$ . The corresponding angular frequency of the synthesized sinusoid is the increment divided by the synthesis hop: $\frac{\phi_{\text{out}}[k,m] - \phi_{\text{out}}[k,m-1]}{H_s} = \frac{\omega_{\text{true}}[k,m] \cdot H_s}{H_s} = \omega_{\text{true}}[k,m].$ The instantaneous frequency is $\omega_{\text{true}}[k,m]$ , identical to the analysis frequency. Since no frequency scaling is applied, the pitch is preserved.

Why naive repositioning fails. If the raw analysis phases $\phi[k,m]$ were used directly in the repositioned synthesis frames without recurrence, consecutive frames would carry phase values appropriate for positions $m H_a$ but placed at positions $m H_s$ . The mismatch $\phi[k,m] - \phi[k,m-1] \neq \omega_{\text{true}} H_s$ produces phase discontinuities at frame boundaries, causing comb-filter (metallic) distortion in the overlap-add sum. The recurrence eliminates this. $\square$

Numerical Examples

Example 1: Phase advance for a pure tone

Let $L = 1024$ , $H_a = 256$ , sample rate $f_s = 44100$ Hz. Consider a pure tone at $f_0 = 440$ Hz (A4).

The angular frequency is $\omega_0 = 2\pi \cdot 440 / 44100 \approx 0.06267$ rad/sample. The nearest bin is $k_0 = \operatorname{round}\!\left(\frac{440 \cdot 1024}{44100}\right) = \operatorname{round}(10.22) = 10.$ The center frequency of bin 10 is $\omega_{10} = 2\pi \cdot 10 / 1024 \approx 0.06136$ rad/sample.

Expected phase advance per hop: $\Delta\phi_{\text{expected}} = \omega_{10} \cdot 256 \approx 15.71$ rad.

True phase advance per hop: $\Delta\phi_{\text{true}} = \omega_0 \cdot 256 \approx 16.04$ rad.

Phase deviation: $\delta\phi = 16.04 - 15.71 = 0.33$ rad (no unwrapping needed since $|\delta\phi| \ll \pi$ ).

Instantaneous frequency: $\omega_{\text{true}} = \omega_{10} + 0.33/256 = 0.06136 + 0.00129 = 0.06265$ rad/sample, recovering $f_0 = 0.06265 \cdot 44100 / (2\pi) \approx 440$ Hz. The estimate is accurate to within measurement resolution.

Example 2: Pitch shifting by a perfect fifth

A perfect fifth corresponds to frequency ratio $r = 3/2 = 1.5$ .

For the 440 Hz tone above, shifted frequency: $\omega_{\text{shifted}} = 1.5 \times 0.06265 = 0.09397$ rad/sample, corresponding to $f = 0.09397 \times 44100/(2\pi) \approx 660$ Hz (E5).

Output phase after frame $m$ (starting from $\phi_{\text{out}}[k,0] = 0$ ): $\phi_{\text{out}}[k,m] = m \cdot 1.5 \cdot \omega_{\text{true}} \cdot H_a \approx m \cdot 1.5 \times 16.04 = m \cdot 24.06 \text{ rad}.$ Each frame receives a consistent phase increment of 24.06 rad, ensuring a smooth sinusoid at 660 Hz in the output.

Example 3: Time stretching by 1.5x

With $\alpha = 1.5$ , $H_a = 256$ , the synthesis hop is $H_s = 384$ samples.

Phase recurrence for the 440 Hz bin: $\phi_{\text{out}}[k,m] = \phi_{\text{out}}[k,m-1] + 0.06265 \times 384 \approx \phi_{\text{out}}[k,m-1] + 24.06 \text{ rad}.$ The output instantaneous frequency is $24.06 / 384 = 0.06265$ rad/sample — unchanged at 440 Hz. A 5-second input (220,500 samples at 44.1 kHz) produces a 7.5-second output (330,750 samples), stretched by exactly $\alpha = 1.5$ .

Example 4: COLA check — Hann at 50% overlap

With $L = 8$ and $H = 4$ , consecutive Hann windows centered at $n = 0$ and $n = 4$ are: $w[n] = \tfrac{1}{2}\!\left(1 - \cos\frac{2\pi n}{8}\right), \quad w[n-4] = \tfrac{1}{2}\!\left(1 - \cos\frac{2\pi(n-4)}{8}\right).$ At $n=2$ : $w[2] = \tfrac{1}{2}(1-\cos(\pi/2)) = 1/2$ and $w[2-4] = w[-2]$ wraps to $w[6] = \tfrac{1}{2}(1-\cos(3\pi/2)) = 1/2$ . Sum: $1/2 + 1/2 = 1$ . This confirms the OLA-COLA condition at 50% overlap (Theorem 35.4).

Musical Connection / 音乐联系

音乐联系

The phase vocoder is the mathematical engine behind some of the most familiar sounds in modern music production.

Sampler instruments. A sampler records a note at one pitch (the “root note”) and uses the phase vocoder — or its lightweight cousin, wavetable synthesis — to render every other pitch on the keyboard from that single recording. Kontakt, Omnisphere, and similar instruments rely on this: one carefully recorded cello note becomes an entire playable cello section.

The “chipmunk effect” and autotune. Setting $r \gg 1$ in pitch shifting pushes all frequencies upward, compressing the spectral content into a high-frequency range. The voice sounds thin and cartoon-like because harmonic spacing and formant positions all shift together. Professional pitch correction (Melodyne, Auto-Tune) avoids this by formant preservation: the spectral envelope (formant curve) is extracted, separated from the fine harmonic structure, and shifted independently so that only the fundamental pitch moves while the vocal timbre (vowel quality) remains natural.

Time-stretching in DAWs. Every digital audio workstation — Logic, Ableton, Reaper — uses a variant of the phase vocoder time-stretch algorithm when the user warps audio to a different tempo. The analysis/synthesis hop ratio $\alpha = H_s/H_a$ directly maps to the tempo change factor. Phase vocoder artifacts (the “phasiness” or metallic quality on transients) motivate further refinements such as transient detection, which switches to a waveform-copy method around percussive events.

Wavetable synthesis. The lighter-weight alternative described in this episode stores a single waveform period and varies playback speed to control pitch. The interpolation kernel used between samples is mathematically identical to the sinc reconstruction studied in

EP30: The Sampling Theorem: linear interpolation is a piecewise-linear approximation to sinc, while windowed-sinc interpolation achieves ideal reconstruction within the Nyquist bandwidth.

Looking ahead. The Short-Time Fourier Transform has a deeper foundation: in EP36, we will see that the time-frequency resolution of the STFT is constrained by the Gabor uncertainty principle, established in 1946 — a musical analogue of the Heisenberg uncertainty principle that fundamentally limits how precisely a sound can be localized in both time and frequency simultaneously. EP42 will revisit these ideas in the context of wavelet analysis, which adapts the window width to frequency and can overcome some of STFT’s resolution tradeoffs.

Limits and Open Questions / 局限性与开放问题

Transient smearing. The phase vocoder assumes local stationarity — each frame is well modeled by a superposition of sinusoids. Percussive transients (drum hits, plucked strings) violate this assumption severely. Time-stretching a snare drum with a standard phase vocoder produces a characteristic “pre-echo” or “flamming” artifact as the transient energy is spread across adjacent frames. State-of-the-art solutions (transient detection, phase locking, object-based coding) partially address this, but perfect transient preservation under time-stretching remains an active research area.

Polyphony and phase locking. When multiple harmonic partials of the same note are present, the relative phases between bins at harmonically related frequencies encode important perceptual cues (the “shape” of the attack and the coherence of the instrument body). Naively scaling all phases independently can destroy this coherence. Phase locking algorithms propagate phase corrections from the analysis bin at the fundamental frequency up through its harmonics, preserving perceptually important phase relationships. The optimal phase locking strategy for complex polyphonic signals is still an open problem.

Pitch shifting limits. Very large pitch ratios (e.g., $r = 4$ , two octaves up) produce audible artifacts even with phase-continuous processing, because the assumed one-sinusoid-per-bin model breaks down as the scaled frequencies of different bins overlap. This motivates resampling-based approaches for large pitch changes, or source-separation preprocessing.

The uncertainty principle. The STFT window length $L$ controls a fundamental tradeoff: long windows give fine frequency resolution ( $\Delta\omega \sim 1/L$ ) but poor time localization ( $\Delta t \sim L$ ); short windows give the reverse. There is no window for which both are simultaneously arbitrarily small — this is the Gabor uncertainty principle, previewed in the narration and to be treated rigorously in EP36.

Formant preservation. As noted in the narration, professional pitch shifting requires separating the spectral fine structure (harmonic positions, encoding pitch) from the spectral envelope (formant positions, encoding timbre). The standard engineering approach uses cepstral liftering or LPC envelope estimation, but both are approximations. Perceptually optimal formant preservation in the presence of vibrato, breathiness, and consonant transitions is an unsolved signal processing problem.

Academic References / 参考文献

Flanagan, J. L., & Golden, R. M. (1966). Phase vocoder. Bell System Technical Journal, 45(9), 1493–1509. The original phase vocoder paper.
Dolson, M. (1986). The phase vocoder: A tutorial. Computer Music Journal, 10(4), 14–27. Accessible tutorial covering instantaneous frequency estimation and overlap-add reconstruction.
Laroche, J., & Dolson, M. (1999). Improved phase vocoder time-scale modification of audio. IEEE Transactions on Speech and Audio Processing, 7(3), 323–332. Introduces phase locking for improved polyphonic time-stretching.
Portnoff, M. R. (1980). Time-frequency representation of digital signals and systems based on short-time Fourier analysis. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(1), 55–69. Rigorous derivation of COLA and perfect reconstruction conditions.
Zölzer, U. (Ed.). (2011). DAFX: Digital Audio Effects (2nd ed.). Wiley. Chapter 7 covers the phase vocoder algorithm in engineering detail, including pitch shifting, time stretching, and transient handling.
Gabor, D. (1946). Theory of communication. Journal of the Institution of Electrical Engineers, 93(26), 429–441. Foundational paper establishing the time-frequency uncertainty principle; the mathematical basis for EP36.
Smith, J. O. (2011). Spectral Audio Signal Processing. W3K Publishing. Available online at https://ccrma.stanford.edu/~jos/sasp/. Comprehensive treatment of STFT, phase vocoder, and wavetable synthesis.