EP15

EP15: Opera Acoustics and the Singer's Formant

Source-Filter Theory, Formant Clustering, and the 3 kHz Window

▶ 9:06 Physics/AcousticsSignal ProcessingPsychoacoustics

前置知识

EP02 String Vibration and the Wave Equation EP09 Combination Tones and Nonlinear Acoustics

Overview

An opera singer has no microphone. A single voice must cut through 110 orchestral instruments in a live hall. This episode asks: how?

中文: “答案藏在——3000赫兹。乐队能量在这里暴跌18分贝。这是歌手唯一的窗口——或者被淹没，或者穿透。”

The answer is not louder singing. It is frequency targeting: the singer concentrates acoustic energy precisely where the orchestra is weakest, where the human ear is most sensitive, and where the physics of the vocal tract allows a structural resonance enhancement.

This episode develops a three-layer quantitative model:

Physical layer: The orchestra’s spectrum drops ~18 dB between its 500 Hz peak and the 3 kHz region. This is the frequency window.
Source layer: The singer’s Formant — a cluster of vocal resonances $F_3$ , $F_4$ , $F_5$ near 2500–3500 Hz — raises the singer’s output at exactly this window by approximately +10 dB.
Perceptual layer: The human auditory system (Fletcher-Munson equal-loudness curves, ISO 226) is ~10 dB more sensitive at 3 kHz than at 500 Hz at forte dynamics.

Combined, the physical signal-to-noise ratio is 48 dB; the perceptual penetration index reaches 49 dB (with a ±2 dB uncertainty on the Formant term). A fourth factor — room acoustics — modulates this advantage by up to ±7 dB depending on hall design.

The episode also presents spectral centroid measurements from four tenors, connecting vocal technique to quantifiable acoustic parameters.

Prerequisites

Fourier Analysis and Timbre (EP02) — frequency spectrum, harmonics, Fourier decomposition
Combination Tones and Nonlinearity (EP09) — nonlinear acoustics, cochlear response, harmonic generation

Definitions

Definition 15.1 (Acoustic Source-Filter Model)

The source-filter model of voice production separates the voice into two independent stages:

Source (glottal excitation): The vocal folds produce a periodic buzz with fundamental frequency $f_0$ and a harmonic series at $f_0, 2f_0, 3f_0, \ldots$ The spectral envelope of the glottal source decays at approximately $-12$ dB/octave (i.e., the amplitude of harmonic $n$ is proportional to $1/n^2$ ).
Filter (vocal tract): The air column above the larynx acts as an acoustic resonator. Its resonant frequencies $F_1, F_2, F_3, \ldots$ are called formants. The vocal tract transfer function amplifies harmonics near each formant and attenuates those between formants.

The radiated sound spectrum is:

|S(f)| = |G(f)| \cdot |H(f)| \cdot |R(f)|

where $G(f)$ is the glottal source spectrum, $H(f)$ is the vocal tract transfer function, and $R(f)$ is a radiation factor ( $\propto f$ , adding +6 dB/octave to the final output).

Net source spectral slope: $-12 + 6 = -6$ dB/octave (before formant amplification).

Definition 15.2 (Formants and Vocal Tract Transfer Function)

The formant frequencies $F_1, F_2, F_3, F_4, F_5$ are the resonant peaks of the vocal tract transfer function. For a simplified model of $k$ uncoupled resonators, the transfer function is:

H(f) = \prod_{k} \frac{1}{1 - (f/F_k)^2}

Each term contributes a peak of amplitude $\rightarrow \infty$ at $f = F_k$ (in the lossless approximation). Including damping, each resonator has a finite bandwidth $B_k$ and the actual amplitude at resonance is $F_k / B_k$ .

Typical speech formant frequencies (male voice):

Formant	Speech (Hz)	Opera singing (Hz)	Shift
$F_1$	500	300–500	varies
$F_2$	1500	800–1200	varies
$F_3$	2200	2600	+400 Hz
$F_4$	3100	3000	−100 Hz
$F_5$	3900	3400	−500 Hz

Definition 15.3 (Singer's Formant)

The Singer’s Formant (Sundberg, 1974) is a clustering of $F_3$ , $F_4$ , and $F_5$ into a narrow frequency band near 2500–3500 Hz, producing an energy concentration approximately +10 dB above what would occur with the same formants in their speech positions.

This clustering is produced by a specific laryngeal configuration in which the pharyngeal cross-sectional area to laryngeal tube opening ratio reaches approximately 6:1. This geometric ratio creates an additional coupled resonance mode that draws $F_3$ , $F_4$ , and $F_5$ toward the same frequency region.

Measured range (Sundberg 1974): +8 to +12 dB. Canonical value: $\Delta_\text{Formant} \approx +10$ dB.

中文: “这不是单一谐波，而是三个共振峰F3、F4、F5的聚合。”

Definition 15.4 (Equal-Loudness Contour (Fletcher-Munson / ISO 226))

An equal-loudness contour at $L$ phon is the curve of sound pressure level (SPL) in dB as a function of frequency $f$ such that a pure tone at $(f, \text{SPL}(f))$ is judged equally loud to a 1 kHz reference tone at $L$ dB SPL.

Key data (ISO 226:2003):

Level (phon)	SPL at 500 Hz (dB)	SPL at 3000 Hz (dB)	Advantage at 3 kHz
40	45	38	+7 dB
60	60	53	+7 dB
80	77	68	+9 dB
100	98	88	+10 dB

At forte operatic dynamics (~90–100 phon), the ear requires 10 dB less SPL at 3 kHz than at 500 Hz to perceive the same loudness. Equivalently, a 3 kHz tone sounds 10 dB louder than a 500 Hz tone at the same physical SPL.

$\Delta_\text{Fletcher} \approx +10$ dB at operatic forte.

Definition 15.5 (Spectral Centroid)

The spectral centroid of a signal with power spectrum $P(f)$ is:

f_c = \frac{\int_0^\infty f \cdot P(f)\, df}{\int_0^\infty P(f)\, df}

In discrete form (DFT bins):

f_c = \frac{\sum_k f_k \cdot |X_k|^2}{\sum_k |X_k|^2}

The spectral centroid is a single-number proxy for perceived brightness (timbre). Higher $f_c$ correlates with a “brighter,” more “forward” vocal quality; lower $f_c$ with a “darker,” “covered” sound.

Definition 15.6 (Clarity Index {{< m >}}C_{80}{{< /m >}})

The clarity index $C_{80}$ of a room measures the ratio of early to late reflected energy arriving at a listener position:

C_{80} = 10 \log_{10} \frac{\int_0^{80\,\text{ms}} p^2(t)\, dt}{\int_{80\,\text{ms}}^\infty p^2(t)\, dt} \quad \text{(dB)}

Higher $C_{80}$ (positive values) indicates more early energy — the room sounds cleaner and more “articulate.” Lower $C_{80}$ (negative values) indicates excessive reverberation blurring onsets.

Representative values:

Center orchestra (good hall): $C_{80} \approx +3$ dB
Side balcony seat: $C_{80} \approx -2$ dB
Difference: $\Delta C_{80} \approx 5$ dB of perceived clarity loss

Main Theorems

Theorem 15.1 (Orchestra Spectral Attenuation Model)

The long-term average sound pressure level of a full symphony orchestra follows an approximate decay model:

\text{SPL}(f) \approx \text{SPL}_0 - \alpha \cdot \log_2\!\left(\frac{f}{f_0}\right)

with $\text{SPL}_0 = 80$ dB at the peak frequency $f_0 = 500$ Hz and $\alpha \approx 9$ dB/octave (Beranek 2004, Meyer 2009).

Numerical values:

Frequency	Octaves from 500 Hz	Predicted SPL	Rounded
1000 Hz	1	71 dB	71 dB
2000 Hz	2	62 dB	62 dB
3000 Hz	2.585	56.7 dB	~57 dB
4000 Hz	3	53 dB	53 dB

The 3 kHz window: at 3000 Hz, orchestral SPL is approximately 57–62 dB (18–23 dB below peak), leaving a spectral gap that a well-trained singer can exploit.

Proof.

$\log_2(3000/500) = \log_2(6) = \ln 6 / \ln 2 \approx 1.7918 / 0.6931 \approx 2.585$ .

$\text{SPL}(3000) = 80 - 9 \times 2.585 = 80 - 23.3 \approx 56.7$ dB.

The narration uses 62 dB, which corresponds to $\alpha = 7$ dB/octave: $80 - 7 \times 2.585 \approx 62$ dB. Both values (7–9 dB/octave) appear in the literature depending on the ensemble and measurement method. We adopt the more conservative estimate $\text{SPL}(3000) \approx 62$ dB for the SNR calculation, consistent with the video. $\square$

Theorem 15.2 (Physical Signal-to-Noise Ratio)

When an opera singer performs at fortissimo dynamics with peak SPL of approximately 110 dB at 3 kHz (peak value; orchestra measured as long-term average), the physical signal-to-noise ratio at 3 kHz is:

\text{SNR}_\text{phys} = \text{SPL}_\text{singer}(3\,\text{kHz}) - \text{SPL}_\text{orchestra}(3\,\text{kHz}) = 110 - 62 = 48 \text{ dB}

Important caveat: The singer’s 110 dB is a peak SPL during fortissimo, while the orchestra’s 62 dB is a long-term average. This “peak vs. average” comparison captures the perceptual reality that the singer’s loudest moments cut through the orchestra’s sustained texture.

Theorem 15.3 (Helmholtz Resonator Model of the Singer's Formant)

The pharynx-larynx system can be approximated as a Helmholtz resonator: a cavity of volume $V$ connected to the exterior through a neck of cross-sectional area $A$ and length $L$ . The resonant frequency is:

f_H = \frac{v}{2\pi} \sqrt{\frac{A}{V \cdot L}}

where $v = 343$ m/s is the speed of sound.

Sundberg (1974) identified that when the ratio of pharyngeal cross-section to laryngeal tube opening reaches $\approx 6:1$ , the coupled resonance mode of the pharynx-larynx system places an extra resonance peak near 3000 Hz, drawing $F_3$ , $F_4$ , and $F_5$ into proximity.

Typical values (Ternström 1988): $V \approx 35$ cm $^3$ , $L \approx 2.5$ cm, $A$ tunable by laryngeal action. The 6:1 area ratio is a key degree of freedom controlled by the singer.

Note: The full vocal tract is a multi-cavity coupled system; the Helmholtz model is a pedagogical approximation. The mechanism is directionally correct but not quantitatively exact.

Proof.

Helmholtz resonance derivation (simplified): consider air in the neck as an acoustic mass $m = \rho L / A$ (where $\rho$ is air density), and the cavity volume as an acoustic compliance $C_V = V / (\rho v^2)$ . The resonant frequency of this mass-spring system is:

f_H = \frac{1}{2\pi\sqrt{m \cdot C_V}} = \frac{1}{2\pi} \sqrt{\frac{A / (\rho L)}{V / (\rho v^2)}} = \frac{v}{2\pi}\sqrt{\frac{A}{VL}}

The 6:1 area ratio $A_\text{pharynx} : A_\text{larynx} = 6:1$ makes the laryngeal tube act as a Helmholtz neck relative to the pharyngeal cavity, positioning the resonance near 3 kHz for typical adult male vocal tract geometry. $\square$

Theorem 15.4 (Total Penetration Index)

Define the perceptual penetration index as the sum of the physical SNR and the relevant psychoacoustic advantages:

\text{PI} = \underbrace{\text{SNR}_\text{phys}}_\text{48 dB} + \underbrace{\Delta_\text{Formant}}_\text{+10 dB} + \underbrace{\Delta_\text{Fletcher}}_\text{+10 dB} \approx 49 \text{ dB}

Note: $\Delta_\text{Formant}$ is already incorporated in the singer’s SPL figure if the 110 dB measurement includes the Formant clustering. In that interpretation, the 48 dB SNR already embeds the Formant effect, and $\Delta_\text{Fletcher} = +10$ dB is the pure perceptual bonus. The video presents all three layers separately for pedagogical clarity; the combined effect is approximately 49 dB.

Under adverse conditions (suboptimal seat, insufficient technique):

\text{PI}_\text{adverse} = 49 - \underbrace{8}_\text{side seat} - \underbrace{10}_\text{no Formant} = 31 \text{ dB}

approaching the ~20 dB empirical threshold below which singers are perceived as “masked.”

Theorem 15.5 (Sabine Reverberation Formula)

The reverberation time $T_{60}$ of a room — the time for a sound to decay 60 dB after the source stops — is given by Sabine’s formula (1900):

T_{60} = \frac{0.161 \cdot V}{A}

where:

$V$ = room volume (m $^3$ )
$A = \sum_i \alpha_i S_i$ = total absorption area (m $^2$ ), with $\alpha_i$ the absorption coefficient of surface $i$ and $S_i$ its area
The constant $0.161 = 4\ln(10^6)/v \approx 4 \times 13.816 / 343 \approx 0.161$ s/m in SI units

Comparison of opera houses:

Hall	$T_{60}$ (mid-frequency)	Character
La Scala, Milan	1.2 s	Clear, articulate
Glyndebourne	1.3 s	Intimate
Royal Opera House	1.4 s	Balanced
Vienna State Opera	1.6 s	Rich
Sydney Opera House	2.0 s	Reverberant, blurry

$C_{80}$ difference between La Scala and Sydney: approximately 4 dB.

Proof.

Sabine’s derivation assumes a diffuse sound field (energy uniformly distributed in all directions). The energy density $E(t)$ in the room decays as:

\frac{dE}{dt} = -\frac{v \cdot A}{4V} E

(each reflection removes a fraction $\bar{\alpha} = A/S_\text{total}$ of the energy hitting each surface). Solving:

E(t) = E_0 \exp\!\left(-\frac{v A}{4V} t\right)

Setting $E(t) = 10^{-6} E_0$ (60 dB decay, factor $10^6$ in power):

t = T_{60} = \frac{4V \ln(10^6)}{v A} = \frac{4V \times 6\ln 10}{v A} = \frac{24 \ln 10}{343} \cdot \frac{V}{A} \approx 0.161 \cdot \frac{V}{A} \quad \square

Bayreuth: Architectural Acoustics as Instrument

The Bayreuth Festspielhaus (1876, designed by Wagner) demonstrates that room design can substitute for vocal technique. Wagner submerged the orchestra pit below stage level and covered it with a curved hood that deflects orchestral sound toward the ceiling rather than directly toward the audience.

Effects:

The orchestra’s direct sound to the audience is attenuated by approximately 7 dB through the additional path length and diffraction losses.
The singer’s direct sound arrives at the audience with no obstruction.
The Haas effect (precedence effect): when two sound sources arrive within 20–40 ms of each other, the auditory system fuses them and attributes localization to the earlier arrival. The singer’s direct sound precedes the orchestra’s reflected sound, so the voice is perceived as the dominant source regardless of amplitude relationships.

中文: “瓦格纳把乐池沉入地下，弧形盖板把乐队声反射向天花板。歌手直达声先到，Haas效应让人声漂浮在乐队之上。乐队衰减7分贝——不靠歌手技术，靠建筑。”

Failure Cases: Quantitative Analysis

Lateral Seating ( $C_{80}$ Loss + High-Frequency Diffraction)

A seat at Ring 3 right side (example: seat B311 at the Canadian Opera Company house) experiences two independent losses compared to center orchestra:

1. Early reflection deficit:

\Delta C_{80} = C_{80}^\text{center} - C_{80}^\text{side} \approx (+3) - (-2) = 5 \text{ dB}

Central seats receive multiple early reflections from lateral walls; side seats in enclosed boxes receive predominantly direct sound plus late reverberation, reducing intelligibility.

2. High-frequency diffraction loss:

At 3000 Hz: wavelength $\lambda = v/f = 343/3000 \approx 11$ cm. Objects of dimension $\sim 10$ cm or larger (seat backs, box partitions) shadow high frequencies effectively. Low frequencies (500 Hz, $\lambda \approx 69$ cm) diffract around these obstacles easily.

Estimated additional HF attenuation for obstructed sightlines: approximately 3 dB at 3 kHz.

Total lateral seat loss: $5 + 3 = 8$ dB reduction in perceived vocal penetration.

Technical Deficiency: Missing Formant Cluster

If the singer does not achieve the 6:1 pharynx-to-larynx area ratio (insufficient laryngeal lowering or pharyngeal expansion), $F_3$ , $F_4$ , $F_5$ remain at their speech positions (dispersed) rather than clustering near 3 kHz. The Formant gain of +10 dB is lost entirely.

Combined adverse scenario:

\text{PI}_\text{worst} = 49 - 8 - 10 = 31 \text{ dB}

If the room’s $C_{80}$ is also unfavorable (e.g., $C_{80} < -3$ dB), the effective PI drops further, crossing the empirical ~20 dB masking threshold.

Tenor Brightness: Spectral Centroid Measurements

Spectral centroid $f_c$ (Definition 15.5) was computed from recordings of four tenors on comparable high notes (B $\flat$ 4–B4):

Tenor	Spectral Centroid ( $f_c$ )	Brilliance band (5–8 kHz)
Yijie Shi (石倚洁)	3335 Hz	−67 dB (relative)
Juan Diego Flórez	3306 Hz	−68 dB
Dave Monaco (COC, 2026)	3005 Hz	−73.6 dB
Ben Bliss (COC, 2026)	2246 Hz	−90.7 dB

The 17–24 dB gap in the 5–8 kHz brilliance band between Ben Bliss and the other three singers is objectively measurable and perceptually salient. A singer with $f_c \approx 2246$ Hz has most energy concentrated below 3 kHz; the spectral rolloff (the frequency below which 85% of energy falls) is approximately 4818 Hz for Bliss vs. 7354 Hz for Monaco.

中文: “这不是偏好，而是可测量的声学差异。”

This is not a value judgment on vocal quality — a “covered” (voix sombre) sound has different artistic functions than a bright squillo sound. But the acoustic difference is real and measurable, and directly predicts penetration capacity.

Musical Connection

音乐联系

The 3 kHz Window Across the Series

The 3 kHz frequency region appears repeatedly across the arc:

EP02 (Fourier): The spectrum of a complex tone is the sum of harmonic components. The human voice’s harmonic series above $f_0 \sim 200$ Hz reaches 3 kHz at approximately the 15th harmonic — well within the resonance-amplifiable range.
EP09 (Combination tones): The cochlea’s nonlinear response generates combination tones at $|mf_1 - nf_2|$ . For a sung vowel with $F_1 \approx 700$ Hz and $F_2 \approx 1200$ Hz, combination products $2F_2 - F_1 \approx 1700$ Hz and $3F_2 - F_1 \approx 2900$ Hz fall near the Singer’s Formant region.
EP15 (this episode): The Singer’s Formant is the mechanism by which the vocalist intentionally concentrates energy in this region.
EP19 (vibrato): Vibrato-modulated formants create sidebands that sweep through the formant region, dynamically sampling it. The 3 kHz sensitivity window explains why vibrato rate (~5–7 Hz) interacts with formant frequency in ways that maximize perceived warmth.
EP28 (microphone physics): The condenser microphone models the diaphragm as a damped oscillator with resonant frequency in the 2–10 kHz range — the same region the Singer’s Formant targets.

Source-filter separation across instruments

The source-filter decomposition (Definition 15.1) applies beyond the voice:

Violin: The bowed string is the source; the body resonance modes are the filter.
Brass instruments: The lip buzz is the source; the bell and bore are the filter.
Piano: The hammer strike is the source; the soundboard is the filter.

The mathematical structure $S = G \cdot H \cdot R$ is universal. Opera singing is the case where the filter ( $H$ , the vocal tract) is dynamically reconfigured by the performer in real time — a capability not available to instrument players.

Limits and Open Questions

Sabine formula assumes diffuse field. For non-diffuse rooms (long, narrow halls; rooms with strong focusing surfaces) Sabine’s formula can overestimate $T_{60}$ by 20–50%. Eyring’s formula $T_{60} = -0.161 V / [S \ln(1 - \bar{\alpha})]$ is more accurate for high-absorption environments.
Masking threshold is not universal. The empirical “~20 dB SNR for audibility” is a population average. Listeners with musical training show lower masking thresholds (better ability to segregate voices from noise). The cocktail party effect (auditory streaming) provides additional perceptual benefit beyond the physical SNR.
Formant clustering mechanism needs clarification. While Sundberg’s 6:1 area ratio is widely cited, the precise biomechanical mechanism (arytenoid adduction, cricothyroid rotation, laryngeal lowering) remains debated. Recent laryngoscopic studies support multiple pathways to Singer’s Formant production.
Individual vocal tract geometry varies. The 6:1 ratio produces a ~3 kHz resonance for an average adult male vocal tract. Sopranos have shorter vocal tracts; their Singer’s Formant region is higher (~3.5–4 kHz). The model predicts this correctly: smaller $V$ , larger $f_H$ in the Helmholtz formula.
Soprano formant tuning (not in EP15, but related). At high soprano pitches ( $f_0 > 700$ Hz), $F_1$ in normal vowels falls below $f_0$ . The soprano must retune $F_1$ to match $f_0$ to maintain acoustic efficiency — sacrificing vowel intelligibility for loudness. This is a frequency-domain trade-off studied by Joliver and Sundberg; it forward-connects to EP19 (vibrato and formant interaction).

Conjecture (Optimal Formant Placement for Penetration)

For a given orchestral texture, the singer’s optimal formant cluster frequency $F^*$ is the frequency $f^*$ that maximizes:

f^* = \arg\max_f \left[ \text{SPL}_\text{singer}(f) - \text{SPL}_\text{orchestra}(f) + w(f) \right]

where $w(f)$ is the ISO 226 perceptual weighting function and $\text{SPL}_\text{singer}(f)$ depends on the singer’s achievable formant placement.

Open question: For different orchestral configurations (Romantic orchestra vs. Baroque ensemble vs. chamber orchestra), does the optimal $f^*$ shift significantly? If so, could this explain stylistic differences in formant placement between Baroque-specialist singers and Verdi tenors?

Academic References

Sundberg, J. (1974). Articulatory interpretation of the “singing formant.” Journal of the Acoustical Society of America, 55(4), 838–844. — Original identification and quantification of the Singer’s Formant; 6:1 area ratio; +8–12 dB clustering gain.
Beranek, L. L. (2004). Concert Halls and Opera Houses: Music, Acoustics, and Architecture (2nd ed.). Springer. — Measured RT60 and C80 for major opera houses worldwide; orchestral SPL measurements.
Fletcher, H., & Munson, W. A. (1933). Loudness, its definition, measurement and calculation. Journal of the Acoustical Society of America, 5(2), 82–108. — Original equal-loudness contour measurements.
ISO 226:2003. Acoustics — Normal equal-loudness-level contours. International Organization for Standardization. — Current standard for equal-loudness data; replaces Fletcher-Munson with modern measurements.
Sabine, W. C. (1900). Reverberation. The American Architect (reprinted in Architectural Acoustics, Sabine, 1922, Harvard University Press). — Original derivation of the reverberation formula.
Meyer, J. (2009). Acoustics and the Performance of Music (5th ed.). Springer. — Orchestral frequency spectra by section; instrument directivity patterns; ensemble SPL measurements. Source for $\alpha \approx 9$ dB/octave attenuation figure.
Ternström, S. (1988). Acoustical aspects of choir singing. Ph.D. dissertation, KTH Stockholm. — Vocal tract parameter measurements; Helmholtz resonator parameters for male singers.
Haas, H. (1951). Über den Einfluss eines Einfachechos auf die Hörsamkeit von Sprache. Acustica, 1, 49–58. (English translation in Journal of the Audio Engineering Society, 1972, 20(2), 146–159.) — The Haas (precedence) effect; perceptual fusion and localization dominance of early sound.
Rossing, T. D., Moore, F. R., & Wheeler, P. A. (2002). The Science of Sound (3rd ed.). Addison-Wesley. Ch. 17 (Voice and Speech). — Source-filter model derivation; formant theory; vocal tract acoustics.
Plomp, R. (1970). Timbre as a multidimensional attribute of complex tones. In Frequency Analysis and Periodicity Detection in Hearing, ed. Plomp & Smoorenburg. Sijthoff. — Perceptual dimensions of timbre; spectral centroid as brightness correlate.

前置知识

Overview

Prerequisites

Definitions

Main Theorems

Bayreuth: Architectural Acoustics as Instrument

Failure Cases: Quantitative Analysis

Lateral Seating ( Loss + High-Frequency Diffraction)

Technical Deficiency: Missing Formant Cluster

Combined adverse scenario:

Tenor Brightness: Spectral Centroid Measurements

Musical Connection

Limits and Open Questions

Academic References

Lateral Seating ( $C_{80}$ Loss + High-Frequency Diffraction)