EP41

EP41: The Mathematics of Source Separation — NMF to Demucs

NMF V≈WH乘法更新规则, Wiener掩模MMSE最优性, U-Net, HTDemucs, SDR指标

▶ 5:24 Statistics/MLSignal Processing

前置知识

EP35 The Phase Vocoder — Mathematics of Pitch Shifting EP36 Quanta of Sound — Gabor and the Uncertainty Principle

后续拓展

EP42 Chord Recognition — From Chroma Vectors to Viterbi Decoding

Overview / 概述

In 1999, Daniel Lee and H. Sebastian Seung published a paper in Nature showing that any non-negative matrix admits a low-rank factorization into two non-negative factors. Twenty-five years later, the descendants of that idea run on your phone, separating vocals from accompaniment in real time. The key insight is deceptively simple: a song is a matrix, and separating sources means factoring that matrix into interpretable parts.

The journey from that 1999 paper to modern systems like HTDemucs spans three conceptual layers. First, non-negative matrix factorization (NMF) provides a mathematically clean linear decomposition of the spectrogram. Second, spectral masking — in particular Wiener filtering — turns that decomposition into an optimal signal estimate under a precise probabilistic model. Third, deep architectures (U-Net, Hybrid Demucs, HTDemucs) replace the linear assumption with learned non-linear representations, achieving signal-to-distortion ratios that would have seemed impossible in 1999.

This episode traces all three layers, making explicit where the mathematics is exact (the Wiener mask is provably MMSE-optimal under its assumptions), where it is a principled heuristic (NMF finds a local minimum, not a global one), and where fundamental signal-processing limits remain (the phase problem is a hard barrier, not an engineering shortcut).

中文: “1999年，两个人写了一篇论文。二十五年后，它跑在你手机里，把人声从伴奏里抠出来。这篇论文里，一首歌只是一个矩阵。”

Prerequisites / 前置知识

Short-Time Fourier Transform and Spectrograms (EP35) — the STFT that produces the matrix $V$ we decompose
Filter Banks and Mel Spectrogram (EP36) — frequency resolution and perceptual weighting underlying the spectrogram representation

Definitions

41.1 — The Spectrogram Matrix

Definition 41.1 (Magnitude Spectrogram)

Let $x(t)$ be a mixed audio signal sampled at rate $f_s$ . Apply the Short-Time Fourier Transform (STFT) with window length $L$ and hop size $H$ to obtain the complex matrix $\mathbf{X} \in \mathbb{C}^{F \times T}$ , where $F = L/2 + 1$ is the number of frequency bins and $T$ is the number of time frames.

The magnitude spectrogram is: $V = |\mathbf{X}| \in \mathbb{R}_{\geq 0}^{F \times T}$ where $|\cdot|$ denotes element-wise absolute value.

Typical dimensions for a four-minute pop song at 44.1 kHz with a 2048-point window: $F \approx 1025$ , $T \approx 10000$ , so $V$ has roughly 10 million entries.

Worked example. A single sinusoid at 440 Hz produces a column of $V$ that is nearly zero everywhere except the row corresponding to the 440 Hz bin. A chord of three sinusoids activates three rows simultaneously. A real recording activates hundreds of rows with complex temporal patterns — which is exactly what NMF will try to explain as a product of a few fundamental templates.

41.2 — Non-Negative Matrix Factorization (NMF)

Definition 41.2 (Non-Negative Matrix Factorization)

Given $V \in \mathbb{R}_{\geq 0}^{F \times T}$ and a rank parameter $r \ll \min(F,T)$ , NMF seeks matrices $W \in \mathbb{R}_{\geq 0}^{F \times r}, \quad H \in \mathbb{R}_{\geq 0}^{r \times T}$ minimizing a reconstruction cost, most commonly the Frobenius-norm objective: $\min_{W \geq 0,\; H \geq 0} \| V - WH \|_F^2$

The columns $w_1, \ldots, w_r \in \mathbb{R}_{\geq 0}^F$ of $W$ are called dictionary atoms (timbre templates); the rows $h_1, \ldots, h_r \in \mathbb{R}_{\geq 0}^T$ of $H$ are called activation envelopes.

Worked example. Set $r = 2$ for a vocal + guitar separation. After convergence, $w_1$ might look like a broad spectral envelope concentrated in the 200–3000 Hz formant range (human voice), while $w_2$ resembles a harmonic comb at guitar string frequencies. The row $h_1$ of $H$ peaks when the singer is active; $h_2$ follows the guitar strumming pattern.

中文: “W是字典矩阵，每一列是一个音色模板——比如吉他的频率轮廓，或者人声的共振峰分布。H是激活矩阵，每一行是对应模板在时间轴上的激活强度。”

NMF vs SVD. Singular value decomposition (SVD) achieves the global minimum of $\|V - UV^T\|_F^2$ among all rank- $r$ approximations, but its factors can be negative. Negative entries in a frequency matrix have no physical meaning (there is no such thing as negative spectral energy). NMF enforces $W, H \geq 0$ , which means each atom $w_k$ represents a genuine non-negative frequency profile that can be interpreted as a timbre shape.

41.3 — Lee–Seung Multiplicative Update Rules

Definition 41.3 (Lee–Seung Multiplicative Updates)

The Lee–Seung multiplicative update rules for minimizing

\|V - WH\|_F^2

are:

H \leftarrow H \odot \frac{W^T V}{W^T W H}, \qquad W \leftarrow W \odot \frac{V H^T}{W H H^T}

where

\odot

denotes element-wise (Hadamard) multiplication, and all divisions are element-wise. Both updates are applied alternately at each iteration.

The key property of these rules is that the multiplying factor $W^T V \,/\, W^T W H$ is a ratio of non-negative quantities whenever $W, H, V \geq 0$ , so multiplying the current value of $H$ by this factor can never introduce a negative entry. Non-negativity is preserved exactly, at every step, by construction.

41.4 — Spectral Masking

Definition 41.4 (Spectral Mask)

A spectral mask $M \in [0,1]^{F \times T}$ is applied element-wise to the magnitude spectrogram $V$ to estimate a target source: $\hat{V}_{\text{source}} = M \odot V$

The Ideal Binary Mask (IBM) sets each entry to 1 if the target source dominates at that time-frequency bin and 0 otherwise: $M_{\text{IBM}}[f,t] = \begin{cases} 1 & |S_{\text{voice}}[f,t]|^2 > |S_{\text{inst}}[f,t]|^2 \\ 0 & \text{otherwise} \end{cases}$

The Wiener soft mask distributes energy proportionally: $M_{\text{W}}[f,t] = \frac{|S_{\text{voice}}[f,t]|^2}{\displaystyle\sum_{j} |S_j[f,t]|^2}$ where the sum runs over all sources.

Worked example. At a time-frequency bin where the vocal power is 3 units and the instrumental power is 1 unit, the IBM gives $M=1$ (full vocal, discard instrument), while the Wiener mask gives $M = 3/(3+1) = 0.75$ (a soft blend that retains 75% of the mixture energy for the vocal estimate). The Wiener mask is smoother, avoids the hard binary cuts that produce musical noise artifacts, and is provably optimal under the correct model (see Theorem 41.2 below).

中文: “有了分解，怎么分离？答案是频谱掩蔽。对每个时频格子，计算人声分量和伴奏分量的能量比，得到一个软掩蔽矩阵M。”

41.5 — Signal-to-Distortion Ratio (SDR)

Definition 41.5 (Signal-to-Distortion Ratio)

For an estimated source signal

\hat{s}

and reference

s

, the Signal-to-Distortion Ratio is:

\mathrm{SDR} = 10 \log_{10} \frac{\|s_{\text{target}}\|^2}{\|\hat{s} - s_{\text{target}}\|^2} \quad \text{(dB)}

where

s_{\text{target}}

is the orthogonal projection of

\hat{s}

onto

s

(to allow for a global scaling factor). Higher SDR means less distortion. A gain of 1 dB in SDR typically corresponds to a perceptible improvement in separation quality.

Main Theorems / 主要定理

Theorem 41.1 — Non-negativity Preservation

Theorem 41.1 (Non-negativity Preservation of Lee–Seung Updates)

W^{(0)}, H^{(0)} \geq 0

element-wise (with no zero denominators), then every iterate

W^{(n)}, H^{(n)}

produced by the Lee–Seung multiplicative updates satisfies

W^{(n)}, H^{(n)} \geq 0

Proof.

The decisive step is structural: the update for

H

takes the form

H^{(n+1)}[k,t] = H^{(n)}[k,t] \cdot \frac{(W^T V)[k,t]}{(W^T W H^{(n)})[k,t]}.

Each factor on the right is non-negative:

H^{(n)}[k,t] \geq 0

by hypothesis,

(W^T V)[k,t] \geq 0

because

W, V \geq 0

, and

(W^T W H^{(n)})[k,t] > 0

by the non-zero denominator assumption. A product of non-negative numbers is non-negative, so

H^{(n+1)} \geq 0

. The argument for

W^{(n+1)}

is identical by symmetry. Induction on

n

completes the proof.

\square

Remark. This is the central advantage of NMF over SVD for spectrograms. SVD computes the globally optimal low-rank approximation in the Frobenius norm, but its factors are unconstrained and will contain negative entries whenever the data matrix has mixed-sign structure. NMF sacrifices global optimality (Theorem 41.3 addresses this) to maintain the non-negativity that gives the factors physical meaning.

Theorem 41.2 — Wiener Mask MMSE Optimality

Theorem 41.2 (Wiener Mask MMSE Optimality)

Assume the mixture is additive: $X[f,t] = S_{\text{voice}}[f,t] + S_{\text{inst}}[f,t]$ . Assume the sources are mutually independent and each STFT coefficient $S_j[f,t]$ is modeled as a zero-mean complex Gaussian with variance $\sigma_j^2[f,t] = \mathbb{E}|S_j[f,t]|^2$ .

Under these assumptions, the minimum mean squared error (MMSE) estimate of $S_{\text{voice}}[f,t]$ given the mixture $X[f,t]$ is: $\hat{S}_{\text{voice}}[f,t] = \frac{\sigma_{\text{voice}}^2[f,t]}{\sigma_{\text{voice}}^2[f,t] + \sigma_{\text{inst}}^2[f,t]} \cdot X[f,t] = M_W[f,t] \cdot X[f,t]$ where $M_W[f,t]$ is exactly the Wiener soft mask with true source powers.

Proof.

Under the zero-mean complex Gaussian model, the conditional expectation $\mathbb{E}[S_{\text{voice}} \mid X]$ is computed by Bayes' theorem. Since the joint distribution of $(S_{\text{voice}}, X)$ is Gaussian (sums of independent Gaussians are Gaussian), the conditional mean is linear in $X$ .

The Wiener filter coefficient is the ratio of the cross-covariance $\mathrm{Cov}(S_{\text{voice}}, X)$ to the mixture variance $\mathrm{Var}(X)$ : $\frac{\mathrm{Cov}(S_{\text{voice}}, X)}{\mathrm{Var}(X)} = \frac{\mathbb{E}[S_{\text{voice}} \overline{X}]}{\mathbb{E}|X|^2} = \frac{\sigma_{\text{voice}}^2}{\sigma_{\text{voice}}^2 + \sigma_{\text{inst}}^2}$ where we used independence ( $\mathbb{E}[S_{\text{voice}}\overline{S}_{\text{inst}}] = 0$ ) and additivity. This coefficient is precisely $M_W[f,t]$ . For Gaussian distributions, the conditional mean is the MMSE estimator, completing the proof. $\square$

Practical use. In an NMF-based system, the true source variances $\sigma_j^2[f,t]$ are unknown. They are approximated by the NMF reconstructions: $\sigma_{\text{voice}}^2[f,t] \approx (W_{\text{voice}} H_{\text{voice}})[f,t]$ , where $W_{\text{voice}}$ contains the columns of $W$ assigned to the vocal source. The Wiener mask is then computed from these approximated variances.

中文: “在加性混合、各源独立、局部复高斯模型假设下，Wiener软掩蔽是最小均方误差意义下的最优估计——在这些假设成立的条件下，这是可以严格证明的结论，不是启发式。”

Theorem 41.3 — Monotone Cost Decrease of Lee–Seung Updates

Theorem 41.3 (Monotone Cost Decrease)

(Lee & Seung, 2000) The Frobenius-norm cost

J(W,H) = \|V - WH\|_F^2

is non-increasing under the Lee–Seung multiplicative updates: for all iterations

n \geq 0

J(W^{(n+1)}, H^{(n+1)}) \leq J(W^{(n)}, H^{(n)})

Proof.

The proof uses an auxiliary function technique. For the

H

-update, define a function

G(H, H')

satisfying (i)

G(H, H) = J(\cdot, H)

and (ii)

G(H, H') \geq J(\cdot, H)

for all

H'

(an upper bound that is tight on the diagonal). The multiplicative update is the minimizer of

G(H, H^{(n)})

over

H

, so:

J(\cdot, H^{(n+1)}) = G(H^{(n+1)}, H^{(n+1)}) \leq G(H^{(n+1)}, H^{(n)}) \leq G(H^{(n)}, H^{(n)}) = J(\cdot, H^{(n)})

The first inequality uses property (i); the second uses the fact that

H^{(n+1)}

minimizes

G(\cdot, H^{(n)})

; the third uses property (i) again. The auxiliary function itself is constructed by replacing the quadratic coupling term in the Taylor expansion of

J

with a diagonal approximation that over-estimates the Hessian curvature, yielding a separable upper bound amenable to closed-form minimization. The W-update is handled symmetrically.

\square

Important caveat. Theorem 41.3 guarantees that the cost never increases. It does not guarantee convergence to the global minimum. The objective $J(W,H) = \|V - WH\|_F^2$ is non-convex in the pair $(W,H)$ jointly — the algorithm finds a local minimum whose quality depends on the initialization.

中文: “算法保证误差不增加，但最终解取决于初始化，不保证全局最优。”

Numerical Examples

Example 41.A — Two-Component NMF on a Toy Spectrogram

Consider a $4 \times 6$ spectrogram $V$ constructed as the sum of two rank-1 components:

V = w_1 h_1^T + w_2 h_2^T, \quad w_1 = \begin{pmatrix}3\\1\\0\\0\end{pmatrix},\; w_2 = \begin{pmatrix}0\\0\\2\\1\end{pmatrix},\; h_1 = \begin{pmatrix}1&2&1&0&0&0\end{pmatrix},\; h_2 = \begin{pmatrix}0&0&1&2&2&1\end{pmatrix}

Here $w_1$ is concentrated in the low-frequency rows (a bass-like timbre), while $w_2$ is concentrated in the high-frequency rows (a treble-like timbre). Initializing NMF with $r=2$ and random $W^{(0)}, H^{(0)} \geq 0$ , the Lee–Seung updates converge (up to column scaling and permutation) to the true factors within approximately 50 iterations.

Wiener mask computation. Approximating source variances by the NMF reconstruction: $M_W^{(\text{low})}[f,t] = \frac{(w_1 h_1^T)[f,t]}{(w_1 h_1^T + w_2 h_2^T)[f,t]} = \frac{V_1[f,t]}{V[f,t]}$

At bin $(f,t) = (1,2)$ (low frequency, early time): $V_1 = 6$ , $V_2 = 0$ , so $M_W = 1.0$ (pure low-frequency component). At $(f,t) = (3,3)$ (high frequency, transition frame): $V_1 = 0$ , $V_2 = 2$ , so $M_W = 0.0$ for the low source, correctly suppressed.

Example 41.B — SDR Comparison Table

The following table summarizes representative vocal SDR scores reported in the benchmark literature for systems built on the MUSDB18 dataset:

System	Architecture	Vocal SDR (dB)
Open-Unmix (2019)	Bidirectional LSTM on magnitude spectrogram	~6.3
Demucs v2 (2021)	Time-domain convolutional encoder–decoder	~7.0
Hybrid Demucs (2021)	Time + frequency dual-path	~7.7
HTDemucs (2023)	Hybrid + cross-attention Transformer	~9.2

Each ~1.5 dB improvement corresponds to a perceptible reduction in residual bleed from the accompaniment into the vocal track.

Musical Connection / 音乐联系

音乐联系

Source separation is the inverse problem of mixing. Every recorded song is a superposition of individually performed parts; separation attempts to undo that superposition. The mathematics reflects the acoustic physics: because different instruments occupy different regions of the time-frequency plane, the spectrogram matrix $V$ is approximately low-rank, which is why NMF works at all.

Why non-negativity matters musically. Spectral energy is inherently non-negative: a microphone cannot detect negative sound pressure in the sense of a negative spectrogram entry. NMF’s insistence on $W, H \geq 0$ means each dictionary atom $w_k$ represents a plausible timbre profile — a physical frequency shape that a real instrument could produce. SVD’s left and right singular vectors, by contrast, will routinely have negative entries that cancel when multiplied, giving mathematically optimal reconstruction at the cost of physically uninterpretable factors.

From NMF to Demucs — what changed musically. NMF treats each time-frequency bin in isolation; a guitar note at 330 Hz and a vocal note at 330 Hz cannot be distinguished by frequency alone. Deep architectures break this local bottleneck by learning temporal context: the neural network “knows” that a vocal note has a characteristic vibrato pattern and a characteristic spectral envelope evolution, while a guitar note has a different attack and decay shape. This contextual reasoning is what pushes SDR from ~6 dB (NMF era) to ~9 dB (HTDemucs).

The HTDemucs dual-domain design mirrors a perceptual insight: the human auditory system simultaneously tracks fine temporal structure (important for rhythmic patterns and consonants in speech) and spectral structure (important for timbre and vowel identity). HTDemucs processes the waveform directly in its time-domain branch (capturing phase and fine timing) and the spectrogram in its frequency-domain branch (capturing harmonic relationships). The cross-attention mechanism allows these two representations to inform each other — analogous to how a skilled listener uses both timing cues and spectral cues simultaneously.

中文: “分离出来的人声和伴奏，里面藏着什么和弦？下一集，我们从色度向量到隐马尔可夫模型，解释和弦识别算法的数学原理。”

Limits and Open Questions / 局限性与开放问题

The Three Fundamental Barriers

The narration identifies three limits that are not engineering shortcomings but reflect deep constraints in signal processing and statistical learning.

1. The Phase Problem. The spectrogram $V = |\mathbf{X}|$ discards the phase of the STFT coefficients. To reconstruct an audio waveform from an estimated magnitude spectrogram $\hat{V}$ , one must estimate the phase. The classical approach is the Griffin–Lim algorithm (iterative projection between the time and frequency domains), which converges to a consistent phase but not necessarily the correct one. Incorrect phase manifests as metallic or “phasey” artifacts perceptible to any listener. Time-domain architectures like Demucs sidestep this by never converting to the magnitude spectrogram at all, but then they must implicitly learn phase relationships from data.

2. Time-Frequency Overlap. When two sources have identical frequency content at the same time instant, the corresponding STFT bin $X[f,t]$ is a superposition of their contributions, and a single scalar mask value $M[f,t]$ cannot recover either source without error. This is an ill-posed (not merely hard) problem at the level of local information: no algorithm that looks only at $X[f,t]$ can do better than guessing. Deep networks partially circumvent this by using surrounding frames and harmonic priors to constrain the estimate — but information genuinely lost at a bin is not recoverable from that bin alone.

3. Out-of-Distribution Generalization. Deep models are trained on specific data distributions (MUSDB18 consists predominantly of Western popular music). A model that achieves 9.2 dB vocal SDR on pop music may perform significantly worse on a guqin solo or a Hindustani classical recording, because those timbres are absent from the training set. This is the standard bias-variance tradeoff of statistical learning, not a special deficiency of source separation.

中文: “前两条是信号处理的根本约束，第三条是统计学习的泛化边界问题。”

Open Conjectures

Conjecture (Phase Consistency Conjecture)

Statement. For a single-channel mixture of two harmonically incoherent sources, no time-frequency masking algorithm (however sophisticated the mask estimation) can simultaneously achieve the MMSE-optimal magnitude estimate and a phase estimate with bounded error in expectation, unless additional structural constraints (e.g., source periodicity) are imposed.

Reasoning. The magnitude-phase decoupling in the STFT means that an optimal mask $M$ for the magnitude $|\hat{S}| = M \cdot V$ carries no information about the phase of $\hat{S}$ at bins where both sources contribute. The complex Wiener filter $\hat{S} = M \cdot X$ sidesteps this by applying the mask to the complex STFT $X$ (not the magnitude), inheriting the mixture phase — which is itself a biased estimate of either source’s true phase.

Falsification criterion. Exhibit a mask-based algorithm that achieves both MMSE magnitude reconstruction and phase estimation error strictly below the mixture-phase baseline, on a corpus with controlled harmonic overlap, without using source-specific phase models.

Conjecture (SDR Ceiling Conjecture)

Statement. For the current MUSDB18 benchmark (44 tracks, four-stem separation), there exists a perceptual SDR ceiling in the range 12–14 dB for vocal separation beyond which further gains are inaudible to human listeners and driven purely by the evaluation metric’s sensitivity to sub-perceptual artifacts.

Reasoning. Informal listening tests suggest that at ~10–11 dB SDR, residual accompaniment bleed is below the threshold of annoyance for most listeners in most contexts. Gains above this level may optimize metric behavior rather than perceptual quality.

Falsification criterion. A rigorous ABX listening study with ≥30 listeners showing statistically significant preference (p < 0.01) for outputs above 12 dB SDR over outputs at 10 dB SDR, controlling for loudness normalization and anchor conditions (ITU-R BS.1534 MUSHRA methodology).

Conjecture (Universality of Dual-Domain Architectures)

Statement. Any source separation architecture that processes only the time domain or only the frequency domain has a representational gap that prevents it from matching the performance of a properly designed dual-domain architecture (one that maintains time-domain and frequency-domain representations simultaneously with cross-domain information exchange) on a sufficiently diverse musical dataset.

Reasoning. The time domain encodes precise phase and transient information that is destroyed by the magnitude STFT; the frequency domain encodes harmonic structure that is difficult to extract from raw waveforms without very large receptive fields. Neither domain alone is complete.

Falsification criterion. A single-domain architecture (pure time-domain or pure frequency-domain) matching or exceeding a dual-domain baseline by ≥0.5 dB SDR on a corpus spanning at least five musical genres, with matched parameter count and training data.

Academic References / 参考文献

Lee, D. D. & Seung, H. S. (1999). “Learning the Parts of Objects by Non-negative Matrix Factorization.” Nature 401, 788–791.
Lee, D. D. & Seung, H. S. (2000). “Algorithms for Non-negative Matrix Factorization.” Advances in Neural Information Processing Systems (NeurIPS) 13, 556–562. — Original proof of the multiplicative update monotonicity theorem.
Wiener, N. (1949). Extrapolation, Interpolation, and Smoothing of Stationary Time Series. MIT Press. — Foundation of Wiener filtering.
Wang, D. L. & Brown, G. J. (2006). Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press / Wiley. — Classic reference on the IBM and computational cocktail party problem.
Févotte, C., Bertin, N. & Durrieu, J.-L. (2009). “Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis.” Neural Computation 21(3), 793–830.
Smaragdis, P. & Brown, J. C. (2003). “Non-negative Matrix Factorization for Polyphonic Music Transcription.” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 177–180.
Griffin, D. & Lim, J. (1984). “Signal Estimation from Modified Short-Time Fourier Transform.” IEEE Transactions on Acoustics, Speech, and Signal Processing 32(2), 236–243. — Griffin–Lim phase reconstruction algorithm.
Ronneberger, O., Fischer, P. & Brox, T. (2015). “U-Net: Convolutional Networks for Biomedical Image Segmentation.” MICCAI 2015. — Original U-Net paper.
Stöter, F.-R., Liutkus, A. & Ito, N. (2019). “Open-Unmix — A Reference Implementation for Music Source Separation.” Journal of Open Source Software 4(41), 1667.
Défossez, A., Usunier, N., Bottou, L. & Bach, F. (2019). “Music Source Separation in the Waveform Domain.” arXiv:1911.13254. — Original Demucs.
Défossez, A. (2021). “Hybrid Spectrogram and Waveform Source Separation.” ISMIR Workshop on Music Source Separation.
Rouard, S., Massa, F. & Défossez, A. (2023). “Hybrid Transformers for Music Source Separation.” ICASSP 2023. — HTDemucs: cross-attention dual-domain architecture achieving ~9.2 dB vocal SDR.
Cano, E., FitzGerald, D., Liutkus, A., Plumbley, M. D. & Stöter, F.-R. (2019). “Musical Source Separation: An Introduction.” IEEE Signal Processing Magazine 36(1), 31–40.
Fevotte, C. & Idier, J. (2011). “Algorithms for Nonnegative Matrix Factorization with the β-Divergence.” Neural Computation 23(9), 2421–2456.
Liutkus, A. & Badeau, R. (2015). “Generalized Wiener Filtering with Fractional Power Spectrograms.” ICASSP 2015. — Extensions of Wiener masking beyond the Gaussian model.