EP52

EP52: Optimal Transport & Melody

Wasserstein Distance, Kantorovich Duality, and the Earth Mover's Metric on Pitch

▶ 2:37 Metric GeometryOptimizationMusicology

前置知识

EP04 All-Interval Rows and ℤ₁₂ EP08 Shepard Tones and the Tritone Paradox

Overview / 概述

Two melodies stand before you. How far apart are they? Not the number of notes that differ, not a key-signature gap — but the minimum cost of transporting one distribution of pitches into the other. That cost is the Wasserstein distance, also called the Earth Mover’s Distance (EMD).

Optimal transport (OT) reframes melody comparison as a logistics problem: notes are piles of earth at different pitch locations, and we want to move every grain of source-earth onto the matching target location with the smallest total effort. The decisive insight is that this minimisation problem has a dual formulation — the Kantorovich duality — that connects geometry to functional analysis and turns melody distance into a computable quantity.

This episode introduces the transport-plan matrix, the Kantorovich primal–dual theorem, the Sinkhorn algorithm for fast approximation, and three musical applications: style similarity, plagiarism detection, and tonal distance along the circle of fifths.

中文: “两段旋律有多’远'？不是音符数量的差，不是调性的差，而是把一段旋律’搬运’成另一段的最小代价。这叫瓦瑟斯坦距离。”

Prerequisites / 前置知识

All-Interval Rows and ℤ₁₂ (EP04) — integer modulo-12 distances between pitch classes; the algebraic perspective on pitch proximity
Pitch-Space Topology (EP08) — continuous deformations of pitch space; the topological perspective on note relationships

Definitions

Definition 52.1 (Melody as a Probability Measure)

Let $\mathcal{X} = \{0, 1, \ldots, 11\}$ denote the set of pitch classes (mod 12), or more generally let $\mathcal{X} \subset \mathbb{R}$ be a finite set of MIDI note numbers.

A melody of length $n$ with notes $x_1, x_2, \ldots, x_n \in \mathcal{X}$ and non-negative amplitude weights $a_1, \ldots, a_n \geq 0$ with $\sum_i a_i = 1$ defines a discrete probability measure:

$\mu = \sum_{i=1}^{n} a_i \, \delta_{x_i}$

where $\delta_{x}$ is the Dirac mass at pitch $x$ . The weight $a_i$ encodes the relative importance (duration, velocity, or uniform $1/n$ ) of note $x_i$ .

Worked example. A three-note motif C–E–G with equal weights: $\mu = \tfrac{1}{3}\delta_0 + \tfrac{1}{3}\delta_4 + \tfrac{1}{3}\delta_7$ (using MIDI pitch classes C=0, E=4, G=7).

Definition 52.2 (Ground Cost and Cost Matrix)

A ground cost is a function $c : \mathcal{X} \times \mathcal{X} \to [0, \infty)$ measuring the cost of transporting one unit of “earth” from pitch $x$ to pitch $y$ .

For melody comparison the standard choice is the $L^1$ pitch distance: $c(x, y) = |x - y|$

(in semitones). Given source notes $x_1, \ldots, x_m$ and target notes $y_1, \ldots, y_n$ , the cost matrix is: $C_{ij} = c(x_i, y_j) = |x_i - y_j|, \quad 1 \leq i \leq m, \; 1 \leq j \leq n$

Example. Moving pitch C (0) to G (7) costs 7; moving C to D (2) costs 2.

Definition 52.3 (Transport Plan)

Given source measure $\mu = \sum_i a_i \delta_{x_i}$ and target measure $\nu = \sum_j b_j \delta_{y_j}$ , a transport plan is a matrix $\gamma \in \mathbb{R}^{m \times n}_{\geq 0}$ satisfying the marginal constraints:

$\sum_{j=1}^{n} \gamma_{ij} = a_i \quad \forall i, \qquad \sum_{i=1}^{m} \gamma_{ij} = b_j \quad \forall j$

The entry $\gamma_{ij}$ specifies how much mass is moved from source note $x_i$ to target note $y_j$ .

The set of all transport plans is denoted $\Pi(\mu, \nu)$ and is a convex polytope.

Musical reading. Row sums equal source note weights; column sums equal target note weights. Each row describes how one source note is “spread” across target notes; each column describes which sources contribute to one target note.

Definition 52.4 (Wasserstein-1 Distance (Earth Mover's Distance))

The Wasserstein-1 distance (or $W_1$ distance, Earth Mover’s Distance) between measures $\mu$ and $\nu$ over a metric space $(\mathcal{X}, c)$ is:

$W_1(\mu, \nu) = \min_{\gamma \in \Pi(\mu, \nu)} \sum_{i,j} \gamma_{ij} \, C_{ij} = \min_{\gamma \in \Pi(\mu, \nu)} \langle \gamma, C \rangle$

where $\langle \gamma, C \rangle = \sum_{i,j} \gamma_{ij} C_{ij}$ is the Frobenius inner product (total transport cost).

Intuition. Each unit of mass must be moved; the cost is distance times mass. The minimisation finds the most efficient re-assignment, avoiding unnecessary long-range transport.

Definition 52.5 (Kantorovich Dual Potentials)

The Kantorovich dual problem to $W_1$ seeks a pair of functions $(f, g)$ with $f : \mathcal{X} \to \mathbb{R}$ and $g : \mathcal{Y} \to \mathbb{R}$ (called dual potentials) subject to the constraint:

$f(x_i) + g(y_j) \leq C_{ij} \quad \forall i, j$

The dual objective is:

$\max_{f, g} \sum_i a_i f(x_i) + \sum_j b_j g(y_j)$

The functions $f$ and $g$ can be interpreted as “prices” at each source and target location in a resource-allocation economy. At optimality, $f$ and $g$ are $c$ -conjugates of each other.

Main Theorems / 主要定理

Theorem 52.1 (Wasserstein Is a Metric)

Let $(\mathcal{X}, c)$ be a metric space. Then $W_1$ defines a metric on the space of probability measures $\mathcal{P}(\mathcal{X})$ :

Non-negativity: $W_1(\mu, \nu) \geq 0$ , with equality if and only if $\mu = \nu$ .
Symmetry: $W_1(\mu, \nu) = W_1(\nu, \mu)$ .
Triangle inequality: $W_1(\mu, \rho) \leq W_1(\mu, \nu) + W_1(\nu, \rho)$ for any third measure $\rho$ .

Proof.

Non-negativity and symmetry follow directly from the definition: $C_{ij} \geq 0$ and any plan $\gamma$ for $(\mu, \nu)$ can be transposed to give a plan for $(\nu, \mu)$ with the same cost.

For the triangle inequality, given optimal plans $\gamma^* \in \Pi(\mu, \nu)$ and $\eta^* \in \Pi(\nu, \rho)$ , one constructs a gluing $\zeta \in \Pi(\mu, \rho)$ via the disintegration theorem: define $\zeta(x, z) = \int_\mathcal{Y} \frac{d\gamma^*(x,y)}{d\nu(y)} \cdot d\eta^*(y,z) \, d\nu(y)$ . Then: $\langle \zeta, c \rangle \leq \langle \gamma^*, c \rangle + \langle \eta^*, c \rangle = W_1(\mu,\nu) + W_1(\nu,\rho)$ by the triangle inequality on $c$ : $c(x,z) \leq c(x,y) + c(y,z)$ . Since $\zeta$ is a valid coupling for $(\mu,\rho)$ , the infimum $W_1(\mu,\rho) \leq \langle \zeta, c\rangle$ gives the result. $\square$

Theorem 52.2 (Kantorovich Duality)

For discrete measures $\mu = \sum_i a_i \delta_{x_i}$ and $\nu = \sum_j b_j \delta_{y_j}$ with finite support and ground cost $c$ , the primal and dual values are equal:

$W_1(\mu, \nu) = \min_{\gamma \in \Pi(\mu,\nu)} \langle \gamma, C \rangle = \max_{\substack{f, g \\ f(x_i)+g(y_j) \leq C_{ij}}} \left[ \sum_i a_i f(x_i) + \sum_j b_j g(y_j) \right]$

Moreover, for the $L^1$ ground cost $c(x,y)=|x-y|$ , the dual admits the Lipschitz reformulation:

$W_1(\mu, \nu) = \sup_{\|h\|_{\mathrm{Lip}} \leq 1} \left[ \int h \, d\mu - \int h \, d\nu \right]$

where the supremum is over all 1-Lipschitz functions $h : \mathcal{X} \to \mathbb{R}$ .

Proof.

The primal problem is a linear programme in $\gamma$ (minimise a linear objective over a convex polytope). Strong LP duality applies whenever the primal is feasible and bounded — both hold since $\Pi(\mu,\nu)$ is non-empty (e.g., the independent coupling $\gamma_{ij}=a_i b_j$ ) and costs are non-negative. Strong duality yields $\min \langle \gamma, C \rangle = \max \sum_i a_i f_i + \sum_j b_j g_j$ over dual-feasible $(f,g)$ .

The Lipschitz reformulation follows from complementary slackness: at optimality $f(x_i) - f(y_j) \leq C_{ij} = |x_i - y_j|$ , so the optimal $f$ (setting $g = -f$ for the symmetric $L^1$ case) must be 1-Lipschitz. The supremum over 1-Lipschitz $h$ equals the dual optimum. $\square$

中文: “康托洛维奇对偶是最优传输理论的核心定理：最小化搬运代价，等于最大化一对函数的积分差。对偶性把几何距离和函数分析联系起来，是最优传输成为强大工具的数学原因。”

Theorem 52.3 (Transposition Is Near, Inversion Is Far)

Let $\mu = \sum_{i=1}^n \tfrac{1}{n} \delta_{x_i}$ be a melody with uniform weights. Define:

Transposition by $k$ semitones: $\mu^{+k} = \sum_{i=1}^n \tfrac{1}{n} \delta_{x_i + k}$ .
Inversion about axis $a$ : $\mu^{\text{inv}} = \sum_{i=1}^n \tfrac{1}{n} \delta_{2a - x_i}$ .

Then: $W_1(\mu, \mu^{+k}) = |k|$ $W_1(\mu, \mu^{\text{inv}}) = \frac{2}{n} \sum_{i=1}^n |x_i - a|$

Proof.

Transposition. The identity coupling $\gamma_{ii} = \tfrac{1}{n}$ (ship note $x_i$ to $x_i + k$ ) is feasible and has cost $\sum_i \tfrac{1}{n} |k| = |k|$ . No coupling can do better, because every unit of mass must move at least $|k|$ (a lower bound from the dual: the 1-Lipschitz function $h(x) = x$ satisfies $\int h \, d\mu^{+k} - \int h \, d\mu = k$ ).

Inversion. The identity coupling ships $x_i$ to $2a - x_i$ at cost $|x_i - (2a-x_i)| = 2|x_i - a|$ . Summing with weight $\tfrac{1}{n}$ gives the stated formula. Optimality of this coupling follows from the Monge solution for 1D transport (Theorem 52.4 below). $\square$

中文: “转位：把旋律整体移高五度，代价很小——每个音符搬运相同距离。听起来也’近'。倒影：把每个音符关于轴翻转，搬运代价大——音符散布开来。听起来’远’了。这和直觉完全吻合。”

Theorem 52.4 (Monge Solution in One Dimension)

When $\mathcal{X} = \mathbb{R}$ and $c(x,y) = |x-y|$ , the optimal transport plan between discrete measures is achieved by the sorted matching: sort source pitches $x_1 \leq x_2 \leq \cdots \leq x_m$ and target pitches $y_1 \leq y_2 \leq \cdots \leq y_n$ (with repetition according to weights), then match them in order.

For equal-weight melodies of the same length $n = m$ : $W_1(\mu, \nu) = \frac{1}{n} \sum_{i=1}^n |x_{\sigma(i)} - y_i|$ where $\sigma$ is the rank-permutation sorting $\mu$ .

Proof.

For the

L^1

cost on

\mathbb{R}

, any crossing in a transport plan is suboptimal: if

\gamma_{ij} > 0

and

\gamma_{kl} > 0

with

x_i < x_k

and

y_j > y_l

, then uncrossing — sending

x_i \to y_l

and

x_k \to y_j

— costs the same or less, by the

L^1

identity

|a-d| + |b-c| \leq |a-c| + |b-d|

when

a \leq b

c \leq d

. Removing all crossings yields the sorted matching, which is optimal.

\square

Prop 52.1 (LP Complexity and Sinkhorn Approximation)

The exact transport plan solves a linear programme with

mn

variables and

m + n

equality constraints. The interior-point complexity is

O((mn)^{3/2})

(roughly

O(n^3)

for square problems). The Sinkhorn algorithm (Cuturi 2013) adds an entropic regularisation term

\varepsilon H(\gamma)

to the objective (where

H(\gamma) = -\sum_{ij} \gamma_{ij} \log \gamma_{ij}

), yielding an approximate plan in

O(n^2 / \varepsilon^2)

iterations, each of cost

O(n^2)

Proof.

The regularised problem has a unique solution (strict convexity of entropy) given by a doubly stochastic scaling of the kernel matrix

K_{ij} = e^{-C_{ij}/\varepsilon}

\gamma^\varepsilon = \mathrm{diag}(u) \, K \, \mathrm{diag}(v)

for scaling vectors

u, v

. The Sinkhorn iteration alternately normalises rows and columns of

K

, converging geometrically. As

\varepsilon \to 0

\gamma^\varepsilon \to \gamma^*

\square

Numerical Examples

Example 1: Transposition by a fifth.

Source: C–E–G–B–D (uniform weights $1/5$ each), pitches $\{0, 4, 7, 11, 14\}$ . Target: G–B–D–F♯–A (shifted up 7 semitones), pitches $\{7, 11, 14, 18, 21\}$ .

By Theorem 52.3: $W_1(\mu, \mu^{+7}) = 7$

The sorted matching ships each note exactly 7 semitones. Total cost $= 5 \times \tfrac{1}{5} \times 7 = 7$ .

Example 2: Inversion about axis C (pitch 0).

Source: same pentatonic $\{0, 2, 4, 7, 9\}$ (C–D–E–G–A), inversion about C maps to $\{0, -2, -4, -7, -9\}$ .

W_1(\mu, \mu^{\text{inv}}) = \frac{2}{5}(|0| + |2| + |4| + |7| + |9|) = \frac{2}{5} \times 22 = 8.8

Inversion costs 8.8 semitone-units vs. 7 for transposition — confirming that inversion “moves the melody further.”

Example 3: Transport plan matrix (4×4).

Consider source notes C, E, G, B with weights $(a_1, a_2, a_3, a_4)$ and target notes D, F, A, C with weights $(b_1, b_2, b_3, b_4)$ . A feasible plan:

	D	F	A	C
C	0.8	0.2	0.0	0.0
E	0.0	0.7	0.3	0.0
G	0.0	0.0	0.8	0.2
B	0.1	0.0	0.0	0.9

Each row sums to the source note weight; each column sums to the target note weight. The total cost is $\sum_{ij} \gamma_{ij} |x_i - y_j|$ , minimised over all such feasible matrices.

中文: “最优传输计划是一个矩阵：行是源旋律的音符，列是目标旋律的音符，每个格子表示从这个源音符搬运多少到那个目标音符。目标：最小化总搬运代价，即矩阵与代价矩阵的点积。这是一个线性规划问题。”

Example 4: Tonal distance from C major to F major.

Represent C major scale as uniform distribution on $\{0, 2, 4, 5, 7, 9, 11\}$ and F major on $\{0, 2, 4, 5, 7, 9, 10\}$ . The only differing note is B (11) vs. B♭ (10). The sorted matching sends all notes to themselves except B → B♭ at cost 1, giving:

W_1(\mu_{C}, \mu_{F}) = \frac{1}{7} \times 1 = 0.143

Moving two steps on the circle of fifths (C → F, a distance of 1 fifth) costs this amount, matching the geometric intuition that adjacent keys are close.

Musical Connection / 音乐联系

音乐联系

Why Wasserstein beats edit distance for melody

Classical edit distance (Levenshtein) on melody sequences counts insertions, deletions, and substitutions — but assigns cost 1 to every substitution regardless of pitch distance. Under edit distance, changing C to C♯ (1 semitone) is just as costly as changing C to B (11 semitones). Musicians know this is wrong: a chromatic neighbour feels similar; a tritone substitution sounds remote.

Wasserstein distance corrects this by embedding pitch into its natural metric space. The ground cost $c(x,y) = |x - y|$ (or a circular variant $\min(|x-y|, 12-|x-y|)$ on $\mathbb{Z}_{12}$ ) means nearby pitches are cheap to transport. The Wasserstein metric inherits the geometry of pitch space.

Three applications in music analysis

Style similarity. Represent a composer’s melodic output as a weighted average of melody measures. $W_1$ between two composers' aggregated pitch distributions gives a principled distance; clustering by $W_1$ recovers stylistic families more reliably than histogram overlap.
Plagiarism detection. When $W_1(\mu_A, \mu_B) < \theta$ for a calibrated threshold $\theta$ , the two melodies are flagged for review. Unlike fingerprinting (exact match), OT-based detection tolerates transposition, rhythmic variation, and ornamental embellishment — precisely the transformations composers use to disguise borrowing.
Tonal distance and the circle of fifths. Represent each key as its scale distribution. The Wasserstein distance between scale distributions matches the circle-of-fifths distance: C → G costs less than C → F♯, because the scales differ by one note vs. six notes. This gives a data-driven derivation of the circle of fifths from first principles.

Transposition and inversion through the OT lens

Transposition by $k$ semitones moves every pitch the same distance, so the transport plan is diagonal and the cost is exactly $|k|$ — a small, controlled motion. Melodic inversion scatters pitches relative to an axis; unless the melody is highly symmetric, inversion produces a large, non-diagonal transport plan and a correspondingly large Wasserstein distance. This quantifies the intuition that transpositions are “the same melody in a new key” while inversions are genuinely different melodic shapes.

中文: “瓦瑟斯坦距离在旋律分析中的应用：风格相似度、抄袭检测、调性距离。比编辑距离更自然，因为它内置了音高的几何结构。”

Limits and Open Questions / 局限性与开放问题

Pitch-only transport ignores rhythm. The current formulation represents a melody as a distribution over pitches, discarding all temporal information. Two melodies with identical pitch-class distributions but entirely different rhythms receive $W_1 = 0$ . Extending OT to joint pitch–time distributions (measures on $\mathbb{R}^2$ ) requires a 2D ground cost and dramatically increases the size of the LP.
Circular pitch space. The linear cost $c(x,y) = |x-y|$ does not respect octave equivalence. On $\mathbb{Z}_{12}$ the natural metric is $\min(|x-y|, 12-|x-y|)$ , making the ground space a circle rather than a line. The 1D Monge sorting theorem no longer applies directly, increasing computational complexity.
Choice of ground cost is a modelling decision. The $L^1$ semitone distance treats all semitone steps as equal. A psychoacoustically motivated cost (e.g., weighted by consonance intervals, or derived from MDS of listener similarity ratings) could yield musically more meaningful distances, but is harder to justify axiomatically.
Computational cost at scale. Exact LP computation is $O(n^3)$ ; Sinkhorn at precision $\varepsilon$ is $O(n^2/\varepsilon^2)$ . For large corpora (millions of melody pairs), even Sinkhorn becomes expensive. Sliced Wasserstein distance — averaging 1D projections — runs in $O(n \log n)$ per projection but loses metric information.
Wasserstein vs. Gromov–Wasserstein for structural comparison. $W_1$ compares pitch distributions embedded in a fixed ambient space. When comparing melodies across different modal systems (e.g., Western vs. Indian ragas with different scale structures), the ambient spaces differ. Gromov–Wasserstein distance (Mémoli 2011) compares metric-measure spaces intrinsically, without requiring a common embedding — but is $O(n^3)$ even approximately.

Conjecture (Wasserstein Threshold for Melodic Similarity Perception)

There exists a perceptual threshold $\theta^* > 0$ such that for any two melodies $\mu, \nu$ with $W_1(\mu, \nu) < \theta^*$ , trained listeners judge them as “similar” with probability exceeding 0.75. Conversely, for $W_1(\mu, \nu) > 2\theta^*$ , the similarity judgment probability falls below 0.25.

Falsification criterion. A valid counterexample is a pair of melodies with $W_1 < \theta^*$ that listeners consistently rate as dissimilar (e.g., due to rhythmic or timbral contrast not captured by pitch distribution), or a pair with $W_1 > 2\theta^*$ that listeners rate as similar (e.g., due to motivic imitation or shared contour). Such examples would demonstrate that $W_1$ on pitch distributions is insufficient for perceptual melody similarity, and that a richer ground space (pitch × time × timbre) is necessary.

Academic References / 参考文献

Kantorovich, L. V. (1942). On the translocation of masses. Doklady Akademii Nauk USSR, 37(7–8), 227–229. [Translated: Management Science, 5(1), 1–4, 1958.]
Wasserstein, L. N. (1969). Markov processes over denumerable products of spaces describing large system of automata. Problemy Peredachi Informatsii, 5(3), 47–52.
Villani, C. (2009). Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften, Vol. 338. Springer. — The definitive modern reference; Ch. 1–2 cover discrete OT and Kantorovich duality.
Peyré, G., & Cuturi, M. (2019). Computational Optimal Transport. Foundations and Trends in Machine Learning, 11(5–6), 355–607. Free PDF: https://arxiv.org/abs/1803.00567 — The primary reference for computational methods including Sinkhorn.
Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport distances. Advances in Neural Information Processing Systems (NeurIPS), 26.
Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The Earth Mover’s Distance as a metric for image retrieval. International Journal of Computer Vision, 40(2), 99–121. — The paper that popularised EMD in computer vision and adjacent fields.
Mémoli, F. (2011). Gromov–Wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, 11(4), 417–487.
Typke, R., Wiering, F., & Veltkamp, R. C. (2005). A survey of music information retrieval systems using the Earth Mover’s Distance. Proceedings of ISMIR, 112–119.
Ferretti, M. (2021). Measuring melodic similarity: Earth mover’s distance vs. dynamic time warping. Journal of New Music Research, 50(3), 215–230.
Sturm, B. L. (2011). Evaluating music emotion recognition: Lessons from music genre recognition. Proceedings of the IEEE International Conference on Multimedia and Expo.
Agmon, E. (1991). Linear transformations between cyclically generated chords. Musikometrika, 3, 15–40. — Early quantitative approach to inter-key distances that Wasserstein formalises.
Krumhansl, C. L. (1990). Cognitive Foundations of Musical Pitch. Oxford University Press. — Empirical tonal distance data that OT-based methods can be validated against.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151. — KL-divergence alternative to Wasserstein for distribution comparison; useful contrast.