EP07

EP07: Information Entropy and All-Interval Rows

Shannon熵, 信源编码

▶ 8:31 Information TheoryCombinatoricsAbstract Algebra

前置知识

EP04 All-Interval Rows and ℤ₁₂

小红书 Watch on 小红书

Overview

A listener hears two melodies and judges the second as “more chaotic.” But when you measure the distribution of adjacent intervals, the second melody is mathematically more uniform — its Shannon entropy is higher, not lower. The discomfort comes not from disorder but from the absence of the familiar concentrated pattern that tonal music trains the brain to expect.

This episode builds Shannon’s entropy from first principles and applies it to music. The central object is the all-interval row on $\mathbb{Z}_{12}$ : a 12-tone sequence whose 11 adjacent intervals hit every value in $\{1,\ldots,11\}$ exactly once. This uniform distribution achieves the theoretical entropy maximum $H = \log_2 11 \approx 3.46$ bits. In contrast, a Bach prelude’s interval distribution concentrates on steps and thirds, yielding $H \approx 3.1$ bits.

The paradox: the all-interval row looks random (uniform histogram) but is extremely rare (only 1,928 out of $12! \approx 479{,}000{,}000$ twelve-tone rows satisfy the condition). It is a pseudorandom object — deterministically constructed to maximize a statistical property. This is the same mathematical principle underlying cryptographic pseudorandom generators.

中文: “随机性不是无序的对立面，而是极端有序的另一种表现。全音程音列用最严格的约束，生成了最均匀的输出。”

Prerequisites

All-Interval Rows and $\mathbb{Z}_{12}$ (EP04) — the definition of all-interval rows, the Klein mother chord
Basic probability: discrete probability distributions, expected value

Definitions

Def 7.1 (Shannon Entropy)

Let $X$ be a discrete random variable taking values in $\{x_1, \ldots, x_n\}$ with probabilities $p_i = P(X = x_i)$ . The Shannon entropy of $X$ is

H(X) = -\sum_{i=1}^n p_i \log_2 p_i \quad \text{(bits)}

using the convention $0 \log_2 0 = 0$ . Entropy measures the average surprise of an outcome: $-\log_2 p_i$ is the surprise of outcome $x_i$ .

Def 7.2 (Interval Distribution of a Melody)

Given a melody as a sequence of pitch classes $m_1, m_2, \ldots, m_N \in \mathbb{Z}_{12}$ , the adjacent interval sequence is $d_k = (m_{k+1} - m_k) \bmod 12$ for $k = 1,\ldots,N-1$ .

The interval distribution is the empirical probability distribution over $\mathbb{Z}_{12} \setminus \{0\}$ : $\hat{p}_j = |\{k : d_k = j\}| / (N-1)$ .

The melodic entropy is $H = -\sum_{j=1}^{11} \hat{p}_j \log_2 \hat{p}_j$ .

Def 7.3 (All-Interval Row)

A twelve-tone row is a permutation $(m_0, m_1, \ldots, m_{11})$ of $\mathbb{Z}_{12}$ . It is an all-interval row if the 11 adjacent intervals $d_k = (m_{k+1} - m_k) \bmod 12$ , $k = 0,\ldots,10$ , take each value in $\{1,2,\ldots,11\}$ exactly once.

There are exactly 1,928 all-interval rows (up to starting pitch; if we also mod out by the dihedral group of twelve-tone operations, the count reduces further). Out of $12! = 479{,}001{,}600$ possible rows, this is a fraction of approximately $1/248{,}190$ .

Def 7.4 (Perfect Difference Set)

A set $S = \{s_0, s_1, \ldots, s_{k-1}\} \subset \mathbb{Z}_n$ is a perfect difference set if every nonzero element of $\mathbb{Z}_n$ appears exactly once among the $k(k-1)$ differences $\{s_i - s_j \bmod n : i \ne j\}$ .

Perfect difference sets exist when $n = k^2 - k + 1$ (a prime power condition). For $k = 4$ , $n = 13$ : the set $\{0, 1, 3, 9\} \subset \mathbb{Z}_{13}$ is a perfect difference set.

All-interval rows are related but distinct: they concern ordered sequences of adjacent differences over $\mathbb{Z}_{12}$ , not unordered multisets of all pairwise differences.

Def 7.5 (Entropy Rate of a Markov Chain)

A first-order Markov chain on alphabet $\mathcal{A}$ has transition probabilities $p_{ij} = P(X_{t+1} = j \mid X_t = i)$ and stationary distribution $\pi$ (where $\pi_i = \sum_j \pi_j p_{ji}$ ).

The entropy rate is

\bar{H} = -\sum_{i,j} \pi_i \, p_{ij} \log_2 p_{ij}.

For music modeled as a Markov chain on pitch classes, $\bar{H}$ measures the long-run unpredictability of the melody per note.

Main Theorems

Theorem 7.1 (Maximum Entropy: Uniform Distribution)

Among all probability distributions

(p_1, \ldots, p_n)

with

p_i \ge 0

and

\sum p_i = 1

, entropy

H(X) = -\sum p_i \log_2 p_i

is maximized by the uniform distribution

p_i = 1/n

for all

i

, achieving

H_{\max} = \log_2 n

bits.

Proof.

We use Jensen’s inequality: since $\log_2$ is strictly concave,

\log_2\!\left(\sum_{i=1}^n p_i \cdot \frac{1}{p_i}\right) \ge \sum_{i=1}^n p_i \log_2\!\left(\frac{1}{p_i}\right)

with equality iff all $1/p_i$ are equal, i.e., $p_i = 1/n$ for all $i$ .

The left side is $\log_2\!\left(\sum_{i=1}^n p_i \cdot \frac{1}{p_i}\right) = \log_2(n)$ .

The right side is $\sum_{i=1}^n p_i \log_2(1/p_i) = -\sum_{i=1}^n p_i \log_2 p_i = H(X)$ .

Therefore $H(X) \le \log_2 n$ , with equality iff $p_i = 1/n$ . $\square$

Corollary for all-interval rows: The adjacent interval distribution of an all-interval row is uniform over $\{1,\ldots,11\}$ (each value exactly once out of 11), giving $H = \log_2 11 \approx 3.459$ bits — the maximum possible for an 11-symbol alphabet.

Theorem 7.2 (Shannon's Source Coding Theorem)

Let $X_1, X_2, \ldots$ be i.i.d. with distribution $p$ over alphabet $\mathcal{A}$ . For any lossless prefix-free code assigning codeword of length $\ell_i$ to symbol $x_i$ :

H(X) \le \mathbb{E}[\ell(X)] < H(X) + 1

where $\mathbb{E}[\ell(X)] = \sum_i p_i \ell_i$ is the expected codeword length. The lower bound is achievable; no prefix-free code can do better on average than $H(X)$ bits per symbol.

Proof.

Lower bound ( $\mathbb{E}[\ell] \ge H$ ): For any prefix-free code, the Kraft inequality states $\sum_i 2^{-\ell_i} \le 1$ . Let $q_i = 2^{-\ell_i}/Z$ where $Z = \sum_i 2^{-\ell_i} \le 1$ . Then

\mathbb{E}[\ell] - H = \sum_i p_i \ell_i + \sum_i p_i \log_2 p_i = \sum_i p_i \log_2 \frac{p_i}{2^{-\ell_i}}

= \sum_i p_i \log_2 \frac{p_i}{q_i \cdot Z} = D_{\mathrm{KL}}(p \| q) + \log_2(1/Z) \ge 0

since KL divergence $D_{\mathrm{KL}} \ge 0$ and $\log_2(1/Z) \ge 0$ (as $Z \le 1$ ).

Upper bound ( $\mathbb{E}[\ell] < H + 1$ ): Choose $\ell_i = \lceil -\log_2 p_i \rceil$ . Then $\ell_i \le -\log_2 p_i + 1$ , so $\mathbb{E}[\ell] \le -\sum p_i \log_2 p_i + 1 = H + 1$ . The Kraft inequality is satisfied since $\sum 2^{-\ell_i} \le \sum p_i = 1$ . $\square$

Musical corollary: Tonal melody with $H \approx 3.1$ bits/interval can be encoded with $\approx 3.1$ bits per interval. An all-interval row with $H = 3.46$ bits requires $\approx 3.46$ bits per interval. The tonal melody is more compressible — its redundancy (repeated steps and thirds) is exactly what permits shorter average codes.

Theorem 7.3 (Entropy Rate via Stationary Distribution)

For an ergodic Markov chain with stationary distribution $\pi$ and transition matrix $P$ , the entropy rate equals

\bar{H} = -\sum_{i \in \mathcal{A}} \pi_i \sum_{j \in \mathcal{A}} p_{ij} \log_2 p_{ij} = \sum_{i} \pi_i H(X_{t+1} \mid X_t = i).

This is the weighted average of the per-state conditional entropies.

Proof.

By the chain rule for entropy, $H(X_1, \ldots, X_n) = H(X_1) + \sum_{k=2}^n H(X_k \mid X_1,\ldots,X_{k-1})$ .

For a Markov chain, $H(X_k \mid X_1,\ldots,X_{k-1}) = H(X_k \mid X_{k-1})$ .

By ergodicity, the empirical distribution of $X_{k-1}$ converges to $\pi$ , so

\frac{1}{n}H(X_1,\ldots,X_n) \to H(X_2 \mid X_1) = \sum_i \pi_i H(X_2 \mid X_1 = i) = -\sum_{i,j} \pi_i p_{ij} \log_2 p_{ij}. \quad \square

Prop 7.1 (Pseudorandomness of All-Interval Rows)

An all-interval row achieves the maximum adjacent-interval entropy $H = \log_2 11$ and is incompressible in the Lempel-Ziv sense over its 11 intervals — yet it is the output of a deterministic construction satisfying strict algebraic constraints.

Formally: the 11-symbol string formed by the adjacent intervals of an all-interval row has Kolmogorov complexity $K \ll \log_2(11!) = \log_2(39{,}916{,}800) \approx 25.2$ bits, since the row can be specified by its first element and a short description of the all-interval property. It is statistically indistinguishable from a uniform draw over $S_{11}$ (all permutations of $\{1,\ldots,11\}$ ), but computationally it is a highly structured object.

Visual: Two histograms side by side: left shows Bach C-major Prelude interval frequencies (tall bars at 1,2,3,4; short bars at 6,7,10,11), right shows a flat histogram for an all-interval row. Caption: $H = 3.1$ bits vs. $H = 3.46$ bits.

Musical Connection

音乐联系

The entropy hierarchy of musical styles

Measured entropy of adjacent interval distributions, from various corpora:

Corpus	Approximate $H$ (bits)	Character
Gregorian chant	$\sim 1.5$	Almost entirely stepwise
Bach chorales	$\sim 2.6$	Steps + thirds dominate
Bach C-major Prelude	$\sim 3.1$	Wider range of intervals
Bebop jazz heads	$\sim 3.3$	Chromatic passing tones
Webern Op. 27	$\sim 3.4$	Wide leaps, near-uniform
All-interval row	$3.459$	Uniform, theoretical max

The entropy gradient roughly tracks Western music history: the expansion of harmonic vocabulary from modal chant to twelve-tone serialism corresponds to a monotone increase in interval entropy.

朱载堉 (Zhu Zaiyu) and equal temperament — the same operation

In 1584, the Ming dynasty scholar Zhu Zaiyu solved the tuning problem by choosing frequency ratios of $2^{k/12}$ for all semitones — mathematically equalizing all twelve intervals at the cost of making each one slightly irrational. This is the same mathematical operation as the all-interval row: Zhu uniformized pitch ratios across twelve semitones; Schoenberg uniformized interval frequencies across twelve interval classes. Two revolutions, four centuries apart, unified by the principle of replacing a natural hierarchy with a mathematical uniform distribution.

Entropy rate and musical style fingerprints

Modeling a composer’s style as a first-order Markov chain on $\mathbb{Z}_{12}$ , the entropy rate $\bar{H}$ characterizes how “predictable” their melodic choices are:

Bach: $\bar{H} \approx 2.8$ bits/note (strong tonal gravity, concentrated transitions)
Chopin: $\bar{H} \approx 3.1$ bits/note (more chromatic but still tonal)
Schoenberg (twelve-tone period): $\bar{H} \approx 3.4$ bits/note (near-maximal)

The entropy rate is a single real number that summarizes a composer’s interval vocabulary — a lossy compression of style into one bit of information theory.

This connects forward to EP21 (AI composition uses Markov chain entropy rate as a diversity metric) and EP24 (pop music is low-entropy by design; melodic predictability is a feature, not a bug).

Limits and Open Questions

First-order vs. higher-order models. Melodic entropy computed from adjacent intervals is a first-order statistic. Long-range dependencies (phrase structure, motivic development) require higher-order Markov models or full probabilistic context-free grammars. The entropy rate of a $k$ -th order chain requires much more data to estimate reliably.
Entropy vs. perceived complexity. High entropy does not equal perceived complexity. A melody of random pitches (maximum entropy) is less interesting than a fugue (moderate entropy, rich long-range structure). Information theory measures statistical uniformity; perceived complexity involves pattern recognition at multiple timescales.
Combinatorial counting of all-interval rows. The exact count of 1,928 all-interval rows (up to starting pitch) was determined computationally. The algebraic structure that generates them — related to perfect difference sets and the theory of cyclic difference families — is only partially understood. An open question: is there a direct constructive bijection between all-interval rows and a known combinatorial object?
Entropy of rhythm. Analogously to pitch interval entropy, one can define the entropy of inter-onset intervals (IOIs) in rhythm. African polyrhythm typically achieves higher rhythmic entropy than Western common-practice meter. A comparative study across global music traditions is an active research area.
Minimum description length and style. Kolmogorov complexity $K(m)$ of a melody $m$ is the length of the shortest program that generates $m$ . By the source coding theorem, $\mathbb{E}[K(m)] \approx H$ for i.i.d. melodies, but for structured compositions $K \ll H \cdot \text{(length)}$ because the structure allows extreme compression. The gap $H - K/\text{length}$ measures “structural redundancy” — it is large for tonal music, small for twelve-tone music.

Academic References

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. (The founding paper of information theory.)
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. Ch. 2 (Entropy), Ch. 5 (Source coding theorem).
Knopoff, L., & Hutchinson, W. (1983). Entropy as a measure of style: The influence of sample length. Journal of Music Theory, 27(1), 75–97.
Margulis, E. H., & Beatty, A. P. (2008). Musical style, psychoaesthetics, and prospects for entropy as an analytic tool. Computer Music Journal, 32(4), 64–78.
Morris, R. (1987). Composition with Pitch Classes. Yale University Press. (All-interval rows and their algebraic properties.)
Jedrzejewski, F. (2006). Mathematical Theory of Music. Editions Delatour France/IRCAM. (Perfect difference sets and interval content.)
Nolan, C. (2000). On the mathematical side of Schoenberg’s twelve-tone method. Music Theory Spectrum, 22(2), 162–182.
Temperley, D. (2007). Music and Probability. MIT Press. (Probabilistic models of musical cognition including Markov chain entropy.)