Classical ML · CS229

Independent Component Analysis

Unmixing what was blended. Hand a machine a tangle of overlapping signals — voices, brain waves, mixed audio — and it pulls out the original, independent sources. The math behind solving the cocktail party problem.

Prerequisites: PCA + the idea of a distribution's shape. We contrast the two.
9
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: The Cocktail Party Problem

You're at a crowded party. Two people are talking at once, and two microphones are recording in different corners of the room. Each microphone picks up both voices, but in different proportions — mic 1 is closer to Alice so it hears mostly her with a bit of Bob; mic 2 is closer to Bob so it's mostly him with a bit of Alice. Each recording is a mixture, a blend of the two voices.

Now the seemingly impossible question: given only the two mixed recordings, can you recover the two original voices — Alice's speech alone and Bob's speech alone? No information about the room, the microphone positions, or what either person said. Just the tangled mixtures. It sounds like trying to un-bake a cake. Astonishingly, it's not just possible — it's a solved problem, and the algorithm that solves it is Independent Component Analysis, ICA.

Two voices, two mixed recordings

The top two waveforms are the original sources — two independent signals (think Alice's voice, Bob's voice). The bottom two are what the microphones record: each is a different blend of both sources. You can see the mixtures look like neither original cleanly — they're tangled. ICA's job: recover the top two, given only the bottom two.

This "blind source separation" problem is everywhere, far beyond parties. Brain imaging: EEG electrodes on the scalp each record a mixture of signals from many brain regions — ICA separates them into distinct neural sources (and pulls out artifacts like eye-blinks). Medical: separating a fetal heartbeat from the mother's in a combined ECG. Finance: finding independent driving factors behind correlated stock movements. Audio: isolating instruments from a mix. Anywhere independent sources get blended together, ICA can pull them back apart.

The plan. Ch.1: the mixing model (how sources get blended). Ch.2: the goal (find the un-mixing). Ch.3: the key idea — independence and how to detect it. Ch.4: the surprising catch — why ICA needs the sources to be non-Gaussian. Ch.5: a live un-mixing lab. Ch.6: ICA vs PCA — independence vs mere uncorrelatedness. Ch.7: the algorithm. From "un-bake the cake" to a precise, working method.
Common misconception: "You'd need to know something about the voices or the room to separate them." That's the magic — ICA is blind. It uses no prior knowledge of the sources or the mixing process. It succeeds purely from a single, powerful statistical assumption: the original sources are statistically independent of each other (Alice's words don't depend on Bob's). That one assumption, plus non-Gaussianity (Ch.4), is enough to unmix — no room model required.
What is the "cocktail party problem" that ICA solves?

Chapter 1: The Mixing Model

Let's make the blending precise. There are some hidden source signals — call them s — that are independent (Alice's voice, Bob's voice). What each microphone records is a linear mixture: a weighted sum of the sources, where the weights depend on how close that mic is to each speaker. With two sources and two mics:

x1 = a11s1 + a12s2    x2 = a21s1 + a22s2

The x's are what we observe (the recordings); the s's are the hidden sources we want; and the a's are the mixing weights — how much of each source reaches each microphone. Packaging the weights into a matrix A, the whole thing is just one clean equation:

x = A s

The mixing matrix A captures the entire room geometry in a few numbers. We don't know A (we don't know the room), and we don't know s (the original voices) — all we have is x, the mixed recordings, at many points in time. Two unknowns, one observation. That's the challenge.

Mix two sources yourself

The top two signals are the independent sources. Adjust the mixing weights — how much of each source bleeds into each microphone — and watch the bottom two mixed signals change. With balanced weights the mixtures become a thorough blend; with the identity (no cross-mixing) they stay separate. This blending is exactly what the room does to the voices.

mic1: source2 leak0.6
mic2: source1 leak0.5
Common misconception: "ICA needs as many microphones as there might be sources, or it can't work." In the standard setup, yes — you need at least as many mixtures (microphones) as sources for the square mixing matrix to be invertible, which is what makes recovery possible. With fewer microphones than sources (the "underdetermined" case) the problem is genuinely harder and needs extra assumptions. The clean version we're learning assumes equal numbers — d sources, d mixtures — so A is a square, invertible matrix.
In the ICA model x = As, what does the mixing matrix A represent, and what do we actually observe?

Chapter 2: The Goal — Find the Unmixing

If mixing is "multiply the sources by A," then un-mixing is just the reverse: multiply the mixtures by A's inverse. Call that the unmixing matrix W = A−1. If we could find W, recovering the sources would be one multiplication away:

s = W x

So the entire problem of separating the voices reduces to one thing: find the right unmixing matrix W. Once you have W, you apply it to each recorded sample and out come the separated sources. The whole game is finding those few numbers in W.

But here's the puzzle: we don't know A, so we can't just invert it — we have to find W directly from the mixtures alone. How can we possibly know when we've found the right W, with nothing to compare against? This is where ICA's brilliant insight comes in, and it's the subject of the next chapter. The short version: we recognize the correct W by the fact that it makes the recovered signals independent. When the un-mixed outputs stop looking like blends and start looking like distinct, unrelated signals, we know we've separated them.

Hunt for the unmixing matrix by hand

The mixtures are fixed. Adjust the two unmixing weights in W and watch the recovered signals at the bottom. Most settings give garbage (still-blended messes). But there's a special W that makes the two outputs match the original clean sources. Hunt for it — or press Auto-solve (ICA) to let the algorithm find it instantly.

W: undo leak 10.0
W: undo leak 20.0
recovery quality:
Common misconception: "If we don't know A, finding W = A−1 is hopeless." It would be, with no assumptions. The breakthrough is that we don't need to know A — we need only a criterion that tells us when W is right, computable from the mixtures alone. That criterion is statistical independence of the outputs. ICA turns "invert an unknown matrix" into "find the W that makes the outputs maximally independent" — an optimization we can solve. The unknown A never has to be known directly.
What single object, once found, lets us recover all the original sources from the mixtures?

Chapter 3: Independence — The Signal in the Scatter

How do we recognize the correct unmixing without knowing the sources? The answer lives in a beautiful geometric picture. Forget the waveforms for a moment and instead plot the two mixtures against each other: at every instant in time, take (mic1 value, mic2 value) and drop a point at that coordinate. Over many time samples, you get a scatter cloud — and the shape of that cloud reveals the hidden structure.

Here's the key fact. When two independent sources are mixed, the scatter cloud forms a tilted, sheared shape — for many signals, a parallelogram or rhombus. The edges of that shape point along the original source directions. Mixing rotated and sheared the natural axes; the independent sources are hiding in plain sight as the edges of the cloud. ICA's job is to find those edges — the directions along which the data, once projected, looks like a single clean source rather than a blend.

What makes a recovered signal "a single clean source" rather than "a blend"? Non-Gaussianity. A blend of independent signals always looks more Gaussian (more bell-curved) than the individual sources — this is the Central Limit Theorem in action: sums of independent things drift toward the bell curve. So to un-blend, we run it backwards: find the directions that make the projected signal least Gaussian. The most non-Gaussian directions are the original, unmixed sources. Independence and maximal non-Gaussianity turn out to be two faces of the same target.

The scatter cloud and its hidden axes

Each dot is one time-instant, plotted as (mic1, mic2). Notice the cloud isn't a round blob — it's a sheared shape whose edges (orange lines) are the independent source directions, found by ICA. Compare them to the PCA axes (blue): PCA finds perpendicular max-variance directions, but the true sources aren't perpendicular — ICA's edges follow the actual cloud geometry. Adjust the mixing and watch ICA's axes track the cloud's true edges.

mixing angle35°
Common misconception: "Independent just means uncorrelated — the same thing PCA already gives you." Independence is much stronger than uncorrelatedness. Uncorrelated means no linear relationship (zero covariance); independent means no statistical relationship of any kind — knowing one tells you nothing about the other. PCA achieves uncorrelated (it only looks at second-order statistics, the covariance). ICA achieves true independence (it uses higher-order statistics, like the non-Gaussianity). That gap is exactly why ICA can separate sources that PCA cannot (Ch.6).
How does ICA recognize the original source directions in the mixed data?

Chapter 4: Why ICA Needs Non-Gaussian Sources

Now the most surprising and important fact about ICA, one that trips up everyone the first time: ICA cannot separate Gaussian sources. If the original signals were bell-curved, the whole method collapses. Understanding why reveals the deep reason ICA works at all.

The culprit is symmetry. A 2D Gaussian with independent components has a scatter cloud that's a perfectly round, rotationally symmetric blob — a circular splat. And here's the fatal problem: a circle looks identical no matter how you rotate it. So if you mix two Gaussian sources, the resulting cloud is still a featureless round (or elliptical) blob with no distinguishable edges. There's no "shape" to lock onto, because every rotation of a Gaussian blob is just another equally-valid Gaussian blob. The original source directions are washed out — mathematically unrecoverable.

Non-Gaussian sources break this symmetry, and that's what saves us. A non-Gaussian signal — a sine wave, a sawtooth, speech, anything with a distinctive non-bell shape — produces a scatter cloud with genuine corners and edges: a parallelogram, a star, a square. These have a preferred orientation, so the source directions are visible as the cloud's edges. The more non-Gaussian the sources, the sharper the corners, the easier the separation. Non-Gaussianity is not a nuisance assumption — it is the very thing that makes the sources identifiable.

Gaussian: no edges to find. Non-Gaussian: clear edges.

Toggle the source type. With Gaussian sources, the mixed cloud is a round/elliptical blob — rotationally symmetric, no edges, so ICA has nothing to grab and separation is impossible. With non-Gaussian sources (uniform), the cloud is a sharp parallelogram with obvious edges (the source directions) — ICA locks right on. Same mixing; the only difference is the sources' shape.

The beautiful inversion. For almost every other method in this course, Gaussian was the friendly assumption — least squares, GDA, PCA all love Gaussians. ICA is the dramatic exception: Gaussian is the enemy. Its rotational symmetry destroys exactly the directional information ICA needs. ICA thrives precisely where the data is weird — spiky, skewed, multi-modal. Real-world sources (speech, brain signals, natural images) are wonderfully non-Gaussian, which is exactly why ICA works so well on them.
Common misconception: "If non-Gaussian helps a little, then Gaussian just makes ICA a bit worse." It doesn't make it worse — it makes it impossible. With purely Gaussian sources there is provably no way to recover the original directions from the mixtures, no matter how clever the algorithm or how much data you have. The rotational ambiguity is fundamental, not a numerical issue. Non-Gaussianity isn't a helpful bonus; it's a hard requirement for the problem to even have a unique answer.
Why can't ICA separate Gaussian sources?

Chapter 5: The Unmixing Lab

Time to watch the whole thing work. Below: two original source signals, a mixing matrix that blends them into two tangled recordings, and ICA recovering the originals — with the scatter cloud showing how it does it. This is the cocktail party problem, solved before your eyes.

What to watch:
  • Sources (top): two clean, independent signals. Mixtures (middle): tangled blends — neither voice is cleanly audible.
  • Press Run ICA and watch the scatter cloud's axes rotate to align with the cloud's edges, while the recovered signals (bottom) snap back to match the originals.
  • Re-mix with a new random mixing matrix — ICA recovers the sources every time (up to a possible swap or sign-flip, which doesn't matter for audio).
  • Compare recovered to original: they match in shape, proving the separation worked from the mixtures alone.
ICA Lab — the cocktail party, solved

Top: original sources. Middle: mixed recordings. Bottom: ICA's recovered signals. The scatter (right) shows the (mic1, mic2) cloud and the source directions ICA finds. Press Run ICA to separate.

recovery:

(No quiz — the lab is the test. If you can look at the tangled scatter cloud and point to where the source directions are — the cloud's edges — you understand how ICA sees what we hear as noise.)

Chapter 6: ICA vs. PCA — Independence Beats Uncorrelatedness

Both PCA and ICA find a new set of directions (a new basis) for your data — but they're after fundamentally different things, and seeing the contrast crystallizes what each one does.

PCAICA
GoalDirections of maximum varianceDirections that are statistically independent
AchievesUncorrelated componentsIndependent components
Uses2nd-order stats (covariance)Higher-order stats (non-Gaussianity)
DirectionsAlways orthogonal (perpendicular)Need not be orthogonal
Gaussian dataWorks fineFails (no unique answer)
Typical useCompression, dimensionality reductionSource separation (unmixing)

The crux is in two rows. PCA's directions are always perpendicular and it only guarantees uncorrelated outputs. But the true source directions in a mixture are usually not perpendicular — the mixing sheared them at an angle. So PCA, forced to find perpendicular axes, cannot align with the real sources; it finds the max-variance axes instead, which are a blend. ICA, free to find non-perpendicular directions and demanding full independence (not just uncorrelatedness), locks onto the actual source edges.

PCA axes vs ICA axes on the same mixture

The same sheared scatter cloud of mixed signals, with both methods' directions drawn. PCA (blue) finds perpendicular max-variance axes — which slice across the cloud, not along its edges. ICA (orange) finds the cloud's actual edges — the true source directions, even though they're not perpendicular. For separating sources, only ICA's answer is right.

mixing shear40°
Notice: PCA's axes stay perpendicular; ICA's follow the cloud's true (non-perpendicular) edges.
"Uncorrelated" is necessary but not sufficient. Here's the precise relationship: independence implies uncorrelatedness, but not vice versa. Two signals can be perfectly uncorrelated (zero linear relationship) yet still strongly dependent in a nonlinear way — and a blend of sources is exactly that. PCA makes the outputs uncorrelated and stops, satisfied. ICA pushes further to make them independent, which is what actually un-blends them. PCA is often run first (to whiten the data), then ICA finishes the job — they're partners, not rivals.
Common misconception: "ICA is just a better PCA — use it instead." They solve different problems. PCA is for reducing dimensions and compression (rank directions by importance, keep the top few). ICA is for separating sources (find all the independent components; there's no inherent ranking). If you want to compress 100 features to 10, that's PCA. If you want to unmix 3 blended signals into 3 clean ones, that's ICA. Different tools, different jobs — and ICA needs the non-Gaussianity PCA doesn't care about.
What is the key difference between what PCA and ICA produce?

Chapter 7: The Algorithm & Its Ambiguities

How does ICA actually find W in practice? It's an optimization, in the same family as everything else in this course. We turn "make the outputs independent" into a number to maximize, then climb it.

The recipe has two stages. First, whiten the data (this is where PCA helps): center it and rescale so the mixtures are uncorrelated with equal variance — turning the sheared cloud into a more regular shape. After whitening, the only freedom left is a rotation, which shrinks the search to finding one angle (in 2D) or one rotation matrix (in general). Second, rotate to maximize non-Gaussianity: search for the rotation that makes the projected outputs as far from bell-curved as possible — measured by statistics like kurtosis (a measure of "peakedness/tailedness," zero for a Gaussian) or negentropy. The famous FastICA algorithm does exactly this, climbing toward maximal non-Gaussianity with a fast fixed-point iteration. Equivalently, you can derive ICA as maximum likelihood — choosing W to make the recovered sources most probable under an assumed non-Gaussian source distribution, then doing gradient ascent.

Two harmless ambiguities

ICA can recover the sources, but with two unavoidable ambiguities — both of which, happily, don't matter for real applications:

Common misconception: "These ambiguities mean ICA doesn't fully solve the problem." For the problems ICA is used on, the ambiguities are cosmetic. You recover each source's exact waveform shape — the actual content — just possibly reordered, rescaled, or sign-flipped, none of which changes what the signal is. A separated voice is a separated voice whether it's labeled "channel 1" or played at a slightly different volume. ICA solves the part that matters and is provably unable to resolve only the parts that don't.
In practice, how does an ICA algorithm find the unmixing?

Chapter 8: Connections & Cheat Sheet

You've reached the end of the classical ML core — and ICA is a fitting finale, because it inverts the lesson every other method taught (Gaussian-is-good) and shows the power of going after a stronger goal (independence, not mere uncorrelatedness). You can now separate the inseparable.

The whole lesson on one page

ConceptWhat it means
GoalBlind source separation: recover independent sources from their observed mixtures (the cocktail party problem).
Mixing modelx = As: observed mixtures = (unknown) mixing matrix A times the hidden sources s.
UnmixingFind W = A−1; then s = Wx recovers the sources.
The criterionThe right W makes the outputs statistically independent — equivalently, maximally non-Gaussian.
Non-Gaussian requiredGaussian sources give a rotationally symmetric mix with no recoverable directions. Sources must be non-Gaussian.
vs PCAPCA: uncorrelated, orthogonal, max-variance (2nd-order). ICA: independent, possibly non-orthogonal (higher-order). Independence > uncorrelatedness.
AlgorithmWhiten (PCA), then rotate to maximize non-Gaussianity (FastICA / max-likelihood).
AmbiguitiesSource order and scaling/sign are unrecoverable — but harmless for applications.

ICA in code

python
from sklearn.decomposition import FastICA
import numpy as np

# X: observed mixtures, shape (n_samples, n_microphones)
ica = FastICA(n_components=2, whiten='unit-variance')
S_recovered = ica.fit_transform(X)   # the separated independent sources!
A_est = ica.mixing_                  # estimated mixing matrix A
# X ≈ S_recovered @ A_est.T  — we unmixed without ever knowing A

# Classic demo: mix sine + sawtooth + noise, then separate them:
# t = np.linspace(0,8,2000)
# S = np.c_[np.sin(2*t), signal.sawtooth(3*t)]   # 2 non-Gaussian sources
# X = S @ [[1,1],[0.5,2]].T                       # mix them
# FastICA(2).fit_transform(X)  →  recovers sine & sawtooth

Where to go next

You've finished the Classical ML core

This is the last of the CS229 classical machine learning lessons. Together they form a complete foundation: linear and logistic regression, unified by GLMs; the generative view; the bias-variance tradeoff and model selection; and unsupervised learning — k-means, EM/GMM, PCA, and now ICA. You have the toolkit a working ML engineer reaches for daily.

You can now teach this. ICA solves blind source separation: given mixtures x = As of independent sources, find the unmixing W = A−1 so s = Wx. The trick: the right W makes the outputs maximally independent — equivalently, maximally non-Gaussian — because mixing makes signals more Gaussian (Central Limit Theorem). This is why ICA requires non-Gaussian sources: Gaussian mixtures are rotationally symmetric and unrecoverable. Unlike PCA (uncorrelated, orthogonal, variance), ICA achieves true independence using higher-order statistics. Whiten, then rotate to maximal non-Gaussianity, and the cocktail party is solved.

"The sciences do not try to explain, they hardly even try to interpret, they mainly make models." — John von Neumann. ICA's model — independent sources, linearly mixed — is simple enough to write in one line and powerful enough to pull a single voice from a roaring crowd.