Place Recognition: Bag-of-Words & Loop Closure

Chapter 0: The Robot That Forgot It Had Been Here Before

Picture a ground robot tasked with mapping a university campus. It drives for 10 minutes, collecting thousands of camera frames, and builds a local map using visual odometry (L11). VO is good — each step is accurate. But errors compound over time (the drift math from L11: approximately 1% of path length). After 500 m, the estimated position is off by ~5 m. The map looks like a slightly distorted rubber sheet.

Now the robot turns a corner and its camera captures an image of the same courtyard it left 10 minutes ago. If the robot could recognize "I have been here before," it could add a loop-closure constraint to its factor graph: a measurement that says "position now = position at time t₀." A global optimizer (SLAM, L13) would then pull the entire trajectory taut, distributing that 5-meter drift evenly around the loop and reducing final error to centimeters.

One recognition event undoes 10 minutes of drift accumulation. Loop closure is the difference between a warped map and a metrically accurate one. But it only works if the recognition is correct — a false loop closure (telling the optimizer two different places are the same) is catastrophic: the optimizer will twist the map to satisfy an impossible geometric constraint and it will never recover without manual intervention.

The recognition problem stated precisely

Given a query image q captured at the current time and a database D of images captured at all previous times, find the image in D that shows the same physical place as q — and do it fast enough to run in real time (tens of milliseconds), even when |D| is in the thousands or millions.

This is exactly image retrieval, a classical computer vision problem. The key insight from Sivic & Zisserman (Video Google, ICCV 2003) is that the entire text-retrieval machinery built over 50 years of web search — bag-of-words models, TF-IDF, inverted indexes — translates directly to images if you first map image features into a discrete "visual vocabulary."

The name "Video Google" is deliberate: the paper applied this to every frame of a movie and allowed users to query by example ("find all frames showing this actor's face"), demonstrating real-time retrieval in a 70 000-frame video. The robotics community adopted the same architecture for place recognition within a few years.

The loop-closure advantage: a concrete calculation

Return to the 500-meter exploration loop with 1% VO drift. The estimated endpoint is ~5 m off from the true start. After geometric verification confirms the loop closure, the SLAM back-end has the following constraints: ~1 000 odometry edges saying "I moved forward step-by-step" and 1 loop-closure edge saying "the endpoint is at the same position as the start." The optimizer distributes the 5 m error evenly: each waypoint shifts by about 5 mm. Final map error: on the order of millimeters — a 1000× improvement over raw VO. The only requirement: the recognition must be correct.

Why brute-force matching fails

A SIFT descriptor is 128 bytes. A single image has ~500–2000 keypoints. 10 000 database images → up to 20 million descriptors to compare against, per query. At 1 ns per comparison, that is 20 ms just for a single nearest-neighbor lookup — and we haven't counted the O(N²) cost of verifying geometric consistency. At 100 000 images this becomes unusable. We need a smarter representation.

Query image arrives

1 image, ~1000 SIFT descriptors (128-dim each)

↓

Brute-force option

Compare all 1000 descriptors vs all descriptors in D → O(|D|×N_desc) — too slow

↓ need a smarter representation

Bag-of-Words option

Map each descriptor to a "word" → histogram → fast cosine comparison in vocabulary space

↓ then verify top hit geometrically

Loop-closure decision

Add constraint to SLAM factor graph if verified

A VO system accumulates ~1% drift over path length. After a 600-meter loop, what is the approximate position error, and what does a successful loop-closure constraint do to that error?

0.6 cm error; loop closure reduces it to zero instantly 6 m error; loop closure lets the optimizer distribute the drift around the loop, reducing it to near zero 6 m error; loop closure adds more drift because it introduces a new measurement 60 m error; loop closure halves it

Chapter 1: Image Retrieval — The Right Framing

The key insight that unlocks scalable place recognition: stop thinking about this as geometry matching and start thinking about it as document search. In text retrieval, documents are represented as histograms of word frequencies, and a query document is compared to a database of histograms using fast dot-product similarity. Google can search billions of documents in milliseconds this way.

For images the analog is: replace documents with images, and replace text words with visual words — discrete tokens derived from local image descriptors. The image becomes a histogram of visual-word counts (order-less, just like a document's word counts). Matching is then a fast vector operation, not an exhaustive geometric search.

Two flavors of descriptor

Local descriptors (SIFT, ORB, BRIEF — L6) describe small patches around keypoints. They are invariant to rotation and scale. They capture parts of the scene — edges, corners, blobs. Bag-of-Words uses these. Global descriptors (GIST, NetVLAD) describe the whole image at once — the "spatial envelope." Easier to compute but less discriminative under viewpoint change. This lesson focuses on BoW (local descriptors), which is how DBoW2 and FAB-MAP work.

The retrieval framing decouples scalability from geometry. We pay for scalability by losing order: a BoW histogram doesn't care whether a "bench" visual word was in the top-left or bottom-right of the image. This is why geometric verification (Chapter 7) is mandatory — BoW finds fast candidates; geometry confirms them.

The analogy to text retrieval

The text analogy deserves to be pushed further because it is not just conceptual — the exact same algorithms run. In text retrieval: given a query document ("autonomous driving safety standards"), find the 10 most relevant documents in a corpus of 1 billion web pages. The solution: represent every document as a TF-IDF weighted sparse vector over a vocabulary of 50 000 common English words; precompute an inverted index; dot-product the query against the index to get candidate scores in milliseconds. Google's original PageRank paper relied on this exact infrastructure. Sivic & Zisserman (Video Google, ICCV 2003) directly applied it to images: visual words instead of text words, SIFT descriptor clusters instead of dictionary entries. The implementation is isomorphic — only the preprocessing (descriptor extraction + k-means) differs.

Precision vs. recall in loop-closure detection

In information retrieval, precision = fraction of retrieved images that are true matches; recall = fraction of true matches that were retrieved. For SLAM loop closure the tradeoff is asymmetric and brutal:

False positive (wrong loop closure): catastrophic — the optimizer gets told two different places are the same and distorts the entire map. Very hard to recover from.
False negative (missed loop closure): benign — the robot just accumulates a bit more drift, which the next correct loop closure can fix.

Therefore SLAM systems tune for high precision at the cost of recall. You might miss 70% of true revisits — that's fine. You must not add a single false constraint.

Misconception: "more loop closures = better SLAM." False. One wrong loop closure can permanently corrupt a map. Better to miss a loop closure than to add a false one. This is why geometric verification (Chapter 7) is not optional — it is the precision gate.

Approach	Scalability	Viewpoint change	Appearance change
Brute-force descriptor match	O(N) queries — fails at N>1000	Good (per-descriptor)	Poor
Bag-of-Words (this lesson)	O(log N) with vocab tree + inv. index	OK	Moderate with TF-IDF
NetVLAD / deep global	O(N) but compact embeddings	Good	Very good

In SLAM loop-closure detection, which error is more catastrophic and why?

False positive (adding a wrong loop closure) — it forces the optimizer to satisfy an impossible geometric constraint, permanently distorting the map False negative (missing a true loop closure) — the drift that was about to be corrected keeps accumulating Both are equally bad — the optimizer can detect and reject bad loop closures automatically

Chapter 2: Building the Visual Vocabulary — k-Means in Descriptor Space

A "visual word" is not a pixel, not a keypoint, and not an object. It is a cluster centroid in descriptor space. To understand this, you need to picture a 128-dimensional space where every SIFT descriptor from every training image is a point. Descriptors of similar patches (e.g., all left-side-of-a-window edges at similar scale) cluster together. The k-means algorithm finds K cluster centers — the visual words.

Building the vocabulary: step by step

Collect training data. Grab a diverse set of images (different environments, viewpoints, lighting). Not the runtime images — offline, before deployment. Typical: 10 000–100 000 images.
Extract descriptors. Run SIFT (or ORB) on every training image. Collect all descriptors — you might have 5–20 million 128-dim vectors.
Run k-means. Cluster all descriptors into K clusters. Each cluster center v₁, v₂, ..., v_K is a visual word. Typical K = 5 000–100 000.
Save the vocabulary. The vocabulary is just the K cluster centers. At runtime, any new descriptor is assigned to its nearest visual word by Euclidean distance in descriptor space.

The vocabulary is learned offline, once. At runtime the robot never re-runs k-means. It just uses the vocabulary as a lookup table: "given this descriptor, which of the K centroids is closest?" That lookup is fast, especially with a vocabulary tree (Chapter 5).

How many visual words? The K tradeoff

With K too small (say K = 50), many descriptors from different patches cluster into the same visual word — the histogram has poor resolution, everything looks like everything else. With K too large (say K = 1 000 000 without a tree), quantization is slow and the histogram is so sparse that two images of the same place rarely share any word (quantization noise puts the same physical patch into different nearby clusters). Empirically, K = 5 000–100 000 is the sweet spot. DBoW2 (ORB-SLAM2) uses K = 10⁶ via a vocabulary tree — large enough to be discriminative, fast enough via the tree to be real-time.

Worked numbers: quantizing a descriptor

Suppose we have a toy vocabulary with K = 4 visual words in 2D descriptor space (2D so we can draw it). The centroids are:

v₁ = (1.0, 2.0) v₂ = (4.0, 1.5) v₃ = (3.0, 5.0) v₄ = (6.5, 4.0)

A new descriptor d = (3.8, 1.8). Which visual word does it get?

dist(d, v₁) = √((3.8−1.0)² + (1.8−2.0)²) = √(7.84 + 0.04) = √7.88 ≈ 2.81
dist(d, v₂) = √((3.8−4.0)² + (1.8−1.5)²) = √(0.04 + 0.09) = √0.13 ≈ 0.36 ← nearest
dist(d, v₃) = √((3.8−3.0)² + (1.8−5.0)²) = √(0.64 + 10.24) = √10.88 ≈ 3.30
dist(d, v₄) = √((3.8−6.5)² + (1.8−4.0)²) = √(7.29 + 4.84) = √12.13 ≈ 3.48

Descriptor d is assigned to visual word 2 (centroid v₂). It contributes a count of +1 to word 2 in the image's histogram.

Descriptor Quantization — Drop a Descriptor, Watch It Snap to Its Visual Word

2D descriptor space with 4 visual word centroids (stars). Click anywhere in the canvas to place a descriptor (circle). It snaps to its nearest centroid. The distance to each centroid is shown.

Misconception: visual words are objects or textures. They are not. A visual word is a cluster in 128-dimensional descriptor space. Multiple different-looking patches can map to the same visual word if their SIFT descriptors are similar. And two patches that look similar to a human eye may map to different visual words. The vocabulary is learned from statistics, not semantics.

You have a visual vocabulary of K = 10 000 words, built from k-means clustering on SIFT descriptors. At runtime you detect a new SIFT descriptor d. How do you find which visual word it belongs to?

Re-run k-means on all descriptors including d to update the cluster assignments Compute Euclidean distance from d to each of the K centroids; assign d to the nearest one Run RANSAC between d and each centroid to find the best geometric fit Use the pixel coordinates of the keypoint that generated d to look up the vocabulary cell

Chapter 3: The Bag-of-Words Histogram

Once you have a vocabulary of K visual words, you can represent any image as a single vector of length K. Here's the recipe: detect ~500–2000 keypoints, compute a descriptor at each, assign each descriptor to its nearest visual word, count how many descriptors fell into each word. The result is a histogram h ∈ ℜ^K, where h[k] = number of descriptors assigned to visual word k.

This is exactly how text bag-of-words works for documents: ignore word order, just count how many times each word appears. The document "the cat sat on the mat" becomes the vector [cat:1, mat:1, on:1, sat:1, the:2, ...]. Order-less. Compact. Fast to compare.

Worked example: building a histogram from scratch

Vocabulary has K = 5 words. An image produces 8 descriptors, quantized as follows:

Descriptor	Nearest visual word
d₁	word 2
d₂	word 4
d₃	word 2
d₄	word 1
d₅	word 3
d₆	word 2
d₇	word 5
d₈	word 4

Raw histogram: h = [1, 3, 1, 2, 1]. Normalized (L1): h_norm = [1/8, 3/8, 1/8, 2/8, 1/8] = [0.125, 0.375, 0.125, 0.250, 0.125]. We always L1-normalize so that images with different numbers of keypoints are comparable. Note: we have thrown away all geometric information — we don't know if word-2 features were in the top-left or bottom-right. This is the "bag" in bag-of-words. The histogram is order-less.

python
import numpy as np
from sklearn.cluster import KMeans

# --- Offline: Build visual vocabulary ---
# training_descriptors: shape (N_total, 128) — all SIFT descriptors from training images
K = 500  # visual vocabulary size (toy: 500; production: 10k-100k)
kmeans = KMeans(n_clusters=K, random_state=42)
kmeans.fit(training_descriptors)
vocabulary = kmeans.cluster_centers_  # shape: (K, 128)

# --- Runtime: Build BoW histogram for a query image ---
def bow_histogram(descriptors, vocabulary):
    """descriptors: (N_desc, 128); returns L1-normalized histogram of shape (K,)"""
    K = vocabulary.shape[0]
    # For each descriptor, find its nearest visual word
    diffs = descriptors[:, np.newaxis, :] - vocabulary[np.newaxis, :, :]  # (N, K, 128)
    dists = np.linalg.norm(diffs, axis=-1)  # (N, K)
    word_ids = np.argmin(dists, axis=1)   # (N,) — one word per descriptor
    hist = np.bincount(word_ids, minlength=K).astype(np.float64)
    hist /= hist.sum() if hist.sum() > 0 else 1.0  # L1 normalize
    return hist

BoW Histogram Builder — Two Images Side by Side

Two synthetic "images" (left: outdoor scene, right: indoor scene) with descriptors mapped to 8 visual words. See how similar scenes produce similar histograms. Press Shuffle to randomize descriptors.

Comparing two images: the full worked calculation

Let's trace two images through the complete BoW process. K = 5. Image A has 8 descriptors, Image B has 10. After quantization:

A: words = [2, 4, 2, 1, 3, 2, 5, 4] h_A = [1, 3, 1, 2, 1] / 8 = [0.125, 0.375, 0.125, 0.250, 0.125]
B: words = [2, 2, 3, 1, 2, 4, 3, 2, 5, 1] h_B = [2, 4, 2, 1, 1] / 10 = [0.20, 0.40, 0.20, 0.10, 0.10]

Cosine similarity (raw TF, L2-normalized):

‖h_A‖₂ = √(0.125²+0.375²+0.125²+0.250²+0.125²) = √(0.016+0.141+0.016+0.063+0.016) = √0.252 ≈ 0.502
‖h_B‖₂ = √(0.04+0.16+0.04+0.01+0.01) = √0.26 ≈ 0.510
sim(A,B) = h_A·h_B / (‖h_A‖×‖h_B‖) = (0.025+0.150+0.025+0.025+0.013) / (0.502×0.510) ≈ 0.238 / 0.256 ≈ 0.93

These two images share most of their visual word distribution — a high similarity of 0.93, suggesting they show the same scene. A same-class word (word 2 appears 3 times in A and 4 times in B) dominates, consistent with both images showing, say, the same building facade with many edge-type features.

Misconception: BoW is a poor representation because it throws away geometry. This is true but it's a feature, not a bug. The order-less histogram is fast to compute and compare. For SLAM we do not need the geometry at the retrieval stage — we need it only for the top 1–5 candidates, which we verify geometrically in Chapter 7. Throwing away geometry first makes the search 1000× faster.

An image has 10 descriptors that quantize to visual words [3, 3, 1, 2, 3, 4, 1, 2, 5, 3]. What is h[3] (the count for word 3) and what is h[3] after L1 normalization?

h[3] = 3; normalized = 0.3 h[3] = 3; normalized = 0.5 h[3] = 4; normalized = 0.4 h[3] = 2; normalized = 0.2

Chapter 4: TF-IDF — Rewarding Distinctive Words

Imagine your visual vocabulary includes word 42, which corresponds to "straight horizontal edge at medium scale." This visual word appears in almost every image — buildings, roads, furniture all have horizontal edges. Matching on word 42 tells you almost nothing: it's the visual equivalent of the English word "the." If you rely on it, you'll retrieve irrelevant images that happen to have lots of horizontal edges.

The fix comes from text retrieval: TF-IDF weighting (Term Frequency × Inverse Document Frequency). Downweight words that appear everywhere (low IDF); upweight distinctive words that appear in only a few images (high IDF).

The formula

w_ik = tf_ik × idf_k

tf_ik = h_ik (normalized frequency of word k in image i)

idf_k = log( N ⁄ n_k )

where N is the total number of images in the database, and n_k is the number of database images that contain word k at all (n_k ≥ 1). After TF-IDF, image i is represented by the weighted vector w_i ∈ ℜ^K.

Worked numbers: TF-IDF in action

Database: N = 1000 images. Vocabulary: K = 5 words. Query image has normalized histogram h_q = [0.10, 0.30, 0.10, 0.40, 0.10]. Document frequencies:

Word k	n_k (images containing it)	idf_k = log(1000/n_k)
1	800	log(1.25) ≈ 0.22
2	50	log(20) ≈ 3.00
3	900	log(1.11) ≈ 0.10
4	10	log(100) ≈ 4.61
5	950	log(1.05) ≈ 0.05

TF-IDF weighted vector for the query: w_q = [0.10×0.22, 0.30×3.00, 0.10×0.10, 0.40×4.61, 0.10×0.05] = [0.022, 0.900, 0.010, 1.844, 0.005]. Notice: word 4 (rare, distinctive) now dominates even though it had the same raw count as word 2. Word 5 (ubiquitous) nearly vanishes. The retrieval system will now find images that share the same rare distinctive visual word — far more reliable than matching on common words.

TF-IDF Effect — Toggle Weighting On/Off

Raw BoW histogram (warm) vs TF-IDF weighted (teal). Toggle weighting and watch how common words shrink and rare distinctive words amplify. The ranking of a query against 4 database images can change.

TF-IDF OFF

Why the log in IDF?

Without the log, IDF = N/n_k. This grows linearly: a word appearing in 1 of 1 000 000 images would get IDF = 1 000 000, completely overwhelming all other words. The log compresses this: log(1 000 000) ≈ 13.8. The log also ensures that IDF = 0 when n_k = N (word in every image) and IDF is positive otherwise. In information-theoretic terms, IDF ≈ self-information: a word that occurs in fraction p = n_k/N of images carries −log(p) bits of information about which specific image it came from.

A second interpretation: imagine a random database image. How surprised are you to see visual word k in it? If k appears in 90% of images, not surprised at all (low IDF). If k appears in 0.1% of images, very surprised (high IDF). Surprise = discriminative power = how much this word helps you tell images apart.

python
import numpy as np

# Compute IDF vector from training set
def compute_idf(database_bow_binary, K):
    """
    database_bow_binary: (N_imgs, K) bool array — True if image contains word k
    Returns idf: (K,) array
    """
    N = database_bow_binary.shape[0]
    n_k = database_bow_binary.sum(axis=0).astype(float) + 1e-9  # avoid log(0)
    return np.log(N / n_k)  # shape: (K,)

# Apply TF-IDF and L2-normalize
def tfidf_vector(tf_hist, idf):
    """tf_hist: L1-normalized (K,); idf: (K,); returns L2-normalized (K,)"""
    w = tf_hist * idf
    norm = np.linalg.norm(w)
    return w / (norm + 1e-9)

IDF is computed offline from the training database. The n_k values come from counting how many training images contain each visual word. At runtime, the query just multiplies its TF histogram by the precomputed IDF vector — one element-wise multiply. Very fast.

Visual word A appears in 990 out of 1000 database images. Visual word B appears in 5 out of 1000. Which has higher IDF weight, and what does that mean for retrieval?

Word A (common words are more reliable for matching) Word B — idf_A = log(1000/990) ≈ 0.01 vs idf_B = log(1000/5) ≈ 5.3; word B is far more discriminative Both are equal — IDF only depends on term frequency within the query image

Chapter 5: Vocabulary Tree & Inverted Index — Scaling to Millions

Flat k-means with K = 100 000 visual words requires comparing each descriptor to 100 000 centroids — roughly 100 000 dot-products of 128-dim vectors. At 1000 descriptors per image, that's 10⁸ operations per frame, which is far too slow. The vocabulary tree solves this with a hierarchical k-means: instead of one big clustering, build a tree of branchings.

Hierarchical k-means: vocabulary tree structure

The tree has branching factor b (typically 10) and depth d (typically 6). The root clusters all descriptors into b groups. Each group is clustered into b sub-groups, and so on for d levels. Leaves are the visual words. Total visual words K = b^d (e.g., 10⁶ = 1 million). To quantize a descriptor, you traverse the tree top-down, picking the nearest centroid at each level. With b = 10 and d = 6: you compare to 10 + 10 + ... + 10 = 60 centroids (not 1 000 000). Speed: O(b × d) = O(60) instead of O(K) = O(1 000 000). That is a 16 000× speedup for quantization.

Cost (flat k-means): O(K) = O(b^d) per descriptor
Cost (vocab tree): O(b × d) per descriptor
Speedup: b^d ⁄ (b × d) = 10⁵ ⁄ 60 ≈ 1600× for b=10, d=6

Inverted index: fast candidate retrieval

Once images are represented as TF-IDF-weighted BoW vectors, comparing a query to a database naively requires computing cosine similarity against every stored vector — O(N × K) work. The inverted index (borrowed directly from web search) is the solution.

The inverted index is a dictionary: for each visual word k, store the list of database images that contain k, along with the TF-IDF weight. To retrieve candidates for a query:

For each visual word k with nonzero weight in the query, look up the inverted list for k — these are all database images that share this word with the query.
Accumulate similarity scores: for each (image_id, weight) in the inverted list, add tf_query,k × tf_db,k × idf_k² to that image's running score.
Return top-R images by score. Only images that share at least one visual word with the query are ever touched — images with zero shared words contribute zero and are never accessed.

The inverted index is why BoW is scalable. For a query with 1000 descriptors mapped to 1000 visual words, most database images share none of those words (if the vocabulary is large enough). You only compute similarity against the small subset that have at least one matching word. In practice this reduces the active set from millions of images to thousands.

python
# Inverted index structure
# inv_index[word_id] = list of (image_id, tfidf_weight) tuples
inv_index = {k: [] for k in range(K)}

# --- Build index (offline) ---
for img_id, bow_vec in enumerate(database_bow_vectors):
    for word_id, weight in enumerate(bow_vec):
        if weight > 0:
            inv_index[word_id].append((img_id, weight))

# --- Query (runtime) ---
def retrieve_candidates(query_bow, inv_index, top_k=5):
    scores = {}
    for word_id, q_weight in enumerate(query_bow):
        if q_weight <= 0: continue
        for img_id, db_weight in inv_index[word_id]:
            scores[img_id] = scores.get(img_id, 0.0) + q_weight * db_weight
    return sorted(scores.items(), key=lambda x: -x[1])[:top_k]

Worked numbers: end-to-end scalability analysis

Let's put real numbers on the full pipeline for a system with N = 100 000 database images, K = 50 000 visual words (vocab tree b=10, d=5), and 1 000 descriptors per frame:

Step	Cost (flat k-means)	Cost (vocab tree + inv. index)
Quantize 1000 descriptors	1000 × 50 000 = 50M comparisons	1000 × (10×5) = 50 000 comparisons — 1000× faster
Build query histogram	O(K) = 50 000 ops	O(K) = same
Retrieve candidates	N × K dot-products = 5×10⁹	Only images sharing ≥1 word ≈ 500 images (est.) — 200× faster
Geometric verification	Up to N images to verify	Top-5 candidates only — 20 000× fewer
Total per frame	~50 billion operations — unusable	~10 million operations — ~20 ms real-time

This is why BoW with a vocabulary tree and inverted index is the backbone of real-time place recognition systems. Without these algorithmic components, the approach simply does not scale beyond toy datasets.

The "sparsity assumption" is the key. The inverted index only helps if most visual words are absent from most images. With K = 50 000 words and ~1 000 descriptors per image, each image touches at most 1 000 out of 50 000 words = 2% occupancy. Images that share a word at all are already a filtered subset. The sparser the histogram, the bigger the speedup.

Memory footprint of the inverted index

The inverted index stores, for each visual word k, a list of (image_id, tfidf_weight) pairs. With N = 100 000 images, 1 000 descriptors per image, and K = 50 000 words, the total number of (image_id, weight) entries is N × avg_words_per_image = 100 000 × 1 000 = 100 million entries. At 8 bytes each (4 for ID, 4 for float32 weight), that is 800 MB of RAM — feasible on a modern laptop. For N = 1 million images (full city map), this becomes 8 GB — at the boundary of RAM-based indexing. Production systems compress the index (8-bit quantized weights, gap-coded IDs) or use approximate nearest-neighbor structures (HNSW, FAISS) for the candidate retrieval stage.

A vocabulary tree has branching factor b = 6 and depth d = 5. How many visual words K does it have, and how many centroid comparisons does it take to quantize one descriptor?

K = 6×5 = 30 words; 30 comparisons K = 6⁵ = 7 776 words; 6×5 = 30 comparisons (vs 7 776 for flat) K = 6⁵ = 7 776 words; 7 776 comparisons (same as flat) K = 5⁶ = 15 625 words; 6 comparisons only

Chapter 6: Cosine Similarity & Ranking

After building TF-IDF weighted histograms, comparing two images reduces to a single vector operation: cosine similarity. Given query vector w_q and database image vector w_i, both L2-normalized:

sim(q, i) = w_q · w_i = ∑_k=1^K w_qk × w_ik

Range: [−1, 1], but since TF-IDF weights are non-negative, range is [0, 1]. A score of 1.0 means identical histograms (same place, same viewpoint). A score near 0 means no shared distinctive visual words (completely different scenes).

Worked numbers: cosine similarity

Suppose K = 5. After TF-IDF weighting and L2-normalization:

w_q = [0.05, 0.60, 0.02, 0.80, 0.01] (L2-norm ≈ 1.0)

w_A = [0.04, 0.55, 0.03, 0.82, 0.02] (same place, different viewpoint)

w_B = [0.50, 0.60, 0.50, 0.10, 0.30] (different place — aliased)

sim(q, A) = 0.05×0.04 + 0.60×0.55 + 0.02×0.03 + 0.80×0.82 + 0.01×0.02
= 0.002 + 0.330 + 0.0006 + 0.656 + 0.0002 ≈ 0.989

sim(q, B) = 0.05×0.50 + 0.60×0.60 + 0.02×0.50 + 0.80×0.10 + 0.01×0.30
= 0.025 + 0.360 + 0.010 + 0.080 + 0.003 ≈ 0.478

Image A (same place) scores 0.989 and is ranked first. Image B (different place) scores 0.478 and is ranked lower. But notice: B still has a non-trivial similarity score of 0.48 — because word 2 (a common word) contributes heavily. With better TF-IDF weighting (reducing word 2's influence), B would drop further. This is why TF-IDF matters.

Temporal consistency check

An additional layer of robustness: if image at time t is a true loop closure match to database image at time t₀, then images at times t−1, t−2 should also match images near t₀. If a strong similarity score appears for only one isolated frame in time with no neighboring matches, it's likely a false positive. DBoW2 implements a "temporal window" filter for this.

Similarity Ranking — Query vs Database of 6 Images

A query image has a TF-IDF histogram. Six database images have histograms. Cosine similarity bars rank the matches. The top match is highlighted. Move the slider to add perceptual aliasing — watch a false match rise in rank.

Aliasing level 0.00

Two images have TF-IDF vectors (after L2 normalization): w_q = [0.0, 0.7, 0.0, 0.71, 0.0] and w_i = [0.0, 0.7, 0.0, 0.71, 0.0]. What is cos sim(q, i)?

0.5 — equal vectors have similarity 0.5 by definition 0.0 — they share no unique features 1.0 — 0.7×0.7 + 0.71×0.71 = 0.49 + 0.504 = 0.994 ≈ 1.0 (identical L2-normalized vectors → dot-product = 1)

Chapter 7: Geometric Verification — The Precision Gate

BoW is order-less. Two entirely different places can produce similar histograms if they share common visual words — this is perceptual aliasing. A long corridor photographed from two different ends might share many "straight edge" and "uniform surface" visual words even though the camera poses are meters apart and geometrically incompatible.

Even with TF-IDF correctly down-weighting the common words, if the environment is sufficiently self-similar — think an airport terminal with repeating gate numbers and identical carpet patterns, or a long featureless parking garage — the residual alias score can easily exceed a loose threshold. Geometric verification is the only reliable way to separate these cases.

Geometric verification is the second stage: take the top-k BoW candidates and for each one, run full feature matching followed by geometric consistency checking. If the features can be explained by a valid camera geometry (a fundamental matrix from L7, estimated by RANSAC from L8), the match is confirmed as a true loop closure. If RANSAC finds too few inliers, the candidate is rejected — no loop closure added.

How many candidates should you verify?

The number k of top-BoW candidates passed to geometric verification is a precision-recall knob. k = 1: fastest, but you might miss the true match if the actual revisit is ranked 2nd (due to imperfect TF-IDF or quantization). k = 10: more likely to find the true match, but 10 geometric verifications at ~5 ms each = 50 ms extra per keyframe. Common choices: k = 3 to 5. If any one of the k candidates passes geometric verification, that match is accepted and no further candidates need to be checked — break at the first success.

The verification pipeline

BoW Top-5 candidates

Fast cosine similarity from inverted index

↓

Feature re-match

SIFT descriptors in query vs candidate: ratio-test matching (L6)

↓

RANSAC fundamental matrix

8-point algorithm with RANSAC (L7, L8) — find F consistent with most matches

↓

Inlier count

If ≥ threshold (e.g., 30 inliers): loop closure confirmed. Below: rejected.

↓

Add to SLAM factor graph

Relative pose T_q,db from essential matrix E (L7); optimizer runs

Perceptual aliasing — a worked example

Robot is in a symmetrical building with two identical-looking corridors. Corridor A (visited at t = 0) and Corridor B (current, t = 10 min) have the same flooring, walls, and lighting. BoW produces sim(A, B) = 0.93 — very high. But when you run RANSAC on matched SIFT features between A and B, you find only 4 inlier pairs (threshold = 30) because the actual 3D structure is incompatible — the floor tiles in A and B match descriptor-for-descriptor but are at different depths and orientations. Geometric verification rejects this false loop closure. Crisis averted.

Misconception: if BoW similarity is very high (>0.9), the match must be geometrically correct. Not true. Perceptual aliasing can produce near-1.0 BoW similarity between completely different places, especially in repetitive environments (corridors, warehouses, parking lots, staircases). BoW is necessary but not sufficient. Always verify geometrically.

Geometric Verification — True vs False Positive

Two scenarios: a true revisit (many RANSAC inliers) and a perceptual-aliasing false positive (few inliers). Adjust the inlier threshold slider to see where the decision boundary is. Red matches = outliers; teal matches = inliers.

Inlier threshold 20

Scenario True revisit

Appearance change — when BoW fails at the retrieval stage

Even with geometric verification, a deeper problem lurks: if the same place looks different enough that descriptors change, the visual words themselves change, and the BoW histogram changes. Day vs. night is the classic case. A wall lit by warm afternoon sun produces different SIFT gradients than the same wall under a streetlamp. The visual words may not even overlap. Similarity drops to near zero — not a false positive, a false negative. The loop closure is missed entirely.

Strategies to handle this: (a) vocabulary trained on multi-condition data — include images from different times of day, seasons, and weather; (b) sequence-based matching — compare sequences of frames rather than single frames; a sequence of 5 temporally consistent matches is far more reliable than any single match; (c) learning-based descriptors (NetVLAD, SuperPoint) that are trained to be invariant to appearance change; (d) probability-based detection (FAB-MAP) which models the joint distribution of word co-occurrences and is more robust to partial word-set changes.

python
# Geometric verification: full pipeline
import cv2
import numpy as np

def verify_loop_closure(query_img, candidate_img, inlier_threshold=25):
    """Returns (accepted: bool, n_inliers: int, F: ndarray or None)"""
    # Step 1: detect and describe keypoints
    sift = cv2.SIFT_create()
    kp1, des1 = sift.detectAndCompute(query_img, None)
    kp2, des2 = sift.detectAndCompute(candidate_img, None)
    if des1 is None or des2 is None: return False, 0, None

    # Step 2: ratio-test matching (Lowe's ratio test, L6)
    bf = cv2.BFMatcher()
    matches = bf.knnMatch(des1, des2, k=2)
    good = [m for m, n in matches if m.distance < 0.75 * n.distance]
    if len(good) < 8: return False, len(good), None

    # Step 3: RANSAC fundamental matrix (L7 + L8)
    pts1 = np.float32([kp1[m.queryIdx].pt for m in good])
    pts2 = np.float32([kp2[m.trainIdx].pt for m in good])
    F, mask = cv2.findFundamentalMat(pts1, pts2, cv2.FM_RANSAC, 3.0, 0.999)
    if F is None or mask is None: return False, 0, None

    # Step 4: count inliers
    n_inliers = int(mask.ravel().sum())
    accepted = n_inliers >= inlier_threshold
    return accepted, n_inliers, F

# Usage in SLAM loop:
# top_candidates = retrieve_candidates(query_bow, inv_index, top_k=5)
# for img_id, bow_score in top_candidates:
#     accepted, n_in, F = verify_loop_closure(query_img, db_images[img_id])
#     if accepted:
#         add_loop_closure_constraint(factor_graph, current_pose, db_poses[img_id], F)
#         break  # one verified closure is enough per keyframe

RANSAC geometric verification finds 12 inlier feature matches between the query and the top BoW candidate. The threshold is 25 inliers. Should this be accepted as a loop closure?

Yes — 12 inliers is quite good and the BoW score was high No — 12 inliers is below the threshold of 25; reject to avoid a false loop closure; wait for a frame with more geometric evidence Yes — any positive inlier count means the match is geometrically valid

Chapter 8: Showcase — Full BoW Place Recognition Pipeline

This is the payoff simulation. A robot drives a loop and collects images. An inverted-index BoW database is built incrementally. At each step the current frame is queried against the database. Watch the similarity scores, TF-IDF weighting, and geometric verification all interact. Tune parameters to see how the system degrades under aliasing or improves with better vocabulary size.

What makes this simulation useful is that it lets you build intuition for parameter tradeoffs that are otherwise hard to reason about from theory alone. Try: large K + high aliasing — the threshold needs to be very high to stay clean. Small K + low aliasing — the threshold can be loose and you still get clean loop closure detection. These are the engineering judgments a SLAM practitioner makes when deploying to a new environment.

Three controls govern the behavior:

Vocabulary size K: larger vocabularies produce more distinctive visual words → fewer aliased matches → the similarity curve stays low until a true revisit. Smaller K causes many frames to score moderately (flatter curves, more false triggers).
Aliasing level: simulates a repetitive environment (warehouse aisles, long corridors). At 0.0 only the true revisit spike near the end of the loop crosses the threshold. At 0.8 multiple frames score high, requiring a higher threshold or more geometric caution.
Similarity threshold τ: the minimum BoW score to trigger geometric verification. Raising τ reduces false positives but may delay or miss the true loop closure frame. Watch what happens when you set τ too high — the loop closure fires late or not at all.

Place Recognition Simulator — Loop Closure Detection

Left: robot trajectory (grey=VO drift, teal=corrected). Right: BoW similarity scores as the robot queries its database. When a verified loop closure fires, the trajectory snaps back. Adjust vocabulary size and aliasing to see how performance changes.

Vocabulary size (K) 80

Aliasing level 0.2

Sim threshold τ 0.70

Chapter 9: Connections & Cheat Sheet

The complete BoW pipeline in one sentence per stage

Stage	What it does	Key number
Image retrieval framing	Treat place recognition as document search — find the nearest image in a database of past frames	O(log N) with vocab tree + inv. index
Visual vocabulary (k-means)	Cluster all training descriptors into K centroids = visual words	K = 5k–100k; built once offline
Quantization	Assign each runtime descriptor to its nearest visual word by Euclidean distance	O(b×d) per descriptor with vocab tree
BoW histogram	Count how many descriptors fell into each visual word; L1-normalize	K-dimensional vector; order-less
TF-IDF weighting	Downweight ubiquitous words; upweight rare distinctive ones	idf_k = log(N⁄n_k)
Inverted index	Fast candidate retrieval: only touch images that share at least one visual word with query	Reduces active set from millions to thousands
Cosine similarity	Dot-product of L2-normalized TF-IDF vectors	Range [0, 1]; 1 = identical histograms
Geometric verification	Run RANSAC + fundamental matrix on top-k BoW candidates; reject below inlier threshold	Threshold ~20–50 inliers; precision gate
Precision over recall	Tune for no false positives; missed loop closures are benign; false ones are catastrophic	One false loop closure can corrupt the full map

Where this fits in the VNAV curriculum

Place recognition is the bridge between odometry (L11) and full SLAM (L13). Without it, VO/VIO just drifts indefinitely. With loop closures, the optimizer gets global constraints that pull the map into metric accuracy. The full chain is:

L6: Feature Detection

SIFT descriptors that BoW ingests

↓

L7: Two-View Geometry

Fundamental matrix for geometric verification

↓

L8: RANSAC

Robust inlier counting in geometric verification

↓

L11: VIO

Produces the drifting trajectory that loop closure corrects

↓

L12: Place Recognition (this lesson)

Detects loop closures; adds constraints to factor graph

↓

L13: Full SLAM

Uses loop-closure constraints to eliminate drift globally

The deeper insight: memory as measurement

Place recognition is philosophically interesting: it is the mechanism by which a robot forms a spatial memory and uses it as a measurement. A new camera frame is not just data to process forward — it is a query into a temporal database of all past experience. The match is a measurement: "the robot is now where it was at time t₀." In factor-graph language, this is a binary constraint between two nodes in the pose graph that would otherwise only be connected through a long chain of odometry measurements. The loop closure creates a shortcut — a direct constraint across time — and the optimizer uses it to make the entire map self-consistent.

This is why BoW + geometric verification remains competitive even against modern deep learning approaches for SLAM: it is fast, runs on-device without GPU, requires no training on the specific deployment environment (only a generic vocabulary), and produces a verified geometric constraint with quantified covariance — exactly what the factor-graph optimizer needs. Deep learning descriptors (NetVLAD, SuperPoint+SuperGlue) have higher recall under appearance change, but their outputs must still be fed through the same geometric verification stage before being added to the graph.

Real implementations

DBoW2 (Galvez-Lopez & Tardos, 2012): the standard C++ BoW library used in ORB-SLAM2 and VINS-Mono. Uses a vocabulary tree trained on 9000 images. Includes temporal consistency scoring and geometric verification. FAB-MAP (Cummins & Newman, 2008): a probabilistic extension using a Chow-Liu tree model of visual word co-occurrence, which captures the statistical dependencies between visual words that flat BoW misses. NetVLAD (Arandjelovic et al., 2016): a deep learning approach that learns a global descriptor end-to-end from place recognition data — the "neural" evolution of BoW that handles appearance change much better.

Feynman test: Can you explain to a non-expert why a robot can recognize a place without knowing its precise GPS coordinates, using only a library of past images and a dictionary of visual words? And why a high similarity score alone is not enough — why you must confirm the match with geometry before trusting it? If you can — you've internalized this lesson.

Limitations and failure modes

Appearance change: Day/night, snow, autumn → descriptors change → visual words change → BoW histograms change → missed loop closures. Multi-session mapping and learning-based descriptors address this.
Perceptual aliasing: Corridors, parking lots, indoor spaces — repetitive visual content → false BoW matches → requires strong geometric verification.
Vocabulary out-of-distribution: Vocabulary trained on outdoor scenes may fail indoors. Best practice: train vocabulary on data from the target environment.
Scalability limits: Inverted index requires RAM proportional to database size. At millions of images, approximate nearest-neighbor methods and database pruning become necessary.
Viewpoint sensitivity: A 90° rotation of the camera changes which keypoints are detected and which visual words appear. SIFT is rotation-invariant for individual descriptors, but the histogram distribution changes with extreme viewpoint change. Systems mitigate this by adding loop-closure candidates from multiple recent frames (temporal window) rather than relying on any single frame.
Motion blur and feature density: A fast-moving robot or low-light camera produces blurry frames with few detectable keypoints. Fewer descriptors → sparser BoW histogram → reduced cosine similarity even for a true revisit. Some systems set a minimum keypoint count threshold and skip frames that fall below it.

Common parameter choices in practice

These numbers come from real deployed systems (ORB-SLAM2, VINS-Mono, OpenVSLAM):

Parameter	Typical range	Effect of increasing
Vocabulary size K	1k (fast/small) – 100k (discriminative)	More distinctive words → fewer false BoW matches → higher precision; slower to build (but offline)
Tree branching b	6 – 10	Shallower tree at same K → fewer comparisons per descriptor (but wider nodes)
Tree depth d	5 – 6	More levels → more leaf nodes → more precise quantization
BoW similarity threshold τ	0.6 – 0.9	Higher τ → fewer false positives → lower recall; tune based on environment (repetitive = higher τ)
Geometric inlier threshold	20 – 50 inliers	Higher threshold → fewer false loop closures; may miss loop closures if scene has few features
Top-k candidates to verify	3 – 10	More candidates → higher recall; more computation; usually top-5 is a good balance

Integration with factor-graph SLAM

When a loop closure is confirmed by geometric verification, the system has: the identity of the database frame (timestamp t₀, pose X₀ in the factor graph) and an estimate of the relative pose T_q,0 derived from the essential matrix E decomposition. This relative pose, with covariance derived from RANSAC inlier count and reprojection residuals, is added as a new binary edge between the current pose node X_t and the historical node X₀ in the factor graph. The next time the back-end optimizer runs (iSAM2 or g²o), it sees this new constraint and re-solves the entire trajectory — distributing the accumulated drift to minimize all constraint residuals simultaneously.

This is exactly the "snap-back" described in Chapter 0. The factor graph was an open chain of odometry edges; the loop closure adds the closing edge, turning it into a loop, and the optimizer finds the trajectory that minimizes drift around the full loop. With k correct loop closures, the final trajectory error falls as approximately 1/k of the original drift — each closure gives the optimizer more constraints to anchor the map.

A SLAM system using BoW place recognition is deployed in a large warehouse with identical-looking aisles. Which combination of design choices will give the most reliable loop-closure detection?

Large vocabulary (100k words), no geometric verification, low similarity threshold Small vocabulary (100 words), high similarity threshold, no TF-IDF Large vocabulary (100k), TF-IDF weighting, high similarity threshold, mandatory geometric verification with strict inlier count Medium vocabulary (1000 words), no TF-IDF, geometric verification with loose inlier threshold