Your robot just completed a 500-meter exploration loop. Visual odometry has accumulated 3 meters of drift — the map is warped. But when the robot turns a corner and its camera shows the same courtyard it left 10 minutes ago, a single recognition event can snap the whole map back into alignment. This is place recognition: matching one image against thousands of past frames, in milliseconds, without GPS. The engine behind this is the Bag-of-Visual-Words model — a direct translation of 50 years of text-retrieval research into image space, using k-means clustering, TF-IDF weighting, inverted indexes, and RANSAC-verified geometry. MIT 16.485 Lectures 21–22 by Luca Carlone.
Picture a ground robot tasked with mapping a university campus. It drives for 10 minutes, collecting thousands of camera frames, and builds a local map using visual odometry (L11). VO is good — each step is accurate. But errors compound over time (the drift math from L11: approximately 1% of path length). After 500 m, the estimated position is off by ~5 m. The map looks like a slightly distorted rubber sheet.
Now the robot turns a corner and its camera captures an image of the same courtyard it left 10 minutes ago. If the robot could recognize "I have been here before," it could add a loop-closure constraint to its factor graph: a measurement that says "position now = position at time t0." A global optimizer (SLAM, L13) would then pull the entire trajectory taut, distributing that 5-meter drift evenly around the loop and reducing final error to centimeters.
Given a query image q captured at the current time and a database D of images captured at all previous times, find the image in D that shows the same physical place as q — and do it fast enough to run in real time (tens of milliseconds), even when |D| is in the thousands or millions.
This is exactly image retrieval, a classical computer vision problem. The key insight from Sivic & Zisserman (Video Google, ICCV 2003) is that the entire text-retrieval machinery built over 50 years of web search — bag-of-words models, TF-IDF, inverted indexes — translates directly to images if you first map image features into a discrete "visual vocabulary."
The name "Video Google" is deliberate: the paper applied this to every frame of a movie and allowed users to query by example ("find all frames showing this actor's face"), demonstrating real-time retrieval in a 70 000-frame video. The robotics community adopted the same architecture for place recognition within a few years.
Return to the 500-meter exploration loop with 1% VO drift. The estimated endpoint is ~5 m off from the true start. After geometric verification confirms the loop closure, the SLAM back-end has the following constraints: ~1 000 odometry edges saying "I moved forward step-by-step" and 1 loop-closure edge saying "the endpoint is at the same position as the start." The optimizer distributes the 5 m error evenly: each waypoint shifts by about 5 mm. Final map error: on the order of millimeters — a 1000× improvement over raw VO. The only requirement: the recognition must be correct.
A SIFT descriptor is 128 bytes. A single image has ~500–2000 keypoints. 10 000 database images → up to 20 million descriptors to compare against, per query. At 1 ns per comparison, that is 20 ms just for a single nearest-neighbor lookup — and we haven't counted the O(N2) cost of verifying geometric consistency. At 100 000 images this becomes unusable. We need a smarter representation.
The key insight that unlocks scalable place recognition: stop thinking about this as geometry matching and start thinking about it as document search. In text retrieval, documents are represented as histograms of word frequencies, and a query document is compared to a database of histograms using fast dot-product similarity. Google can search billions of documents in milliseconds this way.
For images the analog is: replace documents with images, and replace text words with visual words — discrete tokens derived from local image descriptors. The image becomes a histogram of visual-word counts (order-less, just like a document's word counts). Matching is then a fast vector operation, not an exhaustive geometric search.
Local descriptors (SIFT, ORB, BRIEF — L6) describe small patches around keypoints. They are invariant to rotation and scale. They capture parts of the scene — edges, corners, blobs. Bag-of-Words uses these. Global descriptors (GIST, NetVLAD) describe the whole image at once — the "spatial envelope." Easier to compute but less discriminative under viewpoint change. This lesson focuses on BoW (local descriptors), which is how DBoW2 and FAB-MAP work.
The text analogy deserves to be pushed further because it is not just conceptual — the exact same algorithms run. In text retrieval: given a query document ("autonomous driving safety standards"), find the 10 most relevant documents in a corpus of 1 billion web pages. The solution: represent every document as a TF-IDF weighted sparse vector over a vocabulary of 50 000 common English words; precompute an inverted index; dot-product the query against the index to get candidate scores in milliseconds. Google's original PageRank paper relied on this exact infrastructure. Sivic & Zisserman (Video Google, ICCV 2003) directly applied it to images: visual words instead of text words, SIFT descriptor clusters instead of dictionary entries. The implementation is isomorphic — only the preprocessing (descriptor extraction + k-means) differs.
In information retrieval, precision = fraction of retrieved images that are true matches; recall = fraction of true matches that were retrieved. For SLAM loop closure the tradeoff is asymmetric and brutal:
Therefore SLAM systems tune for high precision at the cost of recall. You might miss 70% of true revisits — that's fine. You must not add a single false constraint.
| Approach | Scalability | Viewpoint change | Appearance change |
|---|---|---|---|
| Brute-force descriptor match | O(N) queries — fails at N>1000 | Good (per-descriptor) | Poor |
| Bag-of-Words (this lesson) | O(log N) with vocab tree + inv. index | OK | Moderate with TF-IDF |
| NetVLAD / deep global | O(N) but compact embeddings | Good | Very good |
A "visual word" is not a pixel, not a keypoint, and not an object. It is a cluster centroid in descriptor space. To understand this, you need to picture a 128-dimensional space where every SIFT descriptor from every training image is a point. Descriptors of similar patches (e.g., all left-side-of-a-window edges at similar scale) cluster together. The k-means algorithm finds K cluster centers — the visual words.
With K too small (say K = 50), many descriptors from different patches cluster into the same visual word — the histogram has poor resolution, everything looks like everything else. With K too large (say K = 1 000 000 without a tree), quantization is slow and the histogram is so sparse that two images of the same place rarely share any word (quantization noise puts the same physical patch into different nearby clusters). Empirically, K = 5 000–100 000 is the sweet spot. DBoW2 (ORB-SLAM2) uses K = 106 via a vocabulary tree — large enough to be discriminative, fast enough via the tree to be real-time.
Suppose we have a toy vocabulary with K = 4 visual words in 2D descriptor space (2D so we can draw it). The centroids are:
A new descriptor d = (3.8, 1.8). Which visual word does it get?
Descriptor d is assigned to visual word 2 (centroid v2). It contributes a count of +1 to word 2 in the image's histogram.
2D descriptor space with 4 visual word centroids (stars). Click anywhere in the canvas to place a descriptor (circle). It snaps to its nearest centroid. The distance to each centroid is shown.
Once you have a vocabulary of K visual words, you can represent any image as a single vector of length K. Here's the recipe: detect ~500–2000 keypoints, compute a descriptor at each, assign each descriptor to its nearest visual word, count how many descriptors fell into each word. The result is a histogram h ∈ ℜK, where h[k] = number of descriptors assigned to visual word k.
This is exactly how text bag-of-words works for documents: ignore word order, just count how many times each word appears. The document "the cat sat on the mat" becomes the vector [cat:1, mat:1, on:1, sat:1, the:2, ...]. Order-less. Compact. Fast to compare.
Vocabulary has K = 5 words. An image produces 8 descriptors, quantized as follows:
| Descriptor | Nearest visual word |
|---|---|
| d1 | word 2 |
| d2 | word 4 |
| d3 | word 2 |
| d4 | word 1 |
| d5 | word 3 |
| d6 | word 2 |
| d7 | word 5 |
| d8 | word 4 |
Raw histogram: h = [1, 3, 1, 2, 1]. Normalized (L1): hnorm = [1/8, 3/8, 1/8, 2/8, 1/8] = [0.125, 0.375, 0.125, 0.250, 0.125]. We always L1-normalize so that images with different numbers of keypoints are comparable. Note: we have thrown away all geometric information — we don't know if word-2 features were in the top-left or bottom-right. This is the "bag" in bag-of-words. The histogram is order-less.
python import numpy as np from sklearn.cluster import KMeans # --- Offline: Build visual vocabulary --- # training_descriptors: shape (N_total, 128) — all SIFT descriptors from training images K = 500 # visual vocabulary size (toy: 500; production: 10k-100k) kmeans = KMeans(n_clusters=K, random_state=42) kmeans.fit(training_descriptors) vocabulary = kmeans.cluster_centers_ # shape: (K, 128) # --- Runtime: Build BoW histogram for a query image --- def bow_histogram(descriptors, vocabulary): """descriptors: (N_desc, 128); returns L1-normalized histogram of shape (K,)""" K = vocabulary.shape[0] # For each descriptor, find its nearest visual word diffs = descriptors[:, np.newaxis, :] - vocabulary[np.newaxis, :, :] # (N, K, 128) dists = np.linalg.norm(diffs, axis=-1) # (N, K) word_ids = np.argmin(dists, axis=1) # (N,) — one word per descriptor hist = np.bincount(word_ids, minlength=K).astype(np.float64) hist /= hist.sum() if hist.sum() > 0 else 1.0 # L1 normalize return hist
Two synthetic "images" (left: outdoor scene, right: indoor scene) with descriptors mapped to 8 visual words. See how similar scenes produce similar histograms. Press Shuffle to randomize descriptors.
Let's trace two images through the complete BoW process. K = 5. Image A has 8 descriptors, Image B has 10. After quantization:
Cosine similarity (raw TF, L2-normalized):
These two images share most of their visual word distribution — a high similarity of 0.93, suggesting they show the same scene. A same-class word (word 2 appears 3 times in A and 4 times in B) dominates, consistent with both images showing, say, the same building facade with many edge-type features.
Imagine your visual vocabulary includes word 42, which corresponds to "straight horizontal edge at medium scale." This visual word appears in almost every image — buildings, roads, furniture all have horizontal edges. Matching on word 42 tells you almost nothing: it's the visual equivalent of the English word "the." If you rely on it, you'll retrieve irrelevant images that happen to have lots of horizontal edges.
The fix comes from text retrieval: TF-IDF weighting (Term Frequency × Inverse Document Frequency). Downweight words that appear everywhere (low IDF); upweight distinctive words that appear in only a few images (high IDF).
where N is the total number of images in the database, and nk is the number of database images that contain word k at all (nk ≥ 1). After TF-IDF, image i is represented by the weighted vector wi ∈ ℜK.
Database: N = 1000 images. Vocabulary: K = 5 words. Query image has normalized histogram hq = [0.10, 0.30, 0.10, 0.40, 0.10]. Document frequencies:
| Word k | nk (images containing it) | idfk = log(1000/nk) |
|---|---|---|
| 1 | 800 | log(1.25) ≈ 0.22 |
| 2 | 50 | log(20) ≈ 3.00 |
| 3 | 900 | log(1.11) ≈ 0.10 |
| 4 | 10 | log(100) ≈ 4.61 |
| 5 | 950 | log(1.05) ≈ 0.05 |
TF-IDF weighted vector for the query: wq = [0.10×0.22, 0.30×3.00, 0.10×0.10, 0.40×4.61, 0.10×0.05] = [0.022, 0.900, 0.010, 1.844, 0.005]. Notice: word 4 (rare, distinctive) now dominates even though it had the same raw count as word 2. Word 5 (ubiquitous) nearly vanishes. The retrieval system will now find images that share the same rare distinctive visual word — far more reliable than matching on common words.
Raw BoW histogram (warm) vs TF-IDF weighted (teal). Toggle weighting and watch how common words shrink and rare distinctive words amplify. The ranking of a query against 4 database images can change.
Without the log, IDF = N/nk. This grows linearly: a word appearing in 1 of 1 000 000 images would get IDF = 1 000 000, completely overwhelming all other words. The log compresses this: log(1 000 000) ≈ 13.8. The log also ensures that IDF = 0 when nk = N (word in every image) and IDF is positive otherwise. In information-theoretic terms, IDF ≈ self-information: a word that occurs in fraction p = nk/N of images carries −log(p) bits of information about which specific image it came from.
A second interpretation: imagine a random database image. How surprised are you to see visual word k in it? If k appears in 90% of images, not surprised at all (low IDF). If k appears in 0.1% of images, very surprised (high IDF). Surprise = discriminative power = how much this word helps you tell images apart.
python import numpy as np # Compute IDF vector from training set def compute_idf(database_bow_binary, K): """ database_bow_binary: (N_imgs, K) bool array — True if image contains word k Returns idf: (K,) array """ N = database_bow_binary.shape[0] n_k = database_bow_binary.sum(axis=0).astype(float) + 1e-9 # avoid log(0) return np.log(N / n_k) # shape: (K,) # Apply TF-IDF and L2-normalize def tfidf_vector(tf_hist, idf): """tf_hist: L1-normalized (K,); idf: (K,); returns L2-normalized (K,)""" w = tf_hist * idf norm = np.linalg.norm(w) return w / (norm + 1e-9)
Flat k-means with K = 100 000 visual words requires comparing each descriptor to 100 000 centroids — roughly 100 000 dot-products of 128-dim vectors. At 1000 descriptors per image, that's 108 operations per frame, which is far too slow. The vocabulary tree solves this with a hierarchical k-means: instead of one big clustering, build a tree of branchings.
The tree has branching factor b (typically 10) and depth d (typically 6). The root clusters all descriptors into b groups. Each group is clustered into b sub-groups, and so on for d levels. Leaves are the visual words. Total visual words K = bd (e.g., 106 = 1 million). To quantize a descriptor, you traverse the tree top-down, picking the nearest centroid at each level. With b = 10 and d = 6: you compare to 10 + 10 + ... + 10 = 60 centroids (not 1 000 000). Speed: O(b × d) = O(60) instead of O(K) = O(1 000 000). That is a 16 000× speedup for quantization.
Once images are represented as TF-IDF-weighted BoW vectors, comparing a query to a database naively requires computing cosine similarity against every stored vector — O(N × K) work. The inverted index (borrowed directly from web search) is the solution.
The inverted index is a dictionary: for each visual word k, store the list of database images that contain k, along with the TF-IDF weight. To retrieve candidates for a query:
python # Inverted index structure # inv_index[word_id] = list of (image_id, tfidf_weight) tuples inv_index = {k: [] for k in range(K)} # --- Build index (offline) --- for img_id, bow_vec in enumerate(database_bow_vectors): for word_id, weight in enumerate(bow_vec): if weight > 0: inv_index[word_id].append((img_id, weight)) # --- Query (runtime) --- def retrieve_candidates(query_bow, inv_index, top_k=5): scores = {} for word_id, q_weight in enumerate(query_bow): if q_weight <= 0: continue for img_id, db_weight in inv_index[word_id]: scores[img_id] = scores.get(img_id, 0.0) + q_weight * db_weight return sorted(scores.items(), key=lambda x: -x[1])[:top_k]
Let's put real numbers on the full pipeline for a system with N = 100 000 database images, K = 50 000 visual words (vocab tree b=10, d=5), and 1 000 descriptors per frame:
| Step | Cost (flat k-means) | Cost (vocab tree + inv. index) |
|---|---|---|
| Quantize 1000 descriptors | 1000 × 50 000 = 50M comparisons | 1000 × (10×5) = 50 000 comparisons — 1000× faster |
| Build query histogram | O(K) = 50 000 ops | O(K) = same |
| Retrieve candidates | N × K dot-products = 5×109 | Only images sharing ≥1 word ≈ 500 images (est.) — 200× faster |
| Geometric verification | Up to N images to verify | Top-5 candidates only — 20 000× fewer |
| Total per frame | ~50 billion operations — unusable | ~10 million operations — ~20 ms real-time |
This is why BoW with a vocabulary tree and inverted index is the backbone of real-time place recognition systems. Without these algorithmic components, the approach simply does not scale beyond toy datasets.
The inverted index stores, for each visual word k, a list of (image_id, tfidf_weight) pairs. With N = 100 000 images, 1 000 descriptors per image, and K = 50 000 words, the total number of (image_id, weight) entries is N × avg_words_per_image = 100 000 × 1 000 = 100 million entries. At 8 bytes each (4 for ID, 4 for float32 weight), that is 800 MB of RAM — feasible on a modern laptop. For N = 1 million images (full city map), this becomes 8 GB — at the boundary of RAM-based indexing. Production systems compress the index (8-bit quantized weights, gap-coded IDs) or use approximate nearest-neighbor structures (HNSW, FAISS) for the candidate retrieval stage.
After building TF-IDF weighted histograms, comparing two images reduces to a single vector operation: cosine similarity. Given query vector wq and database image vector wi, both L2-normalized:
Range: [−1, 1], but since TF-IDF weights are non-negative, range is [0, 1]. A score of 1.0 means identical histograms (same place, same viewpoint). A score near 0 means no shared distinctive visual words (completely different scenes).
Suppose K = 5. After TF-IDF weighting and L2-normalization:
Image A (same place) scores 0.989 and is ranked first. Image B (different place) scores 0.478 and is ranked lower. But notice: B still has a non-trivial similarity score of 0.48 — because word 2 (a common word) contributes heavily. With better TF-IDF weighting (reducing word 2's influence), B would drop further. This is why TF-IDF matters.
An additional layer of robustness: if image at time t is a true loop closure match to database image at time t0, then images at times t−1, t−2 should also match images near t0. If a strong similarity score appears for only one isolated frame in time with no neighboring matches, it's likely a false positive. DBoW2 implements a "temporal window" filter for this.
A query image has a TF-IDF histogram. Six database images have histograms. Cosine similarity bars rank the matches. The top match is highlighted. Move the slider to add perceptual aliasing — watch a false match rise in rank.
BoW is order-less. Two entirely different places can produce similar histograms if they share common visual words — this is perceptual aliasing. A long corridor photographed from two different ends might share many "straight edge" and "uniform surface" visual words even though the camera poses are meters apart and geometrically incompatible.
Even with TF-IDF correctly down-weighting the common words, if the environment is sufficiently self-similar — think an airport terminal with repeating gate numbers and identical carpet patterns, or a long featureless parking garage — the residual alias score can easily exceed a loose threshold. Geometric verification is the only reliable way to separate these cases.
Geometric verification is the second stage: take the top-k BoW candidates and for each one, run full feature matching followed by geometric consistency checking. If the features can be explained by a valid camera geometry (a fundamental matrix from L7, estimated by RANSAC from L8), the match is confirmed as a true loop closure. If RANSAC finds too few inliers, the candidate is rejected — no loop closure added.
The number k of top-BoW candidates passed to geometric verification is a precision-recall knob. k = 1: fastest, but you might miss the true match if the actual revisit is ranked 2nd (due to imperfect TF-IDF or quantization). k = 10: more likely to find the true match, but 10 geometric verifications at ~5 ms each = 50 ms extra per keyframe. Common choices: k = 3 to 5. If any one of the k candidates passes geometric verification, that match is accepted and no further candidates need to be checked — break at the first success.
Robot is in a symmetrical building with two identical-looking corridors. Corridor A (visited at t = 0) and Corridor B (current, t = 10 min) have the same flooring, walls, and lighting. BoW produces sim(A, B) = 0.93 — very high. But when you run RANSAC on matched SIFT features between A and B, you find only 4 inlier pairs (threshold = 30) because the actual 3D structure is incompatible — the floor tiles in A and B match descriptor-for-descriptor but are at different depths and orientations. Geometric verification rejects this false loop closure. Crisis averted.
Two scenarios: a true revisit (many RANSAC inliers) and a perceptual-aliasing false positive (few inliers). Adjust the inlier threshold slider to see where the decision boundary is. Red matches = outliers; teal matches = inliers.
Even with geometric verification, a deeper problem lurks: if the same place looks different enough that descriptors change, the visual words themselves change, and the BoW histogram changes. Day vs. night is the classic case. A wall lit by warm afternoon sun produces different SIFT gradients than the same wall under a streetlamp. The visual words may not even overlap. Similarity drops to near zero — not a false positive, a false negative. The loop closure is missed entirely.
Strategies to handle this: (a) vocabulary trained on multi-condition data — include images from different times of day, seasons, and weather; (b) sequence-based matching — compare sequences of frames rather than single frames; a sequence of 5 temporally consistent matches is far more reliable than any single match; (c) learning-based descriptors (NetVLAD, SuperPoint) that are trained to be invariant to appearance change; (d) probability-based detection (FAB-MAP) which models the joint distribution of word co-occurrences and is more robust to partial word-set changes.
python # Geometric verification: full pipeline import cv2 import numpy as np def verify_loop_closure(query_img, candidate_img, inlier_threshold=25): """Returns (accepted: bool, n_inliers: int, F: ndarray or None)""" # Step 1: detect and describe keypoints sift = cv2.SIFT_create() kp1, des1 = sift.detectAndCompute(query_img, None) kp2, des2 = sift.detectAndCompute(candidate_img, None) if des1 is None or des2 is None: return False, 0, None # Step 2: ratio-test matching (Lowe's ratio test, L6) bf = cv2.BFMatcher() matches = bf.knnMatch(des1, des2, k=2) good = [m for m, n in matches if m.distance < 0.75 * n.distance] if len(good) < 8: return False, len(good), None # Step 3: RANSAC fundamental matrix (L7 + L8) pts1 = np.float32([kp1[m.queryIdx].pt for m in good]) pts2 = np.float32([kp2[m.trainIdx].pt for m in good]) F, mask = cv2.findFundamentalMat(pts1, pts2, cv2.FM_RANSAC, 3.0, 0.999) if F is None or mask is None: return False, 0, None # Step 4: count inliers n_inliers = int(mask.ravel().sum()) accepted = n_inliers >= inlier_threshold return accepted, n_inliers, F # Usage in SLAM loop: # top_candidates = retrieve_candidates(query_bow, inv_index, top_k=5) # for img_id, bow_score in top_candidates: # accepted, n_in, F = verify_loop_closure(query_img, db_images[img_id]) # if accepted: # add_loop_closure_constraint(factor_graph, current_pose, db_poses[img_id], F) # break # one verified closure is enough per keyframe
This is the payoff simulation. A robot drives a loop and collects images. An inverted-index BoW database is built incrementally. At each step the current frame is queried against the database. Watch the similarity scores, TF-IDF weighting, and geometric verification all interact. Tune parameters to see how the system degrades under aliasing or improves with better vocabulary size.
What makes this simulation useful is that it lets you build intuition for parameter tradeoffs that are otherwise hard to reason about from theory alone. Try: large K + high aliasing — the threshold needs to be very high to stay clean. Small K + low aliasing — the threshold can be loose and you still get clean loop closure detection. These are the engineering judgments a SLAM practitioner makes when deploying to a new environment.
Three controls govern the behavior:
Left: robot trajectory (grey=VO drift, teal=corrected). Right: BoW similarity scores as the robot queries its database. When a verified loop closure fires, the trajectory snaps back. Adjust vocabulary size and aliasing to see how performance changes.
| Stage | What it does | Key number |
|---|---|---|
| Image retrieval framing | Treat place recognition as document search — find the nearest image in a database of past frames | O(log N) with vocab tree + inv. index |
| Visual vocabulary (k-means) | Cluster all training descriptors into K centroids = visual words | K = 5k–100k; built once offline |
| Quantization | Assign each runtime descriptor to its nearest visual word by Euclidean distance | O(b×d) per descriptor with vocab tree |
| BoW histogram | Count how many descriptors fell into each visual word; L1-normalize | K-dimensional vector; order-less |
| TF-IDF weighting | Downweight ubiquitous words; upweight rare distinctive ones | idfk = log(N⁄nk) |
| Inverted index | Fast candidate retrieval: only touch images that share at least one visual word with query | Reduces active set from millions to thousands |
| Cosine similarity | Dot-product of L2-normalized TF-IDF vectors | Range [0, 1]; 1 = identical histograms |
| Geometric verification | Run RANSAC + fundamental matrix on top-k BoW candidates; reject below inlier threshold | Threshold ~20–50 inliers; precision gate |
| Precision over recall | Tune for no false positives; missed loop closures are benign; false ones are catastrophic | One false loop closure can corrupt the full map |
Place recognition is the bridge between odometry (L11) and full SLAM (L13). Without it, VO/VIO just drifts indefinitely. With loop closures, the optimizer gets global constraints that pull the map into metric accuracy. The full chain is:
Place recognition is philosophically interesting: it is the mechanism by which a robot forms a spatial memory and uses it as a measurement. A new camera frame is not just data to process forward — it is a query into a temporal database of all past experience. The match is a measurement: "the robot is now where it was at time t0." In factor-graph language, this is a binary constraint between two nodes in the pose graph that would otherwise only be connected through a long chain of odometry measurements. The loop closure creates a shortcut — a direct constraint across time — and the optimizer uses it to make the entire map self-consistent.
This is why BoW + geometric verification remains competitive even against modern deep learning approaches for SLAM: it is fast, runs on-device without GPU, requires no training on the specific deployment environment (only a generic vocabulary), and produces a verified geometric constraint with quantified covariance — exactly what the factor-graph optimizer needs. Deep learning descriptors (NetVLAD, SuperPoint+SuperGlue) have higher recall under appearance change, but their outputs must still be fed through the same geometric verification stage before being added to the graph.
DBoW2 (Galvez-Lopez & Tardos, 2012): the standard C++ BoW library used in ORB-SLAM2 and VINS-Mono. Uses a vocabulary tree trained on 9000 images. Includes temporal consistency scoring and geometric verification. FAB-MAP (Cummins & Newman, 2008): a probabilistic extension using a Chow-Liu tree model of visual word co-occurrence, which captures the statistical dependencies between visual words that flat BoW misses. NetVLAD (Arandjelovic et al., 2016): a deep learning approach that learns a global descriptor end-to-end from place recognition data — the "neural" evolution of BoW that handles appearance change much better.
These numbers come from real deployed systems (ORB-SLAM2, VINS-Mono, OpenVSLAM):
| Parameter | Typical range | Effect of increasing |
|---|---|---|
| Vocabulary size K | 1k (fast/small) – 100k (discriminative) | More distinctive words → fewer false BoW matches → higher precision; slower to build (but offline) |
| Tree branching b | 6 – 10 | Shallower tree at same K → fewer comparisons per descriptor (but wider nodes) |
| Tree depth d | 5 – 6 | More levels → more leaf nodes → more precise quantization |
| BoW similarity threshold τ | 0.6 – 0.9 | Higher τ → fewer false positives → lower recall; tune based on environment (repetitive = higher τ) |
| Geometric inlier threshold | 20 – 50 inliers | Higher threshold → fewer false loop closures; may miss loop closures if scene has few features |
| Top-k candidates to verify | 3 – 10 | More candidates → higher recall; more computation; usually top-5 is a good balance |
When a loop closure is confirmed by geometric verification, the system has: the identity of the database frame (timestamp t0, pose X0 in the factor graph) and an estimate of the relative pose Tq,0 derived from the essential matrix E decomposition. This relative pose, with covariance derived from RANSAC inlier count and reprojection residuals, is added as a new binary edge between the current pose node Xt and the historical node X0 in the factor graph. The next time the back-end optimizer runs (iSAM2 or g2o), it sees this new constraint and re-solves the entire trajectory — distributing the accumulated drift to minimize all constraint residuals simultaneously.
This is exactly the "snap-back" described in Chapter 0. The factor graph was an open chain of odometry edges; the loop closure adds the closing edge, turning it into a loop, and the optimizer finds the trajectory that minimizes drift around the full loop. With k correct loop closures, the final trajectory error falls as approximately 1/k of the original drift — each closure gives the optimizer more constraints to anchor the map.