A counter, a decoder, and a handful of gates once meant a board full of chips. Collapse it into one chip two ways — a microcontroller that imitates the logic in software, one instruction at a time, or an FPGA that becomes the logic in silicon, everything running at once. This chapter is about the second path: configuring hardware out of thousands of tiny lookup tables.
You are building a digital clock. The design calls for a 10-stage counter to divide down a crystal oscillator, a binary-to-decimal decoder to drive the display digits, and a couple of glue gates to handle the colon blink and the alarm compare. Forty years ago you would have walked to the parts bin and pulled a 7490 decade counter, a 7447 decoder/driver, a 7408 quad AND, a 7432 quad OR — a dozen chips, a rat's nest of wires, and a board the size of a paperback.
No real product is built that way anymore. The whole design has to collapse onto one chip. There are exactly two ways to do that, and they are philosophically opposite.
The microcontroller is a sequential machine pretending to be your circuit. The FPGA is your circuit. That is the single idea this whole chapter unpacks. A field-programmable gate array (FPGA) is a sea of tiny configurable logic cells plus a programmable wiring matrix; you hand it a description of the hardware you want, and a synthesis tool melts that description down into cell-and-wire settings that make the chip behave exactly like your schematic.
Say the divide-by-10 stage must toggle an output once every 10 input clock pulses. On a microcontroller the CPU spins a counter variable: increment, compare to 10, branch, maybe reset, repeat — perhaps 5 instructions per pulse. On the FPGA, four flip-flops wired as a counter just are the divider; on every clock edge they all update simultaneously in one gate delay (nanoseconds), no instructions involved.
Numbers make it vivid. Suppose the input clock is 12 MHz. The FPGA divider runs at the full 12 MHz because the flip-flops update at the clock edge directly. The microcontroller, running at 12 MHz but spending ~5 cycles of overhead per pulse, can only "service" the divider at 12 MHz ÷ 5 = 2.4 MHz if it does nothing else. Add the decoder and the gates to the software loop and that ceiling drops further. The FPGA does all three jobs at once with zero interference.
Three independent counters need to run from three different clocks. Pick a mode and press Run. In FPGA mode all three are real hardware and advance together. In microcontroller mode a single instruction pointer visits one counter at a time — watch it fall behind as the workload grows.
Before FPGAs there were PALs — programmable array logic, the first commercially successful programmable logic device (PLD). The PAL is the simplest possible programmable chip, and it teaches the central trick directly: any combinational function can be written as a sum of products, and a sum of products maps onto a grid of programmable connections.
Recall from Chapter 12: every Boolean function has a sum-of-products (SOP) form. Each product term is an AND of some inputs (each input appearing either true or complemented); the whole function is the OR of those product terms. For example, a 1-of-2 data selector that outputs A when SEL=1 and B when SEL=0 is:
Two product terms, OR'd together. That is exactly the shape a PAL implements. A PAL has two planes of wires. The AND plane takes every input and its complement and feeds them across a grid of horizontal product-term lines; at each crossing sits a programmable connection (historically a literal fuse). Blow or keep each fuse and you decide which literals feed each AND gate. The OR plane then sums selected product terms into each output. In a true PAL the AND plane is programmable and the OR plane is fixed; in its ancestor the PLA both were programmable.
We need two product terms. Lay out four columns — A, A̅, B, B̅ — well, six columns counting SEL and SEL̅. For the selector we want:
Check it: when SEL=1, term 1 = 1·A = A and term 2 = 0·B = 0, so OUT = A. When SEL=0, term 1 = 0 and term 2 = 1·B = B, so OUT = B. The fuse pattern is the truth table. Change which fuses you keep and you change the function — without rewiring a single physical trace.
Two inputs A and B (with complements) feed three product-term rows. Click the grid dots to connect a literal into a row (an AND term). Each connected row that is enabled feeds the OR output. The live truth table on the right shows OUT for all four input combinations. Try to build AND, then OR, then the selector.
The PAL scales badly. A single AND-OR plane works for a function of a few inputs, but a real design has thousands of signals; one giant plane would be enormous and slow. Two architectures grew out of the PAL to fix this, and they took opposite routes.
A complex programmable logic device (CPLD) is, roughly, a bunch of PAL-style blocks — called macrocells — stitched together by a central, predictable interconnect. Each macrocell is a sum-of-products engine (an AND-OR plane) feeding a flip-flop. CPLDs are the direct successors of the PAL: same SOP DNA, just replicated and interconnected. They are typically built on EEPROM/flash, so the configuration is non-volatile — it survives power-off, and the chip is ready the instant you apply power. Their timing is highly deterministic because every signal crosses the same predictable central matrix, which is why CPLDs are loved for glue logic, address decoding, and state machines that must wake up instantly.
An FPGA abandons the SOP plane entirely. Instead of macrocells it uses a fine-grained fabric of small lookup tables (LUTs) — tiny memories whose stored bits are a truth table — each paired with a flip-flop, and surrounded by a rich, flexible routing network. There are far more of these cells (modern parts hold 200,000 to several million), and the routing is much more general, which is what lets FPGAs implement huge, deeply pipelined designs. The price: the configuration lives in SRAM, which is volatile. Cut the power and the design evaporates; it must be reloaded at every power-up (more on this in Chapter 4 of this lesson).
| Property | CPLD | FPGA |
|---|---|---|
| Core logic element | Macrocell (sum-of-products AND-OR plane) | Lookup table (LUT) + flip-flop |
| Heritage | Successor to the PAL | Gate-array fabric, SRAM-configured |
| Capacity | Hundreds to a few thousand cells | 200,000 to several million blocks |
| Configuration memory | EEPROM / flash — non-volatile | SRAM — volatile, reloaded at power-up |
| Timing | Very deterministic (central matrix) | Depends on place & route; more variable |
| Sweet spot | Glue logic, decoders, instant-on state machines | Large parallel systems, DSP, video, SoC |
Our digital clock from Tab 0 needs a 10-stage divider, a decoder, and a few gates — a few dozen flip-flops and a hundred-ish gates. That fits comfortably in a small CPLD with maybe 64–128 macrocells, and the non-volatility means the clock starts keeping time the instant batteries go in, with no external memory. Now suppose we instead want to overlay an HD video clock with anti-aliased digits and a spectrum-analyzer alarm tone: thousands of logic cells, multipliers, deep pipelines — that is FPGA territory, and we accept a ~100 ms boot from external flash. The architecture follows the workload.
Left, a CPLD macrocell: a sum-of-products plane feeding one flip-flop. Right, a slice of FPGA fabric: many small LUT+FF cells joined by routing. Toggle the view to highlight where the configuration lives (volatile SRAM vs non-volatile flash).
Here is the most beautiful idea in the whole chapter, and it is almost embarrassingly simple. How does an FPGA implement any logic function of its inputs without having any actual gates wired up? It uses a lookup table — and a lookup table is just a tiny ROM.
Think back to ROM as addressable storage. A ROM with k address lines has 2k locations; you put an address in, the byte at that location comes out. Now strip the ROM down to one bit wide. A k-input LUT is a 2k × 1-bit ROM. Its k inputs are the address; the single bit stored at that address is the output. And here is the punchline: if you fill those 2k stored bits with the truth table of the function you want, the LUT computes that function exactly. The stored bits don't encode the logic — they are the truth table.
Take k = 2. Then 2k = 22 = 4 memory cells, addressed by the input pair (A,B) in order 00, 01, 10, 11. Whatever four bits you store, you get that function. Toggle the four bits and you change the gate:
| Address (A B) | 00 | 01 | 10 | 11 | Gate |
|---|---|---|---|---|---|
| Stored bits | 0 | 0 | 0 | 1 | AND |
| Stored bits | 0 | 1 | 1 | 1 | OR |
| Stored bits | 0 | 1 | 1 | 0 | XOR |
| Stored bits | 1 | 1 | 1 | 0 | NAND |
Same four physical memory cells, same silicon, four completely different gates — chosen entirely by what you wrote into the cells. This is why an FPGA needs no fixed gates: every cell can become any 2-input gate by reloading four bits.
If a 2-input LUT has 4 cells, and each cell is independently 0 or 1, then the number of distinct fillings is 24 = 16. Sixteen possible configurations — and that is exactly the number of distinct Boolean functions of two variables (AND, OR, XOR, NAND, NOR, XNOR, the two constants, the four "pass/invert one input" functions, and so on). A 2-input LUT can be any of them. In general:
That last line is why modern FPGAs settled on the 6-input LUT: 64 configuration bits per cell buys you literally any function of six inputs. Multi-million-cell parts give you millions of these universal little machines, each ready to be any 6-input gate.
Four memory cells at addresses 00, 01, 10, 11. Click a cell to flip its stored bit. The named gate that results appears live, along with the configuration count (one of 16). Set the live inputs A and B to watch the address light up and the output read out of the table.
A bare LUT computes combinational logic, but real circuits need memory — flip-flops to hold state, to build counters and state machines. So the FPGA's basic repeating unit pairs a LUT with a flip-flop. This unit is the logic block (Xilinx calls a cluster of them a configurable logic block, or CLB; the fundamental cell is sometimes a "logic element").
Tile thousands to millions of these logic blocks across the chip, thread them with a programmable routing matrix (a switching network whose every connection is itself a configuration bit), and ring the edge with I/O blocks that connect the fabric to physical pins. That is the whole FPGA: logic blocks + routing + I/O. Modern parts also drop in hardened helpers — block RAM, DSP multipliers, and on SoC FPGAs, full ARM CPU cores — but the soul of the device is the configurable fabric.
Every one of those bits — the LUT contents, the flip-flop bypass settings, every routing switch — lives in SRAM cells. SRAM is fast and infinitely re-writable, but it forgets everything when power drops. The complete set of bits is the configuration bitstream, and on most FPGA boards it is stored in a small external EEPROM/flash chip. At power-up a tiny on-chip loader streams the bitstream in and the fabric "wakes up" as your circuit — typically in under ~200 ms.
Estimate crudely. Suppose a small FPGA has 10,000 logic blocks, each a 4-input LUT (16 bits) plus a few control bits, say ~20 configuration bits per block: that is 10,000 × 20 = 200,000 bits just for logic. Add routing — often several times the logic bits — and a real small part lands around 1–3 million configuration bits. Streaming 2,000,000 bits from a serial EEPROM at, say, 25 MHz takes 2,000,000 ÷ 25,000,000 = 0.08 s = 80 ms — comfortably inside the ~200 ms power-up budget. The arithmetic is why "FPGAs boot in a fraction of a second" is true.
A grid of CLBs (each a LUT + flip-flop) surrounded by I/O blocks and threaded with routing channels. Click two blocks to select a source and destination, then Wire them. Hit Configure to stream the bitstream in from external EEPROM — watch blocks light up as they load. Power-cycle to see the volatile config vanish.
You now have a chip full of universal LUTs and switchable wires. How do you tell it what circuit to become? You never set individual fuses or LUT bits by hand — that would be hopeless at scale. Instead you describe the hardware you want, and a synthesis tool maps your description onto LUTs and routing automatically. There are two ways to enter the description.
The intuitive way, especially for beginners: drag gate and flip-flop symbols onto a canvas and wire them, exactly as in Chapter 12. For our 1-of-2 selector you would place an AND, an inverter, another AND, and an OR, and wire them per OUT = SEL·A + SEL̅·B. The tool reads the drawing and synthesizes it. Schematics are wonderfully concrete — what you draw is what you get — but they do not scale: a 4-bit counter is a manageable drawing, a 32-bit CPU is an unreadable acre of symbols.
The scalable way: a hardware description language (HDL). You write text that describes the circuit's structure and behavior, and the synthesis tool figures out the gates. The two dominant HDLs are Verilog (C-like, terse, popular in commercial and consumer work) and VHDL (verbose, strongly typed, favored in aerospace and defense where the strictness catches bugs). Our selector in Verilog is just:
That single line describes the same hardware the four-gate schematic does — and a 32-bit counter is one more line (count <= count + 1;) instead of a wall of flip-flops. This is the great leap of the chapter: the HDL looks like software but describes hardware. The synthesis tool does not compile it into instructions; it compiles it into LUT contents and routing.
On a hobby board like the Elbert V2 (about $29.95, built around a Xilinx Spartan XC3S50A), you run this whole flow on your laptop in Xilinx ISE, click "generate," and push the resulting bitstream to the board over USB. Minutes later the $30 chip is your custom counter.
Step through the flow for the 1-of-2 selector. Press Next stage to advance from HDL text, through synthesis to gates, place & route onto the fabric, and finally the loaded bitstream. Toggle the entry method to compare schematic vs HDL at the start.
Let's get concrete with the language. Verilog organizes everything into modules — a module is a block of hardware with named input and output ports. Inside, you declare signals and describe how they relate. The mental model never changes: everything you write describes wires and the logic between them, all existing at once.
An always block describes logic that re-evaluates whenever something in its sensitivity list changes. The most important form, for sequential logic, triggers on a clock edge:
This describes a 4-bit counter: on every rising edge of clk, all four flip-flops update at once, either clearing to 0 or incrementing. The <= is the non-blocking assignment — it means "all these updates happen together at the clock edge," which is precisely how real flip-flops behave.
Our digital clock board runs from a 12 MHz crystal, but the display digit-multiplexing wants about 1 kHz, and the seconds counter wants exactly 1 Hz. You build prescalers — counters that divide the clock by a fixed N, using fout = fin / N. The arithmetic:
So to get 1 kHz you build a counter that counts 12,000 input pulses and toggles (or pulses) its output once per cycle; for 1 Hz you count all the way to 12,000,000. In Verilog each is a few lines, and — crucially — both prescalers can run from the same 12 MHz clock at the same time, in separate always blocks, because they are separate hardware:
(Note 12,000 distinct counts need a counter that reaches 11,999, requiring ⌈log212000⌉ = 14 bits, hence [13:0] — the bit-width arithmetic matters and the tool will warn you if you get it wrong.)
A 4-bit counter described by the always @(posedge clk) block above, plus a divide-by-N prescaler. Set the divisor N and press Step or Run; watch the 4-bit register and the divided "tick" output update on each modeled clock edge. The waveform shows that the prescaler output frequency is exactly clk ÷ N.
Everything converges here. The reason you reach for an FPGA over a microcontroller is the same reason this chapter exists: genuine parallelism. To prove it, we run a race — three counters on three independent clocks — in both worlds, and watch the microcontroller's single instruction pointer lose.
First, the engineering practice that makes big FPGA designs tractable: build small modules and instantiate them. Our clock needs three timers (digit-mux at 1 kHz, blink at 1 Hz, alarm-compare). Rather than write three nearly-identical counters, you write one parameterized prescaler module and instantiate it three times with different divisors:
Three instances, three real prescalers, all clocking off the same 12 MHz crystal at once. You also write a test fixture (testbench) — non-synthesizable Verilog that generates a fake clock and checks the outputs in a simulator before ever touching hardware, so you catch the off-by-one bit-width bug at your desk, not on the bench.
Now suppose all three counters must advance every microsecond. On the FPGA, three counter modules update on their clock edges simultaneously — one gate delay each, fully overlapped. Total time to service all three per tick: essentially one clock period, because they happen at once.
On the microcontroller, the single CPU must visit each counter in turn. If servicing one counter costs ~5 instructions and the CPU runs at 12 MHz (one instruction ≈ 83 ns), three counters cost 3 × 5 = 15 instructions = 15 × 83 ns ≈ 1.25 µs per round. The FPGA finished the same work in ~0.08 µs. As you add a fourth, fifth, sixth counter, the FPGA cost stays flat (more counters = more parallel hardware) while the microcontroller cost grows linearly and eventually it simply cannot keep up — it drops ticks.
Three counters drive three clocks. In FPGA mode they advance together every tick. In µC mode a single instruction pointer (the highlighted marker) hops between counters, spending the set number of instructions on each — and falling progressively behind the FPGA's tick count. Add workload to widen the gap. The live tally shows ticks completed by each world.
Chapter 14 took one problem — collapse a board of logic chips onto one chip — and followed the programmable-logic answer all the way down: from the PAL's AND-OR plane, to the CPLD macrocell and the FPGA LUT, to the fabric, to HDL design entry and the parallel payoff. Here is the whole chapter on one page.
| Concept | Formula / fact | Worked value |
|---|---|---|
| LUT as ROM | k-input LUT = 2k × 1-bit ROM | 6-input = 64 × 1-bit |
| Functions per LUT | 2(2k) | k=2→16; k=4→65,536; k=6→1.8×1019 |
| 2-input LUT configs | 24 distinct fillings | 16 (AND, OR, XOR, NAND, …) |
| Sum of products | OUT = OR of AND-terms; fuses pick literals | OUT = SEL·A + SEL̅·B |
| Clock division | fout = fin / N | 12 MHz÷12,000 = 1 kHz; ÷12,000,000 = 1 Hz |
| Config (FPGA) | Volatile SRAM, reloaded from EEPROM | boot <~200 ms |
| Config (CPLD) | Non-volatile flash, instant-on | 0 ms boot |
| Logic block | LUT + flip-flop (+ routing) | 200K–several-million per FPGA |
Vendors: Xilinx and Altera together held roughly 90% of the FPGA market through the period of this text. Tool: Xilinx ISE ran the describe→synthesize→place&route→bitstream flow.
HDLs: Verilog (C-like, terse, consumer/commercial) and VHDL (verbose, strongly typed, aerospace & defense). Output: a .bit bitstream.
Board: the Elbert V2 (~$29.95, Xilinx Spartan XC3S50A) makes the whole flow accessible on a hobby budget — describe a circuit, generate the bitstream, push it over USB.
SoC FPGAs add hard ARM CPU cores beside the fabric, letting you run software and custom parallel hardware on one die — the best of Chapters 13 and 14 together.
You can now read a Verilog module and see the parallel hardware behind the text, trace a function from truth table to LUT bits, and choose between a microcontroller and an FPGA on principle. Next, in Chapter 15, that parallel hardware goes to work driving motors.