Designing a stock exchange is not “build an order API and add Kafka.”
The hard part is preserving deterministic market behavior under bursty load while keeping fairness, auditability, recovery, and latency credible at the same time.
Most weak system design answers focus too much on horizontal scale and too little on sequencing, hot-path ownership, and failure fences. In an exchange, those are the design.
Start With the Boundary
A stock exchange is not the whole trading stack. If that boundary is blurry, the design usually becomes slower and more coupled than it needs to be.
Exchange responsibilities:
- accept orders from members or brokers
- apply venue rules
- maintain the order book
- match orders with deterministic priority
- publish executions and market data
- keep an auditable event trail
Broker responsibilities:
- customer onboarding and authentication
- account balances, credit, and margin
- portfolio-level exposure checks
- smart order routing across venues
Clearing and settlement responsibilities:
- novation, netting, and settlement workflows
- post-trade reconciliation
- downstream money and securities movement
Senior architect rule: do not drag broker or clearing workflows into the matching hot path. Every extra dependency adds latency variance, failure coupling, and operational ambiguity.
The Non-Negotiable Invariants
An exchange is not primarily a throughput system. It is a correctness system under latency pressure.
The invariants that matter most are:
- For a given symbol partition, there must be one authoritative processing order.
- Matching results must be deterministic and replayable.
- Price-time priority must be enforceable without ambiguity.
- Market data must be derived from the same committed state transitions as trades.
- Recovery must rebuild the exact book state, not an approximation.
- Safety controls must be able to stop damage faster than the market can amplify it.
That list should drive the architecture more than any database or messaging choice.
What Makes This System Hard
The exchange hot path combines design pressures that normally fight each other:
- very low latency
- deterministic ordering
- bursty traffic
- skewed symbol popularity
- strict audit requirements
- expensive failure recovery if you get sequencing wrong
The hottest symbol matters more than average venue traffic. That is one of the first places generic scaling advice breaks down.
If one symbol experiences a sudden spike because of earnings, a rumor, or index rebalancing, the system does not care that your average load is comfortable. The hottest book becomes the architecture test.
Capacity Thinking That Does Not Lie
A good estimate for an exchange is not just “messages per second.” It is at least these four numbers:
- peak inbound order events per second
- cancel or replace ratio
- hottest-symbol concentration
- outbound market data fan-out
Illustrative pressure model:
- inbound order events peak at
1M/s - cancels and replaces dominate during volatile periods
- top symbols can consume a disproportionate share of one partition
- outbound market data can exceed inbound volume by an order of magnitude once fan-out begins
That last point matters. Many teams design the matcher carefully and then under-design market data distribution. In reality, slow consumers, replay storms, and gap recovery can hurt you just as much as raw order ingress.
Architectural consequence: size the platform for hot partitions and market data amplification, not just for average API throughput.
Component Map
The exchange should be described as explicit components with explicit roles. That is what makes architecture review useful instead of diagram theater.
| Component | Responsibility | In hot path | Notes |
|---|---|---|---|
| Global DNS or traffic manager | route members to the active region or site | no | failover and traffic steering |
| L4 load balancer | distribute TCP sessions across gateway nodes | edge only | keep long-lived trading sessions stable |
| FIX / OUCH gateways | terminate protocol sessions and normalize commands | yes | keep parsing fast and predictable |
| Auth and session manager | authenticate member sessions and session recovery | yes, lightweight only | avoid remote lookups on every order |
| Reference data store | symbols, members, venue config, static limits | no | usually replicated into local caches |
| Local config or limit cache | fast access to venue rules, member state, symbol metadata | yes | cache is for reference, not truth |
| Venue controls service | kill switches, price bands, throttles, cancel-on-disconnect | yes | must fail safe |
| Sequencer per partition | assign authoritative event order | yes | the heart of fairness and replay |
| Matching engine shard | apply state transitions and generate fills | yes | single-writer by design |
| In-memory order book | resident state for one symbol or partition | yes | optimized for mutation, not SQL |
| Durable journal or WAL | committed event boundary and replay source | yes, but minimal | feeds standby and downstream systems |
| Warm standby | replay journal and take over after failover | no in steady state | promoted only with sequence fencing |
| Market data builder | turn committed engine output into book deltas and snapshots | off matcher thread | never let fan-out stall matching |
| Snapshot cache | serve market-data snapshots and recovery bootstrap | no | cache for distribution, not for matching |
| Kafka or event bus | downstream distribution for analytics, read models, and integrations | no | not the fairness-critical sequencer |
| Trade ledger store | reporting, historical queries, reconciled trade views | no | built from ordered truth |
| Surveillance and analytics | abuse detection, compliance, operational analytics | no | subscribe from committed stream |
| Clearing adapter | post-trade handoff to clearing and settlement workflows | no | asynchronous from matcher |
| Observability stack | metrics, logs, traces, alerts | no | make replay and failover visible |
| Operations console | kill switches, partition ownership, canary and failover controls | no | must be auditable and permissioned |
Two intentional choices in that table matter:
Kafkaexists, but off the matcher hot path- caches exist, but they are reference or distribution layers, not the source of truth for matching
High-Level System Design Diagram
If you want the full component picture, it should look closer to this:
flowchart LR
subgraph Ingress["Ingress and Session Edge"]
A[Members / Brokers]
B[Global DNS / Traffic Manager]
C[L4 Load Balancer]
D[FIX / OUCH Gateways]
E[Auth / Session Manager]
end
subgraph Control["Reference and Control Plane"]
F[Reference Data DB]
G[Local Config and Limit Cache]
H[Venue Controls<br/>Kill Switches / Price Bands / Throttles]
I[Operations Console]
end
subgraph HotPath["Low-Latency Matching Core"]
J[Sequencer per Partition]
K[Matching Engine Shard]
L[In-Memory Order Book]
M[Durable Journal / WAL]
N[Warm Standby / Replay]
end
subgraph MarketData["Market Data Plane"]
O[Market Data Builder]
P[Snapshot Cache]
Q[Market Data Gateways]
end
subgraph Async["Async Distribution and Read Side"]
R[Kafka]
S[Trade Ledger DB]
T[Surveillance / Analytics]
U[Clearing / Post Trade]
V[Reporting / Read APIs]
end
subgraph Ops["Observability"]
W[Metrics / Logs / Traces / Alerts]
end
A --> B --> C --> D --> E
E --> H
F --> G --> H
I --> H
H --> J --> K --> L
K --> M
M --> N
K --> O --> P --> Q
M --> R
R --> S
R --> T
R --> U
R --> V
D --> W
J --> W
K --> W
M --> W
O --> W
What this diagram is saying:
- the load balancer and gateways handle connection scale, not matching correctness
- caches sit next to control and market data, not as the live book authority
- Kafka is a downstream distribution backbone, not the primary fairness mechanism
- the durable journal is the bridge between low-latency matching and the rest of the platform
Hot Path Architecture
The hot path itself should still stay much smaller than the full platform:
flowchart LR
subgraph Edge["Member Edge"]
A[Members / Brokers]
B[FIX / OUCH Gateways]
C[Session Auth and Basic Validation]
D[Venue Controls<br/>Kill Switches / Price Bands / Throttles]
end
subgraph HotPath["Deterministic Matching Hot Path"]
E[Sequencer<br/>per Symbol Partition]
F[Matching Engine Shard<br/>Single Writer]
G[In-Memory Order Book]
H[Execution Reports / Drop Copy]
end
subgraph MarketData["Market Data Plane"]
I[Market Data Builder]
J[Top of Book / Depth / Snapshots]
K[Fanout and Gap Recovery]
end
subgraph Recovery["Durability and Recovery"]
L[Durable Event Journal]
M[Warm Standby / Replay]
end
subgraph Downstream["Post-Trade and Controls"]
N[Surveillance / Compliance]
O[Clearing / Post Trade]
end
A --> B --> C --> D --> E --> F --> G
F --> H
F --> I
F --> L
I --> J --> K
L --> M
L --> N
L --> O
The important thing here is not the boxes. It is the ordering of responsibility.
The diagram also shows the architectural separation that matters most:
- the matching hot path stays short and single-writer
- market data fan-out is isolated so slow consumers cannot stall matching
- durability and replay are explicit instead of being implied by logs
- surveillance and clearing consume the same ordered truth, but off the hot path
The hot path should do as little as possible before the event reaches the sequencer and matching shard:
- terminate session and protocol
- authenticate member
- apply lightweight structural validation
- enforce venue-level emergency controls
- sequence the event
- apply it to the in-memory book
- emit deterministic results
Everything else should be pushed out of the critical path unless correctness absolutely requires it there.
The Matching Engine Should Be a Deterministic State Machine
The safest mental model is not “service.” It is “single-writer state machine over a sequenced event stream.”
For each symbol or symbol partition:
- one engine instance owns the authoritative mutable book
- events enter in one total order
- the engine mutates in-memory state
- outputs are generated from those state transitions
That is why many serious trading systems still prefer a single-writer design per book. Not because parallelism is unknown, but because parallelizing one order book usually creates more ambiguity than value.
Minimal engine loop:
for each sequenced event:
validate against current venue rules
apply new/cancel/replace logic
generate fills if crossing exists
update top-of-book and depth state
emit execution reports and book deltas
persist or replicate the committed event boundary
If you cannot replay that loop and get the same state, your design is not ready.
Order Book Design
For a basic cash equities venue, a practical in-memory order book usually needs:
- bid side ordered by descending price
- ask side ordered by ascending price
- FIFO queue at each price level for time priority
- order-id index for fast cancel lookup
- compact metadata for member, side, quantity, and flags
The key is not the exact container type. The key is keeping the mutation path short, predictable, and allocation-light.
Design instinct:
- match in memory
- keep the hot book resident
- avoid remote lookups
- avoid per-event heap churn if you are on the JVM
If this is implemented on Java, low-allocation discipline is not optional. Object-heavy models and stop-the-world surprises are expensive in a latency-sensitive matcher. Java can work, but only if the team treats memory layout, allocation rate, and GC behavior as first-class design concerns.
Why Microservices Are Usually the Wrong Shape for the Hot Path
Microservices are good for ownership boundaries. They are often terrible for one sub-millisecond deterministic loop.
A naive microservice breakdown like this:
- order service
- risk service
- book service
- trade service
- market data service
looks modular on a whiteboard, but it creates:
- more network hops
- more serialization
- more retry complexity
- more tail-latency variance
- more places where sequencing can drift
Opinionated rule: the matching path should look more like a tightly integrated appliance than a chatty service mesh.
You can still have service boundaries around:
- onboarding
- reporting
- surveillance
- clearing integration
- analytics
Just do not turn the matcher into a distributed workflow engine.
Sequencing Is the Real Heart of the System
Most interviews talk about the matching engine. Senior architects talk about sequencing.
Without a clear sequencing strategy, everything downstream becomes suspect:
- fairness
- replay
- cancel races
- market data consistency
- standby recovery
You need a clear answer to this question:
Who assigns the authoritative order of events for a symbol partition?
Good default answer:
- one sequencer per partition
- one monotonically increasing sequence space
- one primary writer at a time
The sequencer does not need to know finance. It needs to establish undeniable order.
That is what lets the matcher stay deterministic and what lets standbys replay correctly.
Timestamps Are Not a Sequencing Strategy
Teams sometimes say, “We will use synchronized clocks and order by timestamp.” That is not strong enough for a fairness-critical matching system.
Why not:
- two gateways can still observe arrivals in different moments
- network jitter changes when messages become visible
- clock sync is good, but not perfect enough to be your final authority
- replay is much cleaner with explicit sequence numbers than with inferred time ordering
Use timestamps for:
- audit trails
- latency measurement
- surveillance analysis
- cross-system correlation
Use sequence numbers for:
- authoritative ordering
- replay
- failover fencing
- deterministic market data derivation
That distinction is easy to say and surprisingly important in design reviews.
Scale Out by Symbol, Not by Shared Locking
The natural way to scale an exchange is horizontal partitioning by symbol or instrument group.
That means:
- each partition gets its own sequencer and matching shard
- symbols are mapped to partitions
- partitions can move across machines during planned rebalance
- one symbol belongs to one active shard at a time
Why this works:
- price-time priority is preserved within a partition
- the hottest books are isolated
- recovery and failover stay local to a shard
Why a global shared book service is bad:
- one lock or one queue becomes the bottleneck
- contention collapse appears under burst traffic
- every symbol starts paying for every other symbol’s load
Related instinct on this site:
The Hard Problem: One Hot Symbol
Partitioning by symbol solves average scale. It does not magically solve the hottest symbol on the venue.
That is where many designs become unrealistic.
You cannot arbitrarily split one order book across many active writers without paying for coordination that damages either:
- latency
- simplicity
- determinism
- fairness
For one hot symbol, serious options are usually:
- scale the single shard vertically and keep the book local
- simplify order types or controls in the hottest path
- move some non-critical downstream work off the engine thread
- apply venue mechanisms like auctioning, throttling, or trading halts when market integrity matters more than raw throughput
What is usually a bad answer: “Let us make one symbol active-active across regions with distributed consensus.”
That is academically interesting and operationally dangerous for a low-latency venue.
Active-Active for One Book Is Usually the Wrong Trade-Off
For a stock exchange, active-active on the same symbol looks attractive in a slide deck because it promises availability. In practice, it often injects exactly the wrong kind of coupling into the hottest path.
Costs:
- extra coordination latency
- more complicated failure semantics
- harder replay and reconciliation
- more ambiguity during network partitions
Safer default:
- active-primary per partition
- warm or hot standby following the same event journal
- explicit failover with sequence fencing
This gives you a clean answer to “who owns the book right now?” That answer is worth more than theoretical multi-writer elegance.
Failure Recovery Must Be Designed Before Scale
Recovery is where immature exchange designs usually reveal themselves.
You need to answer these questions early:
- where is the authoritative event journal
- how does a standby know what was committed
- how do sessions recover after failover
- how are duplicate client retries handled
- how do you guarantee that replay rebuilds the same book
Healthy recovery model:
- sequenced input is durably captured
- matching results are derived deterministically
- standby replays the same sequence
- failover promotes only after a clear sequence boundary
- clients resynchronize using last-seen sequence numbers or session recovery rules
Design trade-off: synchronous durability on every event improves safety but increases latency. Asynchronous durability improves latency but expands the failure window.
There is no free answer here. You choose explicitly based on market integrity, regulatory expectations, and tolerated data-loss window.
Market Data Is a System of Its Own
A matching engine without a strong market data design is incomplete.
Market data usually needs:
- top-of-book and depth updates
- snapshots plus incremental feeds
- replay for missed gaps
- slow-consumer isolation
- different products for internal and external consumers
The crucial rule is simple: market data consumers must never be allowed to backpressure the matcher.
That means:
- the engine emits to an internal builder or ring buffer
- downstream fan-out is isolated from the core matcher
- slow sessions are dropped or resynchronized, not allowed to stall core processing
- replay and gap-fill are handled by dedicated recovery infrastructure
This is exactly the kind of boundary where unbounded queues become dishonest. If the fan-out path is overloaded, the design should make that visible quickly instead of storing overload as memory growth and latency.
Related reading:
Pre-Trade Controls Without Polluting the Hot Path
Not every risk check belongs inside the exchange core.
Broker-side checks usually include:
- client credit
- margin
- account permissions
- portfolio exposure
Exchange-side controls usually include:
- price collars and fat-finger checks
- cancel-on-disconnect
- throttling and kill switches
- self-trade prevention policies
- venue state transitions such as halt and auction modes
Architectural principle: keep the exchange hot path focused on venue-owned controls and deterministic matching. Do not make one new order depend on slow external account logic.
If you must consult external state in the hot path, you have already accepted a much slower and more failure-prone system.
Audit, Surveillance, and Post-Trade Should Be Fed From the Same Truth
A common bad pattern is rebuilding audit trails by stitching together logs from multiple services after the fact.
That is not good enough for an exchange.
Instead:
- audit should consume the same sequenced event stream or committed engine outputs
- surveillance should operate off durable, ordered market events
- clearing adapters should subscribe from the same post-trade truth
This reduces one of the nastiest operational problems: two teams arguing during an incident about which log is authoritative.
Senior architect instinct: make truth explicit at the architecture level, not just in incident documentation.
Performance Bottlenecks That Actually Matter
Exchange systems do not usually fail first because of raw CPU shortage. They fail because one of these becomes dominant:
- one sequencer or book becomes too hot
- lock contention appears in what should have been a single-writer path
- market data fan-out backlogs
- per-event allocations create GC turbulence
- retries or reconnect storms amplify load during an incident
- storage or replication jitter increases tail latency
That is why average latency is not enough. The metrics that matter more are:
- p99 and p999 order acknowledgment latency
- queue depth before sequencer and market data fan-out
- per-partition throughput
- cancel latency under burst
- replay catch-up time
- failover recovery time to known-good sequence
This is one of the clearest differences between CRUD thinking and market-infrastructure thinking.
What Serious Trading Venues Do Differently
The strongest real-world systems usually make a few disciplined choices:
- They keep the matching path short and deterministic.
- They use single-writer ownership per book or partition.
- They treat sequencing as a first-class subsystem, not a side effect.
- They separate market data distribution from core matching.
- They prefer explicit failover over ambiguous multi-writer coordination.
- They optimize for worst-case symbol pressure and tail latency, not only aggregate throughput.
- They drill replay, failover, and disconnect scenarios repeatedly.
This is also where “big tech thinking” can mislead people. A stock exchange is not a social feed. You do not win by casually making every component eventually consistent and massively parallel. You win by keeping the hot path precise, deterministic, and operationally understandable.
Common Bad Designs
These designs sound modern and still fail review:
Using a database transaction as the matching engine
The database is useful for durability and downstream views. It is usually too slow and too variable to be the core matching loop.
Putting Kafka or a general event bus directly in the fairness-critical hot path
General messaging platforms are excellent for many things. They are not automatically the right primitive for the lowest-latency authoritative matcher boundary.
Letting slow consumers back up the matcher
This turns external fan-out into internal market integrity risk.
Splitting one order book across many chatty services
That moves the hardest correctness problem into the network.
Designing for average traffic instead of hot-symbol bursts
The average will not be what breaks you on the bad day.
If I Had to Build Version 1
I would start with these choices:
- one region, not multi-region active-active
- cash equities only, not all asset classes
- limit, market, cancel, replace first; add complex order types later
- partition by symbol
- single primary sequencer and matcher per partition
- in-memory resident books
- deterministic event journal
- warm standby replaying the same partition stream
- dedicated market data builder and fan-out tier
- asynchronous downstream integration for surveillance and clearing
That is not because the system is “simple.” It is because version 1 should preserve the hardest invariants with the fewest moving parts.
The wrong ambition in an exchange is feature breadth before sequencing clarity.
Interview-Ready Answer Shape
If this comes up in a system design interview, I would structure the answer like this:
- Define the boundary: exchange, broker, and clearing are separate systems.
- State the invariant: one authoritative event order per symbol partition.
- Describe the hot path: gateway -> validation -> sequencer -> matching engine -> journal -> market data and execution outputs.
- Explain scale: partition by symbol, not by shared locking.
- Explain recovery: deterministic replay with primary-standby failover.
- Call out the real trade-off: active-active for one book hurts latency and determinism more than it helps.
- Close with market data isolation and operational controls.
That answer is stronger than listing caches, load balancers, and databases without saying who owns order.
Design Review Questions I Would Ask
Before approving this architecture, I would ask:
- Where is the single source of truth for event order?
- Can a cancel and a fill race produce ambiguity?
- What is the failover fence that prevents split-brain on one partition?
- Can market data lag or replay without ever slowing the matcher?
- What happens when one symbol becomes 20 times hotter than the rest?
- Which controls can stop damage immediately, and who owns them?
- Can we replay one day of traffic and reproduce the same book and trades?
If those answers are weak, the design is not mature yet.
Related Reading on This Site
- Shared Memory vs Message Passing in Java Applications
- Contention Collapse Under Load in Java Services
- Bounded Queues and Backpressure in Java Systems
- Work Queue Design Mistakes in Backend Systems
Key Takeaways
- A stock exchange is a deterministic sequencing problem under latency pressure, not a generic CRUD platform.
- The core architecture should preserve one authoritative event order per symbol partition.
- Single-writer matching per book is usually a better trade-off than fancy multi-writer coordination.
- Market data fan-out, replay, and failover deserve first-class design attention, not leftover infrastructure.
- The most senior design choice is often deciding what not to put in the hot path.
Comments