Designing a Stock Exchange System

Designing a stock exchange is not “build an order API and add Kafka.”

The hard part is preserving deterministic market behavior under bursty load while keeping fairness, auditability, recovery, and latency credible at the same time.

Most weak system design answers focus too much on horizontal scale and too little on sequencing, hot-path ownership, and failure fences. In an exchange, those are the design.

Start With the Boundary

A stock exchange is not the whole trading stack. If that boundary is blurry, the design usually becomes slower and more coupled than it needs to be.

Exchange responsibilities:

accept orders from members or brokers
apply venue rules
maintain the order book
match orders with deterministic priority
publish executions and market data
keep an auditable event trail

Broker responsibilities:

customer onboarding and authentication
account balances, credit, and margin
portfolio-level exposure checks
smart order routing across venues

Clearing and settlement responsibilities:

novation, netting, and settlement workflows
post-trade reconciliation
downstream money and securities movement

Senior architect rule: do not drag broker or clearing workflows into the matching hot path. Every extra dependency adds latency variance, failure coupling, and operational ambiguity.

The Non-Negotiable Invariants

An exchange is not primarily a throughput system. It is a correctness system under latency pressure.

The invariants that matter most are:

For a given symbol partition, there must be one authoritative processing order.
Matching results must be deterministic and replayable.
Price-time priority must be enforceable without ambiguity.
Market data must be derived from the same committed state transitions as trades.
Recovery must rebuild the exact book state, not an approximation.
Safety controls must be able to stop damage faster than the market can amplify it.

That list should drive the architecture more than any database or messaging choice.

What Makes This System Hard

The exchange hot path combines design pressures that normally fight each other:

very low latency
deterministic ordering
bursty traffic
skewed symbol popularity
strict audit requirements
expensive failure recovery if you get sequencing wrong

The hottest symbol matters more than average venue traffic. That is one of the first places generic scaling advice breaks down.

If one symbol experiences a sudden spike because of earnings, a rumor, or index rebalancing, the system does not care that your average load is comfortable. The hottest book becomes the architecture test.

Capacity Thinking That Does Not Lie

A good estimate for an exchange is not just “messages per second.” It is at least these four numbers:

peak inbound order events per second
cancel or replace ratio
hottest-symbol concentration
outbound market data fan-out

Illustrative pressure model:

inbound order events peak at 1M/s
cancels and replaces dominate during volatile periods
top symbols can consume a disproportionate share of one partition
outbound market data can exceed inbound volume by an order of magnitude once fan-out begins

That last point matters. Many teams design the matcher carefully and then under-design market data distribution. In reality, slow consumers, replay storms, and gap recovery can hurt you just as much as raw order ingress.

Architectural consequence: size the platform for hot partitions and market data amplification, not just for average API throughput.

Component Map

The exchange should be described as explicit components with explicit roles. That is what makes architecture review useful instead of diagram theater.

Component	Responsibility	In hot path	Notes
Global DNS or traffic manager	route members to the active region or site	no	failover and traffic steering
L4 load balancer	distribute TCP sessions across gateway nodes	edge only	keep long-lived trading sessions stable
FIX / OUCH gateways	terminate protocol sessions and normalize commands	yes	keep parsing fast and predictable
Auth and session manager	authenticate member sessions and session recovery	yes, lightweight only	avoid remote lookups on every order
Reference data store	symbols, members, venue config, static limits	no	usually replicated into local caches
Local config or limit cache	fast access to venue rules, member state, symbol metadata	yes	cache is for reference, not truth
Venue controls service	kill switches, price bands, throttles, cancel-on-disconnect	yes	must fail safe
Sequencer per partition	assign authoritative event order	yes	the heart of fairness and replay
Matching engine shard	apply state transitions and generate fills	yes	single-writer by design
In-memory order book	resident state for one symbol or partition	yes	optimized for mutation, not SQL
Durable journal or WAL	committed event boundary and replay source	yes, but minimal	feeds standby and downstream systems
Warm standby	replay journal and take over after failover	no in steady state	promoted only with sequence fencing
Market data builder	turn committed engine output into book deltas and snapshots	off matcher thread	never let fan-out stall matching
Snapshot cache	serve market-data snapshots and recovery bootstrap	no	cache for distribution, not for matching
Kafka or event bus	downstream distribution for analytics, read models, and integrations	no	not the fairness-critical sequencer
Trade ledger store	reporting, historical queries, reconciled trade views	no	built from ordered truth
Surveillance and analytics	abuse detection, compliance, operational analytics	no	subscribe from committed stream
Clearing adapter	post-trade handoff to clearing and settlement workflows	no	asynchronous from matcher
Observability stack	metrics, logs, traces, alerts	no	make replay and failover visible
Operations console	kill switches, partition ownership, canary and failover controls	no	must be auditable and permissioned

Two intentional choices in that table matter:

Kafka exists, but off the matcher hot path
caches exist, but they are reference or distribution layers, not the source of truth for matching

High-Level System Design Diagram

If you want the full component picture, it should look closer to this:

flowchart LR
    subgraph Ingress["Ingress and Session Edge"]
        A[Members / Brokers]
        B[Global DNS / Traffic Manager]
        C[L4 Load Balancer]
        D[FIX / OUCH Gateways]
        E[Auth / Session Manager]
    end

    subgraph Control["Reference and Control Plane"]
        F[Reference Data DB]
        G[Local Config and Limit Cache]
        H[Venue Controls<br/>Kill Switches / Price Bands / Throttles]
        I[Operations Console]
    end

    subgraph HotPath["Low-Latency Matching Core"]
        J[Sequencer per Partition]
        K[Matching Engine Shard]
        L[In-Memory Order Book]
        M[Durable Journal / WAL]
        N[Warm Standby / Replay]
    end

    subgraph MarketData["Market Data Plane"]
        O[Market Data Builder]
        P[Snapshot Cache]
        Q[Market Data Gateways]
    end

    subgraph Async["Async Distribution and Read Side"]
        R[Kafka]
        S[Trade Ledger DB]
        T[Surveillance / Analytics]
        U[Clearing / Post Trade]
        V[Reporting / Read APIs]
    end

    subgraph Ops["Observability"]
        W[Metrics / Logs / Traces / Alerts]
    end

    A --> B --> C --> D --> E
    E --> H
    F --> G --> H
    I --> H

    H --> J --> K --> L
    K --> M
    M --> N

    K --> O --> P --> Q
    M --> R
    R --> S
    R --> T
    R --> U
    R --> V

    D --> W
    J --> W
    K --> W
    M --> W
    O --> W

What this diagram is saying:

the load balancer and gateways handle connection scale, not matching correctness
caches sit next to control and market data, not as the live book authority
Kafka is a downstream distribution backbone, not the primary fairness mechanism
the durable journal is the bridge between low-latency matching and the rest of the platform

Hot Path Architecture

The hot path itself should still stay much smaller than the full platform:

flowchart LR
    subgraph Edge["Member Edge"]
        A[Members / Brokers]
        B[FIX / OUCH Gateways]
        C[Session Auth and Basic Validation]
        D[Venue Controls<br/>Kill Switches / Price Bands / Throttles]
    end

    subgraph HotPath["Deterministic Matching Hot Path"]
        E[Sequencer<br/>per Symbol Partition]
        F[Matching Engine Shard<br/>Single Writer]
        G[In-Memory Order Book]
        H[Execution Reports / Drop Copy]
    end

    subgraph MarketData["Market Data Plane"]
        I[Market Data Builder]
        J[Top of Book / Depth / Snapshots]
        K[Fanout and Gap Recovery]
    end

    subgraph Recovery["Durability and Recovery"]
        L[Durable Event Journal]
        M[Warm Standby / Replay]
    end

    subgraph Downstream["Post-Trade and Controls"]
        N[Surveillance / Compliance]
        O[Clearing / Post Trade]
    end

    A --> B --> C --> D --> E --> F --> G
    F --> H
    F --> I
    F --> L
    I --> J --> K
    L --> M
    L --> N
    L --> O

The important thing here is not the boxes. It is the ordering of responsibility.

The diagram also shows the architectural separation that matters most:

the matching hot path stays short and single-writer
market data fan-out is isolated so slow consumers cannot stall matching
durability and replay are explicit instead of being implied by logs
surveillance and clearing consume the same ordered truth, but off the hot path

The hot path should do as little as possible before the event reaches the sequencer and matching shard:

terminate session and protocol
authenticate member
apply lightweight structural validation
enforce venue-level emergency controls
sequence the event
apply it to the in-memory book
emit deterministic results

Everything else should be pushed out of the critical path unless correctness absolutely requires it there.

The Matching Engine Should Be a Deterministic State Machine

The safest mental model is not “service.” It is “single-writer state machine over a sequenced event stream.”

For each symbol or symbol partition:

one engine instance owns the authoritative mutable book
events enter in one total order
the engine mutates in-memory state
outputs are generated from those state transitions

That is why many serious trading systems still prefer a single-writer design per book. Not because parallelism is unknown, but because parallelizing one order book usually creates more ambiguity than value.

Minimal engine loop:

for each sequenced event:
  validate against current venue rules
  apply new/cancel/replace logic
  generate fills if crossing exists
  update top-of-book and depth state
  emit execution reports and book deltas
  persist or replicate the committed event boundary

If you cannot replay that loop and get the same state, your design is not ready.

Order Book Design

For a basic cash equities venue, a practical in-memory order book usually needs:

bid side ordered by descending price
ask side ordered by ascending price
FIFO queue at each price level for time priority
order-id index for fast cancel lookup
compact metadata for member, side, quantity, and flags

The key is not the exact container type. The key is keeping the mutation path short, predictable, and allocation-light.

Design instinct:

match in memory
keep the hot book resident
avoid remote lookups
avoid per-event heap churn if you are on the JVM

If this is implemented on Java, low-allocation discipline is not optional. Object-heavy models and stop-the-world surprises are expensive in a latency-sensitive matcher. Java can work, but only if the team treats memory layout, allocation rate, and GC behavior as first-class design concerns.

Why Microservices Are Usually the Wrong Shape for the Hot Path

Microservices are good for ownership boundaries. They are often terrible for one sub-millisecond deterministic loop.

A naive microservice breakdown like this:

order service
risk service
book service
trade service
market data service

looks modular on a whiteboard, but it creates:

more network hops
more serialization
more retry complexity
more tail-latency variance
more places where sequencing can drift

Opinionated rule: the matching path should look more like a tightly integrated appliance than a chatty service mesh.

You can still have service boundaries around:

onboarding
reporting
surveillance
clearing integration
analytics

Just do not turn the matcher into a distributed workflow engine.

Sequencing Is the Real Heart of the System

Most interviews talk about the matching engine. Senior architects talk about sequencing.

Without a clear sequencing strategy, everything downstream becomes suspect:

fairness
replay
cancel races
market data consistency
standby recovery

You need a clear answer to this question:

Who assigns the authoritative order of events for a symbol partition?

Good default answer:

one sequencer per partition
one monotonically increasing sequence space
one primary writer at a time

The sequencer does not need to know finance. It needs to establish undeniable order.

That is what lets the matcher stay deterministic and what lets standbys replay correctly.

Timestamps Are Not a Sequencing Strategy

Teams sometimes say, “We will use synchronized clocks and order by timestamp.” That is not strong enough for a fairness-critical matching system.

Why not:

two gateways can still observe arrivals in different moments
network jitter changes when messages become visible
clock sync is good, but not perfect enough to be your final authority
replay is much cleaner with explicit sequence numbers than with inferred time ordering

Use timestamps for:

audit trails
latency measurement
surveillance analysis
cross-system correlation

Use sequence numbers for:

authoritative ordering
replay
failover fencing
deterministic market data derivation

That distinction is easy to say and surprisingly important in design reviews.

Scale Out by Symbol, Not by Shared Locking

The natural way to scale an exchange is horizontal partitioning by symbol or instrument group.

That means:

each partition gets its own sequencer and matching shard
symbols are mapped to partitions
partitions can move across machines during planned rebalance
one symbol belongs to one active shard at a time

Why this works:

price-time priority is preserved within a partition
the hottest books are isolated
recovery and failover stay local to a shard

Why a global shared book service is bad:

one lock or one queue becomes the bottleneck
contention collapse appears under burst traffic
every symbol starts paying for every other symbol’s load

Related instinct on this site:

Contention Collapse Under Load in Java Services

The Hard Problem: One Hot Symbol

Partitioning by symbol solves average scale. It does not magically solve the hottest symbol on the venue.

That is where many designs become unrealistic.

You cannot arbitrarily split one order book across many active writers without paying for coordination that damages either:

latency
simplicity
determinism
fairness

For one hot symbol, serious options are usually:

scale the single shard vertically and keep the book local
simplify order types or controls in the hottest path
move some non-critical downstream work off the engine thread
apply venue mechanisms like auctioning, throttling, or trading halts when market integrity matters more than raw throughput

What is usually a bad answer: “Let us make one symbol active-active across regions with distributed consensus.”

That is academically interesting and operationally dangerous for a low-latency venue.

Active-Active for One Book Is Usually the Wrong Trade-Off

For a stock exchange, active-active on the same symbol looks attractive in a slide deck because it promises availability. In practice, it often injects exactly the wrong kind of coupling into the hottest path.

Costs:

extra coordination latency
more complicated failure semantics
harder replay and reconciliation
more ambiguity during network partitions

Safer default:

active-primary per partition
warm or hot standby following the same event journal
explicit failover with sequence fencing

This gives you a clean answer to “who owns the book right now?” That answer is worth more than theoretical multi-writer elegance.

Failure Recovery Must Be Designed Before Scale

Recovery is where immature exchange designs usually reveal themselves.

You need to answer these questions early:

where is the authoritative event journal
how does a standby know what was committed
how do sessions recover after failover
how are duplicate client retries handled
how do you guarantee that replay rebuilds the same book

Healthy recovery model:

sequenced input is durably captured
matching results are derived deterministically
standby replays the same sequence
failover promotes only after a clear sequence boundary
clients resynchronize using last-seen sequence numbers or session recovery rules

Design trade-off: synchronous durability on every event improves safety but increases latency. Asynchronous durability improves latency but expands the failure window.

There is no free answer here. You choose explicitly based on market integrity, regulatory expectations, and tolerated data-loss window.

Market Data Is a System of Its Own

A matching engine without a strong market data design is incomplete.

Market data usually needs:

top-of-book and depth updates
snapshots plus incremental feeds
replay for missed gaps
slow-consumer isolation
different products for internal and external consumers

The crucial rule is simple: market data consumers must never be allowed to backpressure the matcher.

That means:

the engine emits to an internal builder or ring buffer
downstream fan-out is isolated from the core matcher
slow sessions are dropped or resynchronized, not allowed to stall core processing
replay and gap-fill are handled by dedicated recovery infrastructure

This is exactly the kind of boundary where unbounded queues become dishonest. If the fan-out path is overloaded, the design should make that visible quickly instead of storing overload as memory growth and latency.

Pre-Trade Controls Without Polluting the Hot Path

Not every risk check belongs inside the exchange core.

Broker-side checks usually include:

client credit
margin
account permissions
portfolio exposure

Exchange-side controls usually include:

price collars and fat-finger checks
cancel-on-disconnect
throttling and kill switches
self-trade prevention policies
venue state transitions such as halt and auction modes

Architectural principle: keep the exchange hot path focused on venue-owned controls and deterministic matching. Do not make one new order depend on slow external account logic.

If you must consult external state in the hot path, you have already accepted a much slower and more failure-prone system.

Audit, Surveillance, and Post-Trade Should Be Fed From the Same Truth

A common bad pattern is rebuilding audit trails by stitching together logs from multiple services after the fact.

That is not good enough for an exchange.

Instead:

audit should consume the same sequenced event stream or committed engine outputs
surveillance should operate off durable, ordered market events
clearing adapters should subscribe from the same post-trade truth

This reduces one of the nastiest operational problems: two teams arguing during an incident about which log is authoritative.

Senior architect instinct: make truth explicit at the architecture level, not just in incident documentation.

Performance Bottlenecks That Actually Matter

Exchange systems do not usually fail first because of raw CPU shortage. They fail because one of these becomes dominant:

one sequencer or book becomes too hot
lock contention appears in what should have been a single-writer path
market data fan-out backlogs
per-event allocations create GC turbulence
retries or reconnect storms amplify load during an incident
storage or replication jitter increases tail latency

That is why average latency is not enough. The metrics that matter more are:

p99 and p999 order acknowledgment latency
queue depth before sequencer and market data fan-out
per-partition throughput
cancel latency under burst
replay catch-up time
failover recovery time to known-good sequence

This is one of the clearest differences between CRUD thinking and market-infrastructure thinking.

What Serious Trading Venues Do Differently

The strongest real-world systems usually make a few disciplined choices:

They keep the matching path short and deterministic.
They use single-writer ownership per book or partition.
They treat sequencing as a first-class subsystem, not a side effect.
They separate market data distribution from core matching.
They prefer explicit failover over ambiguous multi-writer coordination.
They optimize for worst-case symbol pressure and tail latency, not only aggregate throughput.
They drill replay, failover, and disconnect scenarios repeatedly.

This is also where “big tech thinking” can mislead people. A stock exchange is not a social feed. You do not win by casually making every component eventually consistent and massively parallel. You win by keeping the hot path precise, deterministic, and operationally understandable.

Common Bad Designs

These designs sound modern and still fail review:

Using a database transaction as the matching engine

The database is useful for durability and downstream views. It is usually too slow and too variable to be the core matching loop.

Putting Kafka or a general event bus directly in the fairness-critical hot path

General messaging platforms are excellent for many things. They are not automatically the right primitive for the lowest-latency authoritative matcher boundary.

Letting slow consumers back up the matcher

This turns external fan-out into internal market integrity risk.

Splitting one order book across many chatty services

That moves the hardest correctness problem into the network.

Designing for average traffic instead of hot-symbol bursts

The average will not be what breaks you on the bad day.

If I Had to Build Version 1

I would start with these choices:

one region, not multi-region active-active
cash equities only, not all asset classes
limit, market, cancel, replace first; add complex order types later
partition by symbol
single primary sequencer and matcher per partition
in-memory resident books
deterministic event journal
warm standby replaying the same partition stream
dedicated market data builder and fan-out tier
asynchronous downstream integration for surveillance and clearing

That is not because the system is “simple.” It is because version 1 should preserve the hardest invariants with the fewest moving parts.

The wrong ambition in an exchange is feature breadth before sequencing clarity.

Interview-Ready Answer Shape

If this comes up in a system design interview, I would structure the answer like this:

Define the boundary: exchange, broker, and clearing are separate systems.
State the invariant: one authoritative event order per symbol partition.
Describe the hot path: gateway -> validation -> sequencer -> matching engine -> journal -> market data and execution outputs.
Explain scale: partition by symbol, not by shared locking.
Explain recovery: deterministic replay with primary-standby failover.
Call out the real trade-off: active-active for one book hurts latency and determinism more than it helps.
Close with market data isolation and operational controls.

That answer is stronger than listing caches, load balancers, and databases without saying who owns order.

Design Review Questions I Would Ask

Before approving this architecture, I would ask:

Where is the single source of truth for event order?
Can a cancel and a fill race produce ambiguity?
What is the failover fence that prevents split-brain on one partition?
Can market data lag or replay without ever slowing the matcher?
What happens when one symbol becomes 20 times hotter than the rest?
Which controls can stop damage immediately, and who owns them?
Can we replay one day of traffic and reproduce the same book and trades?

If those answers are weak, the design is not mature yet.

Key Takeaways

A stock exchange is a deterministic sequencing problem under latency pressure, not a generic CRUD platform.
The core architecture should preserve one authoritative event order per symbol partition.
Single-writer matching per book is usually a better trade-off than fancy multi-writer coordination.
Market data fan-out, replay, and failover deserve first-class design attention, not leftover infrastructure.
The most senior design choice is often deciding what not to put in the hot path.

Share on

X Facebook LinkedIn Bluesky

Start With the Boundary

The Non-Negotiable Invariants

What Makes This System Hard

Capacity Thinking That Does Not Lie

Component Map

High-Level System Design Diagram

Hot Path Architecture

The Matching Engine Should Be a Deterministic State Machine

Order Book Design

Why Microservices Are Usually the Wrong Shape for the Hot Path

Sequencing Is the Real Heart of the System

Timestamps Are Not a Sequencing Strategy

Scale Out by Symbol, Not by Shared Locking

The Hard Problem: One Hot Symbol

Active-Active for One Book Is Usually the Wrong Trade-Off

Failure Recovery Must Be Designed Before Scale

Market Data Is a System of Its Own

Pre-Trade Controls Without Polluting the Hot Path

Audit, Surveillance, and Post-Trade Should Be Fed From the Same Truth

Performance Bottlenecks That Actually Matter

What Serious Trading Venues Do Differently

Common Bad Designs

Using a database transaction as the matching engine

Putting Kafka or a general event bus directly in the fairness-critical hot path

Letting slow consumers back up the matcher

Splitting one order book across many chatty services

Designing for average traffic instead of hot-symbol bursts

If I Had to Build Version 1

Interview-Ready Answer Shape

Design Review Questions I Would Ask

Key Takeaways

Share on

Comments

You may also enjoy

Linked List Patterns in Java — A Detailed Guide

Intervals Pattern in Java — A Detailed Guide

Heap and Priority Queue Pattern in Java — A Detailed Guide

Monotonic Queue (Deque) Pattern in Java — A Detailed Guide

Designing a Stock Exchange System

Start With the Boundary

The Non-Negotiable Invariants

What Makes This System Hard

Capacity Thinking That Does Not Lie

Component Map

High-Level System Design Diagram

Hot Path Architecture

The Matching Engine Should Be a Deterministic State Machine

Order Book Design

Why Microservices Are Usually the Wrong Shape for the Hot Path

Sequencing Is the Real Heart of the System

Timestamps Are Not a Sequencing Strategy

Scale Out by Symbol, Not by Shared Locking

The Hard Problem: One Hot Symbol

Active-Active for One Book Is Usually the Wrong Trade-Off

Failure Recovery Must Be Designed Before Scale

Market Data Is a System of Its Own

Pre-Trade Controls Without Polluting the Hot Path

Audit, Surveillance, and Post-Trade Should Be Fed From the Same Truth

Performance Bottlenecks That Actually Matter

What Serious Trading Venues Do Differently

Common Bad Designs

Using a database transaction as the matching engine

Putting Kafka or a general event bus directly in the fairness-critical hot path

Letting slow consumers back up the matcher

Splitting one order book across many chatty services

Designing for average traffic instead of hot-symbol bursts

If I Had to Build Version 1

Interview-Ready Answer Shape

Design Review Questions I Would Ask

Related Reading on This Site

Key Takeaways

Share on

Comments

You may also enjoy

Linked List Patterns in Java — A Detailed Guide

Intervals Pattern in Java — A Detailed Guide

Heap and Priority Queue Pattern in Java — A Detailed Guide

Monotonic Queue (Deque) Pattern in Java — A Detailed Guide