💡

Pro tip: This simulator is best experienced on desktop for the full hands-on experience.

System Design Simulator: Build, Break, and Fix Databases

You've memorized the definitions. But can you explain why one service going offline crashes your entire platform? If not, you've never broken a distributed system. Let's fix that.

The simulator below is StreamScale, a fictional streaming platform with 6 microservices and 3 database options. Each service has dependencies. Each database has trade-offs. Your job is to keep customers happy while the system tries to fall apart.

StreamScale Infrastructure Simulator

Learn how companies like Netflix scale their systems. Click services to simulate failures!

Learning Mode: All values shown (latencies, costs, sync times) are simplified for educational purposes. Real-world systems have more nuanced behavior, but these approximations help illustrate the core concepts.

Key Concepts

Latency

Time to respond. Lower is better. <100ms feels instant.

Throughput

Requests/second the system handles. Scale with shards.

Availability

99.9% = 8.7hrs downtime/year. Replicas increase this.

Idempotency

Same request twice = same result. Safe to retry on failure.

Consensus

Replicas agreeing on data. Raft/Paxos algorithms ensure this.

Shard

Horizontal data split. Each shard holds a portion of data.

Replica

Data copy for fault tolerance. Primary + followers.

Circuit Breaker

Stops requests to failing services. Prevents cascades.

Select a scenario to learn

The Solution

Select a scenario to see how to fix it.

Infrastructure

Shards1

Replicas1

Metrics & Traffic

Latency

115ms

Throughput

1.2M

Availability

99.0%

1×1×$150$150/mo

Live Infrastructure Map

Healthy

Degraded

Offline

Fast and smooth. Users are enjoying the app!

Control Panel

Nothing selected

Click on any service or database cluster in the map

What You Just Learned (Even If You Didn't Realize It)

If you played with the simulator above, you now understand system design concepts at a visceral level that no whiteboard session can provide:

You felt sharding. When you moved that slider from 1 to 4 shards and watched latency drop from 200ms to 50ms, that's not theory anymore. That's your intuition being calibrated.
You caused cascade failures. When taking down HISTORY turned RECS, HOME, and PLAY red, you now understand why microservice dependency graphs matter more than individual service SLAs.
You saw CAP theorem in action. When PostgreSQL made you wait 200ms for all replicas to sync, but Cassandra let you write instantly with a 5-second eventual consistency window, that's the CP vs. AP trade-off, lived.
You understood why database choice matters. Cassandra's 1.8x throughput multiplier isn't just a number. You saw it save your system during a 100x traffic spike.

The Real World is Messier (But You're Ready)

Real production systems have caching layers (Redis, Memcached), message queues (Kafka, RabbitMQ), circuit breakers, retry policies, rate limiters, and sophisticated observability stacks. This simulator simplifies those away to focus on the fundamentals.

But those fundamentals (sharding, replication, consistency models, failure domains) are the foundation everything else builds on. When you're debugging a production incident at 3 AM, you won't have time to look up what "eventually consistent" means. You'll need to know it.

Frequently Asked Questions

Why does latency decrease when I add shards?

Each shard holds less data, so queries scan smaller datasets. With 4 shards, each query only searches 25% of the data. The formula: ~3ms reduction per additional shard (in this simplified model).

Why does adding replicas slightly increase latency?

Synchronous replication (when configured for strong consistency) waits for all replicas to confirm writes. More replicas = more network round-trips. The formula: ~5ms increase per additional replica. The trade-off is worth it for availability.

What's the difference between PostgreSQL, MongoDB, and Cassandra in the simulator?

PostgreSQL: 35ms base latency, 1.0x throughput, strong consistency (synchronous replication configurable). MongoDB: 25ms latency, 1.3x throughput, tunable consistency. Cassandra: 20ms latency, 1.8x throughput, asynchronous replication (eventual consistency).

Are these numbers realistic?

They're simplified for educational purposes. Real-world latencies depend on hardware, network topology, query complexity, and caching. The relationships (sharding reduces latency, replication adds overhead, Cassandra is faster for writes) are directionally accurate.

Level Up Your System Design Skills

Get weekly interactive tutorials on distributed systems, databases, and backend architecture delivered to your inbox.

Subscribe Free

Enjoyed this article? Share it with others!

Comments

System Design Simulator: Build, Break, and Fix Databases

You've memorized the definitions. But can you explain why one service going offline crashes your entire platform? If not, you've never broken a distributed system. Let's fix that.

Key Concepts

Latency

Time to respond. Lower is better. <100ms feels instant.

Throughput

Requests/second the system handles. Scale with shards.

Availability

99.9% = 8.7hrs downtime/year. Replicas increase this.

Idempotency

Same request twice = same result. Safe to retry on failure.

Consensus

Replicas agreeing on data. Raft/Paxos algorithms ensure this.

Shard

Horizontal data split. Each shard holds a portion of data.

Replica

Data copy for fault tolerance. Primary + followers.

Circuit Breaker

Stops requests to failing services. Prevents cascades.

What You Just Learned (Even If You Didn't Realize It)

If you played with the simulator above, you now understand system design concepts at a visceral level that no whiteboard session can provide:

You felt sharding. When you moved that slider from 1 to 4 shards and watched latency drop from 200ms to 50ms, that's not theory anymore. That's your intuition being calibrated.
You caused cascade failures. When taking down HISTORY turned RECS, HOME, and PLAY red, you now understand why microservice dependency graphs matter more than individual service SLAs.
You saw CAP theorem in action. When PostgreSQL made you wait 200ms for all replicas to sync, but Cassandra let you write instantly with a 5-second eventual consistency window, that's the CP vs. AP trade-off, lived.
You understood why database choice matters. Cassandra's 1.8x throughput multiplier isn't just a number. You saw it save your system during a 100x traffic spike.

The Real World is Messier (But You're Ready)

Frequently Asked Questions

Why does latency decrease when I add shards?

Each shard holds less data, so queries scan smaller datasets. With 4 shards, each query only searches 25% of the data. The formula: ~3ms reduction per additional shard (in this simplified model).

Why does adding replicas slightly increase latency?

What's the difference between PostgreSQL, MongoDB, and Cassandra in the simulator?

Are these numbers realistic?

Level Up Your System Design Skills

Get weekly interactive tutorials on distributed systems, databases, and backend architecture delivered to your inbox.

Subscribe Free

Enjoyed this article? Share it with others!

System Design Simulator: Build, Break, and Fix Databases

Key Concepts

Latency

Throughput

Availability

Idempotency

Consensus

Shard

Replica

Circuit Breaker

The Solution

Live Infrastructure Map

Control Panel

What You Just Learned (Even If You Didn't Realize It)

The Real World is Messier (But You're Ready)

Frequently Asked Questions

Why does latency decrease when I add shards?

Why does adding replicas slightly increase latency?

What's the difference between PostgreSQL, MongoDB, and Cassandra in the simulator?

Are these numbers realistic?

Level Up Your System Design Skills

Comments

Leave a comment

System Design Simulator: Build, Break, and Fix Databases

Key Concepts

Latency

Throughput

Availability

Idempotency

Consensus

Shard

Replica

Circuit Breaker

The Solution

Live Infrastructure Map

Control Panel

What You Just Learned (Even If You Didn't Realize It)

The Real World is Messier (But You're Ready)

Frequently Asked Questions

Why does latency decrease when I add shards?

Why does adding replicas slightly increase latency?

What's the difference between PostgreSQL, MongoDB, and Cassandra in the simulator?

Are these numbers realistic?

Level Up Your System Design Skills

Comments

Leave a comment