Pro tip: This simulator is best experienced on desktop for the full hands-on experience.
System Design Simulator: Build, Break, and Fix Databases
You've memorized the definitions. But can you explain why one service going offline crashes your entire platform? If not, you've never broken a distributed system. Let's fix that.
The simulator below is StreamScale, a fictional streaming platform with 6 microservices and 3 database options. Each service has dependencies. Each database has trade-offs. Your job is to keep customers happy while the system tries to fall apart.
StreamScale Infrastructure Simulator
Learn how companies like Netflix scale their systems. Click services to simulate failures!
Learning Mode: All values shown (latencies, costs, sync times) are simplified for educational purposes. Real-world systems have more nuanced behavior, but these approximations help illustrate the core concepts.
Key Concepts
Latency
Time to respond. Lower is better. <100ms feels instant.
Throughput
Requests/second the system handles. Scale with shards.
Availability
99.9% = 8.7hrs downtime/year. Replicas increase this.
Idempotency
Same request twice = same result. Safe to retry on failure.
Consensus
Replicas agreeing on data. Raft/Paxos algorithms ensure this.
Shard
Horizontal data split. Each shard holds a portion of data.
Replica
Data copy for fault tolerance. Primary + followers.
Circuit Breaker
Stops requests to failing services. Prevents cascades.
The Solution
Select a scenario to see how to fix it.
Live Infrastructure Map
Control Panel
Nothing selected
Click on any service or database cluster in the map
What You Just Learned (Even If You Didn't Realize It)
If you played with the simulator above, you now understand system design concepts at a visceral level that no whiteboard session can provide:
- You felt sharding. When you moved that slider from 1 to 4 shards and watched latency drop from 200ms to 50ms, that's not theory anymore. That's your intuition being calibrated.
- You caused cascade failures. When taking down HISTORY turned RECS, HOME, and PLAY red, you now understand why microservice dependency graphs matter more than individual service SLAs.
- You saw CAP theorem in action. When PostgreSQL made you wait 200ms for all replicas to sync, but Cassandra let you write instantly with a 5-second eventual consistency window, that's the CP vs. AP trade-off, lived.
- You understood why database choice matters. Cassandra's 1.8x throughput multiplier isn't just a number. You saw it save your system during a 100x traffic spike.
The Real World is Messier (But You're Ready)
Real production systems have caching layers (Redis, Memcached), message queues (Kafka, RabbitMQ), circuit breakers, retry policies, rate limiters, and sophisticated observability stacks. This simulator simplifies those away to focus on the fundamentals.
But those fundamentals (sharding, replication, consistency models, failure domains) are the foundation everything else builds on. When you're debugging a production incident at 3 AM, you won't have time to look up what "eventually consistent" means. You'll need to know it.
Frequently Asked Questions
Why does latency decrease when I add shards?
Each shard holds less data, so queries scan smaller datasets. With 4 shards, each query only searches 25% of the data. The formula: ~3ms reduction per additional shard (in this simplified model).
Why does adding replicas slightly increase latency?
Synchronous replication (when configured for strong consistency) waits for all replicas to confirm writes. More replicas = more network round-trips. The formula: ~5ms increase per additional replica. The trade-off is worth it for availability.
What's the difference between PostgreSQL, MongoDB, and Cassandra in the simulator?
PostgreSQL: 35ms base latency, 1.0x throughput, strong consistency (synchronous replication configurable). MongoDB: 25ms latency, 1.3x throughput, tunable consistency. Cassandra: 20ms latency, 1.8x throughput, asynchronous replication (eventual consistency).
Are these numbers realistic?
They're simplified for educational purposes. Real-world latencies depend on hardware, network topology, query complexity, and caching. The relationships (sharding reduces latency, replication adds overhead, Cassandra is faster for writes) are directionally accurate.
Level Up Your System Design Skills
Get weekly interactive tutorials on distributed systems, databases, and backend architecture delivered to your inbox.
Subscribe Free