Pro tip: This simulator is best experienced on desktop for the full hands-on experience.
System Design Simulator: Build, Break, and Fix Databases
You've read the blog posts. You've memorized the definitions: "Sharding splits data horizontally, replication creates copies for redundancy." You can explain CAP theorem on a whiteboard.
But here's the question that separates those who memorize system design from those who understand it: Why did the Billing service just take down your entire platform when the User service went offline? If you can't answer that instantly, if you don't feel it in your bones, you've never actually broken a distributed system. Let's fix that.
- Sharding multiplies throughput: 4 shards โ 4x capacity. Each shard handles a slice of your data.
- Replicas multiply availability: With 2+ replicas, one server dying doesn't mean downtime.
- Database choice = trade-off choice: PostgreSQL: consistent but slower. Cassandra: fast but eventually consistent. MongoDB: you decide per-operation.
- Cascade failures are real: One service down can avalanche through your entire system. Watch it happen below.
- The simulator below lets you break things on purpose. Take services offline, trigger 100x traffic, and watch customer happiness plummet.
The Problem With How We Teach System Design
Open any system design book and you'll see diagrams with boxes and arrows. "Add a load balancer here. Shard the database there. Replicate for redundancy." Great. You can draw the architecture on a whiteboard. Test passed.
But here's what the diagrams don't tell you: what does it actually feel like when your carefully designed system melts down at 3 AM? What happens to latency when you add a 5th shard? Why does your "highly available" system show 94% availability when just one service goes offline?
Netflix runs ~700 microservices. When the User Preferences service hiccups, it doesn't just affect "remember my volume settings." The Recommendations service can't personalize. The Home page can't populate. The Play button starts timing out. That's a cascade failure, and you're about to cause one on purpose.
The StreamScale Mental Model
The simulator below is StreamScale, a fictional streaming platform with 6 microservices and 3 database options. Each service has dependencies. Each database has trade-offs. Your job is to keep customers happy while the system tries to fall apart.
Try these experiments:
- Experiment 1: Click any service, then toggle "Take Offline." Watch the customer mood change. Notice which other services turn red.
- Experiment 2: Hit the "100x Surge" button. Watch latency spike. Now add 4 shards and see it stabilize.
- Experiment 3: Switch to the "Data Consistency" scenario. Add a user with PostgreSQL vs. Cassandra. Watch the replication timing difference.
- Experiment 4: Click a database cluster to see its "Key Concepts": ACID properties, LSM-trees, write-ahead logging, and more.
StreamScale Infrastructure Simulator
Learn how companies like Netflix scale their systems. Click services to simulate failures!
Learning Mode: All values shown (latencies, costs, sync times) are simplified for educational purposes. Real-world systems have more nuanced behavior, but these approximations help illustrate the core concepts.
Key Concepts
Data Center
Physical servers in a building. $150/month per instance.
Shard
A horizontal slice of your data. Splits load across machines.
Replica
A backup copy of your data in another location.
Latency
Time to respond. Lower is better. Measured in milliseconds.
Throughput
Requests handled per second. Higher is better.
Availability
Uptime percentage. 99.9% = 8.7 hours downtime/year.
- 1This is your streaming platform with 1 database shard.
- 2Click any service (colored boxes) to open the Service Control Panel.
- 3Click the database cluster (blue box) to see Key Concepts like ACID.
- 4Move the Shards slider to 2 and watch data split across databases!
Live Infrastructure Map
Control Panel
Nothing selected
Click on any service or database cluster in the map to view details
What You Just Learned (Even If You Didn't Realize It)
If you played with the simulator above, you now understand system design concepts at a visceral level that no whiteboard session can provide:
- You felt sharding. When you moved that slider from 1 to 4 shards and watched latency drop from 200ms to 50ms, that's not theory anymore. That's your intuition being calibrated.
- You caused cascade failures. When taking down HISTORY turned RECS, HOME, and PLAY red, you now understand why microservice dependency graphs matter more than individual service SLAs.
- You saw CAP theorem in action. When PostgreSQL made you wait 200ms for all replicas to sync, but Cassandra let you write instantly with a 5-second eventual consistency window, that's the CP vs. AP trade-off, lived.
- You understood why database choice matters. Cassandra's 1.8x throughput multiplier isn't just a number. You saw it save your system during a 100x traffic spike.
The Real World is Messier (But You're Ready)
Real production systems have caching layers (Redis, Memcached), message queues (Kafka, RabbitMQ), circuit breakers, retry policies, rate limiters, and sophisticated observability stacks. This simulator simplifies those away to focus on the fundamentals.
But those fundamentals (sharding, replication, consistency models, failure domains) are the foundation everything else builds on. When you're debugging a production incident at 3 AM, you won't have time to look up what "eventually consistent" means. You'll need to know it.
Now you do.
Frequently Asked Questions
Why does latency decrease when I add shards?
Each shard holds less data, so queries scan smaller datasets. With 4 shards, each query only searches 25% of the data. The formula: ~3ms reduction per additional shard (in this simplified model).
Why does adding replicas slightly increase latency?
Synchronous replication (when configured for strong consistency) waits for all replicas to confirm writes. More replicas = more network round-trips. The formula: ~5ms increase per additional replica. The trade-off is worth it for availability.
What's the difference between PostgreSQL, MongoDB, and Cassandra in the simulator?
PostgreSQL: 35ms base latency, 1.0x throughput, strong consistency (synchronous replication configurable). MongoDB: 25ms latency, 1.3x throughput, tunable consistency. Cassandra: 20ms latency, 1.8x throughput, asynchronous replication (eventual consistency).
Are these numbers realistic?
They're simplified for educational purposes. Real-world latencies depend on hardware, network topology, query complexity, and caching. The relationships (sharding reduces latency, replication adds overhead, Cassandra is faster for writes) are directionally accurate.
Level Up Your System Design Skills
Get weekly interactive tutorials on distributed systems, databases, and backend architecture delivered to your inbox.
Subscribe Free