Lesson 6
Load Balancing & Basic Scaling
Understand how load balancers distribute traffic across multiple servers to handle more users and improve reliability.
20 min read · Beginner
When one server is not enough
Your application is growing. Response times are creeping up. The server CPU is pinned at 100%. You have two choices: buy a bigger machine (vertical scaling) or add more machines (horizontal scaling). For most web applications, horizontal scaling is the better long-term strategy — but it introduces a new question: how do you distribute incoming traffic across multiple servers?
That is the job of a load balancer. It sits in front of your servers, receives all incoming requests, and forwards each one to an available server.
Vertical vs horizontal scaling
Vertical scaling has a ceiling (biggest machine available). Horizontal scaling adds capacity by adding machines.
| Approach | How | Limit | Cost curve |
|---|---|---|---|
| Vertical | Upgrade CPU, RAM, disk on one machine | Hardware maximum | Expensive at top end |
| Horizontal | Add more identical machines | Needs load balancer | Linear, more predictable |
Vertical scaling is simpler (no code changes) but hits a ceiling. Horizontal scaling is how Netflix, Google, and Amazon handle billions of requests — but requires stateless servers and a load balancer.
Multi-layer load balancing
Load balancing happens at multiple layers in a production system:
Traffic passes through multiple balancing layers before reaching your application code.
| Layer | What it balances | Example |
|---|---|---|
| DNS | Traffic across regions/data centers | Route 53, Cloudflare |
| Application | HTTP requests across app servers | NGINX, AWS ALB, HAProxy |
| Database | Read queries across replicas | PostgreSQL read replicas |
How load balancers work
A load balancer is itself a server (or managed service) with a public IP address. Clients connect to the load balancer, not to individual application servers.
Client → Load Balancer → Server A
→ Server B
→ Server C
This setup gives you two major benefits:
- Capacity — three servers handle roughly three times the traffic of one
- Reliability — if Server B crashes, traffic flows to A and C with minimal disruption
Load balancing algorithms
| Algorithm | How it works | Best when |
|---|---|---|
| Round-robin | Each server gets the next request in turn | Servers are identical, requests are similar |
| Least connections | Sends to server with fewest active connections | Requests have varying processing times |
| IP hash | Same client IP always goes to same server | Sticky sessions needed |
| Weighted | More traffic to more powerful servers | Mixed hardware capacities |
| Random | Picks a server at random | Simple, surprisingly effective |
For most starting applications, round-robin is perfectly adequate.
Round-Robin Load Balancer
Watch how incoming requests are distributed evenly across servers using round-robin.
Sticky sessions
When your app stores session data in server memory (stateful), a user must always hit the same server. Sticky sessions (session affinity) route the same client to the same backend.
| Approach | How | Downside |
|---|---|---|
| Cookie-based stickiness | LB sets a cookie mapping to a server | Uneven load if users have different activity levels |
| IP hash | Hash client IP to pick server | Breaks when IPs change (mobile networks) |
| Shared session store | Store sessions in Redis | Best approach — any server can serve any user |
Recommendation: Use stateless APIs with tokens, or store sessions in Redis. Avoid sticky sessions when possible — they complicate deployments and create uneven load.
Auto-scaling intro
Auto-scaling automatically adds or removes servers based on demand:
CPU > 70% for 5 min → add 2 servers
CPU < 30% for 10 min → remove 1 server
| Metric | Scale up when | Scale down when |
|---|---|---|
| CPU utilization | Sustained above 70% | Sustained below 30% |
| Request count | Queue depth growing | Queue empty |
| Response time | p95 latency above target | p95 latency normal |
Cloud providers (AWS Auto Scaling Groups, Kubernetes HPA) handle this automatically. Start with manual scaling until you understand your traffic patterns, then automate.
Health checks
Load balancers continuously check if backend servers are healthy — usually by sending a periodic request to a /health endpoint. Unhealthy servers are removed from the rotation until they recover.
A good health endpoint returns:
{ "status": "ok", "database": "connected", "version": "1.2.3" }
Check dependencies (database, cache) — a server that cannot reach its database should not receive traffic.
In practice
Before adding a load balancer, confirm you actually need one. A single well-provisioned server handles thousands of concurrent users for most apps. Add load balancing when monitoring shows sustained high CPU or response time degradation — not because architecture diagrams look better with more boxes.
Key takeaways
- Load balancers distribute traffic across multiple servers for capacity and reliability
- Horizontal scaling (more servers) beats vertical scaling long-term
- Round-robin is a simple, effective algorithm for even distribution
- Avoid sticky sessions — use shared session storage instead
- Auto-scaling adds/removes servers based on metrics
- Health checks enable automatic failover
Common mistakes
- Scaling before you need to — premature optimization adds operational complexity
- Ignoring stateful sessions — sticky sessions or shared storage required for stateful apps
- Forgetting about the database — adding app servers does not help if the database is the bottleneck
- No health checks — without them, traffic goes to crashed servers
Go deeper
- NGINX Load Balancing Guide — practical configuration reference
- AWS Well-Architected Framework — reliability and performance best practices
- High Scalability — real-world architecture case studies